Descriptions

RH Parser Project

Paula S. Newman. 2007. RH: A Retro-Hybrid Parser. In Proceedings of Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (HLT-NAACL 2007), pdf

This paper describes a hybrid symbolic natural language parser architecture with two primary layers. The first layer is a chunker, i.e., an analyzer that identifies basic syntactic chunks such as basic noun phrases (e.g., the happy child), and the second layer combines the chunks.

The combination provides good performance because (a) chunking can be done more efficiently than more general parsing, using different techniques and (b) given larger chunks to combine, the subsequent phase has fewer alternatives to check. The latter also allows for a simple, effective, preference method for selecting "best" parses.

The chunker consists of the very fast XIP chunker software built at the Xerox Research facility at Grenoble, France together with a chunker grammar (a collection of rules) for English. Chunker rules are applied in sequence to build chunks, and, once constructed, a chunk is undone only by some final fixup rules applied at the end of the analysis. Importantly for subsequent processing, the chunker allows lexical entries and chunks to be tagged with various features (e.g., this is a place). The XIP chunker distribution used in RH included an initial chunker grammar for English, which was extensively modified and extended as part of the RH parser development.

The second layer of RH is an ATN-like parser, augmented by many constraints and preference tests, which combines the chunks to form larger units and then full syntactic parses for sentences. The innovative, but simple, preference scoring system (see preference-system-specific paper) utilizes both syntactic and semantic criteria to select the best syntactic parse. At the time of publication, the parser was as fast or faster than most other published parsers, and accuracy was close, and was continuing to improve.

Paula S. Newman. 2007. Symbolic Preference Using Simple Scoring. In Proceedings of International Conference on Parsing Technologies, (IWPT 2007 ), Prague, 83-92 pdf

This paper details the second layer of the RH parser, which combines basic chunks supplied by the first, chunker layer (see summary paper), and focuses on the preference scoring aspect. Unlike earlier preference schemes, the scoring is very simple. During the parse, each new attachment (e.g., the attachment of a PP within a VP) is given a score, which is either -1 (dispreferred), 0 (no preference), or +1(preferred), based on both syntactic and semantic grounds. The best parse is that with the highest cumulative score (with low attachment preferred for equally high scoring final parses). As the parse progresses, similar partial parses with lower scores over the same extents are discarded, a critical contributor to parser efficiency.

The simplicity of the scoring system is enabled at least in part by the fact that its job is to assign scores to combinations of relatively long sequences with clearly defined properties, so the effects of assigning scores are more easily predictable, and less fine- tuning is needed. Also, in testing and improving the parser, errors attributable to scoring are easily traced to their source.

Paula S. Newman, 2005. TextTree Construction for Parser and Treebank Development. In Proceedings of Workshop on Software of 2005 Conference of Association for Computational Linguistics, Ann Arbor, Michigan pdf

The TextTree representation was developed during the RH parser project to expedite parser development. TextTrees are formed by converting parser output trees into unlabeled indented strings with minimal bracketing. An example TextTree is

    They
    must have
        a clear delineation
            of
                [roles,
                missions,
                and
                authority].

Reading a TextTree for a correct parse is similar to reading ordinary text, but reading a TextTree for an incorrect parse is jarring. This property allows a file of TextTrees to be read rapidly to review the results of parsing a long document, in order to find (and then analyze) problem areas. See paper for examples and for the TextTree construction method.

TextTrees are constructed by a parser-independent mechanism that accepts, as input, parser output trees which are expressed in a standard form, but with grammar-specific category labels. The formatting is guided by user-specified rules, one per category, expressed as <category, directive> pairs, e.g., <ROOT, Align>

Paula S. Newman. 2007, Grammars and Programming Languages: To Further Narrow the Gap. In Proceedings of the GEAF 2007 Workshop, Stanford, 2007 pdf

The combination of a symbolic natural language grammar and the software that interprets it can be viewed as a component of an application or system in the same way as the combination of a program in some language (e.g., JAVA) and its interpreter can be viewed as such a component. Thus, just as it is appropriate to look at a programming language and its processor from the point of view of amenability to testing and debugging, it is reasonable to do the same with respect to a grammar formalism and its interpreter.

This paper describes tools associated with the RH (Retro-Hybrid) parser that facilitate grammar testing and debugging and are, to varying extents, more broadly applicable. The paper also suggests a new approach to improving parser efficiency using constrained inputs, based in part on one of the RH debugging tools.

MailContent Project

Paula S. Newman. 2002. Exploring discussion lists: steps and directions. In Proceedings of ACM/IEEE Joint Conference on Digital Libraries, JCDL 2002, Portland, OR, 126-134 pdf

The MailContent project was inspired by direct personal experience. Circa 1999 the author was asked to participate in a number of standards groups having long histories and accompanying large email-based forums; too large to easily select the threads and subthreads contained the most important background and/or immediately pertinent information. So, shortly thereafter, she began to consider alternative presentation mechanisms for email forums.

Many of the major results are described in this paper. They include a number of ways of presenting overall forum content, and new thread visualizations. Overall forum content is provided via several types of enhanced subject listings, giving an immediate appreciation of the most important threads and subject matter.

The new thread visualizations consist of narrow trees and treetables. Narrow trees address the problem that the standard representation of long message/response sequences, where responses are serially indented, lead to thread representations extending far off the right side of a window. This problem is avoided by a different, but clear set of conventions... see paper for method and examples. Treetables provide an immediate appreciation of the structure of a discussion, e.g.,

motorcycle thread via treetable

(Note: For the various types of displays, message abbreviations are obtained using finite-state-based analyses of message structure to isolate substantive content from other material such as quoted passages, greetings, etc. Some details are described in related patents by the author.)

Paula S. Newman. 2002. Email archive overviews using subject indexes. In Proceedings of Conference on Human Factors in Computing Systems (CHI 2002), Minneapolis, Minnesota, pdf

While most facilities of the MailContent project relate to the presentation of individual threads, this paper discusses a more global aspect: the formation of table-of-content-like information for full email corpora. This is achieved by grouping threads appearing to have similar subject matter. In the method used, subject lines are first scanned to select candidate subjects, specifically terms not occurring in a list of the 5000 or so most frequent words in the language. The candidate subjects are then ranked based on their relative importance in the archive, which is influenced by the number of thread subjects in which they occur, and by the lengths of those threads. A partial example index obtained from a python-related archive is:


thread summary

Paula S. Newman, John C. Blitzer. 2003. Summarizing archived discussions: a beginning. In Proceedings of the 8th International Conference on Intelligent User Interfaces, Miami, 2003 pdf

This paper describes the method used in the MailContent project to obtain thread summaries, approximately divided into subtopics. Messages are first grouped into connected subtrees by forming a word vector for each message, and then clustering the vectors. (A critical aspect of message vector formation is the adjustment of messages to avoid distortions in lexical distance due to differences in quoting style.)

Then two types of summaries (respectively developed by the two authors) are extracted for each cluster: a shorter overview and a longer treatment. Both types of summary use parameters such as semantic centrality and readability. For example:

Group 3: thing people sell good buy year time difference give bike (23 msgs)
   ... Looks like I’ve got a winter project on my hands and Im beginning to realize I didn’t underpay the lady. Group 4. market pay item worth buyer seller position prepare price agree (17 msgs)
   ... Hm. I take the view that something is worth what someone is prepared to pay for it.

Polle Zellweger, Anne Mangen, Paula S. Newman. 2002. Reading and writing fluid Hypertext Narratives. In Proceedings of the Conference on Hypertext and Hypermedia 2002, College Park, MD. 45-54 pdf

This paper describes an adaptation of treetables for building alternate versions of short narratives. On returning from a sabbatical and seeing the email-forum-oriented treetables for the first time, Dr. Zellweger realized that a similar approach might be preferable to an earlier one for story building and visualization. This paper describes that earlier method, and then the treetable-related adaptation.

In the adaptation, the basic treetable design is modified. In both kinds of treetables, each column represents a single path from the root to the end of a branch. In the original treetables, preservation and representation of structure is of major importance, so elements at the same depth are aligned horizontally, and their content abbreviated if necessary (and viewable in full via controls). However, in the "fluid hypertext" approach, each column represents one version of a story, and includes all text of that version. The adaptation is named “unaligned treetables” because, to allow complete stories to be viewed conveniently, contributions at the same depth but along different branches are not “aligned” horizontally. Suitable controls are provided to explore the displays, for example to focus on specific sequences.

XLE Project

Ronald M. Kaplan, Paula S. Newman. Lexical Resource Reconciliation in the Xerox Linguistic Environment. In Proceedings of Workshop on Computational Environments for Grammar Development and Linguistic Engineering, Madrid, 1997 pdf

The Xerox Linguistics Environment (XLE) succeeded the initial Common Lisp implementation of the LFG formalism, and contained many changes. This paper focuses on facilities for specifying lexical preprocessing and the lexicon proper.

“Lexical preprocessing” includes both tokenization, which splits input texts into words and punctuation, and also morphological analysis, which divides words into components. For example, “books” might have the two analyses: “book +Npl” and “book + VSg +VPres”.

In the Lisp implementation, lexical preprocessing was specified by sets of explicit rules, along with the grammar rules and the lexicon proper, and alternatives could be combined via specification called a “configuration”. In XLE, configurations instead provided for the use of multiple external components for lexical preprocessing. In the result, morphological processing might be specified as, e.g.,

ANALYZE USEFIRST:
          morph-override1
          main-morph
ANALYZE USEALL:
          morph-add1
           morph-add2

USEFIRST serves to provide corrections to large scale analyzers. It specifies that, for a particular word, only the results of the first listed analyzer to find any analysis for that word are to be used. “USEALL” extends analyses to add words or entire topics.

Specification of the lexicon itself was augmented to facilitate the assembly and correction of large LFG lexicons. A full LFG lexicon entry for a word, e.g, “down”, might look like:

down  P      BASE  @PREP;
          ADV BASE  @DIRADV

It has two subentries, specifying “down” as either a P_BASE or an ADV_BASE (@PREP and @DIRADV are macro-like labels for collections of properties). Originally, lexicon configurations could only add or replace full entries. (e.g., the entire entry for “down”). The replacement provided fine-grained, flexible modifications specified at the subentry level. This was especially necessary for verbs, where different subentries, implying different meanings and dependent types, are needed in many languages.

Paula S. Newman. 1996. Representing 2P Clitic Placement in LFG. In Proceedings of the LFG '96 Conference, Rank Xerox, Grenoble, 1996 pdf

This paper was first written as a term paper for an audited Stanford graduate course in computational linguistics. It describes how LFG grammars might be extended to cope with clitic motion phenomena. In just a little more detail. In some languages, some very small unstressed words, like auxiliary verbs and prepositions, can occur in different places in a sentence, based on what are considered by some to be “prosodic rules” (~ rules based on auditory considerations) rather than grammar rules. There are many studies, relating to different languages, of these possibly prosodic rules.

This paper proposes several alternative means of incorporating provisions for prosodically-positioned clitics into LFG, a formal syntactic grammar, and dealing with those provisions in parsing and generating texts using the resulting grammar.

LASC MT Project

Paula S. Newman. 1990. Symmetric slot grammar (SSG): a bi-directional design for MT. In Proceedings of Third international conference on Theoretical and Methodological Issues in Machine Translation of Natural Language, 11-13 June 1990, Linguistics Research Center, University of Texas; pp.145-157 pdf

Machine translation (MT) research in the 1980s focused on symbolic processing and thus issues of rules and representations. The LASC/MT project, at IBM’s Los Angeles Scientific Center (c,.1987-1990) gradually arrived at an interesting but ambitious design. Unfortunately, the center closed before implementation could proceed.

Because much of the expected MT system operation was largely documented in technical reports and IBM-sponsored or internal conference papers, it is briefly outlined here. The evolved system design consisted of:

The grammar formalism, which is the focus of this paper, was bidirectional, i.e., intended for use in both parsing and generation. It is described as a collection of modifications to the slot grammar developed by Michael McCord of IBM Research. "Slot-grammar" nicely separates dominance information (e.g., a verb dominates its object) from precedence information (e.g., English active verbs generally precede their objects). The modifications further separated the two kinds of information, and re-expressed some grammatical information as features, for example, that a reference is definite.

These modifications were intended to allow the precedence data for an input to be ignored after parsing. Then the dominance information, after lexically-based transformations, could be used as the dominance part of the target language representation. The paper describes the modifications in detail, including the use of “zones” for detecting and expressing discourse-oriented features.

To continue with the system design outline: After parsing, the transfer aspect of the system was carried out via the single concept lexicon mentioned above. Each entry in a unilingual lexicon would have an associated entry in the concept lexicon, constructed or connected when the unilingual lexicon entry was created. The concept lexicon entry, in turn, might related to the target lexicon in one of the following ways:

(Two notes. First, the concept lexicon was also the suitable place for ontological information, i.e., is-a information, useful in (unilingual) disambiguation. Also, the design underwent further refinement after a study of Chinese; see description of paper An Approach to Paraphrase in MT Systems Based on a Study of Chinese Verbs further below.)

Paula S. Newman. 1990. Towards Convenient Bi-Directional Grammar Formalisms. In Proceedings of 13th International Conference on Computational Linguistics, (COLING 90), Helsinki, 294-298 pdf

This short paper focuses on the advantages of ID/LP grammars for parsing and generation. In ID/LP grammars, immediate dominance relationships (e.g., a verb dominates its object) are specified by different rules than linear precedence (i.e., ordering) relationships.

The paper is overly laborious, examining some implications of using a non-ID/LP grammar for generation. However, in the overall design of the MT system, the use of an ID/LP grammar allowed the results of ID rules, modified by mostly word-triggered translation rules, to serve as generation input. For an overview of the MT project system design, see the description of the "Symmetric Slot Grammar" paper, above.

Bonnie Chiu, Paula S. Newman, Shelley Smith, 1989. An Approach to Paraphrase in MT Systems Based on a Study of Chinese Verbs. In Proceedings of the 1989 International Conference on Computer Processing for Chinese and Oriental Languages. pdf

(Note: This paper was accepted to the above conference, but we later received notice that the conference, scheduled to take place in Changsha a few months after Tiananmen Square, had been cancelled. There are some web references to a proceedings, and I am trying to find out more about that. In the meantime, I have stored a copy of the paper locally)

This paper motivates and describes some changes needed to the initial MT system design developed at IBM’s Los Angeles Scientific Center, based on a study of Chinese verbs.

The system design consisted of (a) a grammar formalism intended for bi-directional use (i.e., for parsing and generation), (b) a unilingual lexicon for each language to be handled by an instance of the system, and (c) a common lexicon spanning those languages.

During translation, parsing produced a dependency representation, where each head word was connected to its dependent content words and many grammatical words (e.g., determiners) and affixes (e.g. for plurals) were represented by features.

A word was identified by a reference to its unilingual lexicon entry, which, in turn, was linked to an entry in the common lexicon for the MT system instance. And, in the reverse direction, each entry in the common lexicon was linked to all the unilingual entries to which it directly translated, possibly with the addition of some unilingually required material.

After parsing, the more substantive aspects of transfer consisted of modifications to the parser output made if

This design was developed primarily in relation to Romance languages, for which the above provisions seemed adequate. Then a study of its utility for translating to and from Chinese indicated the need for a better approach to structural changes. The paper describes and analyzes many examples of translating Chinese motion verbs, and then outlines a direction for a systematic treatment of the required structural changes using a feature-based typology.

Paula S. Newman. 1988. Combinatorial Disambiguation. In Proceedings of 2nd Applied Natural Language Processing Conference, ANLP 1988, Austin, TX, 243-252 pdf

This paper describes a means of disambiguating parsing ambiguities. The approach is an ambitious one, aiming to simultaneously disambiguate both structural and sense ambiguities. Structural ambiguities are ones of head-dependent relationships, for example, in He gave a present to Mary, to Mary is a dependent of the phrasal head give, while in. He saw a letter to Mary, to Mary is a dependent of letter. Sense ambiguities are ones of meaning, for example, letter in the foregoing example can be a missive or a member of an alphabet.

Unfortunately, there can be a huge number of combinations of possibilities. Even in the short sentence He shot some bucks with a rifle, shot can mean fired a weapon or wasted, bucks can be animals or dollars, and with can specify an instrument (of shooting) or an accompanier (of either the shooter or the things shot).

The paper shows how this potentially combinatorial explosion can addressed using a variant of the A* search optimization method. The possible head-dependent pairings are first organized into choice point sets. Each such set represents a dependent word, and contains pairs of its senses with senses of its possible heads. For example, a pair within the with set would be, < with = instrument, shoot = waste>. The method requires that all these pairs be given a score. For example, expected head-dependent pairs like to… give receives a high score because give expects a modifier to.

The choice point sets are conceptually organized into a tree whose levels represent the choice points, and choice points with the largest differences in scores are placed highest in the tree. Then a variety of A* (see paper for details) can be used to optimize the number of tree nodes actually examined.

Later work by the author (see 2007 papers) suggests that structural ambiguities can be handled with less sense information, and that many sense ambiguities may be left to be dealt with depending on need.

IDE Project

Paula S. Newman. 1982. Towards an Integrated Development Environment. In IBM Systems Journal 21,1 1982 pdf

The Integrated Development Environment project at IBM’s Los Angeles Scientific Center was concerned with two related types of integration: vertical integration of tools for requirements gathering, specification, and implementation, and horizontal integration of specification and programming languages with their data references.

This paper is primarily concerned with vertical integration. In the mid-1970s data processing was increasingly supporting many business activities, and thus there were many software-engineering related research efforts devoted to understanding how to collect and combine data and processing requirements, how to express designs, and how to expedite implementation. This gave rise to a multiplicity of unrelated tools that inhibited their utility.

The paper motivates and suggests a direction for increased attention to integration of these tools, including:
   -   The use of a coherent system repository for system description data at all levels.
   -   The use of an expressive data model that shares characteristics of the E/R and relational models for that repository and for application data as well.
   -   The use of closely related design documentation languages and programming languages, thus allowing the use of similar concepts and providing inter-level comparability.
   -   The use of partial and full application generator capabilities

(For more on the intended data model and language see “The Data Model of IDE”)

Paula S. Newman. 1985. The Data Model of IDE: A Value Network. In Proceedings of the Fourth International Conference on Entity-Relationship Approach (ER 1985). Chicago, 246-255 pdf

The Integrated Development Environment project at IBM’s Los Angeles Scientific Center was concerned with the complexity of enterprise system specification and implementation. One cause of the complexity was seen to be the use of different data models for (a) “conceptual” models of enterprise data, (b) their implementation in data bases, and (c) program-local data.

This paper describes a data model that shares characteristics of the E/R and relational models and was intended for all the above purposes. The paper also discusses aspects of the programming language, PL/IDE, so as to demonstrate the adequacy, convenience, and power of the model. The data model is ultimately based on work by the author circa 1976, with William Kent, on an ad-tech IBM project related to data dictionaries. The programming language is more fully described in an IBM Scientific Center Report G32-27094, June 1980, which may become available online.

In a nutshell, in PL/IDE both program-local and data base data are organized into scalars, tuples, and sets of scalars or tuples. Scalars and the elements of tuples must be elements of either "built-in" sets (e.g., integers, reals, strings...) or other, more dynamically constructed, ones.

For concreteness, some aspects of the language, as it relates to data, are illustrated here. For example, for data access, given a set "Course", containing tuples such as <"Math", 404>, with tuple elements from the sets "Subject" and "integer", and a set "Lecturer" of tuples such as < <"Math", 404>, "J. Smith"> with elements from the sets "Course", and "Person":

  -  The expressions Lecturer (<"Math",404>) and Lecturer (<" Math", 404>, ?) would be equivalent and equal to "J. Smith".
  -  Lecturer (?, "J. Smith">) would be the set of all courses for which "J. Smith" is the lecturer.

Query and assignment forms are inspired by the non-data-base-related language SETL, by Jacob Schwartz. As examples of assignment forms:

  -  Lecturer (< "History", 101>) = "W. Jones"
  - Course += < "Biology", 200>

Iteration is over query-specified sets. Data base references are identical to local ones, except that the referenced data base names are prefixed to the set names, e.g., MyDB.Lecturer.

The paper also introduces additional aspects of the data model and related language, and compares the data model to other similar ones.

Paula S. Newman. 1983. IF-THEN-ELSE, again. In SIGPLAN Notices 18, 12 (December, 1983) 106-111 pdf

A brief note suggesting and illustrating the use of a "SELECT" statement (described) in place of nested IFs for reasons of readability and comprehensibility.