Project description

Scientific objectives

When we use grammatical functions like subject and direct object in the description of a language, how do we know, for instance, that a subject in Georgian is the same as a subject in Norwegian? Also, when linguists use predicate-argument structures to describe the meaning of sentences, is there a way of deriving those objectively by comparing structures across languages? The present project wants to come closer to an answer to those questions by a new multilingual approach using computational grammars and tools. Computer implementation is required if a grammar is both to be formally precise and cover a large portion of a language, both of which are necessary to make the grammar properly testable. Based on principles employed in an existing grammar for Norwegian, such grammars are being constructed and will be further developed for the typologically diverse languages Georgian, Tigrinya and Dutch. When grammar writing is anchored in a common framework with theory-imposed constraints, differences in grammatical analyses across languages are likely to reflect real differences between languages, rather than accidentally different descriptive strategies among the grammarians.

Furthermore, such parallel grammars will allow the construction of parallel treebanks, which are databases of linguistic analyses of translated texts in which not only words are linked to their translations, but also phrases. From such parallel treebanks the project hopes to derive crosslinguistically valid insights into the nature of grammatical functions and the way grammar expresses meaning. The resulting grammars and treebanking methods will be relevant for theoretical, computational and applied linguistics. The insights to be gained and the resources to be produced will be of interest to information technology applications such as information retrieval, machine translation and natural language based search.

Project plan

The present pilot phase is aimed towards realizing a larger project which we will apply for through the Research Council’s FRIHUM program. The longterm goal is to explore a new method for studying linguistic diversity and commonality through the application of parallel grammars of a complexity only made possible by computational implementation, in combination with the emerging technique of parallel treebanks. This represents a novel approach allowing a broader empirical basis for the analysis of typologically different languages within a common formal framework. The research themes to be addressed can be summarized under the following three points.

  1. We will develop four parallel grammars with the aim of contributing to the development of language resources for the important but less-studied languages Tigrinya and Georgian, and of studying the possibility of porting information from an existing grammar (Norwegian) to facilitate the development of new grammars — typologically distant (Tigrinya, Georgian) as well as typologically close (Dutch). In the pilot project we will focus on achieving coverage for parallel test suites of central grammatical constructions for Tigrinya, Georgian and Dutch.
  2. We will investigate how the existence of parallel grammars can facilitate the development of parallel treebanks as an important resource for both basic research and practical applications, and develop a number of experimental parallel treebanks in the process. In the pilot project the basic architecture for parallel treebanking will be put in place.
  3. The ultimate goal is to use the experimental treebanks to explore a number of hypotheses concerning grammatical parallelism and multilingual semantics.

Scientific methodology

Our research group has participated in the Parallel Grammar Project (ParGram) since 1999, primarily with the Norwegian grammar NorGram, which now has achieved substantial coverage, but recently also with grammars of Tigrinya and Georgian. The focus of ParGram is to achieve ‘parallel grammars’ in the sense of employing the theoretical framework of Lexical-Functional Grammar (LFG) in the same way, and hence becoming maximally comparable. Parallel grammars describe common properties of languages in the same way and differ only to the extent that the languages are irreducibly different.

The grammars are being developed on a computer platform called the Xerox Linguistic Environment (XLE), which implements the LFG framework and allows efficient automatic analysis and generation of text based on the developed grammars. The platform is a state-of-the-art tool that is being continually developed and is used in both fundamental research and advanced commercial development (cf. Powerset Inc.). In the LOGON and TREPIL projects we have extended the platform with a Web interface (see picture 1) and the LFG Parsebanker (see pictures 2 and 3). These tools are language and grammar independent and can therefore be used for the diverse languages in the proposed project. These computational tools enable the linguist to combine adherence to formalized theoretical assumptions about language with wide-coverage, data driven language description. The result is a far broader basis than before for testing precise linguistic hypotheses about grammar and semantics.

Grammatical description in LFG is on two levels: c(onstituent)-structure (representing phrasal organization in the form of a tree) and f(unctional)-structure (representing grammatical functions such as SUBJ and OBJ, as well as basic predicate-argument structures). The ParGram project concentrates on parallelism only in the f-structures across languages, a leading question being in what circumstances we have to conclude that there are irreducible differences between the f-structures for semantically corresponding sentences of two given languages.

We intend to use LFG theory to construct grammars that are at once formalized and comprehensive. Our use of the LFG framework is motivated by the fact that it is a substantial theory about the class of possible human languages, and not just a tool for grammatical description. While the basic projection formalism for co-description allows a wide range of c-structures to be associated with the same f-structure, and vice versa, including pairings that are implausible linguistically, recent LFG research has in addition proposed strong universal constraints on the possible relationship between the two kinds of structures, implying empirical claims about the limits of possible variation among languages. A central contribution to this research is Bresnan’s development of a theory of how the relationship between c-structures and f-structures is constrained. By basing our grammars on Bresnan’s proposals we intend to extend the notion of parallel grammars from just considering f-structure, as in ParGram, to encompass c-structure as well. A consequence of adopting such common principles is that we will approach a situation where categorial and configurational differences between analyses of translationally corresponding sentences will always reflect genuine differences among the languages, and never just arbitrarily different principles of analysis among the grammarians. This will be relevant for the second theme in our research, parallel treebanking.

We intend to develop several parallel treebanks in the course of the full project. To guide grammar development in the pilot phase, we will start with parallel test suites that both correspond translationally and cover most basic constructions of the languages involved. These will help us make sure similar constructions are implemented in a parallel fashion, and will serve as a test bed for the implementation of linking extensions to the LFG Parsebanker tool.

The basic task of any grammatical theory is to account for the way in which form and meaning are linked in natural languages, e.g. showing how specific phrases in sentences pick out the participants and their roles in the described situations (agent, patient, beneficiary etc.). Within LFG the link between syntactic phrases and participant (or semantic) roles is mediated by an inventory of syntactic functions like SUBJ, OBJ, XCOMP, etc. According to Lexical Mapping Theory (LMT), a given semantic role is partly characterized by the set of syntactic functions to which it can map, in addition to other criteria, such as case marking. A parallel treebank allows us to recover interlingual predicates based on translational correspondences among verbs, as well as translational correspondences among the syntactic arguments of such corresponding verbs. The translational correspondences among the syntactic arguments give further information about the set of syntactic functions which is crosslinguistically available for a given argument. Our hypothesis is that LMT will allow the derivation of (possibly underspecified) information about the semantic role of an argument from such sets of alternative syntactic functions. In a sense, semantic roles would then be labeled by their sets of alternative syntactic expressions in a way analogous to the way in which alternative translations express the semantic properties of words in the Semantic Mirrors approach. If the hypothesis is confirmed, this would be a highly interesting result both theoretically (as it provides an intersubjectively testable basis for semantic categorization) and practically (as it would facilitate the automatic derivation of semantic information from texts).