<< Previous - Up - Next >>

2 Grammar specification language

In this chapter, we describe the grammar specification language using an illustrative example. We implement in this chapter an enhanced version of the example grammar presented in (Duchier and Debusmann 2001) which handles several phenomena associated with the German verb cluster. The grammar file is included in this package under the name "grammar-acl.dg". Notice that we encourage the user to use the extension "dg" throughout for all grammar files.

2.1 Uses

Here you define what components of the parser the grammar uses:

defuses {id lp}
I.e. we use the ID and the LP-component. The ID-component must be used, whereas the LP-component is optional. To write a grammar that does not include word order constraints (i.e. only uses the ID-component), you would write:
defuses {id}
Note: The default is that both the ID and the LP-components are used.

New: from version 1.2, you can also use the TH-component (th).

2.2 Types

Here we define the types which will be used in the grammar:

deftypes
{
  EDGELABELID : {det subject object vinf vpast zuvinf zu}
  EDGELABELLP : {df mf vc xf zuf}
  NODELABELLP : {d n v z}

  PERSON      : {1 2 3}
  GENDER      : {masc fem neut}
  NUMBER      : {sg pl}
  DEF         : {def indef undef}
  CASE        : {nom gen dat acc}
  AGR         : PERSON * GENDER * NUMBER * DEF * CASE
}

Types are named by variables, i.e. by identifiers beginning with an uppercase letter, e.g. EDGELABELID. Variables are defined as a set of constants which must begin with a lowercase letter. Here, EDGELABELID is defined as a set of grammatical roles. AGR is defined as the cartesian product of PERSON, GENDER, NUMBER, DEF and CASE. Note that each element of a domain is a typed constant, e.g. det has type EDGELABELID. As a consequence, you must use distinct constants for different domains. Also note that the type EDGELABELID must be defined for each grammar. EDGELABELLP and NODELABELLP must be defined in each grammar that uses the LP-component, and EDGELABELTH in each grammar that uses the TH-component.

2.3 Lexical entry

Next, we define the features of a lexical entry.

defentry
{
  edgeID     : EDGELABELID set
  valencyID  : EDGELABELID valency
  edgeLP     : EDGELABELLP set
  nodeLP     : NODELABELLP set
  valencyLP  : EDGELABELLP valency
  blocks     : EDGELABELID aset

  agrs       : AGR set
}

Each feature is typed, and set, aset and valency are builtin type constructors. For example, EDGELABELID set denotes a set of grammatical roles. The difference between set and aset has impact on maximal values and inheritance which we explain in 2.8.

In each grammar, the features edgeID and valencyID are obligatory. Each grammar that uses the LP-component must include edgeLP, nodeLP, valencyLP and blocks. Each grammar using the TH-component must include edgeTH, valencyTH, link, raisedsubj and blocksTH.

The features edgeID, edgeLP, nodeLP, blocks, edgeTH, raisedsubj and blocksTH must be either of type set or aset. valencyID, valencyLP and valencyTH must be of type valency and link of type EDGELABELTH -> EDGELABELID aset.

2.4 Total order

The union of the sets EDGELABELLP and NODELABELLP must be totally ordered. We specify the order as follows:

{
  z zuf d df n mf vc v xf
}

2.5 Signs

For each word, there is a corresponding sign with the following internal structure:

sign(
   lex : o(index: Index
           word : Word
           entry: Entry)
   id  : NodeID
   lp  : NodeLP
   th  : NodeTH
   attribute : AttributeRecord)

where Index is the index of the selected entry, Word the string corresponding to the node and Entry the selected lexical entry itself. NodeID holds the information for the occurrence of the node in the ID tree. If the LP-component is used, NodeLP bears the corresponding information for its occurrence in the LP tree and (if the TH-component is used) NodeTH in the TH graph.

AttributeRecord is a record that holds additional attributes which are introduced as follows:

defattributes
{
  agr        : AGR
}

Defining constraints for these attributes are then specified as follows:

defnode
{
  _[agr] in _.lex.entry.agrs
}

In other words, each one of these attributes is introduced in order to pick one of the values licensed by the lexical entry.

2.6 Edge constraints

We stipulate edge constraints next. First, in the ID tree:

defedges id {
  det {
    _[agr] = ^[agr]
  }

  subject {
    _[agr] = ^[agr]
    _[agr] in $ nom
  }

  object {
    _[agr] in $ acc
  }
}

For example, this states that for an edge labeled det to be licensed, the daughter must agree with its mother (i.e. _[agr]=^[agr]). _ denotes the `current' node, and ^ its head. The notation _[agr] is equivalent to _.attribute.agr and is merely supported for convenience.

We can define similar constraints for edges in the LP tree. In the example grammar, we do not define any constraints for edges in the LP tree:

defedges lp {
  __ { }
}

Here, __ matches any edge label.

2.7 Distribution

For the parser, we must specify a distribution strategy. Currently, we can specify the sequence of features on which to perform labeling:

defdistribute
{
  _.id.mothers
  _.id.daughterSet
  _.lp.mothers
  _.lp.daughterSet
  _.lp.nodeLP
  _.lp.pos
}

This says to first perform labeling on the the ID mothers, then on the ID daughter sets, the LP mothers, the LP daughter sets, the node labels and the position.

2.8 Lexicon

2.8.1 Lexical types

Finally, we need to specify a lexicon. The lexicon can be specified on the basis of lexical types which can be combined using lexical inheritance to obtain lexical entries. In "grammar-acl.dg", finite verbs inherit from the following lexical type:

defword t_fin {
  edgeID    : {}
  valencyID : {subject}
  edgeLP    : {}
  nodeLP    : {v}
  valencyLP : {mf* xf?}
  blocks    : {det subject object vinf vpast vpast zuvinf}
}

This lexical type indicates that the set of accepted roles edgeID of a finite verb denotes the empty set and that finite verbs always subcategorize for a subject by their role valency valencyID. The set of accepted fields is empty and the set of accepted node labels includes only v. By its field valency (lexical attribute valencyLP), a finite verb offers a Mittelfeld (mf) and an extraposition fiel (xf). It blocks the set of all roles.

Valency (i.e. the attributes valencyID and valencyLP) is specified using wildcard notation: e.g. subject indicates that exactly one syntactic dependent with edge label subject is required. xf? indicates that at most one topological dependent with edge label xf is permitted and mf* that any number of dependents with edge label mf is permitted. One or more dependents are indicated by a +.

As in the example, we can omit lexical attributes. An omitted attribute is assigned its maximal value which depends on the attribute's type. If the attribute is of type set, the maximal value is its range. Hence in the specification of the lexical type t_fin above, the omitted attribute agrs is assigned the set AGR of all agreement tuples. If the omitted lexical attribute is of type aset or of type valency, its maximal value is the empty set.

Transitive verbs inherit from the following lexical type:

defword t_tr {
  valencyID : {object}
}

t_tr only specifies the valencyID-attribute, stating that an object is required. All the other lexical attributes are assigned their maximal values.

2.8.2 Lexical entries

Here is how we obtain the lexical entry for the word liebt, using lexical inheritance:

defword liebt t_fin t_tr {
  agrs      : $ 3 & sg & nom
}

The lexical entry for liebt defines only the value of the lexical attribute agrs and the other lexical attributes are assigned their maximal values. In the specification of the agrs-attribute, the prefix operator $ introduces a set generator which is a boolean expression that generates values for the corresponding type. For example, $ 3 & sg & nom denotes the set of agreement tuples that are 3rd person, singular and nominative. In addition, the lexical entry inherits from the lexical types t_fin and t_tr, stating that it is both a finite verb and a transitive verb.

Notice that there can of course be several entries for one word form. Also note that lexical entries have to be escaped using quotation characters if there is an identical type in the deftypes-section. In "grammar-acl.dg", we do not have to escape liebt because there is no identical type defined in the deftypes-section. However, if there would be a lexical entry for the word subject, we would have to escape it and write 'subject' instead. Further notice that in this implementation, we do not distinguish between lexical types and lexical entries: both are defined in exactly the same way. It is however convenient to notationally distinguish lexical types from lexical entries, and we adopt for this reason the notational convention to prefix lexical types with t_.

2.8.3 Lexical inheritance

Lexical inheritance proceeds differently for each lexical attribute depending on its type. It amounts to set intersection if the lexical entry is of type set and to set union if it is of type aset or valency. For instance, this is how the lexical entry for liebt is obtained:

t_fin t_tr liebt liebt t_fin t_tr
edgeID : EDGELABELID set {} EDGELABELID (max) EDGELABELID (max) {}
valencyID : EDGELABELID valency {subject} {object} {} (max) {subject object}
edgeLP : EDGELABELLP set {} EDGELABELLP (max) EDGELABELLP (max) {}
nodeLP : NODELABELLP set {v} {d n v} (max) {d n v} (max) {v}
valencyLP : EDGELABELLP valency {mf* xf?} {} (max) {} (max) {mf* xf?}
blocks : EDGELABELID aset EDGELABELID {} (max) {} (max) EDGELABELID
agrs : AGR set AGR (max) AGR (max) $ 3 & sg & nom $ 3 & sg & nom

In the table above, we display the lexical attributes in the leftmost column. The second column from the left indicates the values of these lexical attributes specified by the lexical type t_fin, the third those specified by t_tr and the fourth those specified by lexical entry liebt. We display the resulting values in the rightmost column. Notice that we annotate those values with (max) which are omitted in the respective lexical specification and which are therefore assigned their maximal values. As can be seen from the example, inheritance amounts to set intersection for the lexical attributes with type set, i.e. edgeID, edgeLP, nodeLP and agrs. Lexical inheritance amounts to set union for the lexical attributes with type aset and valency, i.e. valencyID, valencyLP and blocks.

Notice that inheritance proceeds slightly differently for valencies than for normal accumulative set lattices. If two elements with the same edge label but with different cardinality are to be combined, only the most specific of the two is contained in the resulting set. The order of specifity is as follows for an edge label r:

(r not in the valency set) < r* < r? < r+ < r

If we for instance combine the valency set {subj? adv*} with the valency set {subj adv?}, the result is not {subj? subj adv* adv?} but {subj adv?} because subj is more specific that subj? and adv? is more specific than adv*.

<< Previous - Up - Next >>