vinci preliminary concepts

Strings, Keywords and Identifiers

The language for which vinci is generating utterances is called the object language. The language, or sets of notation, used to describe the object language is called the metalanguage.

Sequences of letters and digits, as is common in the computing field, are called strings. When strings in the object language appear in the metalanguage, they are enclosed in double quotes: "cat".

The following strings, if they occur outside double quotes, are reserved as keywords:

Note that vinci does distinguish between uppercase and lowercase letters. In some limited cases, like RULE and rule, both cases are accepted. They are retained for the sake of long-existing files, but it is recommended to use the forms shown here.

Identifiers are strings of letters, digits and the underscore occurring outside double quotes and, with one exception, not starting with a digit. They are used for naming items in the metalanguage: attribute types and values, attribute variables, word categories, the tags of lexical pointers, morphology rules and tables, tree nodes and terminal nodes, and so on. (The exception relates to the names of morphology rules which were once numbered rather than named.)

In some cases there is no clash if the same name is used for two different kinds of item. This is not recommended. Nor is the use of underscore as the first symbol, since we may want to add extra keywords in the future.

Identifiers have no structure and no inherent meaning. Thus, an identifier dirobj has no relation to identifiers dir and obj. If a certain attribute value is noted to cause "s" to be added to some English nouns and "were", rather than "was", to appear in some past tenses, it probably denotes the English plural, regardless of whether it is called plur, zyzzt or (misleadingly) fem. Indeed, if one substitutes a new identifier for an existing one systematically throughout a description, it does not affect the object language.

As noted in the Overview page, the following identifiers play a special part as rule names in the syntax:

These are reserved identifiers, not keywords. Though this may seem to be splitting hairs, it affects the point at which possible errors are detected, and therefore, the error messages which may be displayed. Keywords are "tokenized", i.e. converted to distinct tokens, at a very early stage of language installation. Misspelt keywords commonly damage the whole structure of rules, and trigger error messages during installation. Misspelt reserved identifiers cause problems only during sentence generation, and except for ROOT may not lead to a detected error. They may just give bad results: an ANSWER sentence not generated, a preselection not carried out, etc.

Our Naming Conventions

The authors observe certain naming conventions, not required by vinci, but helpful when reading our language descriptions. They are followed in this Manual:

We also follow conventions in naming the files which make up a language descriptions. These will be mentioned at the appropriate time.

Characters and Characters Sets

ivi/vinci assumes that the character set in use for the program follows the ASCII standard, extended to ISO 8859-1 (Latin 1) if European accented characters are in use.

A displayable character is any which has an on-screen representation (as distinct from TAB and other control characters, which do not).

A letter is one of the alphabetical subset of these, including the European accented characters.

With a few exceptions, any letter (including accented ones) may be used in vinci identifiers, and any displayable character as letters of the object language (i.e. within double quotes).

The exceptions are as follows:

Patterns, Matches and Searches

When a vinci operation requires two objects to be matched, vinci commonly allows one of them to be a pattern rather than a fixed object. For example, if the objects are strings, it may allow a pattern like "me*t", where * is permitted to match any substring, including the empty one (the one having no characters). So this pattern matches "meat", "meant", "met", "merit" and a host of others.

The symbol * in this context is a wildcard or wildstar.

Other examples:

    "t*"    a string beginning "t"
    "*ing"  a string ending "ing"
    "*"     any string

There may be several wildstars in the same string (but they should not be adjacent): "b*tt*r", but not "b**r".

If vinci needs to determine which substrings match each *, we must be aware that, if there is more than one *, the result may be ambiguous. For example, "b*an*a" matches "banana" in two different ways.

When a match involves attribute values, vinci often uses an attribute type as a wildcard representing any value of the type.

When searching a list to locate (or fail to locate) an object on it, vinci may again allow the object to be a pattern. The pattern may match several objects on the list.

As we have noted elsewhere, a terminal node is really a pattern, a lexical search pattern, which may identify several lexicon entries matching it.

Guarded rules

Guarded rules appear in three different vinci contexts: context-free rules, syntax transformations and morphology rules. We discuss them here to avoid repeating the same information in three places.

A guarded rule consists of a sequence of subrules, each having a guard, i.e. a condition, and an action. Thus, it has a form such as:

    guard1 : action1;
    guard2 : action2;
    guard3 : action3;

When a guarded rule is to be obeyed, the guards are tested in turn until one is found to be true. The action of the rule is simply the action corresponding to the first true guard. If no guard is true, the rule causes no action or returns no value, according to context. Optionally, the final subrule may a default, marked by having no guard or a guard which is always true, like the wildcard *. If this default is present, its action is certain to occur if no other does.

Programmers who have not met such rules before should note that they are generalizations with if_then_else, if_then and switch_case statements as special cases.