vinci: preliminary concepts

Strings, Keywords and Identifiers

The language for which vinci is generating utterances is called the object language. The language, or sets of notation, used to describe the object language is called the metalanguage.

Sequences of letters and digits, as is common in the computing field, are called strings. When strings in the object language appear in the metalanguage, they are enclosed in double quotes: "cat".

The following strings, if they occur outside double quotes, are reserved as keywords:

CHOOSE
INHERIT
SELECT
RULE
TRANSFORMATION
PRIORITY
_pre_
_in_
_is_
_and_
_makes_
_ilt_
_lsp_
_rhs_

Note that vinci does distinguish between uppercase and lowercase letters. In some limited cases, like RULE and rule, both cases are accepted. They are retained for the sake of long-existing files, but it is recommended to use the forms shown here.

Identifiers are strings of letters, digits and the underscore occurring outside double quotes and, with one exception, not starting with a digit. They are used for naming items in the metalanguage: attribute types and values, attribute variables, word categories, the tags of lexical pointers, morphology rules and tables, tree nodes and terminal nodes, and so on. (The exception relates to the names of morphology rules which were once numbered rather than named.)

In some cases there is no clash if the same name is used for two different kinds of item. This is not recommended. Nor is the use of underscore as the first symbol, since we may want to add extra keywords in the future.

Identifiers have no structure and no inherent meaning. Thus, an identifier dirobj has no relation to identifiers dir and obj. If a certain attribute value is noted to cause "s" to be added to some English nouns and "were", rather than "was", to appear in some past tenses, it probably denotes the English plural, regardless of whether it is called plur, zyzzt or (misleadingly) fem. Indeed, if one substitutes a new identifier for an existing one systematically throughout a description, it does not affect the object language.

As noted in the Overview page, the following identifiers play a special part as rule names in the syntax:

ROOT
PRESELECT
QUESTION
ANSWER
R_3, R_4, ... R_20

These are reserved identifiers, not keywords. Though this may seem to be splitting hairs, it affects the point at which possible errors are detected, and therefore, the error messages which may be displayed. Keywords are "tokenized", i.e. converted to distinct tokens, at a very early stage of language installation. Misspelt keywords commonly damage the whole structure of rules, and trigger error messages during installation. Misspelt reserved identifiers cause problems only during sentence generation, and except for ROOT may not lead to a detected error. They may just give bad results: an ANSWER sentence not generated, a preselection not carried out, etc.

Our Naming Conventions

The authors observe certain naming conventions, not required by vinci, but helpful when reading our language descriptions. They are followed in this Manual:

metavariables (i.e. names of tree and terminal nodes) and syntax transformation rules: capitals (NP, N, NEG, ...)
attribute types: capital initial (Number, Gender, ...)
attribute values and lexical pointer tags: lowercase and digits (sing, plur, p2, ...)
attribute variables: capital initial and one lowercase (Nu, Ge, ...)

We also follow conventions in naming the files which make up a language descriptions. These will be mentioned at the appropriate time.

Characters and Characters Sets

ivi/vinci assumes that the character set in use for the program follows the ASCII standard, extended to ISO 8859-1 (Latin 1) if European accented characters are in use.

A displayable character is any which has an on-screen representation (as distinct from TAB and other control characters, which do not).

A letter is one of the alphabetical subset of these, including the European accented characters.

With a few exceptions, any letter (including accented ones) may be used in vinci identifiers, and any displayable character as letters of the object language (i.e. within double quotes).

The exceptions are as follows:

ivi/vinci uses byte 255 (y-umlaut) as an internal marker. This should not appear in a file, whether within double quotes or not. The same may possibly apply to bytes 254 (Icelandic thorn, lowercase) and 253 (y-acute).
The following have a special role in vinci files, and should not be used as object language letters: | { } "
The symbols * ^(circumflex) and `(backquote) in an object language string are interpreted as the wildstar, the space-eater and the capitalizer respectively. (See later.) They should not be used as normal letters in the object language.
The ivi Editor, in its reformat operation, regards the minus symbol as a hyphen. This may affect vinci output if minus is used as an object language letter, and the output is reformatted by ivi.
ivi, incidentally, uses only ASCII characters (which include only unaccented letters) in its commands and error messages. vinci does so too in the fixed parts of messages, but may also output identifiers and object language strings. It should, therefore, be possible to use fonts with the Cyrillic or Greek alphabets if these only replace the Latin-1 accented characters. The restriction on byte 255 still applies. (If y-umlaut, or any other character represented by byte 255, is essential, it must be represented in ivi/vinci by a different byte. ivi has a feature which allows this to be displayed by y-umlaut, or whatever, on the screen, but it will have to be converted if it is to be used in any other program.)

Patterns, Matches and Searches

When a vinci operation requires two objects to be matched, vinci commonly allows one of them to be a pattern rather than a fixed object. For example, if the objects are strings, it may allow a pattern like "me*t", where * is permitted to match any substring, including the empty one (the one having no characters). So this pattern matches "meat", "meant", "met", "merit" and a host of others.

The symbol * in this context is a wildcard or wildstar.

Other examples:

    "t*"    a string beginning "t"
    "*ing"  a string ending "ing"
    "*"     any string

There may be several wildstars in the same string (but they should not be adjacent): "b*tt*r", but not "b**r".

If vinci needs to determine which substrings match each *, we must be aware that, if there is more than one *, the result may be ambiguous. For example, "b*an*a" matches "banana" in two different ways.

When a match involves attribute values, vinci often uses an attribute type as a wildcard representing any value of the type.

When searching a list to locate (or fail to locate) an object on it, vinci may again allow the object to be a pattern. The pattern may match several objects on the list.

As we have noted elsewhere, a terminal node is really a pattern, a lexical search pattern, which may identify several lexicon entries matching it.

Guarded rules

Guarded rules appear in three different vinci contexts: context-free rules, syntax transformations and morphology rules. We discuss them here to avoid repeating the same information in three places.

A guarded rule consists of a sequence of subrules, each having a guard, i.e. a condition, and an action. Thus, it has a form such as:

    guard1 : action1;
    guard2 : action2;
    guard3 : action3;
    %

When a guarded rule is to be obeyed, the guards are tested in turn until one is found to be true. The action of the rule is simply the action corresponding to the first true guard. If no guard is true, the rule causes no action or returns no value, according to context. Optionally, the final subrule may a default, marked by having no guard or a guard which is always true, like the wildcard *. If this default is present, its action is certain to occur if no other does.

Programmers who have not met such rules before should note that they are generalizations with if_then_else, if_then and switch_case statements as special cases.