er;
efine :parser <laration-part>ame>rguments-part>
        <ions-part>
        <initions-part>
        <ens-part>
        <es-part>
    enddefine;

-11 syntax interface to the LALR(1) parser generator.

Contents

ect headings to return to index

Introduction

define_parser is a *define_form which simplifies the construction of parsing procedures for LALR(1) languages. Its design is influenced heavily by other similar systems, most notably the Unix yacc utility, but integrated into the Poplog environment.

define-form allows the parser to be built around a grammar for the language: this is used to generate automatically the basic parsing engine, leaving the programmer free to concentrate on writing the action code which determines what the parser does once it has identified the particular language constructs. Parsers built in this way are quicker and simpler to write than those done by hand, especially if an appropriate grammar is already available from elsewhere (perhaps as part of a language specification); they are also easier to read, understand and modify at a later date. The technique has been proven over many years by the success of yacc and its derivatives.

define-form is built directly on the parser-generator utilities defined by LIB * LR_PARSER. Users who have not encountered similar systems before might like to consult the file HELP * LR_PARSER to get some background information about the LALR(1) system and how the basic parsing engine works. It should be noted that LALR(1) grammars are of most use in describing formal languages (including programming languages, command interpreters and the like) rather than natural languages.

e all define-forms, the define_parser syntax generates a procedure at the point at which it is compiled. You can, however, arrange for a Pop-11 program to be generated simultaneously which can be loaded at a later time to recreate the parsing procedure without using this library. This gives the maximum flexibility between development and delivery. Other features of the system include a choice of parsing engines (one for tracing and one for fast execution) and independence of the parser from the lexical representation of grammar tokens.

Loading the Library

order to use the define_parser syntax, your program must include the line:

er;

define-form will then be autoloaded on first use. You should mark and load this line now if you want to compile the examples in this file.

An Example

following example defines a parser for terms in the lambda calculus and illustrates the main features of the define-form. The definition is based around a grammar for lambda terms, the same as that used in the Lambda1 example from HELP * LR_PARSER.

er Lambda();
)' VAR:isword
   exp ::=
                \ VAR:x . exp:e      { [ABS ^x ^e] }
            |   appexp
            ;
   appexp ::=
                appexp:e1 aexp:e2    { [APP ^e1 ^e2] }
            |   aexp
            ;
   aexp ::=
                VAR:x                { [VAR ^x] }
            |   '(' exp:e ')'        { e }
            ;

tokens part of the definition declares the terminal symbols of the grammar, and defines recognisers which allow the parser to identify instances of the tokens in its input: the first four tokens are taken literally, but the VAR token is declared to be any other item satisfying the test isword. The quotes around the tokens '(' and ')' are necessary because unquoted these have a special meaning as part of the define-form syntax.

rules part simultaneously declares the non-terminal symbols, the start symbol and all the productions of the grammar. Each production can have a Pop-11 action (written inside the {} brackets) associated with it, which is run automatically when the rule is reduced. Symbols in a rule can have names attached which are bound to the symbol values during execution of the action code. The actions shown here build a parse tree reflecting the abstract syntax of lambda terms.

should try marking and loading this definition now. If it seems to take a long time, that could be because the define_parser library has to be autoloaded on first use. Compiling the definition a second time should be considerably quicker.

result of the definition is a procedure:

** <cedure Lambda>

run the parser, we need to provide it with a source of input. To keep the parser independent of any particular input format, input is obtained from a procedure which generates a sequence of items; these items are mapped to grammar tokens by means of the recognisers declared in the tokens part of the definition. By default, the parser will look for an input procedure called Lambda_input (which is, of course, the grammar name with _input added to the end) which is declared automatically by the define-form, and initialised to readitem:

=> ** <cedure readitem>

ng readitem means that the parser fits comfortably with standard Pop-11 input conventions, but the input procedure can be redefined as appropriate and for demonstration purposes we shall define it to work on single characters, stopping at the end of a line:

efine Lambda_input();
        lvars c = cucharin();
        while c == `\s` or c == `\t` do
            ;;; skip leading spaces
            cucharin() ->
        endwhile;
        if c == `\n` or c == termin then
            ;;; end of input
            termin;
        elseif strmember(c, '\\.()') or islowercode(c) then
            ;;; legal item
            consword(c, 1);
        else
            ;;; illegal item
            consstring(c, 1);
        endif;
    enddefine;

can then try some examples: to run these, you must mark and load both the line containing the procedure call, and the subsequent line of input.

t; x ** [VAR x]
t; \x.\y.x ** [ABS x [ABS y [VAR x]]]
t; (\x.xx)(\x.xx) ** [APP [ABS x [APP [VAR x] [VAR x]]] [ABS x [APP [VAR x] [VAR x]]]]

the input is syntactically incorrect and can't be parsed, the parser gives up with an error:

t; f(xy))
PARSE ERROR ;;; INVOLVING: )

same happens if the input contains something the parser doesn't recognise as a token -- in this case, a string:

t; f(f0)
PARSE ERROR ;;; INVOLVING: '0'

Structure of the Define Form

parser define-form has the syntax:

efine :parser <laration-part>ame>rguments-part>
        <ions-part>
        <initions-part>
        <ens-part>
        <es-part>
    enddefine

s defines a parsing procedure called <e>t may also, depending on any options specified, define a second, input procedure called:

<laration-part>ntrols the declaration of both identifiers.

<uments-part>ecifies any arguments to the parse procedure.

<ions-part>n be omitted. If present, it contains various options for controlling the behaviour of the parser generator.

<initions-part>n also be omitted. If present, it contains ordinary Pop-11 code which is compiled inside the procedure body. This is meant for the definition of local variables and procedures used within any subsequent action code.

<ens-part> compulsory. It declares the terminal symbols of the grammar and defines recognisers for them so that the parser can identify instances of tokens in its input.

<es-part> also compulsory. It declares the non-terminal symbols, start symbol and productions of the grammar and defines actions to be invoked whenever a rule is reduced during parsing.

define-form is read using the standard Pop-11 itemiser, so the usual itemisation rules apply. In particular, comments can appear anywhere. Macro-expansion is not done, except of course in those sections which are compiled as ordinary Pop-11 code.

The Declaration and Arguments Parts

<larations-part> the define-form controls the declaration of both the parser itself and the subsidiary input procedure (where necessary). It consists of any meaningful combination of the usual declarators:

ant lconstant vars lvars procedure

HELP * DEFINE if the meaning of any of these is unclear. The actual declaration is done with pop11_define_declare so any attributes not specified explicitly in the <laration-part>herit the defaults pertaining to a standard definition.

<uments-part>clares any arguments expected by the parser and can be omitted if there are none. It has the general form:

list>

re the <ument-list> a comma-separated sequence of arguments. Each argument is normally just a word, which is implicitly declared as lvars within the procedure body, but this default declaration can be overridden by prefixing the argument with an explicit declaration made up of declarators from the set:

dlocal vars procedure

<uments-part> superficially similar to the argument list of a standard procedure definition, but with two significant differences:

The Options Part

<ions-part> the define-form has the general structure:

ptions
        <ion-sequence>
    endoptions

<ion-sequence> a possibly empty list of options, where each option has one of the forms:

;ue>

option <e> always a word; the <ue>f present) is typically a single item (word, string or number) but depends on the particular option. Adjacent options can be separated by commas.

ions specified in the <ions-part>ply only to the definition in which they occur. You can select options which apply globally to all parser definitions by assigning to the variable define_parser_options. If set, this must be a list of options, where a simple option is denoted by its name, and an option taking a value is denoted as a sub-list:

e]

ions from this global list are processed after those specified in the <ions-part> take precedence.

available options are summarised here, but many are discussed in more detail at relevant points throughout the text:

icts = <gt;<gt;
    Declares the number of conflicts expected in the parser: <gt; the
    number of SHIFT/REDUCE conflicts and <gt; the number of
    REDUCE/REDUCE conflicts. <gt;d <gt;st both be non-negative
    integers.
 = <e>
    Declares an alternative name for the parser's input procedure. The
    <e>gument must be a word; if omitted, it defaults to ""
    See the section headed Input to the Parser below.
    Passed on to the call to lr_build causing the generated parser to be
    added into the lr_parser property table keyed under the procedure
    name.
r = <e>
    Declares a name for the parser tables generated by lr_build. The
    <e>gument must be a word and is declared lconstant within
    the parser procedure, making it accessible from user code in the
    <initions-part>d the <es-part> the definition. If the
    <e>gument is omitted, it defaults to the word "r"
am = <ename>
    Writes a program to <ename>ich will recreate the parser
    without using the define-form, requiring less time and space to
    compile. The <ename>gument can be a word or string; a word
    will have the extension '.p' added automatically. If omitted, the
    <ename>faults to the procedure name. See the section headed
    Generating a Program File below.
nput
    Declares that the parser's input procedure has the call form
() ->tem, token_n)
generation of the input map. See the section headed Input to the Parser below.
t = <ename>
    Calls the procedure lr_report to write a report on the parser to
    <ename>he <ename>gument can be a word or string; on VMS
    systems, a word will have the extension '.lis' added automatically.
    If omitted, the <ename>faults to the procedure name.
ts = <t;
    Declares the default multiplicity of non-terminal symbols, i.e. the
    number of values expected from all symbols not explicitly qualified.
    The value <t;st be a non-negative integer.
ode
    Used in combination with LIB * SHOWCODE. If set, then the variable
    pop_show_code is set <e>ile the parser definition is compiled.
_checks
    Adds code to each user action to check that the number of results
    returned by the action matches that which the parser expects from
    the corresponding non-terminal symbol.
 = <bol>
    Declares <bol> the start symbol of the grammar: <bol>st
    be subsequently declared as a non-terminal symbol. The default start
    symbol is the first non-terminal declared.
n = <e>
    Declares an alternative name for the end-of-input marker. The <e>
    argument must be a word; if omitted, it defaults to "n"
    the section headed Input to the Parser below.
    Uses the lr_trace implementation of the LR(1) parsing algorithm in
    place of standard lr_parse. This produces the same results (albeit
    more slowly) but also generates a trace of the parse in an output
    file.

The Definitions Part

<initions-part> the define-form has the general structure:

efinitions
        <11-statement-sequence>
    enddefinitions

Pop-11 code is compiled in the usual way within the context of the parse procedure. This provides a place for defining local variables and procedures which are to be used in subsequent recogniser or action code, or for declaring dynamic local variables and actions.

The Tokens Part

<ens-part> the define-form has the general structure:

okens
        <en-definitions>
    endtokens

<en-definitions>nsists of a sequence of token definitions. This may be empty: it is possible to have a grammar with no terminal symbols, although its behaviour would be predictable.

imple token definition has one of three forms:

t;11-expression>
t;11-expression>

e complex definitions involving precedences are possible: these are described later. Adjacent token definitions are normally separated by a comma but this can be omitted for definitions consisting of the <en> alone.

h form of definition declares the item <en> a terminal symbol of the grammar. A token is always a word. You can write tokens inside string quotes, in which case they are read as strings but converted immediately to words using consword. There are three possible reasons for using quoted tokens:

   certain words are special to the parser syntax form and simply
        can't be used as tokens unquoted; the full list of these special
        words is:
ptions definitions enddefinitions tokens endtokens rules endrules , ; : = ::= | ( ) { }
   quoted tokens are read as strings, and so can contain characters
        which aren't normally allowed in words, e.g:
'...' '\n'
   using quotes consistently can be useful in distinguishing
        terminal and non-terminal symbols in grammar rules, e.g:
ession 'do' block 'endwhile'
, the tokens are the same written with or without quotes.

oken definition also defines a recogniser for the token it declares: the recogniser is simply a test which the parser can apply to an input item to determine whether it corresponds to that token.

the simplest definition form:

token is its own recogniser: the input item is compared against the token itself using "

the other definition forms:

t;11-expression> <en><11-expression>

<11-expression>rt is evaluated in the usual way. In the " case, the result of the expression is compared against the input item using the standard equality procedure; in the "se, the result itself is expected to be a procedure which is applied to the item.

ogniser expressions are evaluated at run-time, each time the recogniser is applied, so be wary of using tests such as:

r(% [a b c d] %)

re the closure will be repeatedly constructed. For efficiency, use the compile-time evaluation brackets:

define a local procedure in the <initions-part>

ognisers are applied to an input item in the order in which they were declared, and the first one returning a non-false result determines the token to which the item corresponds. More details about how input items are classified are given in the section headed Input to the Parser below.

The Rules Part

<es-part> the define-form has the general structure:

ules
        <e-definitions>
    endrules

<e-definitions>nsists of a sequence of rule definitions. This may not be empty: the grammar must have at least one non-terminal symbol with at least one rule.

basic form of a rule definition is:

ide>= <ht-hand-side>ction>

left-hand-side of a rule has two possible forms:

lt;tiplicity>

each case, the item <bol> declared as a non-terminal symbol of the grammar. The first non-terminal symbol declared is taken to be the start symbol of the grammar. The rules for non-terminal symbols are the same as those for tokens: symbols are always words, but may be written in string quotes where necessary. It is an error for a symbol previously declared as a token to be redeclared as a non-terminal symbol.

<tiplicity>rt of the left-hand-side is a non-negative integer which specifies the multiplicity of the symbol, i.e, how many results an instance of the symbol is expected to return. The default multiplicity is normally 1, but can be changed with the option results.

same symbol can appear on the left-hand-side of more than one rule but the multiplicity must be the same in each case. Where two or more adjacent rules share the same left-hand-side they can be abbreviated using the syntax:

ide>= <ht-hand-side-1>ction-1>
; The right-hand-side of a rule consists of a sequence of symbols, both terminal and non-terminal. It may be empty to indicate a null production. The right-hand-side can contain references to non-terminal symbols which have not yet been declared: this is quite common, since grammars are very often written in "own"e. It is an error if any of these symbols remain undeclared once all the rules have been read. A symbol occurring on the right-hand-side of a rule can be qualified with a list of identifiers: <bol>( <1>..., <n> (n >) If there is just a single identifier, the parentheses can be omitted: <bol><gt; The identifiers name the values associated with the symbol for use in the <ion>rt of the rule. The number of identifiers must match the multiplicity of the symbol. As a special case, the item "n be used as a placeholder to indicate the presence of a value which is not wanted: in the example phrase : (first, _, last) the symbol phrase is expected to return three results, but only the first and the last are interesting enough to be named: the middle one is discarded. A qualifying identifier list which consists only of "ems can be abbreviated to a non-negative integer count: <bol><t; This indicates that the symbol has <t;lues, but none of them are wanted. As before, <t;st agree with the declared multiplicity of the symbol. An unqualified symbol behaves as if it were qualified with its multiplicity: all its values are ignored. For this to work, the multiplicity must be already known, either because the symbol has been explicitly declared or (if it's a forward reference) because it has the default multiplicity. Symbols from the left- and right-hand-sides of a rule, stripped of their qualifiers, together define a production of the grammar which is added to the rule set. The <ion>rt of a rule is optional. If present, it has the form: { <11-statement-sequence> The brackets ({}) are part of the define-form syntax and have nothing to do with their use in Pop-11 as vector constructors. The action code is compiled as a lexical block, so that any lexical variables or constants declared within the action code are not visible elsewhere: variables which are to be common to all actions must be declared in the <initions-part> the define-form. The action code is evaluated each time the corresponding grammar rule is reduced by the parser. Its purpose is to compute the value(s) of the rule's left-hand-side symbol. The number of results which the action code returns MUST match the multiplicity of the left-hand-side symbol: this is a frequent source of error which cannot be checked for at compile-time, but you can use the option stack_checks to enable run-time checking. The action code can refer to the values of symbols from the right-hand-side of the rule by means of their qualifying identifiers: these are automatically declared as lexical variables within the action code block. A terminal symbol always carries a single value, which is the input item read for that token; a non-terminal symbol carries 0 or more values (depending on its multiplicity) computed by earlier reduce actions. The final result of a call to the parser is whatever is returned by the action code associated with the final reduction of the start symbol. If the <ion>rt of a rule is missing, the default action is to return the first <t;lues from the right-hand-side of the rule, where <t; the multiplicity of the left-hand-side. If the right-hand-side returns fewer then <t;lues, this will cause the error: MISHAP - TOO FEW RESULTS RETURNED BY RIGHT-HAND-SIDE Some examples: (1) exp ::= \ VAR:x . exp:e { [ABS ^x ^e] } ; This declares the symbol exp as a non-terminal symbol and declares the production: exp -->VAR . exp Because the exp symbol on the left-hand-side has no qualifier, it is assumed to be of multiplicity 1. The right-hand-side of the rule consists of four symbols, of which the values of the literal tokens '\' and '.' are ignored, but the values of the VAR and exp symbols are assigned to the lexical variables x and e respectively. The action code for the rule constructs a list containing these two values: this is a single result, correctly matching the multiplicity of the exp symbol. (2) exp ::= appexp ; This (re)declares the non-terminal symbol exp and the grammar rule: exp -->pexp There is no action code, so the default action applies: the value returned by appexp becomes the value of exp. (3) aexp ::= '(' exp:e ')' { e } ; This declares the non-terminal aexp with multiplicity 1, and the grammar rule: aexp -->' exp ')' The explicit but trivial action is necessary, because without it, the value of aexp would default to the first value taken from the right-hand-side: the value of '('. Note ==== Grammar rules can, of course, be recursive, in the sense that the right-hand-side of a rule can include its own left-hand-side symbol. The parser has a distinct preference for so-called left-recursive rules, i.e. ones where the left-hand-side symbol appears as the first symbol on the right-hand-side. Often there will be no choice how to write a particular rule, but consider the case of a non-terminal describing a simple sequence of items, for which there are two possible formulations: item_seq_A ::= /* empty */ | item_seq_A item /* left-recursive */ ; item_seq_B ::= /* empty */ | item item_seq_B /* right-recursive */ ; Semantically, these are equivalent in that they both derive the same sentences, but the parser can read the left-recursive item_seq_A in constant space, where reading an item_seq_B requires space proportional to the number of items in the sequence.

Input to the Parser

The parser obtains its input from a procedure having the general call form: input() ->em The type of item returned by the input is not constrained in any way. The parser consumes items returned by successive calls to this procedure up to some special end-of-input marker (usually <min>r until it encounters an error. The input procedure is only ever referenced by name, and the default name for a parser called <e> <e>put You can define this procedure for yourself, but if there is no declaration for it active at the time the parser definition is compiled, then it will be declared automatically using the same <laration-part> as the parser itself and given the initial value of readitem. You can choose a different name for the input procedure with the option: input = <ut-name> where the <ut-name>st be a word. In this case, you must provide a definition for the procedure yourself, since the define_parser syntax will not declare an input procedure with anything other than the default name. The <ut-name>ould be at least declared (if not defined) before the <ens-part> the parser definition is read; one possible place for this is the <initions-part> the define-form: define :parser Lambda(); options input = tokenise endoptions definitions define lconstant tokenise() ->em; lvars item, c = cucharin(); /* ... continues as above */ enddefine; enddefinitions ... enddefine; Alternatively, you can parameterise the parser on its input by including the input procedure name in the <uments-part> the definition. In this case, of course, the parser must be applied to an appropriate input procedure each time it is called. The parser maps input items onto grammar tokens using the recognisers declared in the <ens-part> the define-form. These are compiled into a separate procedure called the input map. The input map applies the base input procedure to obtain the next input item; this is compared first against the end-of-input marker (using "to determine if the input is complete, and if not, the item is then tested against each of the recognisers in turn, and the first to succeed determines the grammar token to which the item corresponds. Tokens are represented as small integer numbers: 0 stands for the end-of-input marker and other tokens are counted from 1 in their order of declaration. If an input item fails all the recogniser tests, its token number is returned as -1 which will provoke a subsequent error from the parser. The general call form of the input map is: input_map() ->tem, token_n) where item is as returned by the base input procedure and token_n is its corresponding token number. The input map generated for the Lambda grammar would look something like: define Lambda_input_map() ->tem, token_n); lvars item = Lambda_input(), token_n; if item == termin then 0; ;;; end-of-input elseif item == "en 1; ;;; \ elseif item == "en 2; ;;; . elseif item == "en 3; ;;; '(' elseif item == "en 4; ;;; ')' elseif isword(item) then 5; ;;; VAR else -1; ;;; unrecognised endif ->ken_n; enddefine; The test against the end-of-input marker is always done first. The use of <min> indicate the end of input is conventional, but you can specify a different marker with the option: termin = <-of-input-name> where the <-of-input-name>st be a word and should be declared before the <ens-part> the define-form is read. The subsequent recogniser tests are always applied in the strict order in which they were declared in the <ens-part>his is a defined property of the define_parser syntax and is important in cases such as the above where the isword test for a VAR token would be equally satisfied by the other special tokens, if applied first. A rearrangement of the Lambda <ens-part> read: tokens VAR:isword; \ . '(' ')' endtokens would be incorrect, since only the VAR token would ever be recognised. Use of the input map is a convenience to allow off-the-shelf itemisers such as readitem to be integrated easily with the parser generator. Many languages, however, require special-purpose lexical analysers which will typically classify items down to token level, making the extra level of interpretation imposed by the input map an unnecessary expense. You can disable use of the input map by specifying the option: raw_input This acts as a declaration that the input procedure itself has the call form: input() ->tem, token_n) making the input map redundant. In this case, any explicit recognisers included in the <ens-part>e ignored -- or rather, are compiled with pop_syntax_only set to <e> which allows the same definition to be compiled for either input form. Obviously, when using this option, the token numbers returned by the input must agree with the order in which the tokens themselves are declared in the <ens-part> the definition. There is no way of ensuring this automatically, so some care is needed.

Structure of the Parser

The define_parser syntax form creates a parsing procedure using the parser-generation utilities described in REF * LR_PARSER. The resulting procedure is equivalent to an ordinary Pop-11 procedure with the general outline: define <laration>ame><uments> /* 1 */ <ument-declarations> lconstant <ser>lr_build(<e>grammar> /* 2 */ <al-definitions> /* 3 */ define lconstant <ut-p>->tem, token_n); /* 5 */ lvars item, token_n; ... enddefine; define lconstant <uce-p> rule_n */); /* 6 */ go_on /* rule_n */ to ... enddefine; lr_parse(<ut-p>reduce-p>parser> /* 4 */ enddefine; The principal components of this definition are as follows:
    procedure <e> as originally given; its <laration>d the <uments>d their declarations are synthesised in an obvious way from the <larations-part>d <uments-part> the define-form.
  1. <ens-part>d <es-part> the define-form declare the symbols and productions of a grammar. These are passed at compile- time to the procedure lr_build -- with the procedure <e>so used as the grammar name -- and the resulting parser structure which encodes the parsing tables for the grammar is bound to an lconstant identifier within the procedure body. The default name for the parser is a private one, but you can specify an alternative name with the option:
parser = <ser-name> where the <ser-name>st be a word. This allows the parser tables to be referenced from Pop-11 code in the subsequent <initions-part>d <es-part>his is particularly useful for error reporting: see the section on Error Handling below.
    local definitions are simply copied from the <initions-part> of the define-form. This code is the first to be executed when the procedure is applied, so can be used to initialise the parse.
  1. main action of the procedure lies in the final call to lr_parse which interprets the parsing tables constructed at (2). The interpreter requires two additional procedure arguments defined locally at (5) &. No code is executed after this call returns -- except, of course, for any dlocal exit actions included in the local definitions -- so this determines the result of the parse.
The option: trace causes the trace interpreter lr_trace to be used instead of the standard lr_parse: this uses the same parsing algorithm, so produces the same results, but will additionally generate a trace of its actions, usually in a Ved window.
    input procedure provides the input source for the parse: this is simply the input map discussed above. If the option raw_input is specified, this local definition is omitted and the base input procedure is passed directly to lr_parse.
  1. reduce procedure is called once for each reduction in the parse. This is generated from the action code in the <es-part> the define-form and consists of a single go_on which switches on the rule number to select the appropriate action-code block. Code is added to the start of each action to declare the identifiers named on the right-hand-side of the rule and initialise them with items popped from the stack.
You can elect to have the program text for the parsing procedure written out to a separate file so that it can be compiled independently at a later date: this is described more fully in the section headed Generating a Program File below.

Additional Features

These are more advanced features which can be ignored initially, but which can significantly improve more complex applications.

Intermediate Actions in Rules

The full syntax of rules allows for intermediate actions to be interspersed with symbols on the right-hand-side: <t-hand-side>= <ion-1>ymbol-1>. <ion-n>ymbol-n>ction> ; Each intermediate action adds to the grammar a new non-terminal symbol with a single empty rule as follows: <t-hand-side>= '{1}' <bol-1>. '{n}' <bol-n>ction> ; '{1}' ::= <ion-1> ... '{n}' ::= <ion-n> The new symbols are kept distinct from any explicitly declared and will appear as '{1}', '{2}', ... etc. in the parser report file. The empty rules can introduce conflicts into the parser, so it's a good idea to get the grammar as complete as possible before adding any intermediate actions which might confuse things. Intermediate actions have the same form as the final actions already discussed. The action code is executed when it is encountered by the parser during the reading of the rule. Such actions are useful for checking values just read, or for initialising the next symbol to be read. The code can refer to named values occurring to the left in the rule, but not to those on the right since they haven't been read at the time the code is executed. Intermediate actions can modify results returned by symbols or actions on the left, and the changed values will be seen by any subsequent actions occurring on the right. For example, in the rule: binding ::= var:id { declare(id) -> } '=' expression:e { ... id ... } the reference to id in the final action will see the value returned by the previous call of declare. Intermediate actions can also return additional results, indicated by qualifying the action with an identifier list in the same way as for a symbol: <ion>( <1>.., <n> Unqualified actions are always assumed to return no results, regardless of the defaults applying to non-terminal symbols.

Meta Symbols and Rules

Meta symbols and rules provide a macro-like facility for abbreviating common forms in grammars -- a user-definable meta-notation for expressing concepts such as the optional occurrence of phrases, the repetition of phrases etc. For example, the notion of an optional phrase could be expressed with the meta rules: OPT(phrase) ::= /* empty */ { false } | phrase ; This in itself adds nothing to the grammar, but the OPT meta-symbol can be used in other rules wherever an optional item is required, as in: vars ::= /* empty */ { [] } | var:v OPT(',') vars:vs { v :: vs } ; Meta-symbols and their rules are declared in the <es-part> the define-form along with the standard rules. The order of declaration is unimportant: meta-symbols can be forward referenced in the same way as ordinary non-terminal symbols. The form of a meta-rule definition is the same as for any other rule, except that the left-hand-side <bol> replaced by a <a-symbol> of the form: <bol><mal-parameters> Meta-symbols have the same itemisation rules as tokens and non-terminal symbols and they share the same name-space: you cannot have a meta-symbol with the same name as a token or other symbol. A meta-rule is distinguished from an ordinary rule solely by the presence of parentheses on the left-hand-side. As with other rules, consecutive meta-rules with the same left-hand-side can be combined. The <mal-parameters>rt of a meta-rule consists of a comma- separated sequence of formal parameter names which again obey the usual itemisation rules. Formal parameter names can be used on the right-hand-side of the meta-rule in place of tokens or non-terminal symbols; they cannot be used to stand for Pop-11 identifier names, or anywhere in the action code associated with the rule. There is no notion of whether a particular formal parameter is meant to stand for either a token or a non-terminal symbol, but the multiplicity of the parameter must be known. Each formal parameter is assumed to have the default multiplicity unless it is explicitly qualified on the left-hand-side of the meta-rule, as in: EXAMPLE(phrase:2) ::= ... where the formal parameter phrase is declared to have multiplicity 2. As with ordinary symbols, the meta-symbol itself can be declared to have a multiplicity different to the default. The number and multiplicities of the formal parameters together with the multiplicity of the result determines the signature of the meta-symbol. For the examples so far, the signatures are: OPT(1) : 1 EXAMPLE(2) : 1 Repeated declarations for the same meta-symbol must have the same signature. A meta-rule definition does not in itself add anything to the grammar being defined. However, on the right-hand-side of any rule -- including other meta-rules -- the occurrence of a grammar symbol can be replaced by an application of a meta-symbol, written: <bol><ual-parameters> The actual parameters can be tokens, non-terminal symbols, nested meta-applications, or -- in the case of a meta-rule definition -- formal parameters of the meta-rule. Needless to say, the multiplicities of the actual parameters and the number of results expected from the application must match the signature of the meta-symbol being applied. Each distinct application of a meta-symbol adds a new non-terminal symbol to the grammar. These new symbols are distinct from any explicitly declared: in any output produced by the parser generator they appear as the application itself. Rules for the new symbols are generated by copying the rules associated with the meta-symbol, consistently replacing all formal parameters by the actual parameters supplied to the application. So the example rule for vars given above is exactly equivalent to the following expansion: vars ::= /* empty */ { [] } | var:v 'OPT(,)' vars:vs { v :: vs } ; 'OPT(,)' ::= /* empty */ { false } | ',' ; All copies of a meta-rule share the same action code, because the code is independent of the parameters to the rule. Repeated applications of a meta-symbol to the same actual parameters share the same generated symbol and rules. This means that recursive and nested meta-rules can be freely used without any efficiency penalty: SEQ(phrase, separator) ::= /* empty */ { [] } | phrase:x separator SEQ(phrase, separator):xs { x :: xs } ; vars ::= SEQ(var, OPT(',')) ;

Operator Precedences

The full syntax of token definitions allows tokens to be declared as operators with numeric precedence as follows: <cedence><ple-token-definitions> This declares all the tokens defined within the parentheses to have the specified <cedence>he <cedence> omitted for non-operator tokens. The <ple-token-definitions>e as before; the parentheses can be omitted for a single definition. The <cedence> an integer number. The magnitude of the number determines the binding power of the operator: the smaller the magnitude, the tighter the operator binds. The sign of the number determines its associativity: positive means left-associative and negative means right-associative. These are the same conventions used for operators in Pop-11. A rule normally inherits the precedence of its right-most token. This can be overridden for a particular rule by writing an explicit precedence immediately before its right-hand-side: <t-hand-side>= <cedence>ight-hand-side>ction> Precedences are used in the resolution of SHIFT/REDUCE conflicts (described in HELP * LR_PARSER). In a SHIFT/REDUCE conflict, the parser has a choice between SHIFTING some token S onto the parser stack, or REDUCING by a rule R. If both S and R have precedences, the choice can be determined by using the rule: IF S binds more tightly than R OR S and R have the same binding power AND S is right-associative THEN choose the SHIFT OTHERWISE choose the REDUCE Expressed as Pop-11 code: define resolve(S, R) ->oice; lvars S, Sp, R, Rp, choice = undef; if (precedence(S) ->p) and (precedence(R) ->p) then if abs(Sp) <s(Rp) or abs(Sp) = abs(Rp) and Sp < then S else R endif ->oice; endif; enddefine; The rationale for this is best demonstrated by an example. The rule set: exp ::= exp '+' exp ; generates a parser with 20 SHIFT/REDUCE conflicts. This is clearly because the rules are ambiguous, as illustrated by the input: NUMBER '+' NUMBER '*' NUMBER This sentence has two alternative interpretations, resulting in a conflict once the parser has read to the position: NUMBER '+' NUMBER <t;' NUMBER A SHIFT here corresponds to the interpretation: NUMBER '+' ( NUMBER '*' NUMBER ) while a REDUCE leads to the alternative: ( NUMBER '+' NUMBER ) '*' NUMBER Informally, we would resolve this ambiguity by taking into account the precedences of the operator symbols '+' and '*': multiplication normally takes precedence over addition, making the first (SHIFT) alternative the preferred one. We can make this explicit in the parser by declaring the operator tokens as follows: 5 ( '+' '-' ) 4 ( '*' '/' ) NUMBER This is sufficient to remove all ambiguities from the grammar. In the specific example, the choice is between a SHIFT of token '*' and a REDUCE by rule exp -->p '+' exp The precedence of '*' is 4; the precedence of the rule, derived from that of its right-most token '+', is 5. Hence the choice favours the SHIFT. This parser is not quite correct however, because of the use of the '-' token to stand for two distinct operators: negation and subtraction. As a prefix operator, '-' binds very tightly, but the parser doesn't know this: in the configuration '-' NUMBER <t;' NUMBER the parser will still choose to SHIFT the '*' token, because of its apparently higher binding power. The solution is to add an explicit precedence to the negation rule: exp ::= exp '+' exp ; This overrides the default derived from the precedence of the token '-' and means that the rule has a lower precedence than anything else, ensuring that it will always be reduced first. In this conflict-resolution strategy, the associativity of a rule is always disregarded: there is no obviously best choice to make in the face of conflicting associativities, so the decision made here (based on the token alone) is arbitrary. It is therefore better to avoid having operators with different associativities at the same precedence level. The use of operator precedences (where appropriate) to resolve conflicts is a much better solution than the alternative strategy of creating a hierarchy of additional non-terminal symbols which reflects the hierarchy of the operators, like this: exp ::= add_exp; add_exp ::= add_exp add_op mul_exp | mul_exp ; mul_exp ::= mull_exp mul_op NUMBER | NUMBER ; In the first place, operator grammars are usually easier to write and understand. They're also more efficient: introducing the extra symbols and rules means that the parser spends much of its time promoting symbols through the hierarchy, i.e, instead of a single reduction NUMBER ==>p we now have: NUMBER ==>l_exp ==>d_exp ==>p

Generating a Program File

Compiling a parser definition involves a considerable amount of library code. The time and space overhead imposed by loading this code is typically acceptable for development systems, but may not be so for production systems, particularly as the majority of the code is redundant at run-time. To ameliorate this problem, you can arrange to generate a Pop-11 program file which can be loaded in place of the define-form, and which will recreate the parser without the use of the define_parser library. This feature is enabled with the option: program = <ename> where <ename> the name of the program file to create. The generated file contains just a single definition for the parse procedure and so may not be suitable for compiling on its own: if the parser was defined originally inside a section, for example, or declared lconstant. The file is best used in conjunction with the #_INCLUDE mechanism which performs a textual insertion of the file contents. The following example demonstrates a useful idiom which supports the use of the program option: #_IF DEF DEVELOPMENT define :parser foo(); options program = 'fooProg.p' endoptions ... enddefine; #_ELSE #_INCLUDE 'fooProg.p' #_ENDIF If this is compiled in a context where DEVELOPMENT is defined to have a non-false value, the procedure foo will be built using the parser define-form and a corresponding program will be written out to the file 'fooProg.p'; in any other context, the 'fooProg.p' file will be compiled instead. The use of #_INCLUDE to load the program file ensures that it is compiled in the same lexical/section context in which the define-form was originally processed. See REF * PROGLIST for an explanation of the "acros. Note that the define-form will always generate a procedure at the point at which it is compiled, regardless of whether the program option is in effect. Also, the program which is written out is unaffected by the stack_checks and trace options.

Error Handling

On encountering an error, the parser calls the procedure: lr_parse_error(item, token_n, state) where item is the input item which caused the parse to fail, token_n is its token number and state is the state number in which the error arose. This procedure is redefinable. Its default action is to call: mishap(item, 1, 'PARSE ERROR'); This behaviour is unlikely to be acceptable for any but the simplest of cases. Improving the error handling requires redefining the error procedure. This can be done locally within the <initions-part> the define-form, i.e: definitions define dlocal lr_parse_error(item, token_n, state); lvars item, token_n, state; ... enddefine; enddefinitions

Error Reporting

Redefining the error procedure allows error messages to be tailored more appropriately for the application. One obvious distinction in errors is that item may simply be something which failed all the input recogniser tests and so doesn't correspond to any grammar token: in this case, token_n will be -1 and a special message such as 'ILLEGAL INPUT' can be given for this event. Otherwise, the error is a true parsing error in the sense that the token is known, but is invalid in the current state. You can use the parser report file (produced with the report option) to see which parser states can generate errors: basically, any state without a default action. One strategy for error reporting is simply to enumerate all error states and devise an appropriate message for each one: the state argument to the error routine can then be used to index a table of error messages. This might give the most precise reports, but it can be a tedious job when the number of states runs into hundreds, and has a major drawback in that the composition and numbering of states is very sensitive to changes in the grammar: the addition of a single extra rule could completely disrupt such a system. There is little point in using this strategy unless the grammar can be considered extremely stable. A more fruitful approach is to use the information available from the parser at run-time to construct an error message dynamically. This means using the parsing tables, which can be obtained with the parser option (see above). The tables can be investigated with the procedures defined in LIB * LR_STATE for which you must include in your program the line: uses lr_state; The state procedures allow you to determine which tokens and symbols are valid in the current state, the set of potential successor states and so on. They are fully described in REF * LR_PARSER. For example, we could augment the Lambda parser shown earlier with the following: options parser ;;; gives access to the parser tables endoptions definitions ;;; redefine the error procedure local to the parser define dlocal lr_parse_error(item, token_n, state); lvars item, token_n, state, token, tokens; mishap(item, 1, ;;; produce an appropriate message if token_n == -1 then 'ILLEGAL INPUT' else lr_state_tokens(state, parser) ->okens,); dest(tokens) ->oken, tokens); if tokens /== [] then 'ITEM CANNOT START AN EXPRESSION' elseif token == termin then 'UNEXPECTED ITEM FOUND AT END OF EXPRESSION' else 'UNEXPECTED ITEM FOUND READING TO: ' sys_>oken endif; endif); enddefine; enddefinitions This would cause the parsing errors shown before to report as: Lambda() => f(xy)) ;;; MISHAP - UNEXPECTED ITEM FOUND AT END OF EXPRESSION ;;; INVOLVING: ) Lambda() => f(f0) ;;; MISHAP - ILLEGAL INPUT ;;; INVOLVING: '0' A more sophisticated example of error reporting is given in the example file LIB * LR_EXAMPLE_1.

Error Recovery

You cannot (at present) recover from parsing errors, in the sense that once the procedure lr_parse_error has been called, it is impossible to resume parsing from that state. The only option is to exit the current invocation of the parser and re-enter, doing whatever cleaning up is required. It is intended that some support for automatic recovery will be provided in the future.

Hints on Parser Development

A successful parser depends above all on a correct grammar, so concentrate on getting the grammar right before worrying about the action code. Try to minimise the number of conflicts in the grammar. If these do arise, they will lead to warnings of the form: WARNING - <t;IFT/REDUCE CONFLICTS WARNING - <t;DUCE/REDUCE CONFLICTS Such warnings indicate that the parser cannot interpret the grammar exactly as it's written: it will still run, but may not behave quite as intended. Don't ignore them. For help in tracking down the source of conflicts, look at the parser report file which is produced with the option: report = <ename> This is, unfortunately, difficult to understand without some background knowledge, for which see HELP * LR_PARSER. Conflicts which cannot be eliminated but which are judged to be benign can be declared with the option: conflicts = <t;<t; If the number of conflicts found when the definition is compiled matches exactly the number declared, then no warning messages will be produced. If it's not obvious what the parser's doing when it runs, use the trace option to compile it in trace mode. This produces a step-by-step trace of the parser's actions in a Ved buffer, showing in particular the sequence of reductions being performed. A description of the trace output can be found in REF * LR_PARSER. As an alternative, the option keep arranges for the parser tables to be saved in a property from where they can be recovered later with the call lr_parser(<e>>rser The parser structure can be used with any of the procedures described in REF * LR_PARSER. In particular, if your action code is in a mess, use the lr_trace procedure to interpret the tables directly and generate a parse-tree which can be displayed with LIB * SHOWTREE. A common error with the action code is to return the wrong number of results -- more or less than was implied by the multiplicity of the corresponding left-hand-side symbol. This corrupts the parser's value stack, but the resulting errors may not show up until some distance from the actual cause of the problem, making it difficult to diagnose. The option stack_checks causes additional code to be added to each action code block to check the number of results returned and raise an error immediately on encountering a discrepancy. The error message is of the form: MISHAP - STACK CHECK FAILED IN PARSER <e>INE <t; INVOLVING: <und>p>wanted> where <und> the number of results actually returned, <nted> the number expected and <gt; either " " appropriate. You can set options without changing your program by setting the variable define_parser_options before compiling, e.g: vars define_parser_options = [keep stack_checks];

Further Examples

Realistic examples are too large to be included in a help file, so the following libraries have been provided for demonstration purposes. They can be viewed using <ER>owlib LIB * LR_EXAMPLE_1 Defines a complete compiler for a tiny imperative language using the parser generator in conjunction with the Poplog Virtual Machine. LIB * LR_EXAMPLE_2 A parser for ANSI C based on the grammar in the second edition of The C Programming Language by Kernighan and Ritchie. --- C.all/help/define_parser --- Copyright University of Sussex 1993. All rights reserved.