REF ITEMISE                                         John Gibson Nov 1995

       COPYRIGHT University of Sussex 1995. All Rights Reserved.

<<<<<<<<<<<<<<<<<<<<<                             >>>>>>>>>>>>>>>>>>>>>>
<<<<<<<<<<<<<<<<<<<<<       ITEMISATION AND       >>>>>>>>>>>>>>>>>>>>>>
<<<<<<<<<<<<<<<<<<<<<        LEXICAL SYNTAX       >>>>>>>>>>>>>>>>>>>>>>
<<<<<<<<<<<<<<<<<<<<<                             >>>>>>>>>>>>>>>>>>>>>>

This file deals with  the Pop-11 itemiser, which  splits up a stream  of
characters  stream  of  items  according  to  12  pre-defined  character
classes. Each item is one of the following types: word, string,  integer
(or biginteger), ratio, floating-point (decimal or ddecimal), or complex
number: rules  are given  for  the recognition  of  each of  these.  The
representation of Ved graphic  characters and control  codes in text  is
explained, as is the use of Ved character attributes.

         CONTENTS - (Use <ENTER> g to access required sections)

  1   Character Classes

  2   Syntax of Items Produced by the Itemiser
      2.1   Word
      2.2   String
      2.3   Integer
      2.4   Floating-Point
      2.5   Ratio
      2.6   Complex Number

  3   Operation of Character Classes
      3.1   Alphabeticiser (Class 12)
      3.2   End-of-line Comments (Class 9)
      3.3   Bracketed Comments (Classes 10 and 11)

  4   Backslash in Strings & Character Constants
      4.1   Control Characters
      4.2   Ved Graphics Characters
      4.3   Ved Special Space Characters
      4.4   Explicit Integer Character Code
      4.5   Backslash in Words
      4.6   Ved Character Attributes
      4.7   Ved Characters with Associated Data

  5   Associated Procedures

  6   Exceptions Raised

1  Character Classes

The itemiser procedure returned by incharitem (see below) takes a stream
of input characters produced by a character repeater procedure and turns
it into a stream of items for  compilation, or any other use. To  effect
this process, each  ASCII character value  from 0 -  255 has  associated
with it an integer defining the class of that character, the class  of a
character governing how it is treated.

The 12  pre-defined classes  are described  below. Note  that the  class
names (and examples of them) are determined by the normal assignment  of
classes to  characters, although  by using  item_chartype the  user  can
assign any  character to  any desired  class, either  globally or  for a
particular item repeater (thus for example,  the letter "A" can made  to
behave as if it were a separator in class 5).

    Class   Description
    -----   -----------
        1   Alphabetic - the letters a-z, A-Z ;

        2   Numeric - the numerals 0-9;

        3   Signs -- characters like "+",   "-",  "#", "$",  "&"  etc; A
            character in classes  10 and  11 (bracketed comment  1 &  2)
            will default to this class  if not occurring in the  context
            of such a comment.

        4   Underscore, i.e. "_" ;

        5   Separators- the  characters ".", ",", ";", """, "%"  and the
            brackets "[",  "]", "{",  "}". Control  characters are  also
            included in this class (except for those in class 6), as are
            all characters 128-255;

        6   Spaces - the space, tab and newline characters;

        7   String quote - the apostrophe character;

        8   Character quote - the character "`";

        9   End-of-line comment character - the character ";"   (but see

        10  Bracketed comment or sign, 1st character - the character
            "/" ;

        11  Bracketed comment or sign, 2nd character - the character
            "*" ;

        12  Alphabeticiser - this is special class that forces  the next
            character in the  input stream  to be  of class  alphabetic,
            i.e. class 1 - see below. "\" (backslash) may be given  this
            type by default in later versions of Poplog.

New  classes  other  than  these  can  be  defined  with  the  procedure
item_newtype (see below under Associated Procedures).

2  Syntax of Items Produced by the Itemiser

The itemiser splits up  a stream of characters  into a stream of  items,
each item being one of the following types:

    ¤   word
    ¤   string
    ¤   integer (or biginteger)
    ¤   ratio
    ¤   floating-point (decimal or ddecimal)
    ¤   complex number

This is done according to the following rules:

2.1  Word
A word is represented by either

    ¤   a sequence  of alphabetic or numeric  characters beginning
        with an alphabetic one, e.g. "abc123", "X45" ;

    ¤   a sequence of sign characters, e.g. "+", "&$+" ;

    ¤   a sequence of  words produced  by either  of the  preceding,
        joined by underscores, e.g. "fast_+";

    ¤   a single separator character, e.g. "[" ;

    ¤   a  sequence  of  characters  in a  new  class  created  by

2.2  String
A string  is represented  by  any sequence  of characters  starting  and
ending with string quotes, e.g.  'abcdefgh12&3'. If   the characters  of
the string extend over more than one line, the newline character at  the
end  of  the  line  must  be  preceded  by  a  "\"  (backslash),  unless
pop_longstrings is  true,  i.e.  if pop_longstrings  is  false  then  an
unescaped newline causes a mishap.

There is also additional syntax inside strings for representing  special
characters, e.g. a  newline can be  inserted as '\n'.  See Backslash  in
Strings & Character Constants below.

2.3  Integer
An integer is represented by either

    ¤   A sequence of  digits, optionally preceded  by a minus  sign
        e.g. 12345, -789;

    ¤   A number preceded  by an  integer and a  colon (:),  meaning
        that the number is to be  taken to the base of the  integer,
        e.g. `2:1101` represents 13 as a binary number. The  integer
        base must be  in the  range 2-36;  if greater  than 10,  the
        letters A-Z (uppercase only)  can be used  in the number  to
        represent  digit  values  from  10  to  35,  e.g.  `16:1FFA`
        represents 8186 as a hexadecimal number.

        If a minus sign is present, this may either follow the radix
        or precede it,  e.g. both  of the following:  8:-77 and  and
        -8:77 are valid.

    ¤   A character  constant,  giving  the integer  code  for  that
        character. This is any character preceded and followed  by a
        character  quote.  E.g.  `a`  gives  the  ASCII  value   for
        lowercase  "a"  (97).  See   also  Backslash  in   Strings &
        Character Constants below.

Except in  the character  constant case,  an integer  may optionally  be
followed by  the  letter 'e'  and  a  (signed or  unsigned)  integer  to
indicate an exponent specification, i.e. NeI will produce

        N * (b  ** I)

where b is the radix of N. This may actually result in the production of
a ratio rather than an integer, e.g.

        2:110e5 = 2:110 * (2 ** 5) = 192
        23e-2   = 23 * (10 ** -2)  = 23_/100

If the integer  read in is  too large  to be represented  as a  "simple"
object (see REF * DATA) then a biginteger is created. E.g.

        isinteger(123456789) =>
        ** <true>
        isinteger(123456789123456789) =>
        ** <false>
        isbiginteger(123456789123456789) =>
        ** <true>

2.4  Floating-Point
A floating-point literal is a sequence of numeric characters  containing
a period, e.g.  `12.347`; as with  integers, this can  also be  prefixed
with a base,  i.e. an integer  followed by a  colon. (The whole  number,
including fractional places, is taken to this base.)

As with integers, an exponent specification may follow, but in this case
any of the letters 'e', 's' or 'd' may be used. That is

            NeI   NsI   NdI

all produce

            N *  (b ** I)

where b is the base  of N. The difference between  them is that 'e'  and
'd'  specify  a  double-float  (ddecimal),  whereas  's'  results   in a
simple-float (decimal). Thus

        23.0e-2  = 23.0 * (10 ** -2)  = 0.23     (ddecimal)
        2:11.1d5 = 2:11.1 * (2 ** 5)  = 112.0    (ddecimal)
        56.2s+3  = 56.2 * (10 ** 3)   = 56200.0  (decimal)

If the  exponent  specification  is  omitted,  the  result  is  always a
double-float (ddecimal), regardless of the value of popdprecision.

2.5  Ratio
A ratio  is  two integers  (numerator  and denominator)  joined  by  the
character sequence `_/`,  e.g. 2_/3, -467_/123678.  If the numerator  is
preceded by a radix,  then this radix applies  also to the  denominator;
the denominator itself must not have a radix or preceding minus sign.

Note that owing to the rule of 'rational canonicalisation' the resulting
object will actually  be a  ratio with the  greatest common  denominator
divided out of numerator  and denominator, or an  integer if this  would
make the denominator equal to 1.

2.6  Complex Number
A complex number is any two of the above kinds of number (the real  part
and the imaginary part) joined by the character sequence `_+:` or `_-:`,
e.g. 2_+:3, 1.2_+:8.9, 5_/4_-:3_/2.

The imaginary  part must  not  have either  a radix  specification  or a
preceding minus sign; as with ratios, the redix of the first number  (if
any) carries over to the second, and  the sign of the imaginary part  is
determined by the joining sequence, `_+:` or `_-:`. If an explicit radix
is specified, then  this must PRECEDE  any minus sign  on the real  part
(that is, 16:-A_+:B is valid, but not -16:A_+:B).

The two numbers  may be of  different types, although  when either  is a
floating-point the actual  result will  have both parts  coerced to  the
same type  of float;  in addition,  when both  parts are  rational,  the
result will be a rational rather than a complex if the imaginary part is
integer 0.

3  Operation of Character Classes

The itemiser reads characters and produces items from them according  to
the rules given above;  all characters in the  space class are  ignored,
and only serve to delineate item boundaries (but see popnewline below).

The effect of other classes not  mentioned in the preceding rules,  i.e.
the comment classes and the alphabeticiser, are as follows:

3.1  Alphabeticiser (Class 12)
An occurrence of  a character of  this class causes  the next  character
read to be  interpreted as  having class alphabetic,  regardless of  its
actual class.  Assuming that  \  has this  class,  this means  that  for

        A\+B\-C   \&_\[\{\(   \12345

are all valid 5-character words. In addition, the following character is
also interpreted  as  for  the  character following  \  in  strings  and
character constants  (see Backslash  in  Strings &  Character  Constants
below), thus enabling non-printable characters to have class alphabetic,


is a word consisting  of the characters newline,  A, Ctrl-A, Ctrl-Z  and
carriage return (ASCII 10, 65, 1, 26, 13).

3.2  End-of-line Comments (Class 9)
A character in  this class causes  the rest of  the current line  upto a
newline to be treated as a comment and ignored. Normally, this character
is semicolon  ";" and,  IN THIS  CASE ONLY,  3 adjacent  semicolons  are
actually required for  a comment. If  a semicolon occurs  by itself,  or
only adjacent to one other, then it is treated as a separator (class 5).
(This is due to the Pop-11 compiler needing ";" for punctuation, and the
fact that ";;;" has always been the Pop-11 comment escape.)

3.3  Bracketed Comments (Classes 10 and 11)
These two classes provide  for comments which  begin with a  2-character
sequence like `/*` and  end with the reversed  sequence `*/`, and  which
otherwise occupy any number of characters or lines in between. The start
of such  a comment  is  therefore recognised  as  a class  10  character
immediately followed by a class 11 character, after which characters are
read and discarded until the sequence  class 11 followed by class 10  is
encountered. During the  reading of  the comment  another occurrence  of
class 10, class 11 is  taken as a nested  comment and so will  correctly
account for such nesting. For example (assuming / and * have classes  10
and 11 respectively):

        1 -> x; /* this is a comment */ 2 -> y;
    /*  1 -> x; /* this is a comment */ 2 -> y; *

where in the second example the whole line has been commented out.

Any occurrence of class 10 or  11 characters other than one  immediately
followed by the other will default to class 3, i.e. to the sign class.

4  Backslash in Strings & Character Constants

Various special and non-printable  characters (e.g. control  characters)
can be  represented inside  strings and  character constants  using  the
character "\" (backslash) combined with other characters, as follows:

4.1  Control Characters
The following sequences are available for the most commonly used control

        Seq     Dec    Hex   Name
        ---     ---    ---   ----
        \b        8     8    backspace
        \t        9     9    tab
        \n       10     A    newline
        \r       13     D    carriage return
        \e       27    1B    escape
        \s       32    20    space

Additionally, any of the control characters  ASCII 0 - 31 and ASCII  127
can be got  by following  the "\"  with "^"  (up-arrow) and  one of  the

            @  A-Z  [  \  ]  ^  _  ? a-z

These sequences are:

        Seq     Dec    Hex   Name
        ---     ---    ---   ----
        \^@       0     0    NUL
        \^A       1     1    Ctrl-A      (also \^a)
        \^B       2     2    Ctrl-B      (also \^b)
        ...     ...    ...   ...
        \^Z      26    1A    Ctrl-Z      (also \^z)
        \^[      27    1B    ESC
        ...     ...    ...   ...
        \^_      31    1F
        \^?     127    7F    DEL

4.2  Ved Graphics Characters
The Ved editor  defines a standard  set of codes  to represent  graphics
characters; these  consist  of 15  line-drawing  characters plus  a  few
others. They are represented by sequences with "G" after the "\", viz:

        Seq     Dec    Hex   Name
        ---     ---    ---   ----

        \Gle    129    81    left end of horizontal line
        \Gre    130    82    right end of horizontal line
        \Gbe    132    84    bottom end of vertical line
        \Gte    136    88    top end of vertical line

        \Gtl    137    89    top left corner junction
        \Gtr    138    8A    top right corner junction
        \Gbl    133    85    bottom left corner junction
        \Gbr    134    86    bottom right corner junction

        \Glt    141    8D    left T-junction
        \Grt    142    8E    right T-junction
        \Gtt    139    8B    top T-junction
        \Gbt    135    87    bottom T-junction

        \G-     131    83    full horizontal line
        \G|     140    8C    full vertical line
        \G+     143    8F    crossed full horizontal/vertical lines

        \Go     144    90    degree sign
        \G#     145    91    diamond
        \G.     146    92    centred dot

Note  that  the  15  line-drawing   characters  are  all  built  up   by
superimposing combinations of the  half-line 'end' characters Gle,  Gre,
Gte and Gbe  (e.g. G-  is Gre combined  with Gle).  Moreover, the  'end'
characters are encoded  with single  1s in  bits 0, 1,  2 and  3 of  the
character codes,  which means  that the  other characters  are  produced
simply by or'ing together the appropriate combination. For example,

        `\G-`  =  `\Gre` || `\Gle`
        `\Gtl` =  `\Gte` || `\Gle`
        `\Gtt` =  `\G-`  || `\Gte`

etc. However, also  note that  terminals, etc which  support display  of
these  line-drawing  graphics  do   not  support  the  half-line   'end'
characters; these are  therefore always displayed  as the  corresponding
full-line characters G- or G|.

(The reason for representing the  'end' characters as separate codes  is
to make it easy for facilities like * veddrawline to produce the correct
combined character  when 'overdrawing'  one character  on another.  E.g.
although Gle will display as G-, if overdrawn with Gte it will turn into
Gtl, whereas G- overdrawn with Gte will become Gtt.)

4.3  Ved Special Space Characters
In addition to the graphics characters, Ved also defines several special
kinds of space and a special newline; these (together with the ISO Latin
'no-break' space character) are represented by sequences beginning  "\S"
and "\N", viz:

        Seq     Dec    Hex   Name
        ---     ---    ---   ----
        \Sh     155    9A    Ved hair space
        \Nt     155    9B    Trailing newline
        \Sf     156    9C    Format-control space
        \Ss     157    9D    Ved no-break space
        \St     158    9E    Trailing space
        \Sp     159    9F    Prompt-marker space
        \Sn     160    A0    ISO Latin no-break space

(See Special Spaces Etc in REF * VEDPROCS for more information.)

4.4  Explicit Integer Character Code
"\" may also be followed by "(" to signal an explicit integer value  for
a character, the integer being terminated by ")". E.g.


is a string containing the characters 255, `a`, `b` and `c`. The integer
obeys the normal itemiser syntax, so can be radixed, etc. It must be  >=
0 and <= 255.

4.5  Backslash in Words
As described  under  Operation  of  Character  Classes  above,  all  the
foregoing "\" sequences are also valid as part of a word when  following
any alphabeticiser (class 12) character. E.g, if "\" has this class then


is word containing the character codes 27, 137 and 255.

However,  backslash  sequences  representing  Ved  character  attributes
(described below) are  valid only in  strings and characters  constants,
NOT in words.

4.6  Ved Character Attributes
From Version  14.11  of  Poplog,  integer  character  values  have  been
extended to  24  bits, and  a  new  datatype, the  'dstring',  has  been
introduced  to  allow  characters-with-attributes   to  be  stored   and
retrieved. (See REF * STRINGS.)

Although the basic system does not  give any interpretation to the  (top
8) attribute bits in characters, the Ved editor does: these are  defined
in INCLUDE * VEDSCREENDEFS.  In  strings and  character  constants,  the


may be used to  attach Ved attribute bits  to the succeeding  character,
where attributes is a sequence of the following (in any order):

           b       sets VEDCMODE_BOLD         (i.e. Bold)
           u       sets VEDCMODE_UNDERLINE    (i.e. Underline)
         a or i    sets VEDCMODE_ALTFONT      (i.e. Alt Font/Italic)
           f       sets VEDCMODE_BLINK        (i.e. Flashing)
           A       sets VEDCMODE_ACTIVE       (selects colours 0A - 7A)
         0 to 7    sets colour number 0 to  7

For example,


is a character constant for a bold italic `X` in colour 5.

Note that the following character itself may be a backslash sequence. In
a character constant, the following character may be omitted  altogether
to give just the attributes bits (i.e. as if with a NUL character).

In a string,  curly brackets may  be used instead  of square ones.  This
means apply the attributes to all characters following, e.g.


would attach the same set of attributes to 'defg'. However, \[...] takes
precedence for the next character, so that in


the "f" would have  only the 'underline' attribute.  On the other  hand,
the additional characters "+" and "-"  may appear in the attributes  (in
either brackets) to indicate that following options are to be added  to,
or subtracted from, those currently in force. For example,


would add 'underline' to the others rather than replacing them (for  the
"f", that  is). Note  that when  any colour  number is  specified,  this
always replaces any existing colour; thus in


the -5 is unnecessary, since


gives the same result.

Finally (of  course), a  string literal  that contains  characters  with
non-zero attribute  bits will  result  in the  production of  a  dstring
rather than an ordinary one.

4.7  Ved Characters with Associated Data
From Version 15+ of Poplog, the  concept of 'character' in Ved has  been
further extended  to include  not just  'character-with-attributes'  but
'character-with-attributes-plus-associated-data'. By itself, a character
with associated data is represented by a pair of the form

        conspair(integer-char, data-item)

where integer-char is the ordinary  integer character, and data-item  is
any associated item. Such characters  are stored in "vedstrings",  which
are actually  just strings  or dstrings,  but with  any associated  data
items  held  by  entries   in  the  property  vedstring_data_prop   (see
Vedstrings in REF * STRINGS).

The \[attributes] escape sequence in quoted strings has been extended to
allow the construction  of vedstrings,  but with  associated data  items
being limited to quoted strings only. To embed a string on a  character,
simply include a quoted string in attributes, e.g.

            'abc\{bi5}de\['EMBEDDED STRING']fg'

will attach the string 'EMBEDDED STRING' to the character "f" (note this
is permissible only inside \[...], not inside \{...} ).

If (as in the above example), attributes contains only a quoted  string,
other character attributes  currently in  force are  unaffected for  the
character. (Hence "f" also gets bold, italic, colour 5. Compare


which would set 0 attributes on the "f". You can use

            'abc\{bi5}de\[0'EMBEDDED STRING']fg'

to force 0 attributes with an embedded string.)

Embedded strings are  also applicable  (if less  useful) with  character

            `\[bi5'EMBEDDED STRING']X`

results in the pair

            conspair(`\[bi5]X`, 'EMBEDDED STRING')

5  Associated Procedures

incharitem(char_rep) -> item_rep                             [procedure]
        Returns an item repeater  item_rep constructed on the  character
        repeater char_rep, i.e. item_rep is a procedure which each  time
        it is called returns the next item produced from the  characters
        supplied by char_rep, or termin when there are no more to  come.
        item_rep is initially set up to use the global character  table;
        by use of item_chartype (see below) item_rep can be made to  use
        its own local table.

        (Note that  from Poplog  14.11,  integer characters  values  are
        allowed to be  24-bit, where  the top 8  bits represent  display
        attributes, and  the  bottom  16 are  the  character  code  (see
        REF * STRINGS). However, as for strings, characters produced  by
        char_rep are restricted  to 8-bit  character codes  -- that  is,
        they may have non-zero attribute  bits (which are ignored),  but
        the bottom 16 bits must be in the range 0 - 16:FF.)

popnewline -> bool                                            [variable]
bool -> popnewline
        If true, this boolean variable causes item repeaters produced by
        incharitem to change the class  of the newline character  (ASCII
        10) to  be 5,  (i.e.  a separator),  so  that instead  of  being
        ignored as a  space-type character, a  newline will produce  the
        word whose single character is a newline. (Default value false)

pop_longstrings -> bool                                       [variable]
bool -> pop_longstrings
        A boolean  variable controlling  reading  of quoted  strings  by
        incharitem item repeaters. If this is false then quoted  strings
        cannot contain  a  newline  unless preceded  by  "\".  Otherwise
        strings can extend over several  lines without the backslash  at
        the end of each line. (Default value false)

isincharitem(item_rep) -> item_rep_or_false                  [procedure]
isincharitem(item_rep, true) -> char_rep_or_false
        Used to test whether item_rep is a procedure created by applying
        * incharitem to  a character  repeater, or  whether the  current
        value of * proglist is based  on an item repeater created  using

        If item_rep  was created  using incharitem  then the  result  is
        item_rep itself. If item_rep is * readitem or * itemread and the
        current proglist is a dynamic list, then its generator procedure
        is examined, and  if derived  from incharitem  the generator  is
        returned. Otherwise false is returned.

        If the  optional  boolean  second argument  is  true,  then  the
        underlying character repeater  is returned instead  of the  item

item_chartype(char)           -> N                           [procedure]
item_chartype(char, item_rep) -> N
N -> item_chartype(char)
N -> item_chartype(char, item_rep)
        The base procedure returns the integer class number N associated
        with the character char, either  for the global character  table
        (the first form) or for  the item repeater item_rep (the  second

        The updater assigns the  class number N  to the character  char,
        either for the global  character table (the  first form) or  for
        the item repeater  item_rep only  (the second  form). Note  that
        once an assignment has been done for a particular item  repeater
        item_rep, it  will  no longer  use  the global  table,  so  that
        subsequent changes to this will not be reflected in item_rep. On
        the other hand, changes to the global table WILL be reflected in
        all item repeaters which have not been locally changed.

        For both  base and  updater, the  item repeater  item_rep  (when
        supplied) may be either a  procedure produced by incharitem,  or
        one of the procedures itemread or readitem. In the latter  case,
        the item repeater at the end of proglist is used.

        Note that any attributes (i.e. top 8 bits) on char are ignored.

item_newtype() -> N                                          [procedure]
        Returns an  integer N  ( >  12  ) representing  a new  class  of
        characters that form words only with members of that class.  The
        value returned  can  be given  to  item_chartype to  assign  any
        desired characters into the new class.

nextchar(item_rep) -> char                                   [procedure]
char_or_string -> nextchar(item_rep)
        Returns (and removes) the next character in the input stream for
        the item  repeater item_rep  -- this  may or  may not  call  the
        character repeater  on which  item_rep  is based,  depending  on
        whether there are any characters buffered inside item_rep.

        The updater adds character(s) back onto the front of the current
        input stream for the  item repeater item_rep. If  char_or_string
        is an integer character, then  this is added; otherwise it  must
        be string, in which  case all the characters  of the string  are

        item_rep may take the same values as for item_chartype.

6  Exceptions Raised

This section describes exceptions generated by procedures in this file.

incharitem-num:syntax                                     [exception ID]
        (Error) An incharitem item repeater  detected a syntax error  in
        an input number.

incharitem-bsseq:syntax                                   [exception ID]
        (Error)  An  incharitem  item   repeater  detected  an   invalid
        backslash escape sequence in a string or character constant.

incharitem-attr:syntax                                    [exception ID]
        (Error) An  incharitem  item   repeater  detected  an   invalid
        attribute specification in a string or character constant.

incharitem-utcomm:syntax                                  [exception ID]
        (Error) An incharitem item repeater  failed to find the  closing
        */ bracket in a /* comment sequence.

incharitem-uts:syntax                                     [exception ID]
        (Error) An incharitem item repeater  failed to find the  closing
        quote for a string or character constant.

incharitem-nextchar:type-incharitem_rep                   [exception ID]
        (Error) isincharitem  was not  true  for the  item_rep  argument
        given to nextchar or item_chartype.

+-+ C.all/ref/itemise
+-+ Copyright University of Sussex 1995. All rights reserved. Logo