REF ITEMISE John Gibson Nov 1995
COPYRIGHT University of Sussex 1995. All Rights Reserved.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
<<<<<<<<<<<<<<<<<<<<< >>>>>>>>>>>>>>>>>>>>>>
<<<<<<<<<<<<<<<<<<<<< ITEMISATION AND >>>>>>>>>>>>>>>>>>>>>>
<<<<<<<<<<<<<<<<<<<<< LEXICAL SYNTAX >>>>>>>>>>>>>>>>>>>>>>
<<<<<<<<<<<<<<<<<<<<< >>>>>>>>>>>>>>>>>>>>>>
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
This file deals with the Pop-11 itemiser, which splits up a stream of
characters stream of items according to 12 pre-defined character
classes. Each item is one of the following types: word, string, integer
(or biginteger), ratio, floating-point (decimal or ddecimal), or complex
number: rules are given for the recognition of each of these. The
representation of Ved graphic characters and control codes in text is
explained, as is the use of Ved character attributes.
CONTENTS - (Use <ENTER> g to access required sections)
1 Character Classes
2 Syntax of Items Produced by the Itemiser
2.1 Word
2.2 String
2.3 Integer
2.4 Floating-Point
2.5 Ratio
2.6 Complex Number
3 Operation of Character Classes
3.1 Alphabeticiser (Class 12)
3.2 End-of-line Comments (Class 9)
3.3 Bracketed Comments (Classes 10 and 11)
4 Backslash in Strings & Character Constants
4.1 Control Characters
4.2 Ved Graphics Characters
4.3 Ved Special Space Characters
4.4 Explicit Integer Character Code
4.5 Backslash in Words
4.6 Ved Character Attributes
4.7 Ved Characters with Associated Data
5 Associated Procedures
6 Exceptions Raised
--------------------
1 Character Classes
--------------------
The itemiser procedure returned by incharitem (see below) takes a stream
of input characters produced by a character repeater procedure and turns
it into a stream of items for compilation, or any other use. To effect
this process, each ASCII character value from 0 - 255 has associated
with it an integer defining the class of that character, the class of a
character governing how it is treated.
The 12 pre-defined classes are described below. Note that the class
names (and examples of them) are determined by the normal assignment of
classes to characters, although by using item_chartype the user can
assign any character to any desired class, either globally or for a
particular item repeater (thus for example, the letter "A" can made to
behave as if it were a separator in class 5).
Class Description
----- -----------
1 Alphabetic - the letters a-z, A-Z ;
2 Numeric - the numerals 0-9;
3 Signs -- characters like "+", "-", "#", "$", "&" etc; A
character in classes 10 and 11 (bracketed comment 1 & 2)
will default to this class if not occurring in the context
of such a comment.
4 Underscore, i.e. "_" ;
5 Separators- the characters ".", ",", ";", """, "%" and the
brackets "[", "]", "{", "}". Control characters are also
included in this class (except for those in class 6), as are
all characters 128-255;
6 Spaces - the space, tab and newline characters;
7 String quote - the apostrophe character;
8 Character quote - the character "`";
9 End-of-line comment character - the character ";" (but see
below);
10 Bracketed comment or sign, 1st character - the character
"/" ;
11 Bracketed comment or sign, 2nd character - the character
"*" ;
12 Alphabeticiser - this is special class that forces the next
character in the input stream to be of class alphabetic,
i.e. class 1 - see below. "\" (backslash) may be given this
type by default in later versions of Poplog.
New classes other than these can be defined with the procedure
item_newtype (see below under Associated Procedures).
-------------------------------------------
2 Syntax of Items Produced by the Itemiser
-------------------------------------------
The itemiser splits up a stream of characters into a stream of items,
each item being one of the following types:
¤ word
¤ string
¤ integer (or biginteger)
¤ ratio
¤ floating-point (decimal or ddecimal)
¤ complex number
This is done according to the following rules:
2.1 Word
---------
A word is represented by either
¤ a sequence of alphabetic or numeric characters beginning
with an alphabetic one, e.g. "abc123", "X45" ;
¤ a sequence of sign characters, e.g. "+", "&$+" ;
¤ a sequence of words produced by either of the preceding,
joined by underscores, e.g. "fast_+";
¤ a single separator character, e.g. "[" ;
¤ a sequence of characters in a new class created by
item_newtype.
2.2 String
-----------
A string is represented by any sequence of characters starting and
ending with string quotes, e.g. 'abcdefgh12&3'. If the characters of
the string extend over more than one line, the newline character at the
end of the line must be preceded by a "\" (backslash), unless
pop_longstrings is true, i.e. if pop_longstrings is false then an
unescaped newline causes a mishap.
There is also additional syntax inside strings for representing special
characters, e.g. a newline can be inserted as '\n'. See Backslash in
Strings & Character Constants below.
2.3 Integer
------------
An integer is represented by either
¤ A sequence of digits, optionally preceded by a minus sign
e.g. 12345, -789;
¤ A number preceded by an integer and a colon (:), meaning
that the number is to be taken to the base of the integer,
e.g. `2:1101` represents 13 as a binary number. The integer
base must be in the range 2-36; if greater than 10, the
letters A-Z (uppercase only) can be used in the number to
represent digit values from 10 to 35, e.g. `16:1FFA`
represents 8186 as a hexadecimal number.
If a minus sign is present, this may either follow the radix
or precede it, e.g. both of the following: 8:-77 and and
-8:77 are valid.
¤ A character constant, giving the integer code for that
character. This is any character preceded and followed by a
character quote. E.g. `a` gives the ASCII value for
lowercase "a" (97). See also Backslash in Strings &
Character Constants below.
Except in the character constant case, an integer may optionally be
followed by the letter 'e' and a (signed or unsigned) integer to
indicate an exponent specification, i.e. NeI will produce
N * (b ** I)
where b is the radix of N. This may actually result in the production of
a ratio rather than an integer, e.g.
2:110e5 = 2:110 * (2 ** 5) = 192
23e-2 = 23 * (10 ** -2) = 23_/100
If the integer read in is too large to be represented as a "simple"
object (see REF * DATA) then a biginteger is created. E.g.
isinteger(123456789) =>
** <true>
isinteger(123456789123456789) =>
** <false>
isbiginteger(123456789123456789) =>
** <true>
2.4 Floating-Point
-------------------
A floating-point literal is a sequence of numeric characters containing
a period, e.g. `12.347`; as with integers, this can also be prefixed
with a base, i.e. an integer followed by a colon. (The whole number,
including fractional places, is taken to this base.)
As with integers, an exponent specification may follow, but in this case
any of the letters 'e', 's' or 'd' may be used. That is
NeI NsI NdI
all produce
N * (b ** I)
where b is the base of N. The difference between them is that 'e' and
'd' specify a double-float (ddecimal), whereas 's' results in a
simple-float (decimal). Thus
23.0e-2 = 23.0 * (10 ** -2) = 0.23 (ddecimal)
2:11.1d5 = 2:11.1 * (2 ** 5) = 112.0 (ddecimal)
56.2s+3 = 56.2 * (10 ** 3) = 56200.0 (decimal)
If the exponent specification is omitted, the result is always a
double-float (ddecimal), regardless of the value of popdprecision.
2.5 Ratio
----------
A ratio is two integers (numerator and denominator) joined by the
character sequence `_/`, e.g. 2_/3, -467_/123678. If the numerator is
preceded by a radix, then this radix applies also to the denominator;
the denominator itself must not have a radix or preceding minus sign.
Note that owing to the rule of 'rational canonicalisation' the resulting
object will actually be a ratio with the greatest common denominator
divided out of numerator and denominator, or an integer if this would
make the denominator equal to 1.
2.6 Complex Number
-------------------
A complex number is any two of the above kinds of number (the real part
and the imaginary part) joined by the character sequence `_+:` or `_-:`,
e.g. 2_+:3, 1.2_+:8.9, 5_/4_-:3_/2.
The imaginary part must not have either a radix specification or a
preceding minus sign; as with ratios, the redix of the first number (if
any) carries over to the second, and the sign of the imaginary part is
determined by the joining sequence, `_+:` or `_-:`. If an explicit radix
is specified, then this must PRECEDE any minus sign on the real part
(that is, 16:-A_+:B is valid, but not -16:A_+:B).
The two numbers may be of different types, although when either is a
floating-point the actual result will have both parts coerced to the
same type of float; in addition, when both parts are rational, the
result will be a rational rather than a complex if the imaginary part is
integer 0.
---------------------------------
3 Operation of Character Classes
---------------------------------
The itemiser reads characters and produces items from them according to
the rules given above; all characters in the space class are ignored,
and only serve to delineate item boundaries (but see popnewline below).
The effect of other classes not mentioned in the preceding rules, i.e.
the comment classes and the alphabeticiser, are as follows:
3.1 Alphabeticiser (Class 12)
------------------------------
An occurrence of a character of this class causes the next character
read to be interpreted as having class alphabetic, regardless of its
actual class. Assuming that \ has this class, this means that for
example
A\+B\-C \&_\[\{\( \12345
are all valid 5-character words. In addition, the following character is
also interpreted as for the character following \ in strings and
character constants (see Backslash in Strings & Character Constants
below), thus enabling non-printable characters to have class alphabetic,
e.g.
\nA\^A\^Z\r
is a word consisting of the characters newline, A, Ctrl-A, Ctrl-Z and
carriage return (ASCII 10, 65, 1, 26, 13).
3.2 End-of-line Comments (Class 9)
-----------------------------------
A character in this class causes the rest of the current line upto a
newline to be treated as a comment and ignored. Normally, this character
is semicolon ";" and, IN THIS CASE ONLY, 3 adjacent semicolons are
actually required for a comment. If a semicolon occurs by itself, or
only adjacent to one other, then it is treated as a separator (class 5).
(This is due to the Pop-11 compiler needing ";" for punctuation, and the
fact that ";;;" has always been the Pop-11 comment escape.)
3.3 Bracketed Comments (Classes 10 and 11)
-------------------------------------------
These two classes provide for comments which begin with a 2-character
sequence like `/*` and end with the reversed sequence `*/`, and which
otherwise occupy any number of characters or lines in between. The start
of such a comment is therefore recognised as a class 10 character
immediately followed by a class 11 character, after which characters are
read and discarded until the sequence class 11 followed by class 10 is
encountered. During the reading of the comment another occurrence of
class 10, class 11 is taken as a nested comment and so will correctly
account for such nesting. For example (assuming / and * have classes 10
and 11 respectively):
1 -> x; /* this is a comment */ 2 -> y;
/* 1 -> x; /* this is a comment */ 2 -> y; *
where in the second example the whole line has been commented out.
Any occurrence of class 10 or 11 characters other than one immediately
followed by the other will default to class 3, i.e. to the sign class.
---------------------------------------------
4 Backslash in Strings & Character Constants
---------------------------------------------
Various special and non-printable characters (e.g. control characters)
can be represented inside strings and character constants using the
character "\" (backslash) combined with other characters, as follows:
4.1 Control Characters
-----------------------
The following sequences are available for the most commonly used control
characters:
Seq Dec Hex Name
--- --- --- ----
\b 8 8 backspace
\t 9 9 tab
\n 10 A newline
\r 13 D carriage return
\e 27 1B escape
\s 32 20 space
Additionally, any of the control characters ASCII 0 - 31 and ASCII 127
can be got by following the "\" with "^" (up-arrow) and one of the
characters
@ A-Z [ \ ] ^ _ ? a-z
These sequences are:
Seq Dec Hex Name
--- --- --- ----
\^@ 0 0 NUL
\^A 1 1 Ctrl-A (also \^a)
\^B 2 2 Ctrl-B (also \^b)
... ... ... ...
\^Z 26 1A Ctrl-Z (also \^z)
\^[ 27 1B ESC
... ... ... ...
\^_ 31 1F
\^? 127 7F DEL
4.2 Ved Graphics Characters
----------------------------
The Ved editor defines a standard set of codes to represent graphics
characters; these consist of 15 line-drawing characters plus a few
others. They are represented by sequences with "G" after the "\", viz:
Seq Dec Hex Name
--- --- --- ----
\Gle 129 81 left end of horizontal line
\Gre 130 82 right end of horizontal line
\Gbe 132 84 bottom end of vertical line
\Gte 136 88 top end of vertical line
\Gtl 137 89 top left corner junction
\Gtr 138 8A top right corner junction
\Gbl 133 85 bottom left corner junction
\Gbr 134 86 bottom right corner junction
\Glt 141 8D left T-junction
\Grt 142 8E right T-junction
\Gtt 139 8B top T-junction
\Gbt 135 87 bottom T-junction
\G- 131 83 full horizontal line
\G| 140 8C full vertical line
\G+ 143 8F crossed full horizontal/vertical lines
\Go 144 90 degree sign
\G# 145 91 diamond
\G. 146 92 centred dot
Note that the 15 line-drawing characters are all built up by
superimposing combinations of the half-line 'end' characters Gle, Gre,
Gte and Gbe (e.g. G- is Gre combined with Gle). Moreover, the 'end'
characters are encoded with single 1s in bits 0, 1, 2 and 3 of the
character codes, which means that the other characters are produced
simply by or'ing together the appropriate combination. For example,
`\G-` = `\Gre` || `\Gle`
`\Gtl` = `\Gte` || `\Gle`
`\Gtt` = `\G-` || `\Gte`
etc. However, also note that terminals, etc which support display of
these line-drawing graphics do not support the half-line 'end'
characters; these are therefore always displayed as the corresponding
full-line characters G- or G|.
(The reason for representing the 'end' characters as separate codes is
to make it easy for facilities like * veddrawline to produce the correct
combined character when 'overdrawing' one character on another. E.g.
although Gle will display as G-, if overdrawn with Gte it will turn into
Gtl, whereas G- overdrawn with Gte will become Gtt.)
4.3 Ved Special Space Characters
---------------------------------
In addition to the graphics characters, Ved also defines several special
kinds of space and a special newline; these (together with the ISO Latin
'no-break' space character) are represented by sequences beginning "\S"
and "\N", viz:
Seq Dec Hex Name
--- --- --- ----
\Sh 155 9A Ved hair space
\Nt 155 9B Trailing newline
\Sf 156 9C Format-control space
\Ss 157 9D Ved no-break space
\St 158 9E Trailing space
\Sp 159 9F Prompt-marker space
\Sn 160 A0 ISO Latin no-break space
(See Special Spaces Etc in REF * VEDPROCS for more information.)
4.4 Explicit Integer Character Code
------------------------------------
"\" may also be followed by "(" to signal an explicit integer value for
a character, the integer being terminated by ")". E.g.
'\(255)abc'
is a string containing the characters 255, `a`, `b` and `c`. The integer
obeys the normal itemiser syntax, so can be radixed, etc. It must be >=
0 and <= 255.
4.5 Backslash in Words
-----------------------
As described under Operation of Character Classes above, all the
foregoing "\" sequences are also valid as part of a word when following
any alphabeticiser (class 12) character. E.g, if "\" has this class then
\e\Gtl\(255)
is word containing the character codes 27, 137 and 255.
However, backslash sequences representing Ved character attributes
(described below) are valid only in strings and characters constants,
NOT in words.
4.6 Ved Character Attributes
-----------------------------
From Version 14.11 of Poplog, integer character values have been
extended to 24 bits, and a new datatype, the 'dstring', has been
introduced to allow characters-with-attributes to be stored and
retrieved. (See REF * STRINGS.)
Although the basic system does not give any interpretation to the (top
8) attribute bits in characters, the Ved editor does: these are defined
in INCLUDE * VEDSCREENDEFS. In strings and character constants, the
sequence
\[attributes]
may be used to attach Ved attribute bits to the succeeding character,
where attributes is a sequence of the following (in any order):
b sets VEDCMODE_BOLD (i.e. Bold)
u sets VEDCMODE_UNDERLINE (i.e. Underline)
a or i sets VEDCMODE_ALTFONT (i.e. Alt Font/Italic)
f sets VEDCMODE_BLINK (i.e. Flashing)
A sets VEDCMODE_ACTIVE (selects colours 0A - 7A)
0 to 7 sets colour number 0 to 7
For example,
`\[bi5]X`
is a character constant for a bold italic `X` in colour 5.
Note that the following character itself may be a backslash sequence. In
a character constant, the following character may be omitted altogether
to give just the attributes bits (i.e. as if with a NUL character).
In a string, curly brackets may be used instead of square ones. This
means apply the attributes to all characters following, e.g.
'abc\{bi5}defg'
would attach the same set of attributes to 'defg'. However, \[...] takes
precedence for the next character, so that in
'abc\{bi5}de\[u]fg'
the "f" would have only the 'underline' attribute. On the other hand,
the additional characters "+" and "-" may appear in the attributes (in
either brackets) to indicate that following options are to be added to,
or subtracted from, those currently in force. For example,
'abc\{bi5}de\[+u]fg'
would add 'underline' to the others rather than replacing them (for the
"f", that is). Note that when any colour number is specified, this
always replaces any existing colour; thus in
'abc\{bi5}de\[-5+7]fg'
the -5 is unnecessary, since
'abc\{bi5}de\[+7]fg'
gives the same result.
Finally (of course), a string literal that contains characters with
non-zero attribute bits will result in the production of a dstring
rather than an ordinary one.
4.7 Ved Characters with Associated Data
----------------------------------------
From Version 15+ of Poplog, the concept of 'character' in Ved has been
further extended to include not just 'character-with-attributes' but
'character-with-attributes-plus-associated-data'. By itself, a character
with associated data is represented by a pair of the form
conspair(integer-char, data-item)
where integer-char is the ordinary integer character, and data-item is
any associated item. Such characters are stored in "vedstrings", which
are actually just strings or dstrings, but with any associated data
items held by entries in the property vedstring_data_prop (see
Vedstrings in REF * STRINGS).
The \[attributes] escape sequence in quoted strings has been extended to
allow the construction of vedstrings, but with associated data items
being limited to quoted strings only. To embed a string on a character,
simply include a quoted string in attributes, e.g.
'abc\{bi5}de\['EMBEDDED STRING']fg'
will attach the string 'EMBEDDED STRING' to the character "f" (note this
is permissible only inside \[...], not inside \{...} ).
If (as in the above example), attributes contains only a quoted string,
other character attributes currently in force are unaffected for the
character. (Hence "f" also gets bold, italic, colour 5. Compare
'abc\{bi5}de\[]fg'
which would set 0 attributes on the "f". You can use
'abc\{bi5}de\[0'EMBEDDED STRING']fg'
to force 0 attributes with an embedded string.)
Embedded strings are also applicable (if less useful) with character
constants:
`\[bi5'EMBEDDED STRING']X`
results in the pair
conspair(`\[bi5]X`, 'EMBEDDED STRING')
------------------------
5 Associated Procedures
------------------------
incharitem(char_rep) -> item_rep [procedure]
Returns an item repeater item_rep constructed on the character
repeater char_rep, i.e. item_rep is a procedure which each time
it is called returns the next item produced from the characters
supplied by char_rep, or termin when there are no more to come.
item_rep is initially set up to use the global character table;
by use of item_chartype (see below) item_rep can be made to use
its own local table.
(Note that from Poplog 14.11, integer characters values are
allowed to be 24-bit, where the top 8 bits represent display
attributes, and the bottom 16 are the character code (see
REF * STRINGS). However, as for strings, characters produced by
char_rep are restricted to 8-bit character codes -- that is,
they may have non-zero attribute bits (which are ignored), but
the bottom 16 bits must be in the range 0 - 16:FF.)
popnewline -> bool [variable]
bool -> popnewline
If true, this boolean variable causes item repeaters produced by
incharitem to change the class of the newline character (ASCII
10) to be 5, (i.e. a separator), so that instead of being
ignored as a space-type character, a newline will produce the
word whose single character is a newline. (Default value false)
pop_longstrings -> bool [variable]
bool -> pop_longstrings
A boolean variable controlling reading of quoted strings by
incharitem item repeaters. If this is false then quoted strings
cannot contain a newline unless preceded by "\". Otherwise
strings can extend over several lines without the backslash at
the end of each line. (Default value false)
isincharitem(item_rep) -> item_rep_or_false [procedure]
isincharitem(item_rep, true) -> char_rep_or_false
Used to test whether item_rep is a procedure created by applying
* incharitem to a character repeater, or whether the current
value of * proglist is based on an item repeater created using
incharitem.
If item_rep was created using incharitem then the result is
item_rep itself. If item_rep is * readitem or * itemread and the
current proglist is a dynamic list, then its generator procedure
is examined, and if derived from incharitem the generator is
returned. Otherwise false is returned.
If the optional boolean second argument is true, then the
underlying character repeater is returned instead of the item
repeater.
item_chartype(char) -> N [procedure]
item_chartype(char, item_rep) -> N
N -> item_chartype(char)
N -> item_chartype(char, item_rep)
The base procedure returns the integer class number N associated
with the character char, either for the global character table
(the first form) or for the item repeater item_rep (the second
form).
The updater assigns the class number N to the character char,
either for the global character table (the first form) or for
the item repeater item_rep only (the second form). Note that
once an assignment has been done for a particular item repeater
item_rep, it will no longer use the global table, so that
subsequent changes to this will not be reflected in item_rep. On
the other hand, changes to the global table WILL be reflected in
all item repeaters which have not been locally changed.
For both base and updater, the item repeater item_rep (when
supplied) may be either a procedure produced by incharitem, or
one of the procedures itemread or readitem. In the latter case,
the item repeater at the end of proglist is used.
Note that any attributes (i.e. top 8 bits) on char are ignored.
item_newtype() -> N [procedure]
Returns an integer N ( > 12 ) representing a new class of
characters that form words only with members of that class. The
value returned can be given to item_chartype to assign any
desired characters into the new class.
nextchar(item_rep) -> char [procedure]
char_or_string -> nextchar(item_rep)
Returns (and removes) the next character in the input stream for
the item repeater item_rep -- this may or may not call the
character repeater on which item_rep is based, depending on
whether there are any characters buffered inside item_rep.
The updater adds character(s) back onto the front of the current
input stream for the item repeater item_rep. If char_or_string
is an integer character, then this is added; otherwise it must
be string, in which case all the characters of the string are
added.
item_rep may take the same values as for item_chartype.
--------------------
6 Exceptions Raised
--------------------
This section describes exceptions generated by procedures in this file.
incharitem-num:syntax [exception ID]
(Error) An incharitem item repeater detected a syntax error in
an input number.
incharitem-bsseq:syntax [exception ID]
(Error) An incharitem item repeater detected an invalid
backslash escape sequence in a string or character constant.
incharitem-attr:syntax [exception ID]
(Error) An incharitem item repeater detected an invalid
attribute specification in a string or character constant.
incharitem-utcomm:syntax [exception ID]
(Error) An incharitem item repeater failed to find the closing
*/ bracket in a /* comment sequence.
incharitem-uts:syntax [exception ID]
(Error) An incharitem item repeater failed to find the closing
quote for a string or character constant.
incharitem-nextchar:type-incharitem_rep [exception ID]
(Error) isincharitem was not true for the item_rep argument
given to nextchar or item_chartype.
+-+ C.all/ref/itemise
+-+ Copyright University of Sussex 1995. All rights reserved.