REF STRINGS John Gibson Nov 1995 COPYRIGHT University of Sussex 1995. All Rights Reserved. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <<<<<<<<<<<<<<<<<<<<< >>>>>>>>>>>>>>>>>>>>>> <<<<<<<<<<<<<<<<<<<<< STRINGS AND CHARACTERS >>>>>>>>>>>>>>>>>>>>>> <<<<<<<<<<<<<<<<<<<<< >>>>>>>>>>>>>>>>>>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< This REF file explains the character set used by Poplog, the predicates which can be used on these characters and how characters can be located in strings: the available string creation and manipulation procedures are listed (note that some string procedures are also applicable to words). Procedures and predicates relating to other string forms, the 'dstring' and 'vedstring' are also described. CONTENTS - (Use <ENTER> g to access required sections) 1 Introduction 2 Character Sets 3 Predicates on Characters 4 Locating Characters in Strings 5 Predicates on Strings 6 Constructing Strings 7 Accessing String Characters 8 Display Strings ('Dstrings') 9 Generic Datastructure/Vector Procedures on (D)Strings 10 Vedstrings 11 Regular Expression Pattern Matching 12 Miscellaneous --------------- 1 Introduction --------------- Strings in Poplog are indexable 1-dimensional arrays of characters, where each character is an integer value that (generally) represents an ASCII/ISO Latin code for a particular symbol. As with all Poplog vector classes, subscript values for strings number from 1 upwards. An ordinary string provides one byte to hold each character, which permits a value in the range 0 - 16:FF (i.e. 0 - 255). However (from Poplog Version 14.11), characters as integers are actually allowed to be 24-bit, in the range 0 - 16:FFFFFF (0 - 16777215). Nominally, the bottom (least-significant) 16 bits are the character code, while the most significant 8 bits represent other attributes pertaining to the character, as shown: 23 16 15 0 +-------------------------------------------------+ | Attributes | Character Code | +----------------+--------------------------------+ However, characters assigned into strings are restricted to 8-bit character codes, i.e. actually look thus: 23 16 15 8 7 0 +-------------------------------------------------+ | Attributes | 0 | Character Code | +----------------+---------------+----------------+ (There is currently no data type for storing 16-bit character codes, but the layout of integer characters is designed to allow for this in the future.) The attribute part cannot be stored in ordinary strings, and is ignored for operations on these: a character accessed from an ordinary string will have a zero attribute part, and assigning a character into an ordinary string will ignore any attributes. Thus you need not worry about the attribute bits unless your program needs to process 'dstrings' (which are the alternate form of strings that allow the attributes to be stored and retrieved -- see Display Strings below). String creation and manipulation procedures available are listed below; note that some string procedures are also applicable to words. An (ordinary) string is a particular built-in instance of the general class of vectors which can be constructed using conskey or the defclass syntax construct; see REF * KEYS, REF * DEFSTRUCT for details, and REF * DATA for procedures applicable to strings as vectors in general. (N.B. Like all byte vectorclasses, strings are guaranteed to be null-terminated, that is, to have a 0 byte following the last actual byte of the string. While this is irrelevant to internal Poplog use, it means that strings can be passed to external C functions without modification.) ----------------- 2 Character Sets ----------------- From Poplog Version 14.11, support is provided for using the ISO Latin 1 character set (which is a superset of ASCII, defining extra 8-bit character codes in the range 16:A0 - 16:FF). Use of Latin 1 is indicated by the variable pop_character_set having a value of (ASCII character) `1`, which is its default. (Potentially, this could be set to `2`, `3`, or `4` to indicate the alternate Latin 2 - Latin 4 sets, or some other character for other sets, but there is currently only support for Latin 1.) Note that in previous versions of the system, use of 8-bit character codes was difficult owing to the fact that any code greater than or equal to 16:80 was interpreted as a graphics character by the Ved editor. This restriction has now been removed by defining a standard set of graphics characters in the range 16:81 - 16:9F, which do not conflict with ISO Latin (see Ved Standard Graphics Characters in REF * VEDPROCS). The new graphics characters do not conflict with the old ones either, but the old ones can only be interpreted when pop_character_set is false. (However, nothing in Poplog uses the old ones any more, except for the old graphcharsetup library. This remains for backwards compatibility, and if used, sets pop_character_set false.) pop_character_set [variable] This variable contains either false or an integer ASCII character code indicating the current character set in use. Currently only the value `1` is supported, meaning the ISO Latin 1 character set (this is its default value). The value of this variable affects the procedures ¤ isuppercode ¤ islowercode ¤ isalphacode ¤ uppertolower ¤ lowertoupper as well as the Pop-11 itemiser (see REF * ITEMISE). As described above, a non-false value also prevents the Ved editor from interpreting 8-bit characters as old-style graphics characters. --------------------------- 3 Predicates on Characters --------------------------- As stated above, a character is a 24-bit unsigned integer; thus the following procedures will all return false for any integer not in the range 0 <= I <= 16:FFFFFF. The character-code part tested is the bottom 16 bits of the integer (i.e. they will also return false for any integer that has a non-zero value in bits 8-15). The characters recognised as upper and lower case letters by the procedures isuppercode and islowercode (as well as uppertolower and lowertoupper) are the ASCII values plus the additional Latin 1 characters when pop_character_set has the value ASCII `1`. Note that in Latin 1 there are two letters which do not have alternate case equivalents (german double s and y dieresis). isuppercode and islowercode return true only for letters that have an alternate case, and hence these two are excluded. However, isalphacode recognises all letters. Letter type ASCII Latin 1 ----------- ----- ------- upper case 16:41 - 16:5A 16:C0 - 16:D6 16:D8 - 16:DE lower case 16:61 - 16:7A 16:E0 - 16:F6 16:F8 - 16:FE other 16:DF 16:FF isuppercode(item) -> bool [procedure] Returns true if item is a character whose character-code part is an upper case letter (see above), or false otherwise. islowercode(item) -> bool [procedure] Returns true if item is a character whose character-code part is a lower case letter (see above), or false otherwise. isalphacode(item) -> bool [procedure] Returns true if item is a character whose character-code part is a letter (see above), or false otherwise. isnumbercode(item) -> bool [procedure] Returns true if item is a character whose character-code part is the ASCII/ISO Latin code for a digit (i.e. in the range 16:30 - 16:39), or false otherwise. --------------------------------- 4 Locating Characters in Strings --------------------------------- These procedures all search strings for the normal ASCII/ISO Latin part (i.e. bottom eight bits) of a character char. locchar(char, N, string) -> M_or_false [procedure] Searches the string (or word) string for the character char, starting the search at the N-th character of string. Returns the subscript M at which char was found, or false otherwise. E.g: locchar(`a`, 1, 'the cat sat on the mat') => ** 6 locchar(`a`, 7, 'the cat sat on the mat') => ** 10 locchar(`a`, 22, 'the cat sat on the mat') => ** <false> strmember(char, string) -> M_or_false [procedure] Same as locchar(char, 1, string), i.e. returns the subscript M at which char first occurs in the string (or word) string, or false otherwise. locchar_back(char, N, string) -> M_or_false [procedure] As locchar, except that the search is performed BACKWARDS starting from the N-th character. E.g: locchar_back(`a`, 22, 'the cat sat on the mat') => ** 21 locchar_back(`a`, 20, 'the cat sat on the mat') => ** 10 locchar_back(`a`, 5, 'the cat sat on the mat') => ** <false> skipchar(char, N, string) -> M_or_false [procedure] Searches the string (or word) string for any character OTHER than char, starting at the N-th character. Returns the subscript M at which a character other than char was found, or false if every character from the N-th onwards was a char. E.g: skipchar(`*`, 1, '*** HELLO ***') => ** 4 skipchar(`*`, 11, '*** HELLO ***') => ** <false> skipchar_back(char, N, string) -> M_or_false [procedure] As skipchar, except that the search is performed BACKWARDS starting from the N-th character. E.g: skipchar_back(`*`, 13, '*** HELLO ***') => ** 10 skipchar_back(`*`, 3, '*** HELLO ***') => ** <false> ------------------------ 5 Predicates on Strings ------------------------ Note that most of the procedures in this section taking an argument specified as string or sub_string will accept words in place of any of their string arguments (isstring of course returns false for words). All procedures in this section compare only the normal ASCII/ISO Latin parts of characters in substrings. See also REF * vedissubitem isstring(item) -> bool [procedure] Returns true if item is a string (or a dstring), false if not. check_string(item) [procedure] Mishaps if item is not a (d)string. issubstring(sub_string, N, string) -> M_or_false [procedure] issubstring(sub_string, string) -> M_or_false Searches the string (or word) string, starting from its N-th character, for a substring equal to the string sub_string and, if found, returns the subscript M of string at which the matching substring begins; otherwise it returns false. If N is not given, it defaults to 1. E.g: issubstring('the', 1, 'all the cats') => ** 5 issubstring('the', 6, 'all the cats') => ** <false> issubstring_lim(sub_string, N, startlim, endlim, string) [procedure] -> M_or_false Same as issubstring, but the match is constrained to start on or before the subscript startlim, and to end on or before the subscript endlim. The startlim or endlim constraints may be disabled by supplying false for either argument, e.g. issubstring_lim(sub_string, N, false, false, string) is just the same as issubstring. Examples: issubstring_lim('the', 1, 5, false, 'all the cats') => ** 5 issubstring_lim('the', 1, 4, false, 'all the cats') => ** <false> issubstring_lim('the', 1, false, 7, 'all the cats') => ** 5 issubstring_lim('the', 1, false, 6, 'all the cats') => ** <false> issubstring_lim('the', 1, 5, 7, 'all the cats') => ** 5 isstartstring(sub_string, string) -> M_or_false [procedure] If the string (or word) string starts with the substring sub_string then returns subscript 1, otherwise false. E.g: isstartstring('ban', 'banana') => ** 1 isstartstring('ban', 'abandon') => ** <false> (This procedure is the same as issubstring_lim(sub_string, 1, 1, false, string) but quicker.) ismidstring(sub_string, string) -> M_or_false [procedure] If sub_string is a substring of the string (or word) string, but does not start on the first character of string nor end on the last, then this returns the subscript at which the substring starts, otherwise false. E.g. ismidstring('ban', 'banana') => ** <false> ismidstring('ban', 'abandon') => ** 2 isendstring(sub_string, string) -> M_or_false [procedure] If the string (or word) string ends with the substring sub_string, then returns the subscript M of sub_string in string, otherwise false. E.g: isendstring('ing', "working") => ** 5 isendstring('ing', 'ng') => ** <false> hassubstring(string, sub_string) -> M_or_false [procedure] hassubstring(string, N, sub_string) -> M_or_false Same as issubstring(sub_string, 1, string) issubstring(sub_string, N, string) respectively (i.e. N defaults to 1). hasendstring(string, sub_string) -> M_or_false [procedure] Same as isendstring(sub_string, string). hasmidstring(string, sub_string) -> M_or_false [procedure] Same as ismidstring(sub_string, string) (embedded substring). hasstartstring(string, sub_string) -> M_or_false [procedure] Same as isstartstring(sub_string, string). alphabefore(string1, string2) -> bool_or_1 [procedure] This procedure takes two strings (or words) as arguments and returns true if the first is alphabetically before the second, or false if the first is alphabetically after the second. The integer 1 is returned if the strings have exactly the same characters. For example: alphabefore("cat", "dog") => ** <true> alphabefore("dog", "cat") => ** <false> alphabefore('cat', 'catch')=> ** <true> alphabefore("cat", "cat") => ** 1 ----------------------- 6 Constructing Strings ----------------------- consstring(char1, char2, ..., charN, N) -> string [procedure] Returns a string string constructed from the next N character values on the user stack (where the topmost value on the stack will be at the highest subscript in the string). inits(len) -> string [procedure] Returns a newly created string of length len containing all zero (i.e. NUL) characters. (See also initvectorclass in REF * DATA.) substring(N, len, string) -> sub_string [procedure] sub_string -> substring(N, len, string) The base procedure returns a string sub_string consisting of the len characters of the string string starting from the character at subscript N. Note that nullstring is returned for an empty substring. string may also be a word (but the result is still a string -- see subword in REF * WORDS if you want a word result). The updater copies the first len characters of the string sub_string into the string string starting at subscript N. In this case sub_string may also be a word, but not string. lowertoupper(item1) -> item2 [procedure] For item1 a (d)string, word or integer character, returns a new item of the same type with any ASCII/ISO Latin codes for lowercase letters converted to their uppercase equivalent. Otherwise just returns item1. For example: lowertoupper(`a`) => ** 65 ;;; i.e. `A` lowertoupper('hello') => ** HELLO lowertoupper(`A`) => ** 65 lowertoupper([any old list]) => ** [any old list] (Note that ISO Latin letters are recognised only when pop_character_set has an appropriate value -- see Predicates on Characters above.) uppertolower(item1) -> item2 [procedure] For item1 a (d)string, word or integer character, returns a new item of the same type with any ASCII/ISO Latin codes for uppercase characters converted to their lowercase equivalent. Otherwise just returns item1. For example: uppertolower(`A`) => ** 97 ;;; i.e. `a` uppertolower('HELLO') => ** hello uppertolower(`a`) => ** 97 uppertolower([any old list]) => ** [any old list] (Note that ISO Latin letters are recognised only when pop_character_set has an appropriate value -- see Predicates on Characters above.) strlowercase(struct2) -> struct2 [procedure] struppercase(struct1) -> struct2 [procedure] These are defined as mapdata(struct1, uppertolower) -> struct2 mapdata(struct1, lowertoupper) -> struct2 respectively, and so will work on a vector of strings, for example (but uppertolower and lowertoupper are always quicker on individual strings). ------------------------------ 7 Accessing String Characters ------------------------------ deststring(string) -> (char1, ..., charN, N) [procedure] Destructs the string string, i.e. puts all its characters on the stack, together with its length N (in other words, does the opposite of consstring). E.g. deststring('abcd') => ** 97 98 99 100 4 subscrs(N, string) -> char [procedure] char -> subscrs(N, string) Returns or updates the N-th character char of the string string. Since subscrs is also the class_apply of a string (see REF * KEYS), this may also be called as string(N) -> char char -> string(N) ------------------------------- 8 Display Strings ('Dstrings') ------------------------------- From Poplog 14.11, integer character values have been extended to 24 bits (as described in the Introduction above). A new datatype has been introduced to allow the storage and retrieval of 24-bit characters containing 8 character-code bits plus 8 attribute bits, i.e. in the form: 23 16 15 8 7 0 +-------------------------------------------------+ | Attributes | 0 | Character Code | +----------------+---------------+----------------+ The new datatype is a display string ('dstring'): this is a structure whose first part is identical to an ordinary string in all respects (apart from having a different key), but which has a second, parallel set of bytes appended to it. The second set of bytes is used to store the attribute parts (top eight bits) of characters, while the first set store the bottom eight bits as normal. This scheme allows a dstring to behave as an ordinary string when required, but is completely transparent in the sense that accessing or updating a dstring character is simply in terms of a 24-bit integer. For ordinary string operations, strings and dstrings are completely interchangeable. Except where otherwise indicated, all normal string procedures will treat dstrings as ordinary strings (i.e. ignore the attribute parts), including isstring, which recognises both. On the other hand, all dstring procedures treat ordinary strings as if they were dstrings with all-zero attribute bytes. Note that the basic system does not give any interpretation to the attribute bits in characters. However, the Ved editor uses dstrings (where necessary) to represent characters having attributes such as 'bold', 'underlined', etc (the purpose for which dstrings were added). See INCLUDE * VEDSCREENDEFS for the attribute bits defined by Ved. isdstring(item) -> bool [procedure] Returns true if item is a dstring, false if not. Note that false is returned for ordinary strings. consdstring(char1, char2, ..., charN, N) -> dstring [procedure] consdstring(char1, char2, ..., charN, N, sopt) -> dstring consdstring(string) -> dstring The first two forms of this procedure return a (d)string dstring constructed from the next N character values on the user stack (where the topmost value on the stack will be at the highest subscript in the string). The optional boolean argument sopt says whether to optimise the result to an ordinary string if the attribute parts of all characters are zero (true = yes, false = no). NOTE that sopt is TRUE by default, i.e. unless given false for sopt, consdstring will always return an ordinary string if it can. The third form allows a string to be converted to a dstring: if string is an ordinary string then the result dstring is a dstring with the same character codes but all-zero attribute bytes; if string is already a dstring, then that is returned. initdstring(len) -> dstring [procedure] Returns a newly created dstring of length len containing all zero (i.e. NULL) characters. subdstring(N, len, dstring) -> sub_dstring [procedure] subdstring(N, len, dstring, sopt) -> sub_dstring sub_dstring -> subdstring(N, len, dstring) The base procedure returns a (d)string sub_dstring consisting of the len characters of the (d)string dstring starting from the character at subscript N. nullstring is returned for an empty substring. dstring may also be a word (but the result is still a (d)string). As with consdstring, the optional boolean argument sopt says whether to optimise the result to an ordinary string if the attribute parts of all characters in sub_dstring are zero (true = yes, false = no). Note that sopt is TRUE by default, i.e. unless given false for sopt, subdstring will always return an ordinary string if it can. The updater copies the first len characters of the (d)string sub_dstring into the (d)string dstring starting at subscript N. If dstring is an ordinary string this procedure does exactly the same as substring (i.e. ignores attributes); if dstring is a dstring but sub_dstring an ordinary one, the corresponding attribute bytes in dstring are zeroed. As with the updater of substring, sub_dstring may also be a word, but not dstring. destdstring(dstring) -> (char1, ..., charN, N) [procedure] Destructs the (d)string dstring, i.e. puts all its characters on the stack, together with its length (in other words, does the opposite of consdstring). subscrdstring(N, dstring) -> char [procedure] char -> subscrdstring(N, dstring) Returns or updates the N-th character char of the (d)string dstring. (If dstring is an ordinary string, the char returned by the base procedure will have zero attribute bits. For the updater, if dstring is an ordinary string it is an error for char to have non-zero attribute bits.) Since subscrdstring is the class_apply of a dstring (see REF * KEYS), this may also be called as dstring(N) -> char char -> dstring(N) -------------------------------------------------------- 9 Generic Datastructure/Vector Procedures on (D)Strings -------------------------------------------------------- The generic datastructure procedures described in REF * DATA (datalength, appdata, explode, fill, copy, etc) are all applicable to strings and dstrings, as are the generic vector procedures (initvectorclass, move_subvector, sysanyvecons, etc) also described in that file. Note that the operator <> can be used to concatenate (d)strings with (d)strings; the result is a dstring if either argument is. move_subvector from a dstring to an ordinary string ignores the attribute bytes in the source, while moving from a string to dstring zeros the corresponding attributes in the destination. Note also that the default class_= procedure for dstrings is the same as for ordinary strings, i.e. compares only the character-code parts of each character. (There is currently no procedure that compares the attribute parts.) -------------- 10 Vedstrings -------------- Vedstrings are a notional data type designed for use in the Ved editor. They are actually strings or dstrings (and in the future possibly other, e.g. 16-bit, string types), but in addition allow for the embedding of an arbitrary data item on each character in the string (that is, the association of an item with each subscript position in the string). This association is maintained via the property vedstring_data_prop. When a character with associated data is accessed from a vedstring, the return value is a pair of the form conspair(integer-char, data-item) rather than the ordinary integer-char when there is no associated data. Similarily, such a pair may be assigned into a character position in a vedstring to set the associated data item along with the character (assigning integer-char alone removes any data item). The argument vchar in the descriptions below thus means either an integer character or a pair as above. Note that only the procedures described below maintain the embedded data in vedstrings; other generic operations such as copy, <> or explode will just treat them as strings, and the result will lose any embedded data. Thus for example, to copy a vedstring use copy(vstring) -> new_vstring; if vedstring_data_prop(vstring) ->> vec then copy(vec) -> vedstring_data_prop(new_vstring) endif; or alternatively, subvedstring(1, datalength(vstring), vstring) -> new_vstring; etc. (Note also that Pop-11 quoted string syntax allows for the construction of vedstrings, provided the associated data items are themselves (d)strings -- see REF * ITEMISE. As a Ved buffer line, this form of vedstring is the only type that can be written to a file by Ved.) consvedstring(vchar1, vchar2, ..., vcharN, N) -> vstring [procedure] Returns a vedstring vstring constructed from the next N vchar character values on the user stack (where the topmost value on the stack will be at the highest subscript in the string). destvedstring(vstring) -> (vchar1, ..., vcharN, N) [procedure] Destructs the vedstring vstring, i.e. puts all its characters on the stack, together with its length. subvedstring(N, len, vstring) -> sub_vstring [procedure] sub_vstring -> subvedstring(N, len, vstring) The base procedure returns a vedstring sub_vstring consisting of the len characters plus embedded data of the vedstring vstring starting from the character at subscript N. nullstring is returned for an empty substring. vstring may also be a word (but the result is still a (d)string). The updater copies the first len characters plus embedded data of the vedstring sub_vstring into the vedstring vstring starting at subscript N. As with the updater of substring, sub_vstring may also be a word, but not vstring. subscrvedstring(N, vstring) -> vchar [procedure] vchar -> subscrvedstring(N, vstring) Returns or updates the N-th character vchar of the vedstring vstring. (N.B. Since a vedstring is not a distinct datatype, you cannot access a vedstring character with the Pop-11 form vstring(N) -> vchar This will always just give the integer character.) vedstring_data_prop(vstring) -> vec_or_false [procedure] The property used to hold embedded data items for vedstrings. If a string has any embedded data, vedstring_data_prop for the string is a full vector of the form {% sub1, data1, ..., subN, dataN %} meaning that each item dataI is associated with subscript subI in the string. The subscripts appear in order, i.e. sub1 < sub2 < ... < subN. --------------------------------------- 11 Regular Expression Pattern Matching --------------------------------------- REF * REGEXP describes the Poplog facilities for performing regular expression pattern matching on strings. Regular expressions allow you to perform powerful string searching operations using a set of 'wildcards'. ----------------- 12 Miscellaneous ----------------- See also * stringin in REF * CHARIO for constructing character repeaters from strings. strnumber(string_or_word) -> num_or_false [procedure] If the characters of the string or word argument form a valid number according to the lexical syntax rules given in REF * ITEMISE, then that number is returned, otherwise false. E.g. strnumber('123') => ** 123 returns the integer 123. Note that character constants are valid as integers, e.g. strnumber('`a`') => ** 97 sys_parse_string(string) -> (substr1, ..., substrN) [procedure] sys_parse_string(string, sepchar) -> (substr1, ..., substrN) sys_parse_string(string, p) -> (item1, ..., itemN) sys_parse_string(string, sepchar, p) -> (item1, ..., itemN) Given a string (or word) string, this procedure breaks it into substrings delimited by either (a) the character sepchar (if supplied), or (b) by whitespace characters (spaces, tabs and newlines) if sepchar is absent. The substrings are returned on the stack. If a procedure p is supplied as an optional second or third argument, it is applied to each substring as it is produced, i.e. p(substr) p may then either return the substring, or some other item(s) in its place. sysparse_string(string, try_strnumber) -> list [procedure] sysparse_string(string) -> list Similar to sys_parse_string splitting on whitespace, but returns a list instead of separate substrings. If the optional boolean argument try_strnumber is false, this procedure is the same as [% sys_parse_string(string) %] If try_strnumber is true (the default when omitted), it is the same as [% sys_parse_string(string, procedure(substr); lvars substr; strnumber(substr) or substr endprocedure) %] i.e. every substring for which strnumber returns a number is replaced by that number. nullstring -> string [constant] The value of this constant is a string of 0 characters. string_key -> key [constant] dstring_key -> key [constant] These constants holds the key structures for ordinary strings and dstrings. (see REF * KEYS). +-+ C.all/ref/strings +-+ Copyright University of Sussex 1995. All rights reserved.