REGULAR EXPRESSIONS

Jonathan Meyer Sept 1992


COPYRIGHT University of Sussex 1993. All Rights Reserved.

This file describes the Poplog regular expression matcher. The regular expression matcher is used in Ved and other Poplog facilities to perform pattern-based (or `wildcard') searching in strings.

Contents

Select headings to return to index

Introduction

Regular expressions are a powerful and flexible way of performing pattern matching in strings, popularised by UNIX facilities such as grep, awk, vi, ed, sed, etc.

The Poplog regular expression matcher lets you to search Poplog strings and words using regular expressions. You have the option of searching from left-to-right or from right-to-left, and of case sensitive or insensitive searching.

See REF * VEDSEARCH for details of how to use regular expressions in Ved. TEACH * REGEXP gives an introduction to constructing regular expressions.

Poplog Regular Expressions

The main differences between Poplog regular expressions and those provided by the `C' library described in UNIX * REGEXP are:

The Escape Character

Because the backslash character `\' already has a meaning to the Poplog itemiser in strings, the Poplog regular expression compiler uses the `@' (at) character as its escape code instead of the `\' character used by the `C' regular expression matcher. This means that, in Poplog, instead of writing:

\(hello\)

you write:

@(hello@)

Word Boundaries

The @< and @> operators can be matched using * vedchartype rather than the more simplified approach used in the `C' regular expression matcher. Using * vedchartype means that word boundaries are language sensitive, and obey the rules of the Pop-11 itemiser.

Ved Search Patterns

As well as the regular expression patterns described in UNIX * grep, regexp_compile recognises the additional search patterns @a, @z, and @?. These are provided for compatibility with exisiting search facilities in Ved.

Literal mode

Unlike UNIX regular expressions, in Poplog the characters $ ^ . * [ and ] stand for themselves, and are not special wildcard characters UNLESS they are preceded by an escape (eg. @. or @*). Thus, in Poplog, instead of writing

.*[abc]

you write:

@.@*@[abc@]

Long regular expressions

The regular expression matcher can handle regular expressions which span more than one line of text. See Long Regular Expressions below.

Constructing Regular Expressions

This section describes the rules used for constructing regular expressions.

Single Character Matching

The following one-character regular expressions match a single character:

   c    An ordinary character is a one-character regular expression that
        matches that character.
   @.   An escaped `.'  (period) is a  one-character regular  expression
        that matches any character except NEWLINE.
   @[string@]
        A non-empty  string of  characters  enclosed in  escaped  square
        brackets is a one-character regular expression that matches  any
        one character in that string.
If, however, the first character of the string is a `^' (a circumflex or caret), the one-character regular expression matches any character except NEWLINE and the remaining characters in the string. The `^' has this special meaning only if it occurs first in the string.
The `-' (minus) may be used to indicate a range of consecutive ASCII characters; for example, [0-9] is equivalent to [0123456789]. The `-' loses this special meaning if it occurs first (after an initial `^', if any) or last in the string.

Multi Character Matching

The asterisk is used for multiple character matching:

   @*   Any one-character regular expression followed by escaped `*' (an
        asterisk) is  a regular  expression that  matches zero  or  more
        occurrences of the one-character regular expression.
If there is any choice, the longest leftmost string that permits a match is chosen.

Concatenation

Single and multi character regular expressions can be concatenated with other regular expressions to form compound expressions which match the concatenation of the strings matched by each component of the regular expression. Thus you can build complex regular expressions by combining one or more single and multi character regular expressions.

Sub-expressions

Regular expressions can contain up to nine sub-expressions, specified using the @( @) brackets:

   @( and @)
        A regular expression enclosed between the character sequences @(
        and  @)  matches  whatever  the  unadorned  regular   expression
        matches. See @n below.

You can refer back to a sub-expression using the @n expression:

   @N   The expression @N  (where N  is a  digit 1-9)  matches the  same
        string of characters  as was matched  by an expression  enclosed
        between @( and @) earlier in the same regular expression.
The sub-expression specified is that beginning with the nth occurrence of @( counting from the left.
For example, the expression ^@(.*@)@1$ matches a line consisting of two repeated appearances of the same string.

Operators

The following special operators can be used in regular expressions:

   @<   The  sequence  @<  in   a  regular  expression  constrains   the
        one-character regular expression  immediately following it  only
        to match something at the beginning of a "word"; that is, either
        at the beginning of a line,  or just before a letter, digit,  or
        underscore (_) and after a character  not one of these (but  see
        the description of the FLAGS argument to regexp_compile).
   @>   The  sequence  @>  in   a  regular  expression  constrains   the
        one-character regular expression  immediately following it  only
        to match something at  the end of a  "word"; that is, either  at
        the end of a line, or just before a character which is neither a
        letter, digit, nor  underscore (_) (but  see the description  of
        the FLAGS argument to regexp_compile).
   @{m@}
   @{m,@}
   @{m,n@}
        A regular  expression  followed  by @{m@},  @{m,@},  or  @{m,n@}
        matches a range  of occurrences of  the regular expression.  The
        values of m and n must  be non-negative integers less than  256.
            @{m@}   matches exactly m  occurrences;
            @{m,@}  matches at  least m occurrences;
            @{m,n@} matches any number of occurrences between m and  n
                    inclusive.
Whenever a choice exists, the regular expression matches as many occurrences as possible.
   @^   An escaped circumflex or caret (^) at the beginning of an entire
        regular expression constrains that  regular expression to  match
        an initial segment of a line.
   @$   An escaped currency symbol ($) at  the end of an entire  regular
        expression constrains that regular  expression to match a  final
        segment of a line.

Note that the construction

@^entire regular expression@$

constrains the entire regular expression to match the entire line.

Ved Compatibility patterns

The following patterns can be specified in regular expressions for Ved compatibility:

@a The same as @^ at the start an expression.
@z The same as @$ at the end of an expression.
@? The same as @. in an expression, ie. it matches any character.
@@ This matches the single character "@".
@C Where C is one of `/', `\', `"', or ``' matches C.

Long Regular Expressions

Regular expressions are, by convention, used to perform pattern matching against single lines in a text file - a single regular expression can match at most all of the characters of one line in a file.

It is useful to be able to search for patterns which span line breaks. For example, you might wish to search for text which started with 'if' on one line and 'endif' on the next line.

The Poplog regular expression matcher provides support for multiple line (`long') regular expressions. These work by deviding a regular expression for matching several lines into several smaller regular expressions each matching a single line. Line breaks are indicated in the regular expression by writing `@$' (or `@z', ie. constrain to end of line) followed immediately by `@^' (or `@z', ie. constrain to start of line). For example:

dog@$@^cat

Matches text which contains 'dog', followed by a line break, followed by 'cat'. The long-string expression matcher does not allow sub-expressions to span over line breaks. Thus the following expression is illegal:

'@(some text @z@a and some more text@)'

Instead you must write:

'@(some text @)@z@a@( and some more text@)'

The Case Mode

You can turn on and off case sensitivity in the middle of a regular expression using @i and @c.

@i makes the matcher ignore the case of the characters that follow.
@c turns on case sensitivity, making the matcher examine case.

Compiling Regular Expressions

regexp_compile(regexp_string) -> (error_info, regexp_p) [procedure]

regexp_compile(regexp_string, flags, delim_char)
                              -> (error_info, regexp_p)
Takes a regular expression regexp_string (a string or word) and compiles ito into an efficient internal representation for performing regular expression string comparisons.
The optional flags argument is an integer whose bits have the following meanings:
Bit Meaning --- -------
  1. If set, the resulting regexp_p will not be case sensitive. If clear, then the matcher will distinguish between upper and lower case characters See REF * isuppercode, * islowercode.
  2. If set, the regular expression will only match non- embedded items, otherwise it will match embedded or non-embedded items. This implicitly inserts a @< and @> at the start and end of the expression, and is used for " and ` searches in Ved.
  3. If set, the pattern $ (or @z) followed by ^ (or @a) will be treated as a line break, and a `long' regular expression matcher will be built. See the section on 'Long Expressions' below.
  4. If set, word boundaries are identified in strings passed to the regexp_p using the * VEDCHARTYPE mechanism, otherwise they are identified with the simpler algorithm used by the `C' regular expression matcher (where words are formed from letter codes, number codes, or the underscore (_) character). See REF * isalphacode, * isnumbercode, REF * pop_character_set.
  5. If set, occurrences of a tab character (`\t`, ASCII
9) in strings passed to the regexp_p are expected to be followed by a number of padding tab characters determined by vedindentstep. This flag is provided to allow Ved buffer strings containing hard tabs to be searched. See REF * veddecodetabs, REF * vedencodetabs, HELP * vednotabs.
If the flags argument is omitted, all bits default to 0.
The optional delim_char is an ASCII character code that marks the end of the regular expression. regexp_compile will stop parsing regexp_string once it reaches a delim_char character. For example, in an editor like Ved this character would usually be `/`. If delim_char is false, or if the delim_char argument is omitted, the whole of the regexp_string is used.
If regexp_string can be successfully parsed and compiled, error_info will be false and regexp_p will be a procedure of the form:
regexp_p(n, string, len, back) -> (start_index, num_chars)
The integer n specifies where to start searching in string (a string or word), and len specifies how many characters should be examined. If len is false, the rest of the string from n is used.
If back is false, searching is done from left to right in string. If it is true, searching is done from right to left (ie. backwards) starting at the position n + len -1 (or the end of the string if len is false). If it is an integer, it is the subscript in the string where backward searching starts.
On completion, start_index and num_chars indicate where in string the regular expression matches. If there is no match in string, both start_index and num_chars will be false. Otherwise the matching portion of the string can be obtained using:
substring(start_index, num_chars, string)
For example,
vars err, searchp;
;;; the regular expression is 'd@.@*s' - which matches a 'd' ;;; followed by any number of characters followed by 's'
regexp_compile('d@.@*s') -> (err, searchp);
searchp => ** <procedure d@.@*s>
searchp(1, 'this string wont match', false, false) => ** false false ;;; no match can be found
searchp(1, 'this string does match', false, false) => ** 13 4
;;; there was a match - use * substring to identify it: substring(13, 4, 'this string does match') => ** does
Note that the * pdprops of regexp_p is set to the regular expression string that it matches.
If regexp_string cannot be successfully parsed and compiled into a regular expression, error_info will be a string containing an error message, and regexp_p will be a procedure which mishaps if called (regexp_p will always be a procedure so that you can reliably assign it identifiers whose * identtype is "procedure"). For example:
regexp_compile('@[a-z') -> (err, searchp); err => ** '[] imbalance'
regexp_compile('@[a-z@]') -> (err, searchp); err => ** <false>
A common usage of regexp_compile is therefore:
            if (regexp_compile(string) -> searchp ->> err) then
                mishap(string, 1, err);
            endif;
Long Expressions ---------------- You set bit 5 of the flags argument to regexp_compile to enable compilation of long regular expressions. The resulting regexp_p procedure is of the form:
            regexp_p(n, string_1, string_2, ..., string_m, len, back)
                    -> (start_index, num_chars)
Where the integer m is determined by adding one to regexp_break_count, n is an index in string_1 where matching starts, and len is the number of characters of string_m which are included in the search. If m is one (ie. the search only spans one line), then len characters starting from n in string_1 are used. len may be false, in which case as many characters as possible from string_m are used. All characters from strings string_2 to string_m-1 are used. back is as before.
If a match is found, start_index is the index into string_1 where the match starts, and num_chars is the number of characters in string_m that are included in the match. Otherwise start_index and num_chars are both false.
If m is 1, the matching portion of text is retrieved using:
substring(start_index, string_1, num_chars)
Otherwise, it is:
allbutfirst(start_index - 1, string_1) <> string_2 <> ... <> substring(1, num_chars, string_m)

Bracketed Sub-expressions

Regular expressions can have up to 9 sub-expessions enclosed in @( @) brackets. You can find out how many sub-expessions a regular expression has, and also where those sub-expessions matched in the last search.

regexp_subexp_count(regexp_p) -> n [procedure]

Returns the number of bracketed @( @) sub-expessions found in the regular expression that was compiled into regexp_p by regexp_compile. n will be 0 if there are no bracketed expressions. The current implementation allows up to 9 of these bracketed expressions.

regexp_subexp(n, regexp_p) -> (start_index, len, string) [procedure]

After regexp_p has been successfully applied to a string (ie. the string can be matched against the regular expression), an internal table is set to the locations in the string where any bracketed @( @) sub-expessions in the regular expression matched. You can access elements of this table using regexp_subexp. It returns the start index and length of the substring enclosed by the n'th @( @) pattern. It also returns the string which the sub-expression was matched against.
For example:
;;; compile a regular expression vars procedure searchp;
regexp_compile('h@(....@) world') -> (,searchp);
searchp => ** <procedure h@(....@) world>
;;; apply searchp to the string 'hello world':
searchp(1, 'hello world', false, false) => ** 1 11 ;;; ie. the whole string matches
;;; find out where the first bracketed expression matched: regexp_subexp(1, searchp) => ** 2 4 hello world
substring(regexp_subexp(1, searchp)) => ** ello

Miscellaneous

isregexp(item) -> bool [procedure]

true if item is a regular expression matching procedure compiled by regexp_compile.

regexp_delimeter(regexp_p) -> n [procedure]

Returns the index of the location of the delimeter character in the string regexp_string passed to regexp_compile to create regexp_p, or false in no delimeter character appeared in the string.

regexp_break_count(regexp_p) -> n [procedure]

Returns the number of line breaks that appear in the regular expression encoded by regexp_p. For any expression which spans only a single line, this will be 0. For an expression spanning two lines, this will be 1, etc.

regexp_anchored(regexp_p) -> bool [procedure]

true if the regexp_p expression includes an @a, @z, @^ or @$ to anchor the expression to the start or end of a line.

--- C.all/ref/regexp --- Copyright University of Sussex 1993. All rights reserved.

SourceForge.net Logo