Regex - Regular Expressions in ANS Forth

(Version 0.11, 2 May 2011)

Introduction
Features
Portability
Download
Install and Test
Using Regex
String Builder
Features for Future Versions
Features Excluded
Implementation Notes
ANS Forth Compliancy
Contact
Version History
Home

1. Introduction

Regex is an ANS Forth program to compile regular expressions written in the conventional format and to match ASCII text against them. The regular expression flavour is Perl-like. This release is rich in features but is still in development with features to be added. The package includes a String Builder module to make use of the results from a successful regular expression match. Regex is intended for use on 32 bit or larger desktop computers rather than 16 bit or embedded processors and so little attention has been paid to code size - ease of development and maintenance have been the main drivers. To this end the tools LexGen and Grace have been used for the regular expression scanner and parsers and an object oriented approach has been used in the design. The availability of this program has not yet been publicly announced, if you stumble upon it you are welcome to use it but be warned that it is still under development so various things may change until version 1.0 is released.

2. Features

The features provided are:

character concatenation abc
alternation a|b
Greedy quantifiers ? * + {n,m} {n,} and {n} for specification of repetition
Lazy quantifiers *? +? ?? {n,m}? {n,}? and {n}?
Possessive quantifiers *+ ++ ?+ {n,m}+ {n,}+ and {n}+
character classes [abc] or [a-zA-Z]
negated character classes [^abc] or [^a-zA-Z]
match any character except \n using .
metacharacter recognition \{ \} \[  \. \| \* \+ \? \] \^ \$ \\
escaped characters \a \e \f \n \r \t \v \0 \bl \#
escaped characters in character classes \b \- \] \^ \$ \\
hexadecimal escapes \xF \xFF \x{F...}
control characters e.g. \cH for CTRL-H
capturing parentheses (a(bc)def)gh
non-capturing parentheses (?:abc)
named capturing parentheses (?<alice>...)
(Changed in version 0.4)
anchors ^ $ \A \z \Z
predefined character classes \w \W \d \D \s \S both inside and outside character classes
backreferences using (e.g.):
- \1 \2 etc for absolute numbered references
- \g{-1} \g{-2} etc for relative numbered references
- \g{alice} for named references
Execution of embedded code (?{forth code})
Conditional sub-expressions with the test being on back references, named references, embedded Forth code and all look arounds e.g. (?(1)abc|def) (?(name)abc|def) (?(?{forth-code})abc|def) (?(?=xyz)abc|def) (?(?!xyz)abc|def) (?(?<=xyz)abc|def) (?(?<!xyz)abc|def)
Mode modifiers and mode modified spans:
(?i) (?-i) (?i:...) (?-i:...) for case (in)sensitivity
(?m) (?-m) (?m:...) (?-m:...) for enhanced line anchor match mode
(?s) (?-s) (?s:...) (?-s:...) for "dot matches all" mode
(?x) (?-x) (?x:...) (?-x:...) for free-form layout and comments mode terminated with (?e) or (?end). # is used for comments running to the end of the current line
Look ahead (?=...) (?!...) and look behind (?<=...) (?<!...)
Atomic groups (independent subexpressions) (?>...)
Case folding:
spans \U \L ... \E
following character \u \l
Literal character span \Q...\E
Recursion in regular expressions using (?R) (?n) (?+n) (?-n) (?&name)
Interpolation of regular expression and string objects (?$name)
Dynamic Regular Expressions (??{Forth code sequence where the Forth code sequence returns a regular expression.
Alternative capture group numbering (?|...|...|...)

For String Builder features see String Builder.

3. Portability

Regex 0.11 has been tested successfully on 32 bit versions of GForth, SwiftForth, VFX Forth, Win32 Forth, iForth and BigForth, all running under MS Windows XP and Windows 7 (32 bit version) and is, therefore, likely to be ANS Forth compliant. Version 0.2 was reported as working on GForth under Mac OSX and Linux, and on PFE; there is no reason, other than pessimism, to suppose the latest version won't work on these systems. Versions before 0.8 would not work with 64 bit Forths, version 0.8 worked with 64 bit GForth under Linux, version 0.11 is untested on Linux. Feedback on the use of Regex with other systems will be welcomed.

Use of Regex using Forths with cells smaller than 32 bits is not supported.

4. Download

Regex version 0.11 released 2 March 2011. This contains the following files:

Regex files:

xmini_oof.fth - a library file
sets.fth - a library file
regex.fth - the Regex and String Builder source code file

Testing:

regextest.fth - the test program

Documentation:

regex.html - this file

Note that, from version 0.3 onwards, Regex and String Builder files have been concatenated until one large file - regex.fth. Similarly regextest.fth. The state transition tables and parsers were generated by LexGen and Grace in an unformatted mode and hence are unreadable (being computer generated, they were never intended to be readable anyway).

5. Install and Test

This is configured to work with GForth 0.7.0. To test Regex with GForth, unzip the files into a suitable directory under GForth called, say, regex. Then type:

s" regex/regextest.fth" included

or similar, which should give the following output:

Gforth 0.7.0, Copyright (C) 1995-2008 Free Software Foundation, Inc.
Gforth comes with ABSOLUTELY NO WARRANTY; for details type `license'
Type `bye' to exit
s" regex/regextest.fth" included Loading extended mini oof ...
Extended mini oof loaded. <0> Loading Sets.fth ...
redefined cell  redefined decode  Sets.fth loaded. <0> redefined 1Cell with 1cel
l  redefined noop  redefined ~  redefined testsym?  redefined this-parser  redef
ined ~  redefined testsym?  redefined >string  redefined Format  redefined <meta
char>  redefined <forth>  redefined <group>  redefined <escapedchar>  redefined
<reference>  redefined <conditional>  redefined <modemod>  redefined <casefolder
>  redefined <modifier>  redefined this-parser  redefined ~

Regex version 0.11
Copyright (C) Gerry Jackson 2010, 2011

redefined TESTING with Testing
Running Regex tests

*  1 *  2 *  3 *  4 *  5 *  6 *  7 *  8 *  9 * 10 * 11 * 12 * 13 * 14 * 15
* 16 * 17 * 18 * 19 * 20 * 21 * 22 * 23 * 24 * 25 * 26 * 27 * 28 * 29 * 30
* 31 * 32 * 33 * 34 * 35 * 36 * 37 * 38 * 39 * 40 * 41 * 42 * 43 * 44 * 45
* 46 * 47 * 48 * 49 * 50 * 51 * 52 * 53 * 54 * 55 * 56 * 57 * 58 * 59 * 60
* 61 * 62 * 63 * 64 * 65 * 66 * 67 * 68 * 69 * 70
Regex & String Builder tests completed

 ok

With other systems, directory paths may need to be added to the lines that include files e.g.
s" filename" included
in files regex.fth and regextest.fth. This is system dependent as there is no ANS Forth standard way to specify directory paths.

The output from other systems will vary, ignore any redefinition messages (unless they cause an error). The last six lines are the important ones that prove the software works.

If a system fails with an error message during the tests, please send details (the regular expression, any output and the system used) to the contact email address below.

6. Using Regex

This section describes features relevant to Regex; it is not a primer or tutorial on regular expressions - there are many such resources available on the internet or books available for purchase. Knowledge of regular expressions is assumed.

Loading Regex

Assuming the directory paths in regex.fth are correct for your Forth system, simply type:

s" regex.fth" included

Compiling and matching regular expressions

These words are provided to compile regular expressions and to match strings against the compiled regular expression:

regex ( "ccc..." -- rgx ) compiles the following text as a regular expression until one of (?e), (?end) or the end of the line is reached. The multi-line free-format modifiers (?x) or (?x: can be used in the first line to compile a multi-line regular expression with # comments (note that this free-format mode is ignored inside a character class specification where white space and # are treated as characters in the usual way). It returns rgx which is the adddress of the compiled regular expression and which should be used as an input parameter in a subsequent match

regex$ ( caddr u -- rgx ) compiles the regular expression held as a string at (caddr u) returning the address of the compiled regular expression

parse-regex ( char "ccc<char>" -- rgx ) parse ccc in the input buffer delimited by the character char. If the delimiter is not present, it parses to the end of the line. The parsed text is treated as a regular expression and compiled to address rgx. This word enables a delimiter of choice to be used, for example a user may define:
: parse/ [char] / parse-regex ;
to delimit a regular expression with the character /

match ( caddr1 u1 rgx -- caddr u -1 | caddr1 u1 0 ) attempts to match the text at (caddr1 u1) against the compiled regular expression rgx. If a match is found (caddr u -1) is returned where (caddr u) is the rest of the input string following the leftmost matching string, otherwise the input string and 0 (caddr1 u1 0) are returned to indicate failure to match. (Note that the returned string may have length u = 0 if, say, the match was at the end of the input string. The matching string may be obtained using get-match. The specification of match was changed in version 0.5 onwards because experience using Regex showed that returning the rest of the string being searched is far more useful than returning the matching string.

Accessing Regex results

get-match ( -- caddr u -1 | 0 0 0 ) returns a matching string from the previous execution of match if a match was found, otherwise (0 0 0).

get-subex ( n -- caddr u ) returns the contents of the nth pair of capturing parentheses, where the count starts at 1 from the start of the regular expression. 0 get-subex is the same as get-match

(Note that substitute and head&tail were removed in version 0.2 as they have been superseded by the String Builder.)

Related words

regex-string ( caddr u "spaces<name>" -- )
creates an interpolated string object called name which, when executed, leaves that object on the stack ready for compilation into another regular expression. Typical use is within the (?$name) construct.

Notes on various features

No description is provided for most of the features listed above as they behave in a standard manner - examples of usage can be found in the test program provided. This section notes a few items that may not be standard.

Regex finds the longest leftmost match in the string being processed and is therefore does not have a traditional NFA engine using terms described in "Mastering Regular Expression" by Jeffrey Friedl (p 177). Therefore alternation is not ordered.
Capturing parentheses. The default configuration allows up to 9 pairs of capturing parentheses in any one regular expression. This number may be varied by changing the value of constant subex-limit in file regexmatch.fth
Embedded code. In (?{forth code}), the embedded forth code is a sequence of one or more Forth words. The embedded code can make free use of the data stack i.e. data present on the stack before match can be accessed and additional data added or removed from the stack (introduced in version 0.8). Restrictions on embedded Forth code are:
- The code cannot contain a } character because this is used to terminate the code (see below for a work around).
- The code cannot include a definition of any kind. This is because a sequence of Forth words is compiled as a :noname definition which cannot include other Forth definitions (see below for a work around).
To overcome the restrictions in the last two bullet points, put the offending sequence of Forth code words into a colon definition and call that in the embedded Forth. With such a single call Regex saves the xt without a :noname definition and executes that xt during a match.
End of line anchors. $ \z \Z treat both CRLF and LF as the end of the line.
Mode modifiers. Such as (?i) and friends operate until one of: the corresponding (?-i); the end of the regular expression; or until the end of the immediate enclosing parentheses.
The default modifiers (initially all off) are used at the start of a regular expression. These defaults may be changed prior to compilation of a regular expression by the use of: save-idefault save-mdefault save-sdefault save-xdefault, all with stack effect ( flag -- ). For example:
true save-idefault
sets the i modifier default value to on, making following regular expressions case insensitive
Named References. (Note this has been changed in version 0.4 and later versions)The name used in named capture (?<name>...) and named references \g{name} must have been previously declared using the defining word refname. e.g.
refname alice
Following a successful match, the contents of a named reference may be retrieved in Forth code (including embedded Forth code) by executing the name e.g. alice which will return the matched string as ( caddr u ) if matched or ( 0 0 ) otherwise.
The name must not be used as a 2variable as was the case in previous versions. If an invalid name is used the system will report an error.
The look behind feature can contain any valid regular expression items such as \w* which is of indeterminate length. Hence the NFA matching engine does not know how far back to start matching the look behind expression, hence it starts at the beginning of the subject string. With large strings this may be slow and so should be used carefully e.g
(?<=(\d+)\s+abc\s+def.*)\1\s+abc\s+\?(?=xyz)
is potentially slow as the look behind is applied first. On the other hand
(\d+)\s+abc\s+\?(?=xyz)(?<=\1\s+abc\s+def.*)
is fast as the look behind follows the match and look ahead.
Conditional expressions. The syntax is: (?(?<test><then-part>|<else-part>) The back and named reference tests return true if the reference has participated in the match, otherwise false. The value of the reference is not used. Conditional expressions may be nested indefinitely.
The embedded code in conditional expressions must leave true (i.e. non-zero) or false on the stack to indicate whether the then-part or else-part of the expression is to be matched.
Recursion is available within regular expressions where:
- (?R) means recurse to the beginning of the regular expression
- (?1) means recurse to the contents of the first pair of parentheses, (?2) to the second pair of parentheses and so on. (?0) is the same as (?R).
- (?+1) is a relative reference, +1 is the next pair of parentheses, (?+2) the pair after that and so on.
- (?-1) is a relative reference to the previous pair of parentheses, similarly (?-2) etc.
- (?&name) recurses to the named pair of parentheses. This may be a forward reference.
Note that if the parentheses referred to do not enclose the reference, then it is effectively a call to a subroutine. Examples of usage are given in the test program.
Interpolated variables uses non-Perl notation (?$name). The name must be either a regular expression previously returned by use of one of the regex family, or a string object which is compiled into the regular expression being compiled. The string object must have been defined using the defining word regex-string. The same interpolated object may be included as many times as desired - however it is more efficient on space to use a pre-compiled regular expression object as a string object is effectively a macro.
Capturing parentheses inside the interpolated object are numbered consecutively from the enclosing regular expression and may be referred to in the usual way.
Dynamic regular expressions using (??{forth code sequence are compiled during a match using the enclosing regular expression. The forth code sequence must return a regular expression and must therefore include one of the regex family or be a reference to a regular expression object (the latter case is the same as an interpolated regular expression object). It is not possible to recurse into a dynamic regular expression from the enclosing expression.

Character set

This version of Regex has been tested with the 7 bit ASCII character set. Limited tests have been carried out with 8 bit characters and there should be no problems. The character size can be changed to 8 bit by setting the constant maxchar to 255 in file regex.fth. Do not set it larger, e.g. for 16 bit characters, as character classes will then use large amounts of memory.

7. String Builder

This has been introduced to the Regex package to simplify using the results from matching a regular expression. The String Builder has syntax in this form (alternatives use stringer and parse-stringer - see below):

            pattern-string stringer$

where the pattern-string is a string containing a sequence of:

characters which are simply copied through to the concatenation buffer
metacharacters and escaped characters similar to those in regular expressions e.g. \n to copy a line-feed into the concatenation buffer
references to results from the last regular expression match e.g. \1 for the contents of the first pair of capturing parentheses. Also references to Forth variables. These values are then copied to the concatenation buffer.
commands which usually follow the Perl regular expression form of (?x...) e.g. (?-c) to stop the concatenation buffer from being cleared. Note that this is a change between versions 0.2 and 0.3. The commands include a sprintf like formatter for number conversion.

The word stringer$ parses the pattern-string and acts on the contents to build up a string in the concatenation buffer. The typical use is to replace the matching text from a regular expression match with some other text. Note that the necessary code to read/write strings from/to files is outside the scope of the Regex package at present - these operations can clearly be written in standard Forth. A possible future extension could handle some of this in the String Builder command language.

For examples of usage refer to the test program included in the download.

String Builder words

stringer ( "ccc..." -- caddr u ) Accepts the pattern-string from the input source until terminated by the end of line or (?e) in free-layout mode. The concatenation buffer is cleared, unless the command (?-c) is first in pattern-string. The pattern-string is then parsed and acted upon according to the contents. The string built up in the concatenation buffer is returned as (caddr u) on completion, unless the command (?-g) has been used.

stringer$ ( caddr u -- caddr2 u2 ) As for stringer except that the pattern-string is already available as (caddr u).

parse-stringer ( char -- caddr u ) Parse the input until the delimiting character. The parsed string is then used as the pattern-string and stringer$ is then executed.

Related words

stringify ( -- caddr u) Accepts text from the input source in free-layout mode until an (?end) or (?e) command is encountered. White space, including new line characters, are stripped out and the remaining characters concatenated into a single string. The (caddr u) pair of the resulting string is returned to the caller. The aim of this word is to create a pattern string from a readable free-layout representation that may be used inside a colon definition; alternatively to easily create strings containing 'difficult' characters such as ". Use the word constant$ to save the string (see below).

The following points apply to stringify:

An initial free-layout command (?x) is not needed at the start of the free-layout text - if present it is ignored and not copied to the resulting string.
After the start of the text the commands (-?x) and (?x) are obeyed to switch free-layout mode off/on respectively. The (-?x) and (?x) commands are not copied to the resulting string.
If free-layout mode is switched off using (?-x) the concatenation of text is terminated at the end of the current line; a terminating (?end) or (?e) is not required in this case. Free-layout mode may be switched on again with (?x) before the end of the line.
The concatenation buffer is used, therefore any user text in there will be overwritten unless the first command is (?-c) when all text will be appended to the contents of the buffer. The first (?-c) is not copied to the resulting string. Subsequent occurrences of (?c) and (?-c) are not acted upon, but simply copied to the buffer.
Before the resulting string is used it should be saved elsewhere e.g. by using constant$
All other stringer commands not already described are not acted upon, but simply copied to the buffer. This is true for the spanning commands (?x: ... ) and (?-x: ... )

constant$ ( caddr u "spaces<name>" -- ) copies the string specified by ( caddr u ) to dataspace at caddr2, creates a constant called <name> that, when executed, places (caddr2 u) on the stack. This word is included for use with stringify.

Metacharacters and Escaped Characters

These are preceded by the \ character and append the following character or value to the concatenation buffer. The list of these is:

            \\  appends character \     \(  appends character (
            \)          character )     \$          character $
            \.          character .     \{          character {
            \}          character }     \#          character #
            \|          character |     \%          character %

            \a          value 7         \e          value 27
            \f          value 12        \n          value 10
            \r          value 13        \s          value 32
            \t          value 9         \v          value 11
            \0          value 0
            \cX         value (following character mod 32)
                        e.g. \cH appends CTRL H value 8 as does \ch and \c(

References

The pattern string can include various references to the results from the last regular expression match; these append the associated value to the concatenation buffer. The types of reference are:

\1 for the contents of the first capturing parentheses, \2 for the second, and so on. These are equivalent to $1, $2 etc in Perl.
\H refers to the part of the string before the match
\M is the matching string itself
\T refers to the part of the string after the match
\g{name} refers to a named capture where name has been defined by refname (changed in version 0.4 onwards).

Commands

The commands that can be included in the pattern string are:

(?c) and (?-c), where (?c) clears the concatenation buffer and (?-c) means do not clear the buffer. (?-c) is only useful as the first command in a pattern string to retain the results from a previous use of the string builder. By default the concatenation buffer is cleared so the use of (?c) is redundant.
(?g) and (?-g), where (?g) means return the contents of the concatenation buffer as (caddr u) at the end of the pattern string, and (?-g) means do not return the string. By default (caddr u) is returned so that the use of (?g) is redundant.
(?x) and (?-x) to indicate free-form layout in exactly the same way as for regular expressions, (?e) or (?end) is used to terminate the pattern string.
(?{forth-code}) to execute the enclosed forth-code. As shown in the test program this may be used to match regular expressions. It can also be used to insert text into the concatenation buffer e.g.
s" Name: (?{s" Mary" concat}) Smith" stringer$
would return the string
Name: Mary Smith
\U \L \u \l \E for case folding. These behave in exactly the same way as in regular expressions.
(?(<test>)<then-part>|<else-part>) conditional expression where <test> may be one of:
- a reference number such as 1 etc which returns TRUE if the corresponding capturing parentheses participated in the match.
- a named reference which returns TRUE if the corresponding capturing parentheses participated in the match.
- Forth code using the usual syntax (?{forth code sequence}) where the forth code must return FALSE or a non-zero value taken as TRUE.
If the test returns TRUE the <then-part> is interpreted otherwise the <else-part> is interpreted. The <then-part> and <else-part>'s can be any valid String Builder sequence, hence conditionals may be nested.
%<flags><minwidth><precision><long><conversion> or
(?%<flags><minwidth><precision><long><conversion>)
format specification for converting numbers to strings. This is the sprintf functionality provided in several programming languages (e.g. C) and this follows the usual specifications for such functionality:

optional <flags> may be any of - for left justification, 0 to use 0 as a padding character, + to include a + sign for positive numbers, space to leave a space for a positive number, # for a variant of the main operation.
The flag characters may be in any order

optional <minwidth> for a minimum field width, may be a number or * to indicate the width is on the stack.

optional <precision> expressed as a . followed by a number or * to indicate the width is on the stack.

optional l to indicate a double length number.

required conversion operation which may be:
d u c s x X o % to indicate single integer, unsigned integer, character, string, hexadecimal using abcdef for x or ABCDEF for X, octal or the % character respectively.

The value converted (except for %) is taken from the top of the data stack in the order in which they appear in the format specification.

Note that minimum width and precision numbers are read as decimal numbers whatever the value of the Forth variable BASE. If * is used for either of these, the corresponding number must be above the other arguments on the stack

The free-layout command (?x) has no effect on a format specification i.e. white space, other than the space flag, is not permitted.

Floating point conversions using the f e E g G characters remain to be done. A useful addition might be conversion using the current value of the Forth word BASE.

Consult a C reference for the detailed semantics for format conversions and see the test program for examples (where the alternative syntax is mostly used, but either will do).

Future extensions and changes

There is plenty of scope to add to the set of commands in the String Builder and this will probably happen as experience is gained in its use. Examples are the addition of other control structures e.g. loops but it is unclear where the most cost-effective boundary lies between doing this in the pattern string or in Forth.

The floating point options of the format command are outstanding.

For commands the String Builder has kept mostly to the regular expression syntax of Perl. This may change in future versions.

Concatenation buffer

A simple concatenation buffer is provided to build up replacement strings for text matched with regular expressions. Initially the buffer is set to a size of 1024 bytes. New and alternative buffers may be created and used using the words below. Words operate on a current concatenation buffer.

new-concat-buf ( size -- buf ) Creates a new concatenation buffer with size bytes. To make it the current concatenation buffer set value concat-buf to the returned buf.

concat-buf a value holding the current concatenation buffer. It can be set to an alternative buffer by e.g.
10000 new-concat-buf to concat-buf
By manipulating concat-buf several buffers may be operative simultaneously.

clear-concat ( -- ) Empties the current concatenation buffer.

concat ( caddr u -- ) Appends the string (caddr u) to the contents of the current concatenation buffer

concat-char ( char -- ) Appends a character to the current concatenation buffer.

get-concat ( -- caddr u )Returns the contents of the current concatenation buffer

For an example of use, see the test program.

The size of the default concatenation buffer is set by constant cb-size in file regex.fth and therefore may be changed by the user.

The concatenation buffer may be deleted with:
concat-buf delete
but do not use that buffer again as results will be unpredicatable.

Further String Processing

For more complex string processing a package such as David Williams' Dynamic Strings Package can be used, where embedded code can be inserted into the pattern string to call words in the strings package.

8. Features for Future Versions

These features may be added in future versions, but not necessarily in this order:

Integrate with a string handling library package
Optimisation of regular expressions
Better memory management e.g. deletion of regular expressions
Grouped modifiers e.g.(?i-msx)
Unicode support

9. Features excluded

There are no plans to include the following features:

Octal escapes \num
In-line comments (?#...)
Support for Forth systems with cells smaller than 32 bits.

10. Implementation Notes

The regular expression engine is based on the ideas given in the paper Regular Expression Matching Can Be Simple And Fast by Russ Cox.

The regular expression and string builder parsers were generated by Grace from BNF specifications. In all there are three separate parsers in the package.

The state transition tables were generated by LexGen. There are four tables compacted into a single table.

11. ANS Forth compliancy

Regex requires a case-insensitive ANS Forth system. The following words from the various ANS Forth word sets are used.

The Core word set: ! #> #s ' ( * + +! +loop , - . ." / /mod 0< 0= 1+ 1- 2! 2@ 2drop 2dup 2over 2swap : ; < <# = > >body >in >number >r ?dup @ abort" abs align aligned allot and base begin bl c! c, c@ cell+ cells char char+ chars constant count cr create decimal depth do does> drop dup else emit environment? evaluate execute exit find here hold i if immediate invert j literal loop lshift max min mod move negate or over postpone r> r@ recurse repeat rot rshift s" source space spaces swap then type u< unloop until variable while word xor [ ['] [char] ]
The Core extension word set: .( .r 0<> 0> 2>r 2r> 2r@ :noname <> ?do again case compile, endcase endof false hex nip of pad parse pick refill roll to true tuck u> value within \
The Double-Number word set: 2constant 2literal 2variable d= dabs
The File-Access word set: included
The Memory-Allocation word set: allocate free
The Programming-Tools word set: .s
The Programming-Tools extension word set: [if] [else] [then]
The String word set: /string cmove compare

12. Contact

Any comments, feedback (good or bad), error reports etc will be gratefully received, email Gerry Jackson.

13. Version history

v0.11 2 May 2011

Added:

String Builder words concat-buf and new-concat-buf

v0.10 2 March 2011

Added:

Interpolated variables (?$name).
Dynamic regular expressions (??{Forth code sequence})
Relative back references \g{-n}
Literal character span \Q...\E
Alternative capture group numbering (?|...|...)

v0.9 5 January 2011

Recursive regular expressions added.
Sub-expression bug fix

v0.8 18 November 2010

Embedded Forth code now has free use of the data stack
User value of BASE preserved across compilation of the Regex package
Minor changes to work with 64 bit Forths (untested)
Bug fix in String Builder to allow valid sub-expression references
Code rationalisation with look ahead, look behind and related code

v0.7 11 August 2010

Words stringify and constant$ added.
Embedded Forth code now has access to the contents of capturing parentheses up to the point of executing the Forth code.
Embedded Forth code must now leave the stack unchanged.
Correction to stringer documentation. Conditional test references to capturing parentheses are (1) not (\1).
Bug fix: Forth code sequences more than 32 characters long are now permitted in embedded code.
Bug fix: \# in x mode now recognised as escaped #.
Bug fix: x mode is now ignored inside character classes.

v0.6 31 July 2010

Lazy quantifiers {n,m}? {n,}? {n}? added
Bug fix: Lazy quantifiers now return the correct string for enclosed capturing parentheses

v0.5 27 July 2010

Lazy quantifiers ?? *? +? added
Atomic groups and possessive quantifiers added
Specification of MATCH changed to return the rest of the subject string
Early low level tests removed from the test program as high level tests exercise the same functionality

v0.4 10 June 2010

Conditional expressions added to regular expressions.
Named references no longer use a 2variable.
Regular expressions such as a(b*)*c no longer crash.

v0.3 24 May 2010

String Builder extensions:
- conditional command
- formatted number conversions (like sprintf)
Empty look around permitted in regular expressions, in particular (?!) always fails to match

v0.2 7 May 2010

String Builder added.

v0.1 23 April 2010

Specification of the word match changed.
Bug fix, e.g. (?:a|b){3,5} now works correctly
List handling during compilation simplified
Less stack noise during execution of the regex NFA engine
File regexutils.fth added to provide words such as substitute and a simple string concatenation buffer.

v0.0 2 April 2010

Initial development release

14. Some ANS Forth Software

Home

Warning: include(regexcounter.php): Failed to open stream: No such file or directory in /srv/disk2/1154256/www/qlikz.org/forth/regex.html on line 1184

Warning: include(): Failed opening 'regexcounter.php' for inclusion (include_path='.:/usr/local/php-8.2.28/share/pear') in /srv/disk2/1154256/www/qlikz.org/forth/regex.html on line 1184