Regex is an ANS Forth program to compile regular expressions written in the conventional format and to match ASCII text against them. The regular expression flavour is Perl-like. This release is rich in features but is still in development with features to be added. The package includes a String Builder module to make use of the results from a successful regular expression match. Regex is intended for use on 32 bit or larger desktop computers rather than 16 bit or embedded processors and so little attention has been paid to code size - ease of development and maintenance have been the main drivers. To this end the tools LexGen and Grace have been used for the regular expression scanner and parsers and an object oriented approach has been used in the design. The availability of this program has not yet been publicly announced, if you stumble upon it you are welcome to use it but be warned that it is still under development so various things may change until version 1.0 is released.
The features provided are:
abc
a|b
? * + {n,m} {n,}
and {n}
for specification of repetition
*? +? ?? {n,m}? {n,}?
and
{n}?
*+ ++ ?+ {n,m}+ {n,}+
and
{n}+
[abc]
or [a-zA-Z]
[^abc]
or
[^a-zA-Z]
\n
using .
\{ \} \[ \( \) \. \| \* \+ \? \] \^
\$ \\
\a \e \f \n \r \t \v \0 \bl \#
\b \- \] \^ \$
\\
\xF \xFF \x{F...}
\cH
for CTRL-H
(a(bc)def)gh
(?:abc)
(?<alice>...)
^ $ \A \z \Z
\w \W \d \D \s \S
both
inside and outside character classes
\1 \2
etc for absolute numbered references\g{-1} \g{-2}
etc for relative numbered references\g{alice}
for named references
(?{forth code})
(?(1)abc|def)
(?(name)abc|def)
(?(?{forth-code})abc|def)
(?(?=xyz)abc|def)
(?(?!xyz)abc|def)
(?(?<=xyz)abc|def)
(?(?<!xyz)abc|def)
(?i) (?-i) (?i:...) (?-i:...)
for case
(in)sensitivity(?m) (?-m) (?m:...) (?-m:...)
for enhanced line anchor
match mode(?s) (?-s) (?s:...) (?-s:...)
for "dot matches all"
mode(?x) (?-x) (?x:...) (?-x:...)
for free-form layout and
comments mode terminated with (?e)
or (?end)
.
#
is used for comments running to the end of the current
line(?=...) (?!...)
and look behind
(?<=...) (?<!...)
(?>...)
\U \L ... \E
\u \l
\Q...\E
(?R) (?n) (?+n) (?-n)
(?&name)
(?$name)
(??{Forth code sequence
where
the Forth code sequence returns a regular expression.
(?|...|...|...)
For String Builder features see String Builder.
Regex 0.11 has been tested successfully on 32 bit versions of GForth, SwiftForth, VFX Forth, Win32 Forth, iForth and BigForth, all running under MS Windows XP and Windows 7 (32 bit version) and is, therefore, likely to be ANS Forth compliant. Version 0.2 was reported as working on GForth under Mac OSX and Linux, and on PFE; there is no reason, other than pessimism, to suppose the latest version won't work on these systems. Versions before 0.8 would not work with 64 bit Forths, version 0.8 worked with 64 bit GForth under Linux, version 0.11 is untested on Linux. Feedback on the use of Regex with other systems will be welcomed.
Use of Regex using Forths with cells smaller than 32 bits is not supported.
Regex version 0.11 released 2 March 2011. This contains the following files:
Regex files:
xmini_oof.fth
- a library file
sets.fth
- a library file
regex.fth
- the Regex and String Builder source code file
Testing:
regextest.fth
- the test program
Documentation:
regex.html
- this file
Note that, from version 0.3 onwards, Regex and String Builder files have been concatenated until one large file - regex.fth. Similarly regextest.fth. The state transition tables and parsers were generated by LexGen and Grace in an unformatted mode and hence are unreadable (being computer generated, they were never intended to be readable anyway).
This is configured to work with GForth 0.7.0. To test Regex with GForth,
unzip the files into a suitable directory under GForth called, say,
regex
. Then type:
s" regex/regextest.fth" included
or similar, which should give the following output:
Gforth 0.7.0, Copyright (C) 1995-2008 Free Software Foundation, Inc. Gforth comes with ABSOLUTELY NO WARRANTY; for details type `license' Type `bye' to exit s" regex/regextest.fth" included Loading extended mini oof ... Extended mini oof loaded. <0> Loading Sets.fth ... redefined cell redefined decode Sets.fth loaded. <0> redefined 1Cell with 1cel l redefined noop redefined ~ redefined testsym? redefined this-parser redef ined ~ redefined testsym? redefined >string redefined Format redefined <meta char> redefined <forth> redefined <group> redefined <escapedchar> redefined <reference> redefined <conditional> redefined <modemod> redefined <casefolder > redefined <modifier> redefined this-parser redefined ~ Regex version 0.11 Copyright (C) Gerry Jackson 2010, 2011 redefined TESTING with Testing Running Regex tests * 1 * 2 * 3 * 4 * 5 * 6 * 7 * 8 * 9 * 10 * 11 * 12 * 13 * 14 * 15 * 16 * 17 * 18 * 19 * 20 * 21 * 22 * 23 * 24 * 25 * 26 * 27 * 28 * 29 * 30 * 31 * 32 * 33 * 34 * 35 * 36 * 37 * 38 * 39 * 40 * 41 * 42 * 43 * 44 * 45 * 46 * 47 * 48 * 49 * 50 * 51 * 52 * 53 * 54 * 55 * 56 * 57 * 58 * 59 * 60 * 61 * 62 * 63 * 64 * 65 * 66 * 67 * 68 * 69 * 70 Regex & String Builder tests completed ok
With other systems, directory paths may need to be added to the lines
that include files e.g.
s" filename" included
in files regex.fth
and regextest.fth
. This is
system dependent as there is no ANS Forth standard way to specify
directory paths.
The output from other systems will vary, ignore any redefinition messages (unless they cause an error). The last six lines are the important ones that prove the software works.
If a system fails with an error message during the tests, please send details (the regular expression, any output and the system used) to the contact email address below.
This section describes features relevant to Regex; it is not a primer or tutorial on regular expressions - there are many such resources available on the internet or books available for purchase. Knowledge of regular expressions is assumed.
Assuming the directory paths in regex.fth
are correct for
your Forth system, simply type:
s" regex.fth" included
These words are provided to compile regular expressions and to match
strings against the compiled regular expression:
regex ( "ccc..." -- rgx )
compiles the following text as a
regular expression until one of (?e)
, (?end)
or
the end of the line is reached. The multi-line free-format modifiers
(?x)
or (?x:
can be used in the first line to
compile a multi-line regular expression with # comments (note that this
free-format mode is ignored inside a character class specification where
white space and # are treated as characters in the usual way). It returns
rgx
which is the adddress of the compiled regular expression
and which should be used as an input parameter in a subsequent
match
regex$ ( caddr u -- rgx )
compiles the regular expression
held as a string at (caddr u)
returning the address of the
compiled regular expression
parse-regex ( char "ccc<char>" -- rgx )
parse ccc in
the input buffer delimited by the character char
. If the
delimiter is not present, it parses to the end of the line. The parsed
text is treated as a regular expression and compiled to address
rgx
. This word enables a delimiter of choice to be used, for
example a user may define:
: parse/ [char] / parse-regex ;
to delimit a regular expression with the character /
match ( caddr1 u1 rgx -- caddr u -1 | caddr1 u1 0 )
attempts
to match the text at (caddr1 u1)
against the compiled
regular expression rgx
. If a match is found (caddr u
-1)
is returned where (caddr u)
is the rest of the
input string following the leftmost matching string, otherwise the input
string and 0 (caddr1 u1 0)
are returned to indicate failure
to match. (Note that the returned string may have length u
=
0 if, say, the match was at the end of the input string. The matching
string may be obtained using get-match
. The specification of
match
was changed in version 0.5 onwards because experience using
Regex showed that returning the rest of the string being searched is far
more useful than returning the matching string.
get-match ( -- caddr u -1 | 0 0 0 )
returns a matching
string from the previous execution of match
if a match was
found, otherwise (0 0 0)
.
get-subex ( n -- caddr u )
returns the contents of the nth
pair of capturing parentheses, where the count starts at 1 from the start
of the regular expression. 0 get-subex
is the same as
get-match
(Note that substitute
and head&tail
were
removed in version 0.2 as they have been superseded by the String
Builder.)
regex-string ( caddr u "spaces<name>" -- )
creates
an interpolated string object called name
which, when
executed, leaves that object on the stack ready for compilation into
another regular expression. Typical use is within the (?$name)
construct.
No description is provided for most of the features listed above as they behave in a standard manner - examples of usage can be found in the test program provided. This section notes a few items that may not be standard.
subex-limit
in
file regexmatch.fth
(?{forth code})
, the embedded forth
code is a sequence of one or more Forth words. The embedded code can make
free use of the data stack i.e. data present on the stack before
match
can be accessed and additional data added or removed
from the stack (introduced in version 0.8). Restrictions on embedded
Forth code are:
$ \z \Z
treat both CRLF and LF as
the end of the line.(?i)
and friends operate until
one of: the corresponding (?-i)
; the end of the regular
expression; or until the end of the immediate enclosing
parentheses.save-idefault
save-mdefault save-sdefault save-xdefault
, all with stack effect
( flag -- )
. For example: true save-idefault
(?<name>...)
and named references
\g{name}
must have been previously declared using the
defining word refname
. e.g.refname alice
alice
which will return the matched string
as ( caddr u ) if matched or ( 0 0 ) otherwise.2variable
as was the case
in previous versions. If an invalid name is used the system will report
an error.(?<=(\d+)\s+abc\s+def.*)\1\s+abc\s+\?(?=xyz)
(\d+)\s+abc\s+\?(?=xyz)(?<=\1\s+abc\s+def.*)
(?R)
means recurse to the beginning of the regular expression
(?1)
means recurse to the contents of the first pair of
parentheses, (?2)
to the second pair of parentheses and so on.
(?0)
is the same as (?R)
.
(?+1)
is a relative reference, +1 is the next pair of
parentheses, (?+2)
the pair after that and so on.
(?-1)
is a relative reference to the previous pair of
parentheses, similarly (?-2)
etc.
(?&name)
recurses to the named pair of parentheses. This
may be a forward reference.
(?$name)
. The name
must be either a regular expression previously returned by use of one of the
regex
family, or a string object which is compiled into the regular
expression being compiled. The string object must have been defined using the
defining word regex-string
. The same interpolated object may be
included as many times as desired - however it is more efficient on space to
use a pre-compiled regular expression object as a string object is effectively
a macro.(??{forth code sequence
are compiled during a match
using the enclosing regular
expression. The forth code sequence must return a regular expression and
must therefore include one of the regex
family or be a reference
to a regular expression object (the latter case is the same as an interpolated
regular expression object). It is not possible to recurse into a dynamic
regular expression from the enclosing expression.
This version of Regex has been tested with the 7 bit ASCII character set.
Limited tests have been carried out with 8 bit characters and there
should be no problems. The character size can be changed to 8 bit by
setting the constant maxchar
to 255 in file
regex.fth
. Do not set it larger, e.g. for 16 bit characters,
as character classes will then use large amounts of memory.
This has been introduced to the Regex package to simplify using the
results from matching a regular expression. The String Builder has syntax
in this form (alternatives use stringer
and
parse-stringer
- see below):
pattern-string stringer$
where the pattern-string
is a string containing a sequence
of:
\n
to copy a line-feed into the
concatenation buffer
\1
for the contents of the first pair of capturing
parentheses. Also references to Forth variables. These values are then
copied to the concatenation buffer.
(?x...)
e.g. (?-c)
to stop the concatenation
buffer from being cleared. Note that this is a change between versions
0.2 and 0.3. The commands include a sprintf
like formatter
for number conversion.
The word stringer$
parses the pattern-string
and acts on the contents to build up a string in the concatenation
buffer. The typical use is to replace the matching text from a regular
expression match with some other text. Note that the necessary code to
read/write strings from/to files is outside the scope of the Regex
package at present - these operations can clearly be written in standard
Forth. A possible future extension could handle some of this in the
String Builder command language.
For examples of usage refer to the test program included in the download.
stringer ( "ccc..." -- caddr u )
Accepts the
pattern-string
from the input source until terminated by the
end of line or (?e) in free-layout mode. The concatenation buffer is
cleared, unless the command (?-c) is first in
pattern-string
. The pattern-string
is then
parsed and acted upon according to the contents. The string built up in
the concatenation buffer is returned as (caddr u) on completion, unless
the command (?-g) has been used.
stringer$ ( caddr u -- caddr2 u2 )
As for
stringer
except that the pattern-string
is
already available as (caddr u)
.
parse-stringer ( char -- caddr u )
Parse the input until the
delimiting character. The parsed string is then used as the
pattern-string
and stringer$
is then executed.
stringify ( -- caddr u)
Accepts text from the input source
in free-layout mode until an (?end)
or (?e)
command is encountered. White space, including new line characters, are
stripped out and the remaining characters concatenated into a single
string. The (caddr u)
pair of the resulting string is
returned to the caller. The aim of this word is to create a pattern
string from a readable free-layout representation that may be used inside
a colon definition; alternatively to easily create strings containing
'difficult' characters such as "
. Use the word
constant$
to save the string (see below).
The following points apply to stringify
:
(?x)
is not needed at the
start of the free-layout text - if present it is ignored and not copied
to the resulting string.
(-?x)
and
(?x)
are obeyed to switch free-layout mode off/on
respectively. The (-?x)
and (?x)
commands are
not copied to the resulting string.
(?-x)
the
concatenation of text is terminated at the end of the current line; a
terminating (?end)
or (?e)
is not required in
this case. Free-layout mode may be switched on again with
(?x)
before the end of the line.
(?-c)
when
all text will be appended to the contents of the buffer. The first
(?-c)
is not copied to the resulting string. Subsequent
occurrences of (?c)
and (?-c)
are not acted
upon, but simply copied to the buffer.
constant$
stringer
commands not already described are
not acted upon, but simply copied to the buffer. This is true for the
spanning commands (?x: ... )
and (?-x: ... )
constant$ ( caddr u "spaces<name>" -- )
copies the
string specified by ( caddr u )
to dataspace at
caddr2
, creates a constant called <name>
that, when executed, places (caddr2 u)
on the stack. This
word is included for use with stringify
.
These are preceded by the \
character and append the
following character or value to the concatenation buffer. The list of
these is:
\\ appends character \ \( appends character (
\) character ) \$ character $
\. character . \{ character {
\} character } \# character #
\| character | \% character %
\a value 7 \e value 27
\f value 12 \n value 10
\r value 13 \s value 32
\t value 9 \v value 11
\0 value 0
\cX value (following character mod 32)
e.g. \cH appends CTRL H value 8 as does \ch and \c(
The pattern string can include various references to the results from the last regular expression match; these append the associated value to the concatenation buffer. The types of reference are:
\1
for the contents of the first capturing parentheses,
\2
for the second, and so on. These are equivalent to $1,
$2 etc in Perl.
\H
refers to the part of the string before the match\M
is the matching string itself\T
refers to the part of the string after the match
\g{name}
refers to a named capture where name
has been defined by refname
(changed in version 0.4
onwards).
The commands that can be included in the pattern string are:
(?c)
and (?-c)
, where (?c)
clears the concatenation buffer and (?-c)
means do not
clear the buffer. (?-c)
is only useful as the first
command in a pattern string to retain the results from a previous use
of the string builder. By default the concatenation buffer is cleared
so the use of (?c)
is redundant.
(?g)
and (?-g)
, where (?g)
means
return the contents of the concatenation buffer as (caddr
u)
at the end of the pattern string, and (?-g)
means do not return the string. By default (caddr u)
is
returned so that the use of (?g)
is redundant.
(?x)
and (?-x)
to indicate free-form layout
in exactly the same way as for regular expressions, (?e)
or (?end)
is used to terminate the pattern string.
(?{forth-code})
to execute the enclosed forth-code. As
shown in the test program this may be used to match regular
expressions. It can also be used to insert text into the concatenation
buffer e.g.s" Name: (?{s" Mary" concat}) Smith" stringer$
Name: Mary Smith
\U \L \u \l \E
for case folding. These behave in exactly
the same way as in regular expressions.
(?(<test>)<then-part>|<else-part>
)
conditional expression where <test>
may be one
of:TRUE
if the corresponding capturing
parentheses participated in the match.
TRUE
if the
corresponding capturing parentheses participated in the match.
(?{forth code
sequence})
where the forth code must return FALSE
or a non-zero value taken as TRUE
.
TRUE
the <then-part> is
interpreted otherwise the <else-part> is interpreted. The
<then-part> and <else-part>'s can be any valid String
Builder sequence, hence conditionals may be nested.
%<flags><minwidth><precision><long><conversion>
or(?%<flags><minwidth><precision><long><conversion>)
-
for left
justification, 0
to use 0 as a padding character,
+
to include a + sign for positive numbers,
space
to leave a space for a positive number,
#
for a variant of the main operation.*
to indicate the width is on the stack..
followed by a
number or *
to indicate the width is on the stack.l
to indicate a double length number.d u c s x X o %
to indicate single integer, unsigned
integer, character, string, hexadecimal using abcdef
for
x
or ABCDEF
for X
, octal or the
%
character respectively.BASE
. If
*
is used for either of these, the corresponding number
must be above the other arguments on the stackf e E g G
characters
remain to be done. A useful addition might be conversion using the
current value of the Forth word BASE
.There is plenty of scope to add to the set of commands in the String Builder and this will probably happen as experience is gained in its use. Examples are the addition of other control structures e.g. loops but it is unclear where the most cost-effective boundary lies between doing this in the pattern string or in Forth.
The floating point options of the format command are outstanding.
For commands the String Builder has kept mostly to the regular expression syntax of Perl. This may change in future versions.
A simple concatenation buffer is provided to build up replacement strings
for text matched with regular expressions. Initially the buffer is set to
a size of 1024 bytes. New and alternative buffers may be created and used
using the words below. Words operate on a current concatenation buffer.
new-concat-buf ( size -- buf )
Creates a new concatenation
buffer with size
bytes. To make it the current concatenation
buffer set value concat-buf
to the returned buf
.
concat-buf
a value holding the current concatenation buffer.
It can be set to an alternative buffer by e.g.
10000 new-concat-buf to concat-buf
By manipulating concat-buf
several buffers may be operative
simultaneously.
clear-concat ( -- )
Empties the current concatenation buffer.
concat ( caddr u -- )
Appends the string (caddr
u)
to the contents of the current concatenation buffer
concat-char ( char -- )
Appends a character to the current
concatenation buffer.
get-concat ( -- caddr u )
Returns the contents of the
current concatenation buffer
For an example of use, see the test program.
The size of the default concatenation buffer is set by constant
cb-size
in file regex.fth
and therefore may be
changed by the user.
The concatenation buffer may be deleted with:
concat-buf delete
but do not use that buffer again as results will be unpredicatable.
For more complex string processing a package such as David Williams' Dynamic Strings Package can be used, where embedded code can be inserted into the pattern string to call words in the strings package.
These features may be added in future versions, but not necessarily in this order:
(?i-msx)
There are no plans to include the following features:
\num
(?#...)
The regular expression engine is based on the ideas given in the paper Regular Expression Matching Can Be Simple And Fast by Russ Cox.
The regular expression and string builder parsers were generated by Grace from BNF specifications. In all there are three separate parsers in the package.
The state transition tables were generated by LexGen. There are four tables compacted into a single table.
Regex requires a case-insensitive ANS Forth system. The following words from the various ANS Forth word sets are used.
! #> #s ' ( * + +! +loop , - . ." /
/mod 0< 0= 1+ 1- 2! 2@ 2drop 2dup 2over 2swap : ; < <# = >
>body >in >number >r ?dup @ abort" abs align aligned allot
and base begin bl c! c, c@ cell+ cells char char+ chars constant count cr
create decimal depth do does> drop dup else emit environment? evaluate
execute exit find here hold i if immediate invert j literal loop lshift
max min mod move negate or over postpone r> r@ recurse repeat rot
rshift s" source space spaces swap then type u< unloop until variable
while word xor [ ['] [char] ]
.( .r 0<> 0> 2>r
2r> 2r@ :noname <> ?do again case compile, endcase endof false
hex nip of pad parse pick refill roll to true tuck u> value within
\
2constant 2literal 2variable d=
dabs
included
allocate free
.s
[if] [else]
[then]
/string cmove compare
Any comments, feedback (good or bad), error reports etc will be gratefully received, email Gerry Jackson.
v0.11 2 May 2011
Added:
concat-buf and
new-concat-buf
v0.10 2 March 2011
Added:
(?$name)
.
(??{Forth code sequence})
\g{-n}
\Q...\E
(?|...|...)
v0.9 5 January 2011
v0.8 18 November 2010
v0.7 11 August 2010
stringify
and constant$
added.
(1)
not (\1)
.
v0.6 31 July 2010
{n,m}? {n,}? {n}?
added
v0.5 27 July 2010
?? *? +?
added
MATCH
changed to return the rest of the
subject string
v0.4 10 June 2010
2variable
.
a(b*)*c
no longer crash.
v0.3 24 May 2010
(?!)
always fails to match
v0.2 7 May 2010
v0.1 23 April 2010
match
changed.
regexutils.fth
added to provide words such as
substitute
and a simple string concatenation buffer.
v0.0 2 April 2010