module pcre

PCREs, or Perl-compatible regular expressions, are the de facto standard in regular expression text processing. This module exposes a regular expression class which wraps the libpcre library. Not all features are supported at this time, but more may be added in the future.

Prerequisites

This module loads libpcre dynamically when it's first imported. On Windows, this can be libpcre.dll or pcre.dll; on Linux, libpcre.so.3 or libpcre.so; and on OSX libpcre.dylib. The shared library is loaded from the usual places.

The loaded library must be libpcre version 7.4 or higher, built with UTF-8 support. Support for Unicode Properites isn't necessary, just UTF-8. This library will check that the version of libpcre that was loaded meets these requirements after loading it. If the shared library is not suitable, a RuntimeError will be thrown.

Windows: building libpcre manually is kind of a pain, and as far as I know the only project that provides binaries of libpcre is GnuWin32, but for some reason they haven't built a version of it since 7.0 in 2007. To save you the hassle, I've compiled a compatible DLL of 7.4, available here. (Note: you must have the VC++2008 redist installed for this DLL to work. This is a very tiny download and fast install.)

Including this library in the host

To use this library, the host must have it compiled into it. Compile Croc with the CROC_PCRE_ADDON option enabled in the CMake configuration. Then, from your host, when setting up the VM use the croc_vm_loadAddons or croc_vm_loadAllAvailableAddons API functions to load this library into the VM. Then from your Croc code, you can just import pcre to access it.

class Regex

Wraps a PCRE regex object.

Regex.this(pattern: string, attrs: string = '')

Compiles a regular expression.

Params:
pattern
is the regular expression to be compiled. See the PCRE documentation for the syntax.
attrs
is a string containing attributes with which to compile this regex. It can contain any of the following characters, in any order:
'i'
Case-insensitive. Any literal characters or character classes will match either case.
's'
The dot pattern will match all characters including newlines (which it normally doesn't).
'm'
Multiline. Normally the ^ and $ patterns will match the beginning and end of the string. With this modifier, they will match the beginning and end of each line in the subject string.

Throws:
StateError
if you attempt to call this constructor on an already-initialized object.
ValueError
if the pattern could not be compiled.

function Regex.finalizer()

Cleans up the underlying C PCRE objects.

function Regex.numGroups()

Returns:

the number of matched subgroups. This will be 0 if test returned false, or a number greater than 0 otherwise.

function Regex.groupNames()

Returns:

an array of strings of named groups.

Named groups are created with the "(?P<name>pattern)" regex syntax. So, if you compiled something like @"(?P<lname>\w+), (?P<fname>\w+)", this function would return an array containing the strings "lname" and "fname" (though not in any particular order).

function Regex.test(subject: string = null)

The workhorse of the regex engine, this gets the next match of the regex within the current subject string.

Params:
subject
is the optional new subject string. If you don't pass one, this will continue testing on the current subject string. If you do, it's the same as doing re.search(subject).test().

Returns:

true if a new match was found in the subject string. In this case it updates all the matches which can be retrieved using various other methods. Returns false if no more matches were found.

function Regex.search(subject: string)

Sets the subject string and resets all match groups, but does not start looking for matches. You'll have to use test or iterate over the matches with a foreach loop.

Returns:

this regex object, to make it easier to use as the container in a foreach loop.

function Regex.pre(idx: int|string = 0)

Gets the slice of the subject string that comes before the given subgroup match.

Params:
idx
works just like in match.

function Regex.match(idx: int|string = 0)

Gets the most recent match of the regex and its subgroups within the subject string.

Params:
idx
can be the integer index of a subgroup, where index 0 is the entire regex and 1, 2, etc. are the subgroups in order of where they appear in the regex. Alternatively, if you've named subgroups, you can get them by name; only names returned from groupNames are valid.

Returns:

the slice of the subject string which was matched by the given regex group.

Throws:
StateError
if there are no more matches (test returned false).
RangeError
if the given integral subgroup index is invalid.
NameError
if the given string subgroup name is invalid.

function Regex.post(idx: int|string = 0)

Gets the slice of the subject string that comes after the given subgroup match.

Params:
idx
works just like in match.

function Regex.preMatchPost(idx: int|string = 0)

Gets three pieces of the string: the part that comes before the given subgroup match, the match itself, and the part that comes after. This is slightly more efficient than calling pre, match, and post separately if you need two or all three parts.

Params:
idx
works just like in match.

Returns:

the pre, match, and post strings in that order.

function Regex.matchBegin(idx: int|string = 0)

Gets the character index into the subject string where the given subgroup match begins.

Params:
idx
works just like in match.

function Regex.matchEnd(idx: int|string = 0)

Gets the character index into the subject string where the given subgroup match ends.

Params:
idx
works just like in match.

function Regex.matchBeginEnd(idx: int|string = 0)

Gets the character indices into the subject string where the given subgroup match begins and ends.

Params:
idx
works just like in match.

Returns:

the begin and end indices in that order.

function Regex.find(subject: string)

Searches for the first match of this regex in the given subject string.

This is basically the same as re.search(subject).test() ? re.matchBegin() : #subject.

Params:
subject
will be set as the new subject string.

Returns:

the index into the subject string where the first match of this regex was found, or #subject if not.

function Regex.split(subject: string)

Splits subject into an array of pieces, using entire matches of this regex as the delimiters.

Params:
subject
will be set as the new subject string.

Returns:

the array of split-up components. This will have only one element if the regex did not match.

function Regex.replace(subject: string, repl: string|function)

Perform a search-and-replace on subject using this regex as the search term.

Params:
subject
will be set as the new subject string.
repl
can be a string, in which case any matches of this regex will be replaced with repl verbatim.
repl can also be a function. In this case, it should take a single parameter which will be this regex object (through which it can access the match), and should return a single string to be used as the replacement.

Returns:

a new string with all occurrences of this regex replaced with repl.

Throws:
TypeError
if repl is a function and it returns anything other than a string.

function Regex.opApply()

This allows you to iterate over all the matches of this regex in the subject string with a foreach loop. To set the subject string, you can use search, which conveniently returns this regex object.

In the loop, there will be two indices: the first being the 0-based index of the match (that is, the number of times this regex has matched in the subject string), and the second being this regex object itself. For example:

local re = pcre.Regex(@"(\w+)\s?=\s?(\w+)")
local subject =
"foo = bar
baz= quux"

foreach(i, m; re.search(subject))
	writefln("{}: key = '{}', value = '{}'", i, m.match(1), m.match(2))

This will print out:

0: key = 'foo', value = 'bar'
1: key = 'baz', value = 'quux'

Note that opApply is just defined in terms of test. You can also iterate through all matches by doing something like this:

re.search(subject)
for(local i = 0; re.test(); i++)
	writefln("{}: key = '{}', value = '{}'", i, re.match(1), re.match(2))

Given the same regex and subject string, this will print out the same thing as the previous example.

function Regex.opIndex(idx: int|string = 0)

An alias for match, so re[4] is the same as re.match(4), and re['lname'] is the same as re.match('lname'). You can't write re[] for the whole match, though, since that's a full-slice, not indexing.

HTML and JavaScript source derived from

by Victor Nakoryakov; Page generated on 15 Nov 2014 10:28:14

StateError	if you attempt to call this constructor on an already-initialized object.
ValueError	if the `pattern` could not be compiled.

StateError	if there are no more matches (test returned `false`).
RangeError	if the given integral subgroup index is invalid.
NameError	if the given string subgroup name is invalid.