Introduction

Syntax

Program Help

Examples

Download

Match Tool 2.1

Match Tool 3.0

 

DEELX Regular Expression Engine Features

Research project of RegExLab.


Content:

Coded using template

DEELX is coded using template, so DEELX can be used on char, wchar_t and other base types, such as unsigned char, int etc. But only simple types can be used, not struct or union.

For example:

#include "deelx.h"

int pattern[] = {65, '+', 66, 0}; // 0 to indicate end
int text[] = {180000, 65, 65, 66, 0};

CRegexpT <int> regexp(pattern);
MatchResult result = regexp.Match(text);

if(result.IsMatched())
{
    printf("start at: %d\nlength: %d\n", result.GetStart(),
        result.GetEnd() - result.GetStart());
}

More details:

[ Program Help]


RIGHTTOLEFT match mode

DEELX supports right to left match mode, which lets regular expression matches from the end of text to the begin. Within the pattern, the expression in the right side will do match first. To compose a pattern for "RIGHTTOLEFT" is the same as to compose a common pattern. The quantifiers still lays on the right side; ^ still matches the begin of text; Lookahead assertion still matches rightward; The group number still counts from left to right; etc.

There is one point should be paid attention to. If RIGHTTOLEFT, the expression in the right side will do match first. If you used backward reference, the group which is refered should lays on the right side.

For example:

"RIGHTTOLEFT" Mode

Common

\2(.*?)('|")

('|")(.*?)\1

The recursive expression has not this limit, so both of the following is right:

(?2)(.*?)(")

(")(.*?)(?1)


Backward assertion

"Backward assertion" is to ask the text on the left side of current position must matches certain expression. The format is (?<=xxx) or (?<!xxx). As \b, it is just a condition of the position, and it will not match any character. For example:

Expression

Description

\b

Matches word boundary.

(?<!\w)(?=\w)

Matches only begin of word.

(?<=\w)(?!\w)

Matches only end of word.

On the function of "backward assertion", the details in Perl, Java, GRETA and DEELX are different from each other:

Engine

Description

Example

Perl

Works only for fixed-width lookbehind.

(?<=\t)print

Java

Non-fixed width, but maximum needed.

(?<=\{\s{0,100})print

GRETA

Non-fixed width, but the negative has some bugs.

(?<=\{\s*)print

In DEELX:

DEELX uses RIGHTTOLEFT mode to match "Backward assertion". The backward assertion has the same logic as lookahead assertion, except the direction. So, in DEELX, the backward assertion works for non-fixed width lookbehind.

For example, in DEELX:

Text

Expression

Result

{print}

(?<!\{\s*)print

Fail

(?<=\{\s*)print

Success


Easy to use

The whole DEELX is coded using template, so there is no cpp or lib. All the DEELX source code is in a single header file (deelx.h). You need not to create a project for DEELX, nor add any cpp or lib. When runs, no special dll file is needed.

You can simplely add an include in your program:

#include "deelx.h"

DEELX is included into your project directly, so need not worry about link problem.


Compatibility

DEELX is coded in pure C++, no STL nor MFC class is used. DEELX has been tested on following compiler and OS:

Compiler

Version

OS

Remark

VC++

6.0, 7.1, 8.0

Windows

 

GCC

3.4

Cygwin

 

GCC

3.4

Linux

 

Turbo C++

3.0

DOS

 

C++ Builder

6.0

Windows

 

Borland C++

5.5

Windows

 

GCC

2.7

FreeBSD 2.2

 

GCC

3.4

Solaris 10 Unix

http://www.unix-center.net/

We have not tested it on other platforms.

If you have compiled on other platforms, you can tell us whether you have succeeded or failed, through regexlab@gmail.com. Thanks very much.


Named group

DEELX supports named group of Python style and .NET style:

Expression

Style

Description

(?P<the_name>xxxx)

Python

To define a named group 'the_name'

(?<the_name>xxxx)

.NET

(?'the_name'xxxx)

After a successful match, the captured content of named group can be gotten through the name of the group.

DEELX allows more than one groups to have the same group name. Logically, they are the same group, for example: (?<string>".*?")|(?<string>'.*?') . If one named group contains another group which has the same name, the one which is contained will do not capture, for example: (?<number>(?<number>\d+)\.?) .

Other features concerned: backward reference, recursive expression, conditional expression, replace operation:

Feature

Expression

Style

Description

Backward
reference

\g<the_name>

Python

Backward refer to 'the_name'

\k<the_name>

.NET

\k'the_name'

Recursive
expression

(?R<the_name>)

 

Recursive refer to named 'the_name'

(?R'the_name')

Conditional
expression

(?(the_name)yes|no)

.NET

According to if named 'the_name' captured

Replace

${the_name}

 

Stands for what captured by 'the_name'


Conditional expression

According to a certain condition, conditional expression matches one of two options. The condition can be:

  1. Whether a certain group captured or not.
  2. Whether a certain expression can match at current position or not.

Conditional expression and description:

Expression

Characteristic

Description

(?(1)yes|no)

Is number

If group 1 captured, match yes, else match no

(?(?=a)aa|bbb)

Assertion

If next char is 'a', match aa, else match bbb

(?(xxx)aa|bbb)

Not named group

If there is no named group 'xxx', same as (?=xxx)

(?(name)yes|no)

Is named group

If there is named group 'name', captured or not

More:

  1. If it is RIGHTTOLEFT mode,  (?(xxx)aa|bbb) is same as (?(?<=xxx)aa|bbb) .
  2. If there is only one option, the option will do match if the condition success.
  3. If there are more than 2 options separated by "|", only the first "|" is regarded as conditional expression's separator. For example: (?(?=xxx)yes|no1|no2), if condition success, "yes" matches, or "no1|no2" matches.

Recursive expression

"Recursive expression" is to refer to another part of expression itself, not to refer to what is captured. For example:

Expression

Equivalent1

Equivalent2

Matches

(\w)(?1)

(\w)(\w)

 

ab

(?1)(\w(?2))(\d)

(?1)(\w(\d))(\d)

(\w(\d))(\w(\d))(\d)

a1b23

If the part which is referred, contains the "recursive expression", this makes a recursion. For example:

Expression

Equivalent1

Equivalent1

Matches

(\w(?1)?)

(\w(\w(?1)?)?)

(\w+)

ghjk5......

\(([^()]|(?R))*\)

\(([^()]|\(([^()]|(?R))*\))*\)

 

(a * (c + 2))

Formats of recursive expression:

Format

Description

(?R)

Recursively refer to the whole expression.

(?R1), (?R2)

Recursively refer to certain group.

(?1), (?2)

(?R<named>)

Recursively refer to named group.

(?R'named')


Prevent infinite loop

If a repeated patterns matches zero-length substring, it may cause infinite loop in many engines. DEELX has resolved this problem and find a technical method to prevent infinite loop on that situation.

For example:

Expression

Description

(a*)*

a* matches zero-width, but it will not cause infinite loop in DEELX

(a|b|c?)*

One option matches zero-width, but it will not cause infinite loop in DEELX

 

 

RegExLab.com © 2005 - All Rights Reserved