DEELX Regular Expression Engine Features
Research project of RegExLab.
Content:
Coded using template
DEELX is coded using template, so DEELX can be used on char, wchar_t and other base types, such as unsigned char, int etc. But only simple types can be used, not struct or union.
For example:
#include "deelx.h"
int pattern[] = {65,
'+', 66, 0}; // 0 to indicate end int text[] = {180000,
65, 65, 66, 0};
CRegexpT <int> regexp(pattern); MatchResult result = regexp.Match(text);
if(result.IsMatched()) { printf("start at: %d\nlength: %d\n", result.GetStart(), result.GetEnd() - result.GetStart()); } |
More details:
[ Program Help]
RIGHTTOLEFT match mode
DEELX supports right to left match mode, which lets regular expression matches from the end of text to the begin. Within the pattern, the expression in the right side will do match first. To compose a pattern for "RIGHTTOLEFT" is the same as to compose a common
pattern. The quantifiers still lays on the right side; ^ still matches the begin of text; Lookahead assertion still matches rightward; The group number still counts from left to right; etc.
There is one point should be paid attention to. If RIGHTTOLEFT, the expression in the right side will do match first. If you used backward reference, the group which is refered should lays on the right side.
For example:
"RIGHTTOLEFT" Mode |
Common |
\2(.*?)('|")
|
('|")(.*?)\1
|
The recursive expression has not this limit, so both of the following is right:
(?2)(.*?)(")
|
(")(.*?)(?1)
|
Backward assertion
"Backward assertion" is to ask the text on the left side of current position must matches certain expression. The format is (?<=xxx) or (?<!xxx). As \b, it is just a condition of the position, and it will not match any character. For example:
Expression |
Description |
\b |
Matches word boundary. |
(?<!\w)(?=\w) |
Matches only begin of word. |
(?<=\w)(?!\w) |
Matches only end of word. |
On the function of "backward assertion", the details in Perl, Java, GRETA and DEELX are different from each other:
Engine |
Description |
Example |
Perl |
Works only for fixed-width lookbehind. |
(?<=\t)print
|
Java |
Non-fixed width, but maximum needed. |
(?<=\{\s{0,100})print
|
GRETA |
Non-fixed width, but the negative has some bugs. |
(?<=\{\s*)print
|
In DEELX:
DEELX uses RIGHTTOLEFT mode to match "Backward assertion". The backward assertion has the same logic as lookahead assertion, except the direction. So, in DEELX, the backward assertion works for non-fixed width lookbehind.
For example, in DEELX:
Text |
Expression |
Result |
{print} |
(?<!\{\s*)print
|
Fail |
(?<=\{\s*)print
|
Success |
Easy to use
The whole DEELX is coded using template, so there is no cpp or lib. All the DEELX source code is in a single header file (deelx.h). You need not to create a project for DEELX, nor add any cpp or lib. When runs, no special dll file is needed.
You can simplely add an include in your program:
DEELX is included into your project directly, so need not worry about link problem.
Compatibility
DEELX is coded in pure C++, no STL nor MFC class is used. DEELX has been tested on following compiler and OS:
Compiler |
Version |
OS |
Remark |
VC++ |
6.0, 7.1, 8.0 |
Windows |
|
GCC |
3.4 |
Cygwin |
|
GCC |
3.4 |
Linux |
|
Turbo C++ |
3.0 |
DOS |
|
C++ Builder |
6.0 |
Windows |
|
Borland C++ |
5.5 |
Windows |
|
GCC |
2.7 |
FreeBSD 2.2 |
|
GCC |
3.4 |
Solaris 10 Unix |
http://www.unix-center.net/ |
We have not tested it on other platforms.
If you have compiled on other platforms, you can tell us whether you have succeeded or failed, through regexlab@gmail.com. Thanks very much.
Named group
DEELX supports named group of Python style and .NET style:
Expression |
Style |
Description |
(?P<the_name>xxxx)
|
Python |
To define a named group 'the_name' |
(?<the_name>xxxx) |
.NET |
(?'the_name'xxxx) |
After a successful match, the captured content of named group can be gotten through the name of the group.
DEELX allows more than one groups to have the same group name. Logically, they are the same group, for example: (?<string>".*?")|(?<string>'.*?')
. If one named group contains another group which has the same name, the one which is contained will do not capture, for example: (?<number>(?<number>\d+)\.?)
.
Other features concerned: backward reference, recursive expression, conditional expression, replace operation:
Feature |
Expression |
Style |
Description |
Backward
reference |
\g<the_name> |
Python |
Backward refer to 'the_name' |
\k<the_name> |
.NET |
\k'the_name' |
Recursive
expression |
(?R<the_name>) |
|
Recursive refer to named 'the_name' |
(?R'the_name') |
Conditional
expression |
(?(the_name)yes|no) |
.NET |
According to if named 'the_name' captured |
Replace |
${the_name} |
|
Stands for what captured by 'the_name' |
Conditional expression
According to a certain condition, conditional expression matches one of two options. The condition can be:
- Whether a certain group captured or not.
- Whether a certain expression can match at current position or not.
Conditional expression and description:
Expression |
Characteristic |
Description |
(?(1)yes|no) |
Is number |
If group 1 captured, match yes, else match no |
(?(?=a)aa|bbb) |
Assertion |
If next char is 'a', match aa, else match bbb |
(?(xxx)aa|bbb) |
Not named group |
If there is no named group 'xxx', same as (?=xxx)
|
(?(name)yes|no) |
Is named group |
If there is named group 'name', captured or not |
More:
- If it is RIGHTTOLEFT mode, (?(xxx)aa|bbb)
is same as (?(?<=xxx)aa|bbb)
.
- If there is only one option, the option will do match if the condition success.
- If there are more than 2 options separated by "|", only the first "|" is regarded as conditional expression's separator. For example: (?(?=xxx)yes|no1|no2),
if condition success, "yes" matches, or "no1|no2" matches.
Recursive expression
"Recursive expression" is to refer to another part of expression itself, not to refer to what is captured. For example:
Expression |
Equivalent1 |
Equivalent2 |
Matches |
(\w)(?1) |
(\w)(\w) |
|
ab |
(?1)(\w(?2))(\d) |
(?1)(\w(\d))(\d) |
(\w(\d))(\w(\d))(\d) |
a1b23 |
If the part which is referred, contains the "recursive expression", this makes a recursion. For example:
Expression |
Equivalent1 |
Equivalent1 |
Matches |
(\w(?1)?)
|
(\w(\w(?1)?)?) |
(\w+) |
ghjk5...... |
\(([^()]|(?R))*\) |
\(([^()]|\(([^()]|(?R))*\))*\) |
|
(a * (c + 2)) |
Formats of recursive expression:
Format |
Description |
(?R) |
Recursively refer to the whole expression. |
(?R1), (?R2) |
Recursively refer to certain group. |
(?1), (?2) |
(?R<named>) |
Recursively refer to named group. |
(?R'named') |
Prevent infinite loop
If a repeated patterns matches zero-length substring, it may cause infinite loop in many engines. DEELX has resolved this problem and find a technical method to prevent infinite loop on that situation.
For example:
Expression |
Description |
(a*)*
|
a* matches zero-width, but it will not cause infinite loop in DEELX |
(a|b|c?)* |
One option matches zero-width, but it will not cause infinite loop in DEELX |
|