Regular Expressions | TrendAI™

Views:

Regular expressions are used to perform string matching. See the following tables for some common examples of regular expressions. To specify a regular expression, add a “.REG.” operator before that pattern.

There are a number of websites and tutorials available online. One such site is the PerlDoc site, which can be found at:

http://www.perl.com/doc/manual/html/pod/perlre.html

WARNING

Regular expressions are a powerful string matching tool. For this reason, Trend Micro recommends that Administrators who choose to use regular expressions be familiar and comfortable with regular expression syntax. Poorly written regular expressions can have a dramatic negative performance impact. Trend Micro recommends is to start with simple regular expressions that do not use complex syntax. When introducing new rules, use the archive action and observe how the Messaging Security Agent manages messages using your rule. When you are confident that the rule has no unexpected consequences, you can change your action.

Regular Expression Examples

See the following tables for some common examples of regular expressions. To specify a regular expression, add a “.REG.” operator before that pattern.

Counting and Grouping

Element	What it Means	Example
.	The dot or period character represents any character except new line character.	do. matches doe, dog, don, dos, dot, etc. d.r matches deer, door, etc.
*	The asterisk character means zero or more instances of the preceding element.	do* matches d, do, doo, dooo, doooo, etc.
+	The plus sign character means one or more instances of the preceding element.	do+ matches do, doo, dooo, doooo, etc. but not d
?	The question mark character means zero or one instances of the preceding element.	do?g matches dg or dog but not doog, dooog, etc.
( )	Parenthesis characters group whatever is between them to be considered as a single entity.	d(eer)+ matches deer or deereer or deereereer, etc. The + sign is applied to the substring within parentheses, so the regex looks for d followed by one or more of the grouping “eer.”
[ ]	Square bracket characters indicate a set or a range of characters.	d[aeiouy]+ matches da, de, di, do, du, dy, daa, dae, dai, etc. The + sign is applied to the set within brackets parentheses, so the regex looks for d followed by one or more of any of the characters in the set [aeioy]. d[A-Z] matches dA, dB, dC, and so on up to dZ. The set in square brackets represents the range of all upper-case letters between A and Z.
[ ^ ]	Carat characters within square brackets logically negate the set or range specified, meaning the regex will match any character that is not in the set or range.	d[^aeiouy] matches db, dc or dd, d9, d#--d followed by any single character except a vowel.
{ }	Curly brace characters set a specific number of occurrences of the preceding element. A single value inside the braces means that only that many occurrences will match. A pair of numbers separated by a comma represents a set of valid counts of the preceding character. A single digit followed by a comma means there is no upper bound.	da{3} matches daaa--d followed by 3 and only 3 occurrences of “a”. da{2,4} matches daa, daaa, daaaa, and daaaa (but not daaaaa)--d followed by 2, 3, or 4 occurrences of “a”. da{4,} matches daaaa, daaaaa, daaaaaa, etc.--d followed by 4 or more occurrences of “a”.

Character Classes (shorthand)

Element	What it Means	Example
\d	Any digit character; functionally equivalent to [0-9] or [[:digit:]]	\d matches 1, 12, 123, etc., but not 1b7--one or more of any digit characters.
\D	Any non-digit character; functionally equivalent to [^0-9] or [^[:digit:]]	\D matches a, ab, ab&, but not 1--one or more of any character but 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9.
\w	Any “word” character--that is, any alphanumeric character; functionally equivalent to [_A-Za-z0-9] or [_[:alnum:]]	\w matches a, ab, a1, but not !&--one or more upper- or lower-case letters or digits, but not punctuation or other special characters.
\W	Any non-alphanumeric character; functionally equivalent to [^_A-Za-z0-9] or [^_[:alnum:]]	\W matches *, &, but not ace or a1--one or more of any character but upper- or lower-case letters and digits.
\s	Any white space character; space, new line, tab, non-breaking space, etc.; functionally equivalent to [[:space]]	vegetable\s matches “vegetable” followed by any white space character. So the phrase “I like a vegetable in my soup” would trigger the regex, but “I like vegetables in my soup” would not.
\S	Any non-white space character; anything other than a space, new line, tab, non-breaking space, etc.; functionally equivalent to [^[:space]]	vegetable\S matches “vegetable” followed by any non-white space character. So the phrase “I like vegetables in my soup” would trigger the regex, but “I like a vegetable in my soup” would not.

Character Classes

Element	What it Means	Example
[:alpha:]	Any alphabetic characters	.REG. [[:alpha:]] matches abc, def, xxx, but not 123 or @#$.
[:digit:]	Any digit character; functionally equivalent to \d	.REG. [[:digit:]] matches 1, 12, 123, etc.
[:alnum:]	Any “word” character--that is, any alphanumeric character; functionally equivalent to \w	.REG. [[:alnum:]] matches abc, 123, but not ~!@.
[:space:]	Any white space character; space, new line, tab, non-breaking space, etc.; functionally equivalent to \s	.REG. (vegetable)[[:space:]] matches “vegetable” followed by any white space character. So the phrase “I like a vegetable in my soup” would trigger the regex, but “I like vegetables in my soup” would not.
[:graph:]	Any characters except space, control characters or the like	.REG. [[:graph:]] matches 123, abc, xxx, ><”, but not space or control characters.
[:print:]	Any characters (similar with [:graph:]) but includes the space character	.REG. [[:print:]] matches 123, abc, xxx, ><”, and space characters.
[:cntrl:]	Any control characters (e.g. CTRL + C, CTRL + X)	.REG. [[:cntrl:]] matches 0x03, 0x08, but not abc, 123, !@#.
[:blank:]	Space and tab characters	.REG. [[:blank:]] matches space and tab characters, but not 123, abc, !@#
[:punct:]	Punctuation characters	.REG. [[:punct:]] matches ; : ? ! ~ @ # $ % & * ‘ “ , etc., but not 123, abc
[:lower:]	Any lowercase alphabetic characters (Note: ‘Enable case sensitive matching’ must be enabled or else it will function as [:alnum:])	.REG. [[:lower:]] matches abc, Def, sTress, Do, etc., but not ABC, DEF, STRESS, DO, 123, !@#.
[:upper:]	Any uppercase alphabetic characters (Note: ‘Enable case sensitive matching’ must be enabled or else it will function as [:alnum:])	.REG. [[:upper:]] matches ABC, DEF, STRESS, DO, etc., but not abc, Def, Stress, Do, 123, !@#.
[:xdigit:]	Digits allowed in a hexadecimal number (0-9a-fA-F)	.REG. [[:xdigit:]] matches 0a, 7E, 0f, etc.

Pattern Anchors

Element	What it Means	Example
^	Indicates the beginning of a string.	^(notwithstanding) matches any block of text that began with “notwithstanding” So the phrase “notwithstanding the fact that I like vegetables in my soup” would trigger the regex, but “The fact that I like vegetables in my soup notwithstanding” would not.
$	Indicates the end of a string.	(notwithstanding)$ matches any block of text that ended with “notwithstanding” So the phrase “notwithstanding the fact that I like vegetables in my soup” would not trigger the regex, but “The fact that I like vegetables in my soup notwithstanding” would.

Escape Sequences and Literal Strings

Element

What it Means

Example

In order to match some characters that have special meaning in regular expression (for example, “+”).

(1) .REG. C\\C\+\+ matches ‘C\C++’.

(2) .REG. \* matches *.

(3) .REG. \? matches ?.

Indicates a tab character.

(stress)\t matches any block of text that contained the substring “stress” immediately followed by a tab (ASCII 0x09) character.

Indicates a new line character.

Note

Different platforms represent a new line character. On Windows, a new line is a pair of characters, a carriage return followed by a line feed. On Unix and Linux, a new line is just a line feed, and on Macintosh a new line is just a carriage return.

(stress)\n\n matches any block of text that contained the substring “stress” followed immediately by two new line (ASCII 0x0A) characters.

Indicates a carriage return character.

(stress)\r matches any block of text that contained the substring “stress” followed immediately by one carriage return (ASCII 0x0D) character.

Indicates a backspace character.

Denotes boundaries.

(stress)\b matches any block of text that contained the substring “stress” followed immediately by one backspace (ASCII 0x08) character.

A word boundary (\b) is defined as a spot between two characters that has a \w on one side of it and a \W on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a \W. (Within character classes \b represents backspace rather than a word boundary.)

For example, the following regular expression can match the social security number: .REG. \b\d{3}-\d{2}-\d{4}\b

\xhh

Indicates an ASCII character with given hexadecimal code (where hh represents any two-digit hex value).

\x7E(\w){6} matches any block of text containing a “word” of exactly six alphanumeric characters preceded with a ~ (tilde) character. So, the words ‘~ab12cd’, ‘~Pa3499’ would be matched, but ‘~oops’ would not.

Regular Expression Generator

When deciding how to configure rules for Data Loss Prevention, consider that the regular expression generator can create only simple expressions according to the following rules and limitations:

Only alphanumeric characters can be variables.
All other characters, such as [-], [/], and so on can only be constants.
Variable ranges can only be from A-Z and 0-9; you cannot limit ranges to, say, A-D.
Regular expressions generated by this tool are case-insensitive.
Regular expressions generated by this tool can only make positive matches, not negative matches (“if does not match”).
Expressions based on your sample can only match the exact same number of characters and spaces as your sample; the tool cannot generate patterns that match “one or more” of a given character or string.

Complex Expression Syntax

A keyword expression is composed of tokens, which is the smallest unit used to match the expression to the content. A token can be an operator, a logical symbol, or the operand, i.e., the argument or the value on which the operator acts.

Operators include .AND., .OR., .NOT., .NEAR., .OCCUR., .WILD., “.(.” and “ .).” The operand and the operator must be separated by a space. An operand may also contain several tokens. See Keywords.

Regular Expressions at Work

The following example describes how the Social Security content filter, one of the default filters, works:

[Format] .REG. \b\d{3}-\d{2}-\d{4}\b

The above expression uses \b, a backspace character, followed by \d, any digit, then by {x}, indicating the number of digits, and finally, -, indicating a hyphen. This expression matches with the social security number. The following table describes the strings that match the example regular expression:

Numbers Matching the Social Security Regular Expression

.REG. \b\d{3}-\d{2}-\d{4}\b
333-22-4444	Match
333224444	Not a match
333 22 4444	Not a match
3333-22-4444	Not a match
333-22-44444	Not a match

If you modify the expression as follows,

[Format] .REG. \b\d{3}\x20\d{2}\x20\d{4}\b

the new expression matches the following sequence:

333 22 4444