Learning About Regular Expressions
Regular expressions are a very powerful way to match arbitrary text. Stemming from neurophysiological research conducted in the early 1940's, their mathematical foundation was established during the 1950's and 1960's. Their use has a long history in computer science, and they are an integral part of many UNIX tools, including awk, egrep, lex, perl, and sed, as well as many text editors. Regular expressions are slower than simple pattern matching algorithms, and they can be cryptic and difficult to write correctly. Small mistakes in specification can yield surprising results. They are, however, vastly more succinct and powerful than simple pattern matching, and can easily handle tasks that would be difficult or impossible otherwise.
The topic of regular expressions is a very large one, complicated by the arbitrary differences in the implementations found in various tools. Anything beyond an extremely simplistic sketch is well beyond the scope of this manual. To understand them better, we recommend a good text on the subject, such as "Mastering Regular Expressions", by Jeffrey E.F. Friedl (O'Reilly & Associates, Inc, ISBN 1-56592-257-3). The following is an abbreviated, simplified, and incomplete explanation of regular expressions, sufficient to gain a cursory understanding of them.
The regular expression engine attempts to match the regular expression against the input string. Such matching starts at the beginning of the string and moves from left to right. The matching is considered to be "greedy", because at any given point, it will always match the longest possible substring. For example, if a regular expression could match the substring `aa' or `aaa', it will always take the longer option.
Meta Characters
A regular expression "ordinary character" is a character that matches itself. Most characters are ordinary. The exceptions, sometimes called "meta characters", have special meanings. To convert a meta character into an ordinary one, you "escape" it by preceding it with a backslash character (e.g. '\*').
The meta characters are described in the following table:
Subexpressions
Subexpressions are those parts of a regular expression enclosed in parentheses. There are two reasons to use subexpressions:
Bracket Expressions
Bracket expressions (expressions enclosed in square brackets) are used to specify a set of characters that can satisfy a match. Many of the meta characters described above (.*[\) lose their special meaning within a bracket expression. The right bracket loses its special meaning if it occurs as the first character in the expression (after an initial '^', if any).
There are several different forms of bracket expressions, including:
- Matching List — A matching list expression specifies a list that matches any one of the characters in the list. For example, '[abc]' matches any of the characters 'a', 'b', or 'c'.
- Non-Matching List — A non-matching list expression begins with a '^', and specifies a list that matches any character not in the list. For example, '[^abc]' matches any characters except 'a', 'b', or 'c'. The '^' only has this special meaning when it occurs first in the list immediately after the opening '['.
- Range Expression — A range expression consists of 2 characters separated by a hyphen, and matches any characters lexically within the range indicated. For example, '[A-Za-z]' will match any alphabetic character, upper or lower case. Another way to get this effect is to specify '[a-z]' and use the FOLD_CASE keyword to STREGEX.
Special Characters in Regular Expressions
Special (non-printing) characters are often represented in regular expressions using backslash escape codes, such as \t to represent a TAB character or \n to represent a newline character. IDL does not support these backslash codes in regular expressions. See Non-Printing Characters for information on how to represent these special characters in regular expressions.