Finding what matched and other advanced features

an article added by: Andrew Peterson at 05012008


In: Categories » Internet and online » Web services » Finding what matched and other advanced features

Sometimes, all that you need is to know is whether input text matched a pattern. More commonly, you want to further process the specific data that were matched. For example, you hope that data from your web form contain a valid credit card number – a sequence of 13 to 16 digits. You would not simply want to verify the occurrence of this pattern; what you would want to do is to extract the digit sequence that was matched, so that you could apply further verification checks. Regular expressions allow you to define groups of pattern elements; an overall pattern can, for example, have some literal text, a group with a variable length sequence of characters from some class, more literal text, another grouping with different characters, and so forth. If the pattern is matched, the regular expression matching functions will store details of the overall match and the parts matched to each of the specific groups. These data are stored in global variables defined in the Perl core. The groups of pattern elements, whose matches in the string are required, are placed in parentheses. So, a pattern for extracting a 13–16 digit sub-string from some longer string could be /\D(\d{13,16})\D/; if a string matches this pattern, the variable $1 will hold the digit string. The following example illustrates the extraction of two fields from an input line. The input line is supposed to be a message that contains a dollar amount. The dollar amount is expected to consist of a dollar sign, some number of digits, an optional decimal point and an optional fraction amount. The pattern used for this match is:

/\$([0-9]+)\.?([0-9]*)\D/

Its elements are:

\$ A literal dollar sign
   ([0-9]+) A non-empty sequence of digits forming first group
   \.? An optional decimal point
   ([0-9]*) An optional sequence of digits forming second group
   \D Any 'non digit' character

The text that matches the first parenthesized subgroup is held in the Perl core variable $1; the text matching the second group of digits would go in $2. Since the second subgroup expression specifies ‘zero or more digits’, it is possible for $2 to hold an empty string after a successful match. The variables $1, $2 etc. are read-only; data values must be copied from these variables before they can be changed.

while(1) {
   print "Enter string : ";
   $str = <STDIN>;
   if($str =~ /Quit/i) { last; }
   if($str =~ /\$([0-9]+)\.?([0-9]*)\D/) {
   if($2) { $cents = $2; }
   else { $cents = 0; }
   print "Dollars $1 and cents $cents\n";
   }
   else { print "Didn't match dollar extractor\n"; }
   }

Examples of test inputs and outputs are:

Enter string : This is a test of the $ program.
   Didn't match dollar extractor
   Enter string : This program cost $0.
   Dollars 0 and cents 0
   Enter string : This program should cost $34.99
   Dollars 34 and cents 99
   Enter string : qUIT

Often, you need a pattern like:

Some fixed text;

A string whose value is arbitrary, but is needed for processing;

Some more fixed text.

You use .* to match an arbitrary string; so if you were seeking to extract the sub-string between the words ‘Fixed’ and ‘text’, you could use the pattern /Fixed(.*)text/:

while(1) {
   print "Enter string : ";
   $str = <STDIN>;
   if($str =~ /Quit/i) { last; }
   if($str =~ /Fixed(.*)text/) {
   print "Matched with substring $1\n";
   }
   else { print "Didn't match\n"; }
   }

Example inputs and outputs:

Enter string : Fixed up text on slide.
   Matched with substring up
   Enter string : Fixed up this text. Now starting to work on other text.
   Matched with substring up this text. Now starting to work on other

The matching of arbitrary strings can sometimes problematic. The matching algorithm is ‘greedy’ – it attempts to find the longest string that matches. There are more subtle controls; you can use patterns like .*? which match a minimal string (so in the second of the examples above, you would get the match ‘ up this ‘). Sometimes, there is a need for more complex patterns like:

fixed_text(somepattern)other_stuffSAMEPATTERNrest_of_line

These patterns can be defined through the use of ‘back references’ in the pattern string. Back references are related to matched sub-strings.When the pattern matcher is checking the pattern, it finds a possible match for the first sub-string (the element ‘(somepattern)’ in the example) and saves this text in the Perl core variable $1. A back reference, in the form \1, that occurs later in the match pattern will be replaced dynamically by this saved partial match. The pattern matcher can then confirm that the same pattern is repeated. Back references are illustrated in the following code fragments. These fragments might form a part of a Perl script that was to perform an approximate translation of Pascal code to C code. Such a transform cannot be completely automated (the languages do have some fundamental differences, like Pascal’s ability to nest procedure declarations); however, large parts of the translation task can be automated. The simplest transformation operations that you would want are:

Count := Count + 1; =>Count++;
   Count:= Count*Mul; =>Count*=Mul;
   Sum := Sum + 17; =>Sum+=17;

For these, you need a pattern that:

Matches a name (Lvalue); this is to be matched sub-string $1.

Matches Pascal’s := assignment operator.

Matches another name that is identical to the first thing matched, so you need back reference \1 in the pattern.

Matches a Pascal +, -, *, / operator; this is to be matched sub-string $2.

Matches either a number or another name; match sub-string $3.

Matches Pascal’s terminating ‘;’.

Allows extra whitespace anywhere.

If an input line matches the pattern, the program can output a revised line that uses C’s modifying assignment operators (++, += etc.); inputs that do not match may be output unchanged. A little test framework that illustrates transformations only for ‘+’ and ‘-‘ operators is:

while(1) {
   print "Enter string : ";
   $str = <STDIN>;
   if($str =~ /Quit/i) { last; }
   if($str A FAIRLY COMPLEX MATCH PATTERN!) {
   # Replace x:=x+1 by x++, similarly x--
   if(($3==1) && ($2 eq "+")) { print "\t$1++;\n"; }
   elsif(($3==1) && ($2 eq "-")) { print "\t$1--;\n"; }
   # Replace x:=x+y by x+=y, similarly for -
   else { print "\t$1 $2= $3;\n"; }
   }
   else { print "$str\n"; }
   }

The pattern needed here is:

/\s*([A-Za-z]\w*) *:= *\1 *(\+|\*|\/|-) *(([0-9]+)|([A-Za-z]\w*)) *;/)

The parts are:

 s* match any number of leading space or tab characters.

([A-Za-z]\w*) match a string that starts with a letter, then has an arbitrary number of letters, digits and underscore characters (should capture valid Pascal variable identifiers). This is matched subgroup $1; its value will be referenced later in the pattern via the back reference \1. Its value can be used in the processing code.

‘ *’ a space with a * quantifier (zero or more); this matches any spaces that appear after the variable name and before the Pascal assignment operator :=.

:= the literal string that matches Pascal’s assignment operator.

‘ *’ again, make provision for extra spaces.

\1 the back reference pattern. Needed to establish that it is working on forms like sum:=sum+val;.

‘ *’ the usual provision for extra spaces.

(\+|\*\\/|-) match a Pascal binary operator. (Characters like ‘+’ have to be ‘escaped’ because their normal interpretation is as control elements in the pattern definition.)

‘ *’ possible spaces.

(([0-9]+)|([A-Za-z]\w*)) a matched sub-string that is either a sequence of digits – [0-9]+ – or a Pascal variable name.

‘ *’ as usual, spaces.

; Pascal statement separator Regular expressions for complex pattern matching can become quite large. I have heard, via email, rumors of a 4000 character expression that captures the important elements from email address, making allowance for the majority of variations in the forms of email addresses!

Programs that do elaborate text transforms, like a more ambitious version of the toy ‘Pascal to C’ converter, typically need to apply many different transformations to the same line of input. For example, a Pascal if ... then needs to be rewritten in C’s if(...)... style. If the conditional part of that statement involves a Pascal not operator, it must be rewritten using C’s ! operator. Such transformation programs don’t simply read a line, apply a transform and output the transformed line. Instead, they are applied successively to the string in situ. After each transformation, the updated string is checked against other possible patterns and their replacements. Perl has a substitution operator that performs these in situ transforms of strings. A substitution pattern consists of a regular expression that defines features in the source string and replacement text. The patterns and replacements can incorporate matched sub-strings, so it is possible to extract a variable piece of text embedded in some fixed context and define a replacement in which the variable text is embedded in a slightly changed context. The imaginary ‘Pascal to C transformer’ provides another example. One would need to change Pascal’s not operator to C’s ! operator. The common cases, which would be easy to translate, are:

Lvalue := not expression; => lvalue != expression;
 if(not expression) then => if(! expression) then

The if statement would have to be subjected to further transforms to replace the if... then form by the equivalent C construct. A substitution pattern that could make these transformations is:

s/(:=|\() *not +/\1 !/;

The pattern defines:

A subgroup that either contains the literal sequence := or a left parenthesis (escaped as \( ).

Optional spaces.

The literal not.

One or more spaces.

The replacement is whatever text matched the subgroup (either := or left parenthesis), a space and C’s ! operator. This substitution pattern would be used in code like the following:

while($str=<INPUT>) {
   Chomp($str);
   #apply sequence of transforms to $str
   ...
   #next, deal with Pascal’s not operator
   $str =~ s/(:=|\() *not +/\1 !/;
   ...
   print $str, "\n";
   }

Your first applications of regular expressions will use only the simplest forms of patterns. Your tasks will, after all, be simple things like extracting a dollar amount from some input text, isolating an IP address from a server log, or identifying which credit card company is preferred. But it is possible, and it is often worthwhile, to try more sophisticated matches and transforms. You can get many ideas from the Perl perlretut tutorial and perlre reference documentation.

legal notice

Our website is not responsible for the information contained by this article. Web-articles is a free articles resource.
Suggestion: If you need fresh, daily updated content for your website, feel free to use our service. Click here for more information.

Useful tools and features

Link to this article from your page    Send this article to you or to a friend
If you like this article (tutorial), please link to it from your web page using the information above.

related articles

1. The vBulletin Administrator Experience
The vBulletin Administrator Experience What are the differences for an administrator compared to a regular member? Well, there are quite a few. We'll take a look at some of the more important ones now. Forum and Thread Tools The first differences are the forum and thread tools. Forum tools allow the administrator to view the posts and attachments that are in the moderator queue. (These are the posts and attachments that need to be approved before being made visible.) Th...

2. Generation of dynamic pages
Most of this text is concerned with elaborate ways of creating dynamic pages through Perl scripts, PHP scripts, Java servlets and Java Server Pages. The basic Apache setup provides support for CGI programs (based on Perl scripts and alternatives), and for the fairly limited ‘server-side includes’ (SSI) mechanism. The relevant modules (mod_env, mod_cgi and mod_include) are included in the default Apache build. It is best to limit the number of directories that contain executable code that can generate dynamic pages. The...

3. The next few elements define options
In this example, the defaults for htdocs and its subdirectories are set to allow clients to view the contents of a directory (as a page with a list of files, or something prettier), enable support for content negotiation, and permit the use of Unix inter-directory links. The next subdirective, AllowOverride, makes provision for overriding .htaccess files in subdirectories. The options here allow you to specify that nothing be changed (as in the example with AllowOverride None), or that anything be changed (AllowOverride Any...

4. Slightly modified specification for a CS1 program
The manager of a fast food outlet requires a program to help track sales. The outlet only serves burgers with fries; a burger meal costs $5.95. Customers may order any number of burger meals. The program is to help calculate prices of orders, and is also to keep records of total orders and the largest single order. The program is to use a simple menu-select style loop with the options: (1) Place order (2) Print totals so far (3) Quit The order option should result in a prompt for the number of meals ...

5. Lists and arrays
A few more features of Perl must be covered before any more interesting programs can be written. First, we need Perl’s ‘lists’ (or ‘arrays’). A Perl list is like a dynamic array class in C++ or Java (e.g. java.util.Vector). Lists do not use Perl’s object syntax, but a list is basically an object that owns data and which has an associated group of functions. A Perl list: Owns a collection of data elements (usually scalar values, but you can build lists of lists and other more complex struct...

6. Each output line consists of a list of words
These lines have to be sorted using an alphabetic ordering that uses the sub-string starting at the keyword. The keyword starts after column 50, so we require a special sort helper routine that picks out these sub-strings. The sort routine is similar to the numeric_sort illustrated earlier. It relies on the convention that, before the routine is called, the global variables $a and $b will have been assigned the two data elements (in this case report lines) that must be compared. sub by_keystr { my $str1 = substr($a...