Advanced regexps

If you are confused already, it is probably best that you re-read the last section before continuing - the expressions only get more complicated!

We have gone through basic and novice regexps - now we're onto the powerful stuff. Regexps allow you to use the characters +, *, ?, { }, $, and ^ outside of sets to have special meaning.

The first four affect the number of a pattern it should match, and the last two affect the position. + means "match one or more of the previous expression", * means "match zero or more of the previous expression", and ? means "match 0 or 1 of the previous expression".

Here are some examples:

<?php
    preg_match
("/[A-Za-z]*/", $string);
    
preg_match("/-?[0-9]+/", $string);
    
preg_match("/\$[A-Za-z_][A-Za-z_0-9]*/", $string);
?>

The first expression will match "", "a", "aaaa", "The sun has got his hat on", and any other string containing uppercase and lowercase letters - the expression can be translated as "match zero or more uppercase and lowercase letters". The second regexp will match 1, 100, 324343995, and also -1, -100, -234011, etc - the "-?" means "match exactly 0 or 1 minus symbols".

The last regexp is fairly complicated, but, as always with regexps, complexity == power. As mentioned before, $ is a regexp symbol in its own right, however here we proceed it with a backslash, which, unsurprisingly, works as an escape character turning the $ into a standard character and not a regexp symbol. We then match precisely one symbol from the range A-Z, a-z, and _, then match zero or more symbols from the range A-Z, a-z, underscore, and 0-9. What kind of text would that match? Here are some examples: $A, $B, $C, $foo, $bar, $Test99, $_MyTest, $__Foo__. Look familiar? That's right - that regexp will match PHP variables.

Opening braces { and closing braces } can be used to define specific repeat counts in three different ways. Firstly, {n}, where n is a positive number, will match n instances of the previous expression. Secondly, {n,} will match a minimum of n instances of the previous expression. Finally, {m,n} will match a minimum of m instances and a maximum of n instances of the previous expression. Note that there are no spaces inside the braces.

Here is a list of advanced regular expressions using braces, with string used to match, and whether or not a match is made:

Regexp

String

Result

/[A-Z]{3}/

FuZ

No match; the regexp will match precisely three uppercase letters

/[A-Z]{3}/i

FuZ

Match; same as above, but case insensitive this time

/[0-9]{3}-[0-9]{4}/

555-1234

Match; precisely three numbers, a dash, then precisely four. This will match local US telephone numbers, for example

/[a-z]+[0-9]?[a-z]{1}/

aaa1

No match; must end with one lowercase letter

/[A-Z]{1,}99/

99

No match; must start with at least one uppercase letter

/[A-Z]{1,5}99/

FINGERS99

No match; start with a maximum of 5 uppercase letters

/[A-Z]{1,5}[0-9]{2}/i

adams42

Match

Finally, we have the dollar $ and caret ^ symbols, which mean "end of line" and "start of line" respectively. Consider the following string:

$multitest = "This is\na long test\nto see whether\nthe dollar\nSymbol\nand the\ncaret symbol\nwork as planned";

As you know, \n means "new line", so what we have there is a string containing the following text:

This is
a long test
to see whether
the dollar
Symbol
and the
caret symbol
work as planned

In order to parse multi-line strings correctly, we need the "m" modifier, so "m" needs to go after the final slash. Here is some PHP code - which expressions do you think will match?

<?php
    preg_match
("/is$/m", $multitest);
    
preg_match("/the$/m", $multitest);
    
preg_match("/^the/m", $multitest);
    
preg_match("/^Symbol/m", $multitest);
    
preg_match("/^[A-Z][a-z]{1,}/m", $multitest);
?>

The answer is "all of them" - they all match. Line one means "return true if 'is' is at the end of a line", line two is "return true if 'the' is at the end of a line", and line three is "return true if 'the' is at the end of a line". Line four is "return true if "Symbol" is at the start of a line", and line five is "return true if there is a capital letter followed by one or more lowercase letters at the start of a line.

As you can see, matching the beginning and end of a line is simple with the $ and ^ characters, but when combined with +, *, ?, and { }, your regular expression-matching ability should rocket upwards.

However, we're not finished yet, grasshopper - if you wish to attain regexp nirvana, you need to understand the last few secrets of regexp wisdom...

 

Next chapter: Guru regexps >>

Previous chapter: Novice regexps

Jump to:

 

Home: Table of Contents

Follow us on Identi.ca or Twitter

Username:   Password:
Create Account | About TuxRadar