Web Developer's Virtual Library: Encyclopedia of Web Design Tutorials, Articles and Discussions


WDVL Newsletter

Active Server Pages
JSP/Java Servlets
Microsoft SQL Server
Daily Backup
Dedicated Servers
Streaming Audio/Video
24-hour Support    

jobs.webdeveloper.com

Hiermenus


e-commerce
Partner With Us















Developer Channel
FlashKit.com
JavaScript.com
JavaScriptSource
Developer Jobs
ScriptSearch
StreamingMediaWorld
Web Developer's Journal
Web Developer's Virtual Library
WebDeveloper.com
Webreference
Web Hosts
XMLfiles.com

internet.com
IT
Developer
Internet News
Small Business
Personal Technology

Search internet.com
Advertise
Corporate Info
Newsletters
Tech Jobs
E-mail Offers


How the Engine Works - Page 12

March 9, 2001

We've now seen most of the syntax behind regular expression matching and plenty of examples of it in action. The code that does all the matching is called perl's 'regular expression engine'. You might now be wondering about the exact rules applied by this engine when determining whether or not a piece of text matches. And how much of it matches what. From what our examples have shown us, let us make some deductions about the engine's operation.

Our first expression, ([a-z]+) plucked out a set of one-or-more lower-case letters. The first such set that perl came across was 'silly'. The next character after 'y' was a space, and so no longer matched the expression.

  • Rule one: Once the engine starts matching, it will keep matching a character at a time for as long as it can. Once it sees something that doesn't match, however, it has to stop. In this example, it can never get beyond a character that is not a lower case letter. It has to stop as soon as it encounters one.

    Next, we looked for a series of word characters, using (\w+). The engine started looking at the beginning of the string and found one, '1'. The next character was not a word character (it was a colon), and so the engine had to stop.

  • Rule two: Unlike me, the engine is eager. It's eager to start work and eager to finish, and it starts matching as soon as possible in the string; if the first character doesn't match, try and start matching from the second. Then take every opportunity to finish as quickly as possible.

    Then we tried this: ([a-z]+)(.*)([a-z]+). The result we got with this was a little strange. Let's look at it again:

    > perl matchtest2.plx
    Enter a regular expression: ([a-z]+)(.*)([a-z]+)
    The text matches the pattern '([a-z]+)(.*)([a-z]+)'.
    $1 is 'silly'
    $2 is ' sentence (495,a) *BUT* one which will be usefu'
    $3 is 'l'
    >
    

    Our first group was the same as what matched before - nothing new there. When we could no longer match lower case letters, we switched to matching anything we could. Now, this could take up the rest of the string, but that wouldn't allow a match for the third group. We have to leave at least one lower-case letter.

    So, the engine started to reverse back along the string, giving characters up one by one. It gave up the closing bracket, the 3, then the opening bracket, and so on, until we got to the first thing that would satisfy all the groups and let the match go ahead - namely a lower-case letter: the 'l' at the end of 'useful'.

    From this, we can draw up the third rule:

  • Rule three: Like me, in this case, the engine is greedy. If you use the + or * operators, they will try and steal as much of the string as possible. If the rest of the expression does not match, it grudgingly gives up a character at a time and tries to match again, in order to find the fullest possible match.

    We can turn a greedy match into a non-greedy match by putting the ? operator after either the plus or star. For instance, let's turn this example into a non-greedy version: ([a-z]+)(.*?)([a-z]+). This gives us an entirely different result:

    > perl matchtest2.plx
    Enter a regular expression: ([a-z]+)(.*?)([a-z]+)
    The text matches the pattern '([a-z]+)(.*?)([a-z]+)'.
    $1 is 'silly'
    $2 is ' '
    $3 is 'sentence'
    >
    

    Now we've shut off rule three, rule two takes over. The smallest possible match for the second group was a single space. First, it tried to get nothing at all, but then the third group would be faced with a space. This wouldn't match. So, we grudgingly accept the space and try and finish again. This time the third group has some lower case letters, and that can match as well.

    What if we turn off greediness in all three groups, and say this: ([a-z]+?)(.*?)([a-z]+?)

    > perl matchtest2.plx
      Enter a regular expression: ([a-z]+?)(.*?)([a-z]+?)
      The text matches the pattern '([a-z]+?)(.*?)([a-z]+?)'.
      $1 is 's'
      $2 is ''
      $3 is 'i'
    >
    

    What about this? Well, the smallest possible match for the first group is the 's' of silly. We asked it to find one character or more, and so the smallest it could find was one. The second group actually matched no characters at all. This left the third group facing an 'i', which it took to complete the match.

    Our last example included an alternation:

    > perl matchtest2.plx
    
    Enter a regular expression: e(\w|n\w+)
      The text matches the pattern 'e(\w|n\w+)'.
      $1 is 'n'
      >
    

    The engine took the first branch of the alternation and matched a single character, even though the second branch would actually satisfy greed. This leads us onto the fourth rule:

  • Rule four: Again like me, the regular expression engine hates decisions . If there are two branches, it will always choose the first one, even though the second one might allow it to gain a longer match.

    To summarize:

    The regular expression engine starts as soon as it can, grabs as much as it can, then tries to finish as soon as it can, while taking the first decision available to it.

    Try It Out - Page 11
    Beginning Perl
    Working with RegExps - Page 13


  • Up to => Home / Authoring / Languages / Perl / BeginningPerl




    Jupiter Online Media: internet.comearthweb.comDevx.commediabistro.comGraphics.com

    Search:

    Jupitermedia Corporation has two divisions: Jupiterimages and Jupiter Online Media

    Jupitermedia Corporate Info


    Legal Notices, Licensing, & Permissions, Privacy Policy.

    Web Hosting | Newsletters | Tech Jobs | Shopping | E-mail Offers