[Courses] [Perl] Part 7: More About "m//"

Thu Aug 14 16:55:01 EST 2003

LinuxChix Perl Course Part 7: More About "m//"

1) Introduction
2) Options
3) Anticipating Failed Matches
4) Using Open Patterns
5) Exercise
6) Answer to Previous Exercise
7) Past Information
8) Credits
9) Licensing

             -----------------------------------

1) Introduction

I realise that Part 6 was quite a bit to handle. Regular expressions can be a
lot to get your head around -- and there's no good way to split them up into
short lessons -- but once you've mastered them they're very powerful.

Fortunately, Part 7 has less actual content to memorise -- just a lot of
warnings designed to help you avoid bugs.

             -----------------------------------

2) Options for "m//"

The "m//" operator takes several options, but we will only discuss four of them
here:
   i   Do case-insensitive pattern matching.
   m   Treat string as multiple lines (let /^/ and /$/ match "\n").
   s   Treat string as single line (let /./ match "\n").
   g   Match globally, i.e., find all occurrences.
To see the full list, consult "man perlop".

The easiest option to understand is "i" (case insensitive):

   print "February\n" if $date =~ /feb/i;   # Match "Feb", "FEB", etc.

The "s" option (treat string as single line) is only slightly trickier.
Normally /./ matches any character EXCEPT a newline. This is often what you
want, but if you want /./ to match a newline as well, use "s":

   my($head,$body) = ( $html =~ m!<head>(.*)</head>.*<body>(.*)</body>!si );

We use the "s" option here because line breaks shouldn't be taken into account.

The "m" option (treat string a multiple lines) looks like it would be the
opposite of the "s" option, but in fact the two can be used together because
they affect the meanings of different characters. "m" causes /^/ and /$/ to
match the beginning and end of a "line" in the middle of the string in addition
to their normal meaning (the beginning or end of the string itself). The
"lines" in a string are separated by the newline character "\n".

   my($subject) = ( $email =~ /^Subject: (.*)$/m );

Note that even with the "m" option set, /^/ can only be used at the beginning
of the pattern and /$/ can only be used at the end. The purpose of the "m"
option is to allow you to identify a match without worrying about whether it's
the beginning of the string or just the beginning of the line. If you're
looking for a line break in the middle of a pattern, you know it's a newline,
so use /\n/ instead:

   if ( $foo =~ /^one line^two lines$/m ) { ... }    # Wrong
   if ( $foo =~ /^one line$two lines$/m ) { ... }    # Wrong
   if ( $foo =~ /^one line$^two lines$/m ) { ... }   # Wrong
   if ( $foo =~ /^one line\ntwo lines$/m ) { ... }   # Right

Finally, the "g" option causes a "global" match. This doesn't change WHETHER a
string matches; rather, it returns all matches at once:

   # Find up to three words that contain the letter "q".
   my($a,$b,$c) = ( $list =~ /[a-z]*q[a-z]*/g );

This isn't very useful right now, but it will be more helpful when we start
using arrays.

             -----------------------------------

3) Anticipating Failed Matches

The "m//" operator is a great way to extract information from a string, but you
should always anticipate the possibility that the string won't match your
pattern.

For example, consider the following code:

   #!/usr/bin/perl -w
   use strict;

   my($name, $blood);

   while ( 1 ) {   # Loop forever.
     print "Enter patient name:   ";
     chomp($name = <STDIN>);
     print "Enter blood type:     ";
     chomp($blood = <STDIN>);
     $blood =~ m/(AB|A|B|O)/;
     print "Transfusing type $1 blood into $name\n\n";
   }

Now let's try it:

   Enter patient name:   Bob
   Enter blood type:     type AB
   Transfusing type AB blood into Bob

So far, so good, but now the nurse accidentally hits <Enter> before giving a
blood type:

   Enter patient name:   Jack
   Enter blood type:
   Transfusing type AB blood into Jack

Uh oh: since this match didn't work, "$1" maintained its value from the
previous match. If Jack's blood type isn't AB, this could be fatal.

So let's try writing that middle part differently:

   chomp($blood = <STDIN>);
   my($type) = ($blood =~ m/(AB|A|B|O)/);
   print "Transfusing type $type blood into $name\n\n";

Okay, let's run it now:

   Enter patient name:   Bob
   Enter blood type:     type AB
   Transfusing type AB blood into Bob

   Enter patient name:   Jack
   Enter blood type:
   Use of uninitialized value in concatenation (.) or string at [...]
   Transfusing type  blood into Jack

Still not good.

The second-to-last line will only appear if you turned on warnings (which you
should have). You see, "$type" only has a meaningful value if the user entered
meaningful input; otherwise the match fails and "$type" has the special value
"undef". So Jack is getting undefined blood. I don't know if that carries any
risk but I wouldn't want to try it.

Let's try again:

   my $type;
   if ( $blood =~ m/(AB|A|B|O)/ ) {
     $type = $1;
   }
   print "Transfusing type $type blood into $name\n\n";

This "fix" has the same result as the previous one: "$type" is only set if the
blood type is valid, but the transfusion occurs anyway.

So let's do this right:

   #!/usr/bin/perl -w
   use strict;

   my($name, $blood);

   while ( 1 ) {   # Loop forever.
     print "Enter patient name:   ";
     chomp($name = <STDIN>);
     print "Enter blood type:     ";
     chomp($blood = <STDIN>);
     if ( $blood =~ m/(AB|A|B|O)/ ) {
       print "Transfusing type $1 blood into $name\n\n";
     }
     else {
       print "Wrong blood type; try again.\n\n";
     }
   }

Of course, we don't have to use "$1": we could do this instead:

     if ( my($type) = ($blood =~ m/(AB|A|B|O)/) ) {
       print "Transfusing type $type blood into $name\n\n";
     }
     else {
       print "Wrong blood type; try again.\n\n";
     }

The issue isn't whether or not you use "$1". What's important is that you
always plan for a failed match.

             -----------------------------------

4) Using Open Patterns

It's great to check your input, but do it properly. For example, you often want
a user to enter an e-mail address, so you test to make sure the e-mail address
is valid. You might write a test like this:

   if ( $address =~ /^[A-Za-z0-9.]+\@[A-Za-z0-9.]+$/ ) { ... }

But can an e-mail address include quotes? What about slashes?

It turns out that an e-mail address can contain almost any character (at least
on the left side of the "@"). So the following are valid e-mail addresses:
   a/b at example.com
   ***@example.com
   1+2=3 at example.com

And let's not forget that not every top-level domain has just two or three
letters. If you doubt this, go to:
   http://index.museum/

For more information on e-mail addresses, consult RFCs 822 and 2822:
   http://www.faqs.org/rfcs/rfc822.html
   http://www.faqs.org/rfcs/rfc2822.html

Another example: suppose you want someone to enter his last name (sirname). If
you just write the validation without thinking about it, you might forget that
a last name can contain dashes, apostrophes (think "O'Conner") spaces (yes,
some people have compound last names) and accented letters (non-English names).

So remember: check your input, but do it properly. Don't assume you know more
than you do, and give the user some leeway.

Note: some of the examples in this tutorial are deliberately simplified. For
example, in some places we operate on the "Subject" field of an e-mail as
though it was a single line, but RFC 822 says that e-mail headers can be
"folded" into multiple lines. A real program would have to be a little more
rigorous.

             -----------------------------------

5) Exercise

Computers make very bad poets. However, a computer program might be able to
help a poet find a good rhyme.

Write a program that accepts lines of poetry (one at a time) and looks for
rhymes in a pre-programmed list of lines of poetry. The lines in the pre-
programmed list should be separated by newlines.

Of course, it's hard for a computer to determine whether two words rhyme. For
this exercise we will use a very simple mechanism: we will assume that two
words rhyme if their last three letters are the same. For extra credit, try to
find a better mechanism.

Your program should look like this:

   #!/usr/bin/perl -w
   use strict;

   # The following is a multi-line string.
   # (It's the beginning of a poem by A. E. Housman.)

   my $possibilities = "Terence, this is stupid stuff:
   You eat your victuals fast enough;
   There can't be much amiss, 'tis clear,
   to see the rate you drink your beer.
   But oh, good Lord, the verse you make,
   It gives a chap the belly-ache.";

   while ( defined( my $line = <STDIN> ) ) {
     if ( my($last_letters) = ( $line =~ [** regular expression 1 **] ) ) {
       if ( my($match) = ( $possibilities =~ [** regular expression 2 **] ) ) {
         print "That rhymes with: $match\n";
       }
       else {
         print "I can't think of anything that rhymes with that.\n";
       }
     }
     else {
       print "Sorry, I couldn't even begin to rhyme that one.\n";
     }
   }

All you have to do is fill in the parts in [** brackets **].

Some hints:
a) Remember that you only want to match letters; punctuation should be ignored.
b) Use variable interpolation in the second regular expression. That is,
    regular expression 2 should contain '$last_letters'. Doing that can
    sometimes lead to unexpected results (e.g., if $last_letters contained a
    period it would be interpreted as a wild card), but since we're only
    matching letters we shouldn't have any problems.
c) Make sure your match is case-insensitive: "poodle" rhymes with "NOODLE".
d) Test for matches at the beginning, middle and end of $possibilities (that
    is: the first line, the last line, and some other line).

In case you're wondering, I chose the poem "Terence, this is stupid stuff" for
two reasons. First, the rhyming words are spelled very differently, so whether
you want to rhyme with "fluff" or "rough", you will find a rhyme. Second, it's
what people will say to YOU if you let your computer write your poetry!

             -----------------------------------

6) Answer to Previous Exercise

The previous exercise was to write a program that splits a URI into protocol,
host and path. Here is mine:

   #!/usr/bin/perl -w
   use strict;

   while ( defined(my $line = <STDIN>) ) {
     chomp($line);
     if ( my($protocol,$host,$path) =
                       ($line =~ m<^([^:/]+)://([^/]+)(/?.*)$> ) ) {
       print "protocol=$protocol, host=$host, path=$path\n";
     }
     else {
       print "Sorry: I can't parse that URI.\n";
     }
   }

Yours will probably be different. For example, I was very loose on the
hostname; you might have stipulated that it must not contain certain
characters, such as spaces.

             -----------------------------------

7) Past Information

Part 1: Getting Started
         http://linuxchix.org/pipermail/courses/2003-March/001147.html

Part 2: Scalar Data
         http://linuxchix.org/pipermail/courses/2003-March/001153.html

Part 3: User Input
         http://linuxchix.org/pipermail/courses/2003-April/001170.html

Part 4: Control Structures
         http://linuxchix.org/pipermail/courses/2003-April/001184.html

Part 4.5, a review with a little new information at the end:
         http://linuxchix.org/pipermail/courses/2003-July/001297.html

Part 5: The "tr///" Operator:
         http://linuxchix.org/pipermail/courses/2003-July/001302.html

Part 6: Part 6: The "m//" Operator
         http://linuxchix.org/pipermail/courses/2003-August/001305.html

             -----------------------------------

8) Credits

Works cited:
a) "man perlop"
b) Kirrily Robert, Paul Fenwick and Jacinta Richardson's "Intermedia Perl",
    which you can find (along with their "Introduction to Perl") at:
    http://www.perltraining.com.au/notes.html

Thanks to Jacinta Richardson for fact checking.

             -----------------------------------

9) Licensing

This course (i.e., all parts of it) is copyright Alice Wood and Dan Richter,
and is released under the same license as Perl itself (Artistic License or GPL,
your choice). This is the license of choice to make it easy for other people to
integrate your Perl code/documentation into their own projects. It is not
generally used in projects not related to Perl.