[Courses] [Perl] Part 6: The "m//" Operator

Fri Aug 8 12:48:50 EST 2003

LinuxChix Perl Course Part 6: The "m//" Operator

Contents
1) Introduction
2) Introducing the m// Operator
3) A Crash Course on Perl Regular Expressions
4) Back to "m//"
5) Exercise
6) Answers to Previous Exercises
7) Licensing Announcement
8) Past Information
9) Credits

             -----------------------------------

1) Introduction

Having seen the "tr///" operator, we are now going to look at the "m//"
operator, which stands for "match". Because this requires an understanding of
regular expressions, we give a crash course on that. If you're already
familiar with regular expressions, you should still skim over the crash
course because Perl regular expressions have handy shortcuts that you may not
be aware of.

As you may have noticed, Part 6 is twice as long as Part 5, so you may want
to take a break around the beginning of section 4. Consider yourself lucky:
Part 6 was originally about a third longer, but I finally caved in and moved
some material to Part 7.

             -----------------------------------

2) Introducing the m// operator

Try running this program:

   #!/usr/bin/perl -w
   use strict;

   my $a = 'I talked to Alice this morning.';
   my $b = 'I talked to Bobby this morning.';
   print "'$a' matches /Bob/ \n"    if   $a =~ m/Bob/;
   print "'$b' matches /Bob/ \n"    if   $b =~ m/Bob/;

What happens?

This example makes it look as though "m//" indicates whether one piece of
text contains another. In fact, "m//" performs a "regular expression" match
(or "pattern match"), which is much, much more powerful. We haven't seen
regular expressions yet, but we're going to look at them in a moment.

By the way, Perl is big on avoiding redundancy. If you use a slash as the
delimiter, you can omit the letter "m" entirely! So the last two statements
of the above program could have been written like this:

   print "'$a' matches /Bob/ \n"    if   $a =~ /Bob/;
   print "'$b' matches /Bob/ \n"    if   $b =~ /Bob/;

In fact, most Perl programmers prefer to write it that way. But remember: it
only works if your delimiter is a slash!

Now we're going to look at regular expressions in more detail. But keep these
four things in mind:

a) "m//" is used to test whether a given string matches or does not match a
certain regular expression.

b) The match does not need to be a separate word: in the previous example,
/Bob/ matched "Bobby".

c) The syntax of "m//" is similar to the syntax of "tr///". So we could have
used "m!Bob!" or "m<Bob>" instead of "m/Bob/".

d) You can omit the "m" if your delimiter is a slash.

             -----------------------------------

3) A Crash Course on Perl Regular Expressions

Regular expressions are well-known in Unix circles, though they're not unique
to Unix and they certainly don't require Unix to work. Beware that different
Unix tools (like grep and sed) use different "flavors" of regular
expressions, so the patterns you use in Perl probably aren't directly
portable to other tools.

In Perl regular expressions, the following characters have special meaning:
   . * + ? | ^ $ @ ( ) [ ] { } \
   and the delimiter (usually slash)

They can all be escaped with a backslash.

Okay, let's start with the easiest special character: the period (dot). The
period matches any single character (except a newline, but we'll see more
about that in Part 7). So /a.c/ matches "abc", "aNc" or "a%c". If you want to
match exactly "a.c", escape the period with a backslash: /a\.c/

Square brackets indicate a list of possible choices:
   /a[123]b/  matches exactly three strings: "a1b", "a2b" and "a3b".

A dash inside square brackets indicates a range, just as in "tr///":
   /a[1-9]b/  matches "a1b", "a2b", "a3b", ..., "a9b".

To include the dash itself in a list of choices, place the dash at the
beginning or end of the list:
   /a[12-]b/  matches "a1b", "a2b" and "a-b".

A dash OUTSIDE square brackets has no special meaning; it's just a literal
dash.

A caret at the beginning of a choice negates (inverts) it:
   /a[^A-Z]b/ matches "a1b" or "a%b", but NOT "aMb".

A caret anywhere else in a choice means a literal caret:
   /a[A^Z]b/  matches "aAb", "a^b" or "aZb".

The characters "* + ?" don't stand for any character, but instead indicate
that the previous character should appear a certain number of times:
   *   previous character appears zero or more times
   +   previous character appears one or more times
   ?   previous character appears zero or one times

Examples:
   /ab*c/     matches "ac", "abc", "abbc" and "abbbbbbbbbc"
   /ab+c/     same as /a*b/ except doesn't match "ac"
   /ab?c/     matches "ac" and "abc", but NOT "abbc"

Note that if "*" or "+" follows a range or a dot, the match is interpreted
"broadly":
   /a[123]*b/ matches "a22b" and "a22222b", but also "a12312322321b"
   /a.+b/     matches an "a" followed by a "b", with anything in between

Also note that most special characters lose their special meaning when inside
a range. For example, /[.*]/ matches a literal period or a literal astrisk.

The characters "^" and "$" indicate the beginning and end of the string,
respectively:
   /^a/       an "a" followed by anything (or nothing)
   /a$/       anything ending in "a"
   /^a$/      matches only "a"
   /^a[^z]*b$/   matches anything starting with "a", ending with "b" and not
                 containing "z".
Remember that unless you explicitly specify a "^" or a "$", the match can
include any part of the string. So, for example, /ab*c/ matches "abbc", but
also "123abbcXYZ".

In addition to representing the end of the string, the character "$" can also
be used for variable interpolation:
   my $pattern = '[aeiou]';
   print "Found vowels\n" if $str =~ m/$pattern/;

Parentheses are used for grouping:
   /(abc)+/  matches "abc" and "abcabc", but NOT "abcc" or "aabbcc"
   /a(bc)?d/ matches "ad" and "abcd", but NOT "abd" or "acd".

The vertical bar means "either but not both":
   /a|b/     matches "a" and "b".
   /ab|cd/   matches "ab" and "cd"

The vertical bar has low precidence, meaning that /AB|CD/ means "AB or CD"
rather than "A, then either B or C, then D". To override this, use
parentheses:
   /a(b|c)d/ matches "abd" or "acd" but NOT "abcd"

Finally, Perl includes many useful escape-sequences that only have meaning in
regular expressions, such as:
   \s   whitespace (same as [ \t\n\r])
   \w   a "word" character (same as [A-Za-z0-9_])
   \d   a digit (same as [0-9])

Example: /^\s*\d+\s*$/ means "beginning of string, possible leading
whitespace, one or more digits, possible trailing whitespace, end of string".
In other words, it means a non-negative integer.

Capitalizing each of these means "the opposite":
   \S   non-whitespace
   \W   non-word characters
   \D   non-digits

In addition, here is a very handy "zero-length" escape sequence:
   \b   word boundry
and its opposite:
   \B   not a word boundry

They are zero-length because they don't represent a character, just the
boundry between characters. This may not sound important right now, but you
will find it useful with time.

A word boundry is defined as a \w next to a \W (in either order). For
example, /\Bing\b/ matches "playing" and "king", but not "jingle" (doesn't
match \b) or "abc ing" (doesn't match \B).

             -----------------------------------

4) Back to "m//"

Now that we have seen the details of regular expressions, we can use "m//"
more appropriately.

Regular expressions provide a very concise way of describing whether a piece
of text is "what you want". For example:

   print "Unix\n" if $os =~ m/Linux|Unix|(Free|Open)?BSD|Solaris/;

Remember: you can omit the "m" if you use a slash as the delimiter:

   print "Unix\n" if $os =~ /Linux|Unix|(Free|Open)?BSD|Solaris/;

In Perl, you can surround a part of a regular expression with parentheses so
as to extract the information:

   # Match digits-colon-digits-colon-digits
   if ( $time =~ /(\d+):(\d+):(\d+)/ ) {
     $hour = $1;     # First set of parentheses
     $minute = $2;   # Second set of parentheses
     $second = $3;   # Third set of parentheses
   }

We can compress the above into one line:

   ($hour, $minute, $second) = ( $time =~ /(\d+):(\d+):(\d+)/ );

The parentheses are necessary because "m//" returns the list ($1,$2,$3).
We'll learn more about lists at a later date.

If the parentheses in a regular expression are nested, it's the order of the
LEFT parenthesis that counts:

   # Same as the previous example, but takes into account
   # the possibility that the seconds are not specified.
   if ( $time =~ /(\d+):(\d+)(:(\d+))?/ ) {
     $hour = $1;     # First set of parentheses
     $minute = $2;   # Second set of parentheses
     $second = $4;   # Fourth set of parentheses
   }

In the above example, the third set of parentheses is used to associate both
the colon and the digits with "?" - either both should be specified, or
neither, but not one or the other. We could have set something equal to $3 as
well, but that would be something like ":59", which probably isn't very
useful.

By the way, since our regular expression explicitly allows the user to not
specify the number of seconds, we should anticipate the possibility that the
seconds are excluded. We can test this using the "defined" function on "$4"
or "$second", but a quicker way is to do this:

   $second = $4 || 0;

This will set "$second" to zero if $4 is undefined or zero.

If you have enabled warnings (and you should have), Perl will complain if you
use an undefined value for most operations, so be sure to anticipate possible
undefined values in regular expressions.

             -----------------------------------

5) Exercise

Many URIs[*] (but not all) can be divided into protocol, authority and path
(in that order). For example, consider the following URIs:

   ftp://example.com/foo/bar
   gopher://abc.example.com/xyz
   telnet://example.com/

The protocol is the part before the colon, the authority is the part in the
middle (the domain name), and the path is whatever comes after the authority.
For example, in the first URI, the protocol is "ftp", the authority is
"example.com", and the path is "/foo/bar". Note that the "://" between the
protocol and the authority is not part of the protocol or the authority.

Write a Perl program that uses "m//" to split a URI into protocol, authority
and path, or outputs a message saying that you gave it a URI that it can't
parse.

[*] According to RFC 2396, URLs are structural (they tell you how to get what
you want) whereas URNs are logical (a permanent, abstract mapping of strings
to resources), and URIs include both URLs and URNs. The consensus in Perl
seems to be to use the term "URI".

             -----------------------------------

6) Answers to Previous Exercises

a) A program to convert all 1's to i's and all 0's to o's:

   #!/usr/bin/perl -w
   use strict;

   while ( defined(my $line = <STDIN>) ) {
     $line =~ tr/01/oi/;
     print $line;
   }

b) A Ceasar Cypher:

   #!/usr/bin/perl -w
   use strict;

   while ( defined(my $line = <STDIN>) ) {
     $line =~ tr/A-Za-z/C-ZABc-zab/;
     print $line;
   }

             -----------------------------------

7) Licensing Announcement

This course material, including the previously released parts, is copyright
Alice Wood and Dan Richter, and is retroactively released under the same
license as Perl itself (Artistic License or GPL, your choice). This is the
license of choice to make it easy for other people to integrate your Perl
code/documentation into their own projects. It is not generally used in
projects not related to Perl.

By the way, Alice Wood is the author of the first four parts of this course,
and I got her permission to release those parts under that license as well.

             -----------------------------------

8) Past Information

Part 1: Getting Started
         http://linuxchix.org/pipermail/courses/2003-March/001147.html

Part 2: Scalar Data
         http://linuxchix.org/pipermail/courses/2003-March/001153.html

Part 3: User Input
         http://linuxchix.org/pipermail/courses/2003-April/001170.html

Part 4: Control Structures
         http://linuxchix.org/pipermail/courses/2003-April/001184.html

Part 4.5, a review with a little new information at the end:
         http://linuxchix.org/pipermail/courses/2003-July/001297.html

Part 5: The "tr///" Operator:
         http://linuxchix.org/pipermail/courses/2003-July/001302.html

             -----------------------------------

9) Credits

Works cited: "man perlre" and "man perlop"

Thanks to Jacinta Richardson for fact checking.