[Courses] Regular expressions: some simple examples of searching.

Wed May 22 22:09:42 EST 2002

This was originally a post to techtalk, Sonja recommended a repost here.

This is under the Open Publication Licence as per
http://www.opencontent.org/openpub/ so that it can go on the webpage.

I haven't done replacing here, perhaps in a followup.

We need to begin with an important distinction:

a regular expression: a recipe for finding a string of characters. the
regular expression is itself a string of characters, and may range from
a simple string eg 'cat' to a more arcane looking one: '^.*g[0-9]*'

a match: a string that satisfies the regular expression. This is the
string you're using the regular expression to look for. A regular
expression often matches more than one string, this is part of the point
(you can search your document for any sentence not beginning with a
capital letter).

Note that some regular expression tools (grep in particular) will print
out an entire line that *contains* a match, not just the match itself.

More on regular expressions:

Simplest matching:

 Most characters simply match themselves. So most simple strings are
 regular expressions that will match exactly the contents of the string.

 For example:
 'a' matchs 'a', 'cat' matches 'cat'

Meta characters:
 '^' matches the beginning of a line. So '^cat' will only match 'cat' if
 'cat' begins the line, eg '^cat' matches:

 cats are bad

 but not:

 bad cat

 '$' matches the end of a line, so 'cat$' will match the string 'cat'
 in:

 my bad cat

 but not:

 I think cats are evil

 '.' matchs any character at all, so 'a.' will match 'aa', 'ab', 'a/'
 'a%' and so on... (by the way, regular expressions are generally case
 sensitive, so 'a' will not match 'A')

 '*' means "match the previous character any number of times, including
 zero times", so 'cats*' will match:

 'cat' (that's the 'zero times' match)
 'cats'
 'catss'
 'catsss'

 it won't match:

 'cat s'

 though.

 By default it finds the *longest* possible match, so if the whole
 string is 'catssssssss', it won't find the match 'cat' or 'cats' or so
 on on that particular line.

 '*' is commonly combined with '.', as in '.*', which matches any text.
 An example of this is '<body.*>' which will match:

 '<body>'
 '<body bgcolor="#999999">'

 and other HTML body tags.

 '+' matches the previous character one or more times, so '<body.+>'
 will match '<body bgcolor="#999999">' but *not* '<body>' as there was
 nothing between the 'y' and the '>' for our '.' to match.

Ranges:
 There's an intermediate layer between most characters matching themself
 (like 'a' or 'b') and '.' matching everything.

 It looks like: '[0-9]' (match any character between '0' and '9',
 meaning all the numerals), or '[a-z]' (match any lowercase character).

 Examples:

 '[0-9]*' matches:

 '074839578943'
 '' (the zero match)
 '9378597'
 '9'
 '1'

 but not:
 '543a3928' ('a' was not in our range)

 note that the two smaller strings '543' and '3928' would be matched
 though.

 You can have a range of specific characters, eg '[abcz]' meaning "match
 any of 'a', 'b', 'c' or 'z', but nothing else"

 Finally, you can negate a range, using (confusingly) the '^' again, but
 *inside* the square brackets.

 '[^0-9]' will match "anything that is not in the range '0' to '9' - ie
 not a numeral"

If anyone is trying to learn regular expressions, feel free to post in
this thread, saying something like: "I want a regular expression to find
X, here is my attempt" and I and others will be happy to help you out.

-Mary.