[Courses] Regular expressions: some simple examples of searching.
Mary
linuxchix at puzzling.org
Wed May 22 22:09:42 EST 2002
This was originally a post to techtalk, Sonja recommended a repost here.
This is under the Open Publication Licence as per
http://www.opencontent.org/openpub/ so that it can go on the webpage.
I haven't done replacing here, perhaps in a followup.
We need to begin with an important distinction:
a regular expression: a recipe for finding a string of characters. the
regular expression is itself a string of characters, and may range from
a simple string eg 'cat' to a more arcane looking one: '^.*g[0-9]*'
a match: a string that satisfies the regular expression. This is the
string you're using the regular expression to look for. A regular
expression often matches more than one string, this is part of the point
(you can search your document for any sentence not beginning with a
capital letter).
Note that some regular expression tools (grep in particular) will print
out an entire line that *contains* a match, not just the match itself.
More on regular expressions:
Simplest matching:
Most characters simply match themselves. So most simple strings are
regular expressions that will match exactly the contents of the string.
For example:
'a' matchs 'a', 'cat' matches 'cat'
Meta characters:
'^' matches the beginning of a line. So '^cat' will only match 'cat' if
'cat' begins the line, eg '^cat' matches:
cats are bad
but not:
bad cat
'$' matches the end of a line, so 'cat$' will match the string 'cat'
in:
my bad cat
but not:
I think cats are evil
'.' matchs any character at all, so 'a.' will match 'aa', 'ab', 'a/'
'a%' and so on... (by the way, regular expressions are generally case
sensitive, so 'a' will not match 'A')
'*' means "match the previous character any number of times, including
zero times", so 'cats*' will match:
'cat' (that's the 'zero times' match)
'cats'
'catss'
'catsss'
it won't match:
'cat s'
though.
By default it finds the *longest* possible match, so if the whole
string is 'catssssssss', it won't find the match 'cat' or 'cats' or so
on on that particular line.
'*' is commonly combined with '.', as in '.*', which matches any text.
An example of this is '<body.*>' which will match:
'<body>'
'<body bgcolor="#999999">'
and other HTML body tags.
'+' matches the previous character one or more times, so '<body.+>'
will match '<body bgcolor="#999999">' but *not* '<body>' as there was
nothing between the 'y' and the '>' for our '.' to match.
Ranges:
There's an intermediate layer between most characters matching themself
(like 'a' or 'b') and '.' matching everything.
It looks like: '[0-9]' (match any character between '0' and '9',
meaning all the numerals), or '[a-z]' (match any lowercase character).
Examples:
'[0-9]*' matches:
'074839578943'
'' (the zero match)
'9378597'
'9'
'1'
but not:
'543a3928' ('a' was not in our range)
note that the two smaller strings '543' and '3928' would be matched
though.
You can have a range of specific characters, eg '[abcz]' meaning "match
any of 'a', 'b', 'c' or 'z', but nothing else"
Finally, you can negate a range, using (confusingly) the '^' again, but
*inside* the square brackets.
'[^0-9]' will match "anything that is not in the range '0' to '9' - ie
not a numeral"
If anyone is trying to learn regular expressions, feel free to post in
this thread, saying something like: "I want a regular expression to find
X, here is my attempt" and I and others will be happy to help you out.
-Mary.
More information about the Courses
mailing list