[Techtalk] A (Difficult?) Regular Expression
Construction Question
jas at spamcop.net
jas at spamcop.net
Mon Sep 8 09:56:41 EST 2003
Quoting Elizabeth Barham <lizzy at soggytrousers.net>:
> Julie writes:
>
> > I could suggest regular expressions, but I don't see what
> > distinguishes between the "subtopic entry" Weatherquest and the
> > "entry" UNIX. If you can write out the distinguishing
> > characteristics of each entry type then perhaps you could derive the
> > regular expressions more easily.
>
> The distinguishing characteristic between Weatherquest and UNIX is
> that Weatherquest is alphabetically after UNIX.
>
> TECHNOLOGY (title)
> Internet, p. 20 (entry)
> Routers, p. 35
> Techgear (subtopic)
> Apple's New Ipod, p. 21 (subtopic entry)
> Compaq's new Ipaq, p. 12
> Some new PDA, p. 22
> Weatherquest's New PDA, p. 25
> UNIX, p. 30 (entry)
>
> In other words, a match would be successful if the first character is
> within the range preceding-match's-first-character to Z, or something
> like (this is for Java btw):
>
> (?ms)^([\p{L}\p{P}\p{N}\p{Zs}]+(?! [pP][pP]?\.? [0-9,
> ]+))\n((.).*?)\n([\p{3}-Z]|\Z)+$
>
> The first group should match a subtopic (it has no "p. [0-9]+" at the
> of the line), followed by any number of entries (not fully developed)
> whose first character is equal to or after the preceding
> first-character of the line (the range backreference [\3-Z]).
>
> It turns out that we are not allowed to use backreferences in a
> character-class / range, however, so this won't work.
>
> > The other thing is that you can't do things like "alphabetical
> > order" with regular expressions. You need something which can keep
> > state information. You might want to look at Perl or Awk.
>
> Right; I was hoping to stay within the confines of regular-expressions
> and that there was some little trick I wasn't aware of that could
> handle this sort of thing instead of handing data off to a function.
It looks from here as if you'd be better off doing it line by line; just
"remember" what the previous line was, then compare the two. If the current line
is "less then" (i.e. alphabetically before) the previous one, it's a main entry
again.
Something like this in Perl:
$previous="";
while(<>)
{
if ($_ lt $previous)
{ (it's a main entry) }
else
{ (it isn't) }
$previous=$_;
}
Easier than using regular expressions, I think - what language are you using?
James.
More information about the Techtalk
mailing list