[Techtalk] A (Difficult?) Regular Expression Construction Question

Mon Sep 8 09:56:41 EST 2003

Quoting Elizabeth Barham <lizzy at soggytrousers.net>:

> Julie writes:
> 
> > I could suggest regular expressions, but I don't see what
> > distinguishes between the "subtopic entry" Weatherquest and the
> > "entry" UNIX.  If you can write out the distinguishing
> > characteristics of each entry type then perhaps you could derive the
> > regular expressions more easily.
> 
> The distinguishing characteristic between Weatherquest and UNIX is
> that Weatherquest is alphabetically after UNIX.
> 
> TECHNOLOGY (title)
>    Internet, p. 20 (entry)
>    Routers, p. 35
>     Techgear (subtopic)
>       Apple's New Ipod, p. 21 (subtopic entry)
>       Compaq's new Ipaq, p. 12
>       Some new PDA, p. 22
>       Weatherquest's New PDA, p. 25
>    UNIX, p. 30 (entry)
> 
> In other words, a match would be successful if the first character is
> within the range preceding-match's-first-character to Z, or something
> like (this is for Java btw):
> 
> (?ms)^([\p{L}\p{P}\p{N}\p{Zs}]+(?! [pP][pP]?\.? [0-9,
> ]+))\n((.).*?)\n([\p{3}-Z]|\Z)+$
> 
> The first group should match a subtopic (it has no "p. [0-9]+" at the
> of the line), followed by any number of entries (not fully developed)
> whose first character is equal to or after the preceding
> first-character of the line (the range backreference [\3-Z]).
> 
> It turns out that we are not allowed to use backreferences in a
> character-class / range, however, so this won't work.
> 
> > The other thing is that you can't do things like "alphabetical
> > order" with regular expressions.  You need something which can keep
> > state information.  You might want to look at Perl or Awk.
> 
> Right; I was hoping to stay within the confines of regular-expressions
> and that there was some little trick I wasn't aware of that could
> handle this sort of thing instead of handing data off to a function.

It looks from here as if you'd be better off doing it line by line; just
"remember" what the previous line was, then compare the two. If the current line
is "less then" (i.e. alphabetically before) the previous one, it's a main entry
again.

Something like this in Perl:

$previous="";
while(<>)
{
   if ($_ lt $previous)
      { (it's a main entry) }
   else
      { (it isn't) }
   $previous=$_;
}

Easier than using regular expressions, I think - what language are you using?

James.