[Techtalk] A (Difficult?) Regular Expression Construction Question

Sun Sep 7 20:45:46 EST 2003

Julie writes:

> I could suggest regular expressions, but I don't see what
> distinguishes between the "subtopic entry" Weatherquest and the
> "entry" UNIX.  If you can write out the distinguishing
> characteristics of each entry type then perhaps you could derive the
> regular expressions more easily.

The distinguishing characteristic between Weatherquest and UNIX is
that Weatherquest is alphabetically after UNIX.

TECHNOLOGY (title)
   Internet, p. 20 (entry)
   Routers, p. 35
    Techgear (subtopic)
      Apple's New Ipod, p. 21 (subtopic entry)
      Compaq's new Ipaq, p. 12
      Some new PDA, p. 22
      Weatherquest's New PDA, p. 25
   UNIX, p. 30 (entry)

In other words, a match would be successful if the first character is
within the range preceding-match's-first-character to Z, or something
like (this is for Java btw):

(?ms)^([\p{L}\p{P}\p{N}\p{Zs}]+(?! [pP][pP]?\.? [0-9, ]+))\n((.).*?)\n([\p{3}-Z]|\Z)+$

The first group should match a subtopic (it has no "p. [0-9]+" at the
of the line), followed by any number of entries (not fully developed)
whose first character is equal to or after the preceding
first-character of the line (the range backreference [\3-Z]).

It turns out that we are not allowed to use backreferences in a
character-class / range, however, so this won't work.

> The other thing is that you can't do things like "alphabetical
> order" with regular expressions.  You need something which can keep
> state information.  You might want to look at Perl or Awk.

Right; I was hoping to stay within the confines of regular-expressions
and that there was some little trick I wasn't aware of that could
handle this sort of thing instead of handing data off to a function.

Thank you for your help,
Elizabeth