[Techtalk] A (Difficult?) Regular Expression Construction Question

Mon Sep 8 22:43:06 EST 2003

On 7 Sep 2003, Elizabeth Barham wrote:

> Dear Ladies and Gentlemen,
> 
>    I am trying to fathom a proper method within the constraints of
> regular-expressions of dealing with a rather complicated
> text-structure. For example,

I teach Perl for a living and cover quite a few parts of basic regular
expressions.  Whilst I'm reasonably sure that japhy
(http://www.perlmonks.org/index.pl?node=japhy) would be the definitive
source on this, I suspect that regular expressions are by far the least
ideal method to solve this problem.

> TECHNOLOGY
> Internet, p. 20
> Routers, p. 35
> Techgear
> Apple's New Ipod, p. 21
> Compaq's new Ipaq, p. 12
> Some new PDA, p. 22
> Weatherquest's New PDA, p. 25
> UNIX, p. 30
> 
> Basically, this is a title (TECHNOLOGY), a series of entries
> (e.g. Internet, p. 20), a sub-topic (Techgear) and its own entries -
> note where its entries end, however: UNIX, which ends the
> forward-flowing alphabetic entry. e.g.:
> 
> TECHNOLOGY (title)
>   Internet, p. 20 (entry)
>   Routers, p. 35
>    Techgear (subtopic)
>      Apple's New Ipod, p. 21 (subtopic entry)
>      Compaq's new Ipaq, p. 12
>      Some new PDA, p. 22
>      Weatherquest's New PDA, p. 25
>   UNIX, p. 30 (entry)

Relying on changes in alphabetic ordering seems like an insane way to
build a tree, if you don't mind me saying so.  Far, far too often you'll
want to start a new branch yet just happen to have a title that doesn't
change in  your alphabetical ordering.

In fact, how do we know that UNIX, p 30 should be an entry rather than a
new title, or a subsubtopic?  It fits in both places to my mind.

If this is a homework question then I wish you luck (because it seems
insane to me).  If it's a real life question then I'm, sorry but your data
is screwed and you really need to get someone to walk through it and
delineate what should be a title, subtitle, subsubtitle etc.  Maybe you
can hire a work experience student... they only get paid $5 a day for 5
days, over here (Australia).

If you really want to do this because you're feeling perverse here's some
pseudo code.  No regular expressions though because if it were possible
they'd be "too ugly to live".

last line;
indentation = 0;
for each line in file
	if indentation == 0
		/* must be a title */
		print TITLE: line.
		indentation = 1
		line = "";  /* no more titles */
	else if indentation == 1
		if line < last line   /* alpha order has reversed */
			indentation = 2
			print SUBTOPIC: line
			line = ""    /* no more subtopics */
		else 		     /* just another entry */
			print ENTRY: line
	else if indentation == 2
		if line < last line 
			indentation = 1 /* back to entries... */
			print ENTRY: line
		else
			print SUB_ENTRY: line

	last line = line;
end.

Note that this will fail on your input because "Routers" < "Techgear".
Since you didn't mention that Techgear was only a subtopic because it was
missing a page reference I wasn't sure if I needed to take that into
account.

If that is the definition of a subtopic then the code changes ever so
slightly to:

last line;
indentation = 0;
for each line in file
        if indentation == 0
                /* must be a title */
                print TITLE: line.
                indentation = 1
                line = "";  /* no more titles */
        else if indentation == 1
                if line !~ m/, p. \d{2}/	/* Perl regexp */
                        indentation = 2
                        print SUBTOPIC: line
                        line = ""    /* no more subtopics */
		else if line < last line   /* next title? */
			print TITLE: line
                else                 /* just another entry */
                        print ENTRY: line
        else if indentation == 2
                if line < last line
                        indentation = 1 /* back to entries... */
                        print ENTRY: line
                else
                        print SUB_ENTRY: line

        last line = line;
end.

It's still horribly error prone and not a good idea.

> I'm trying to come up with a regular expression that would correctly
> identify the "Techgear" and its subsequent entries by noting that the
> next-entry (UNIX) is *behind* the last of its own entries
> (Weatherquest), but I cannot think of a regular-expression that can do
> such a thing, which is a bit surprising because it seems like
> regular-expressions can do so much.

It can only do so much too.  ;)  You could probably get lucky with look
aheads and cuts and such but believe me, noone who has to deal with your
code will appreciate you for it.

If you're using Perl then maybe... (maybe) Parse::RecDescent might do it
for you.  Maybe.

> Any ideas on how to deal with it with regular-expressions?

Yup.  Don't.

	Jacinta

--
   ("`-''-/").___..--''"`-._          |  Jacinta Richardson         |
    `6_ 6  )   `-.  (     ).`-.__.`)  |  Perl Training Australia    |
    (_Y_.)'  ._   )  `._ `. ``-..-'   |      +613 9354 6001         |  
  _..`--'_..-_/  /--'_.' ,'           | contact at perltraining.com.au |
(il),-''  (li),'  ((!.-'              |   www.perltraining.com.au   |