[Techtalk] A (Difficult?) Regular Expression Construction Question

Mon Sep 8 13:52:37 EST 2003

Jacinta writes:

> If this is a homework question then I wish you luck (because it
> seems insane to me).  If it's a real life question then I'm, sorry
> but your data is screwed and you really need to get someone to walk
> through it and delineate what should be a title, subtitle,
> subsubtitle etc.  Maybe you can hire a work experience
> student... they only get paid $5 a day for 5 days, over here
> (Australia).

Its for a client. They work on Macintoshes and hand me their index in
text-only MacRoman format. I'm trying to build a Java GUI for them to
use so they can generate an XML version themselves.

> If you really want to do this because you're feeling perverse here's
> some pseudo code.  No regular expressions though because if it were
> possible they'd be "too ugly to live".

LOL! The following is what I've done so far but it currently doesn't
do subtopics (note that (?ms) are regex compiler switches (m =
multiline, s = . matches newline) and the \p{L} etc are shorthand
Unicode ranges):

<?xml version="1.0"?>
<annual-index xmlns:re="http://regexxmlreader.sourceforge.net/" year="2002">
  <!-- 
  groups all the topics together assuming that each title is in
  all-caps.
  -->
  <re:for-each regex="(?ms)^[\p{Lu}\p{P}\p{N}\p{Zs}]+?$.+?(?=(\n[\p{Lu}\p{P}\p{N}\p{Zs}]+$|\Z))">
    <topic>
      <!-- this groups each part into a single line (due to an anomoly in their formatting) -->
      <re:for-each regex="(?ms).*?[^ ](?=(\n[^ ])|\Z)">
	<!-- and normalizes it -->
	<re:replace regex="[ \n]+" with=" " trim="yes">
	  <title re:match="^[\p{Lu}\p{P}\p{N}\p{Zs}]+$">
	    <re:match-string/>
	  </title>
	  <entry re:match="^([\p{L}\p{P}\p{N}\p{Sc}\p{Zs}&#x201C;&#x201D;]+?),? [pP][pP]?\.? ?([-0-9, ]+).*$">
	    <re:group>
	      <title>
		<re:match-string/>
	      </title>
	    </re:group>
	    <re:group>
	      <re:for-each split=", ?">
		<page re:match="[0-9]+">
		  <re:match-string/>
		</page>
		<page-range re:match="([0-9]+) ?- ?([0-9]+)">
		  <re:group>
		    <begin>
		      <re:match-string/>
		    </begin>
		  </re:group>
		  <re:group>
		    <end>
		      <re:match-string/>
		    </end>
		  </re:group>
		</page-range>
	      </re:for-each>
	    </re:group>
	  </entry>
	  <re:otherwise>
	    <re:warning>
	      <re:text>No match with: </re:text>
	      <re:match-string />
	    </re:warning>
	  </re:otherwise>
	</re:replace>
      </re:for-each>
    </topic>
  </re:for-each>
</annual-index>

> last line;
> indentation = 0;
> for each line in file
> 	if indentation == 0
> 		/* must be a title */
> 		print TITLE: line.
> 		indentation = 1
> 		line = "";  /* no more titles */
> 	else if indentation == 1
> 		if line < last line   /* alpha order has reversed */
> 			indentation = 2
> 			print SUBTOPIC: line
> 			line = ""    /* no more subtopics */
> 		else 		     /* just another entry */
> 			print ENTRY: line
> 	else if indentation == 2
> 		if line < last line 
> 			indentation = 1 /* back to entries... */
> 			print ENTRY: line
> 		else
> 			print SUB_ENTRY: line
> 
> 	last line = line;
> end.

I would rather do it all with regular-expressions because, among other
things, my little regex-to-xml thingy would work as is. But, it
appears I'm going to need to support external classes/functions.

You know what is really weird though is that the idea of a
regular-expression that back-references character classes and/or part
of a range would be very powerful.

Thanks to all for sharing your ideas,
Elizabeth