[Techtalk] A (Difficult?) Regular Expression Construction
Question
Elizabeth Barham
lizzy at soggytrousers.net
Mon Sep 8 13:52:37 EST 2003
Jacinta writes:
> If this is a homework question then I wish you luck (because it
> seems insane to me). If it's a real life question then I'm, sorry
> but your data is screwed and you really need to get someone to walk
> through it and delineate what should be a title, subtitle,
> subsubtitle etc. Maybe you can hire a work experience
> student... they only get paid $5 a day for 5 days, over here
> (Australia).
Its for a client. They work on Macintoshes and hand me their index in
text-only MacRoman format. I'm trying to build a Java GUI for them to
use so they can generate an XML version themselves.
> If you really want to do this because you're feeling perverse here's
> some pseudo code. No regular expressions though because if it were
> possible they'd be "too ugly to live".
LOL! The following is what I've done so far but it currently doesn't
do subtopics (note that (?ms) are regex compiler switches (m =
multiline, s = . matches newline) and the \p{L} etc are shorthand
Unicode ranges):
<?xml version="1.0"?>
<annual-index xmlns:re="http://regexxmlreader.sourceforge.net/" year="2002">
<!--
groups all the topics together assuming that each title is in
all-caps.
-->
<re:for-each regex="(?ms)^[\p{Lu}\p{P}\p{N}\p{Zs}]+?$.+?(?=(\n[\p{Lu}\p{P}\p{N}\p{Zs}]+$|\Z))">
<topic>
<!-- this groups each part into a single line (due to an anomoly in their formatting) -->
<re:for-each regex="(?ms).*?[^ ](?=(\n[^ ])|\Z)">
<!-- and normalizes it -->
<re:replace regex="[ \n]+" with=" " trim="yes">
<title re:match="^[\p{Lu}\p{P}\p{N}\p{Zs}]+$">
<re:match-string/>
</title>
<entry re:match="^([\p{L}\p{P}\p{N}\p{Sc}\p{Zs}“”]+?),? [pP][pP]?\.? ?([-0-9, ]+).*$">
<re:group>
<title>
<re:match-string/>
</title>
</re:group>
<re:group>
<re:for-each split=", ?">
<page re:match="[0-9]+">
<re:match-string/>
</page>
<page-range re:match="([0-9]+) ?- ?([0-9]+)">
<re:group>
<begin>
<re:match-string/>
</begin>
</re:group>
<re:group>
<end>
<re:match-string/>
</end>
</re:group>
</page-range>
</re:for-each>
</re:group>
</entry>
<re:otherwise>
<re:warning>
<re:text>No match with: </re:text>
<re:match-string />
</re:warning>
</re:otherwise>
</re:replace>
</re:for-each>
</topic>
</re:for-each>
</annual-index>
> last line;
> indentation = 0;
> for each line in file
> if indentation == 0
> /* must be a title */
> print TITLE: line.
> indentation = 1
> line = ""; /* no more titles */
> else if indentation == 1
> if line < last line /* alpha order has reversed */
> indentation = 2
> print SUBTOPIC: line
> line = "" /* no more subtopics */
> else /* just another entry */
> print ENTRY: line
> else if indentation == 2
> if line < last line
> indentation = 1 /* back to entries... */
> print ENTRY: line
> else
> print SUB_ENTRY: line
>
> last line = line;
> end.
I would rather do it all with regular-expressions because, among other
things, my little regex-to-xml thingy would work as is. But, it
appears I'm going to need to support external classes/functions.
You know what is really weird though is that the idea of a
regular-expression that back-references character classes and/or part
of a range would be very powerful.
Thanks to all for sharing your ideas,
Elizabeth
More information about the Techtalk
mailing list