[techtalk] switch function in C (or how to read commandline args?)

Tue Jul 3 11:01:22 EST 2001

On Mon, Jul 02, 2001 at 01:33:50AM +0100, Conor Daly wrote:
> Thanks James and Jeff for the thoughts.  It kinda looks like an if else-if
> chain is my only option.

nah! there are much more crazy options... :)

Basically, I would of course agree that a list of if-else-if-strcmp()'s
is perfectly suitable for the purpose at hand.
Yet, a programmer's life would be boring if there weren't intelectually
more challenging ways of achieving the same effect :)  Read on if you
feel like playing around with alternative ideas...

One neat way would be to use hashes in combination with function
pointers.
Now, hashes come in many flavors. There are hash functions for
cryptographic purposes, message digests, checksums, etc. (SHA, MD5,
CRC, ...), which were designed with security or error checking
considerations in mind. These are not the ones I'm referring to here.
Other hash functions are used to build the abstract data type "hash
table", which associates arbitrary key-value pairs. In contrast to
linear arrays, where elements are indexed via integer values, the keys
used in hash tables can typically be *any* strings (thus also command
line options). These are the ones I'm having in mind here.

I know this thread is on C programming. Still, I'd like to quickly
illustrate the idea in perl. It's just more compact to start with.
We'll get to the C version immediately afterwards.

In perl you could do something like:

%handlers = (
    "-foo" => \&do_foo,
    "-bar" => \&do_bar,
    ...
);

sub do_foo { ... }
sub do_bar { ... }
...

$handlers{$ARGV[0]}->( $ARGV[1] );

If you would call this simplified pseudo-script (let's name it test.pl)
with the arguments

$ test.pl -bar baz

the routine do_bar() would get called with the argument "baz" to take
care of handling the commandline option "-bar".
The last line of the above code snipped would effectively execute as

$handlers{"-bar"}->("baz");

which does the same as if you had simply written

do_bar("baz");

The $handlers{"-bar"} part selects/returns the appropriate function
reference via the association declared in the hash table ("-bar" =>
\&do_bar). This function reference is then called with the argument "baz".

To slowly move over to the C version, we could rewrite the script to
use an intermediate linear array of function references. An array of
function pointers is something we can easily implement in regular C
as well. The hash would in this case only be needed to map from the
key-string to the numeric array-index:

%hash = (
    "-foo" => 0,
    "-bar" => 1,
);

@handlers = ( \&do_foo, \&do_bar );  # array of function refs/pointers

sub do_foo { ... }
sub do_bar { ... }

$handlers[ $hash{$ARGV[0]} ]->( $ARGV[1] );

So, if we could find an easy way to accomplish this string-to-number
mapping in C code, we could basically use the same technique in C.
Well, this is exactly what hash *functions* do.  The internal
implementation of a typical hash *table* data structure roughly does
the following: The key-string is converted to some (seemingly random)
number within a certain range which is then used as an index to an
array of so-called buckets or bins which hold the values that are to
be associated with the keys. The conversion of the key-string to the
random index number is computed via the hash function. The index number
is 'random' in the sense that there is usually no immediately evident
relation to the key-string. It is reproducible, though, i.e. the same
key always gives the same index number.

There are many different hashing functions which vary with respect to
computation speed and their ability to evenly 'scatter' arbitrary keys
across the available buckets. The latter is a desirable property to
minimize collisions. Collisions happen when different keys map to the
same index number. This is something that cannot be avoided, if the
bucket array has a finite size, and we want to allow arbitrary input
keys to be used. So, much of the details of the internal hash table
mechanics are taking care of this problem. A simple solution would be
to store several key-value associations in one bucket, if required,
and then search them linearly, if there *is* more than one entry per
bucket. Many more advanced techniques have been devised...

Why am I telling all this in excruciating detail? Why not simply use
one of the many hash table C/C++-libraries available?  Also, every
moderately sophisticated application typically comes with its own
incarnation of a hash table -- and we would certainly find a suitable,
ready-to-use canned solution. 

Sure, we could of course use a regular, general purpose hash table
for what we want to do. But this is not what I'm after here -- that
would be quite boring, again :)  Rather, a basic understanding of the
mechanics helps to really appreciate the ultimate theoretical elegance
of the solution offered by so-called "perfect hashing" functions, or,
more precisely, "minimal perfect hashing" functions.  What are those?

They are a special case that can be computed when the set of valid
input keys in *known* in advance -- which is the case with our problem
(otherwise we also wouldn't be able to write a switch() statement).
Perfect hashing functions are made to not cause collisions, so they are
much simpler to implement and handle. Each key nicely maps to exactly
one different index number. Minimal perfect hashing functions have the
additional property that the N valid keys map to the first 0..N-1
consecutive array positions, so they are as compact as possible.

And this is exactly what we need: we have N commandline options or
config keywords, and, as sketched above, we would like to map them to
a compact array of function pointers -- where each function is taking
care of handling one option or keyword.
The nice side effect of this approach (besides its inherent elegance)
is that it scales perfectly. Even with thousands of keys, there won't
ever be a performance problem, because there is no linear search
involved.

OK, enough theory. Now the practical part:

There are several tools/programs that help to compute such concrete
perfect hashing functions for use in your own programs. The procedure
is typically the following: you supply the program a list of valid
keys, and it spits out a piece of C code that implements the custom
hashing function. You just include this (typically tiny) code fragment
in your program, which then allows you to call a function that maps
the key strings to the array indices:  idx = phash(key);  Things
could hardly be simpler :)

A reasonably light-weight and easy-to-use package to play around with
is available here (public domain license):

http://burtleburtle.net/bob/hash/perfect.zip

with some more information (usage, theory, links, etc.) in the docs:

http://burtleburtle.net/bob/hash/perfect.html

Unfortunately, the source files are in weird DOS line endings format
(CRLF), so it needs some fiddling with dos2unix, tr, or similar tools to
get started. (You can of course also get a tidied-up tarball from me --
just drop me a note.)

The following is a step by step recipe on how to build and use the tool:

(1) download and unpack the source in some working directory:

$ unzip perfect.zip

(2) build the program (after having converted all files to unix LF
line endings):

$ make -f makeperf.txt

This creates an executable named 'perfect'. Probably no need to
configure anything for compiling.

(3) create your list of valid keys. This is a simple text file containing
one key per line (see example file: samperf.txt).

(4) let the program compute the minimal perfect hash function for
those keys:

$ perfect < keywords

This creates a C code and header file: phash.c and phash.h
(typically a few hundred bytes only)

(5) build the sanity check program:

$ make -f makeptst.txt

This not only allows to do a sanity check (i.e. that all keys really do
map to some useful index) -- it also shows a list of the keyword-to-index
mappings that you need later in your own program...

This should create the binary 'foo'. Run it like the 'perfect' program:

$ foo < keywords

or

$ foo < keywords | sort

for a list sorted by index number.

('foo' is the silly default name from the makefile -- rename it there)

(6) write your own code and include the phash()-function as needed.
A trivial example would look something like:

#ifndef STANDARD
#include "standard.h"
#endif
#ifndef RECYCLE
#include "recycle.h"
#endif
#ifndef PHASH
#include "phash.h"
#endif

int main(int argc, char *argv[])
{
    int i;
    char *s;

    for (i = 1; i < argc; i++) {
        s = argv[i];
        printf("%20s -> %d\n", s, phash(s, strlen(s)) );
    }
    exit(0);
}

This would display the "key -> index" associations for every keyword
you pass on the command line. The essential part is the calling of the
phash() library function -- I guess it's simple enough to get the idea.
[phash() unfortunately needs some weird strlen() second parameter and
returns a type of 'ub4' (equivalent to unsigned long), so you might
consider writing a small wrapper function to simplify calling, if you
use it in more than one place.]

Compile it as follows (assuming you named it test.c) -- or create
an appropriate Makefile (makeptst.txt can serve as an example):

gcc -c -O test.c
gcc -o test lookupa.o recycle.o phash.o test.o -lm

This should create a binary 'test'. (The phash.o and the other
object files should already have been compiled while building the
sanity check program.)

An example run of the main steps with the words taken from the above
paragraph should look something like -- the wordlist being:
[
BTW, lists like these can be created with a two-line perl script, also
making use of a hash table :)

#!/usr/bin/perl
while (<>) { for (split /\W+/) { $hash{$_}++; } }
print map "$_\n", sort keys %hash;

]

The
This
a
already
and
been
binary
building
check
compiled
create
files
have
o
object
other
phash
program
sanity
should
test
the
while

(these could be your command line options or whatever...)

Create the C code for the custom perfect hash function

$ perfect < wordlist

and build the 'foo' and 'test' program...
(needs to be done anew each time you change the keywords/hash-function)

The sanity check then gives:

$ foo < wordlist | sort
       0  object
       1  and
       2  have
       3  test
       4  the
       5  while
       6  create
       7  This
       8  a
       9  program
      10  already
      11  The
      12  binary
      13  other
      14  phash
      15  check
      16  files
      17  o
      18  been
      19  should
      20  compiled
      21  building
      22  sanity
Read in 23 keys

This means that if you call the phash() function with "already" as
argument, for example, you should get back '10'.

In some contexts it might make sense to declare constants (#DEFINEs or
enums) to make those numbers more human-readable -- though, for the
purpose at hand, it's hardly necessary. Once you've set up your array
of function pointers properly (i.e. in the right sequence -- and you
did give those handlers sensible names, did you?:), you don't really
need to care about the numbers any longer...

The sample program gives (be sure to prepend "./" or don't name it 'test' :)

$ ./test This program should have been compiled
                This -> 7
             program -> 9
              should -> 19
                have -> 2
                been -> 18
            compiled -> 20

All you need to do now is to implement the functions/handlers and set
up an array of pointers to those handlers. Then use the index returned
by phash() to select the function to be called. Just one line now, even
if you have hundreds of keywords. No more spaghetti of strcmp()s...

That's it. Simple, small, fast, scalable and elegant :)

Well, I have to admit that there's one tiny problem: if you enter
invalid keys (mistyped options, for example), then those will
typically also map into the same range of numbers :(
There's a fairly simple solution to this problem, though -- which is
"left as an excercise for the reader" :)

(Hint: declare a regular array of keyword strings (sorted correctly,
so that keyword[10] gives "already", etc.) and do a circular check
to see if the returned string matches the one put into phash()...)

Have fun hashing!

- Almut

(and, as usual, sorry for the length --  can't help it ;)