[techtalk] switch function in C (or how to read commandline args?)

Thu Jul 5 08:23:49 EST 2001

On Thu, Jul 05, 2001 at 08:02:25AM +1200, Penguina wrote:
> 
> it might make sense to separate out the issues of
> 
> 	command-line interface packages
> 
> 	hash tables
> 
> 	lexical analysers
> 
> 	parsers
> 
> 	tokens: what they are & the variety of things you can use them for
> 
> 	how to initialize function pointers, then invoke
> 	the functions using their pointers in various languages
> 
> 	our friend the b-tree
> 
> While all these things are related (they come together spectacularly
> when writing a new compiler, for example), it might be useful for
> non CS grads to familiarise themselves with the above concepts
> prior to jumping into a project which uses one or the other.

Sure :)
It wasn't my primary intention to give an introduction to all these
(very interesting) topics in one single post -- especially without
having been asked for it.  I think my rant on the tiny sub-aspect of
"perfect hashing" was already long enough -- and you'd better not get
me started on the other topics ;)
(not that I'd call myself an expert in these issues -- far from it!)

Also, I've had the impression that Conor didn't necessarily want to
become one of the key developers of the upcoming gcc-4.0 (or did you?;),
but was merely looking for a not-too-ugly solution for the switch()/
case problem.
Although it might appear a bit crazy at first, on second thought, the
solution to apply the perfect hashing technique, might not at all be
that crazy. Rather, it's reasonably easy to use, and can be taken
advantage of *as a recipe*, without really having to understand all the
nitty-gritty details behind it...

> 
> > On Mon, Jul 02, 2001 at 01:33:50AM +0100, Conor Daly wrote:
> > > Thanks James and Jeff for the thoughts.  It kinda looks like an if else-if
> > > chain is my only option.
> 
> On Tue, 3 Jul 2001, Almut Behrens wrote:
> > In perl you could do something like:
> >
> > %handlers = (
> >     "-foo" => \&do_foo,
> >     "-bar" => \&do_bar,
> >     ...
> > );
> >
> > sub do_foo { ... }
> > sub do_bar { ... }
> > ...
> >
> > $handlers{$ARGV[0]}->( $ARGV[1] );
> >
> > $ test.pl -bar baz
> >
> > the routine do_bar() would get called with the argument "baz" to take
> > care of handling the commandline option "-bar".
> > The last line of the above code snipped would effectively execute as
> >
> > $handlers{"-bar"}->("baz");
> >
> > which does the same as if you had simply written
> >
> > do_bar("baz");
> >
> > The $handlers{"-bar"} part selects/returns the appropriate function
> > reference via the association declared in the hash table ("-bar" =>
> > \&do_bar). This function reference is then called with the argument "baz".
> 
> in python, you can use function pointers as dictionary objects
> and let the hashing happen behind the scenes.

well, I think if we substitute a bit of terminology, that statement
equally applies to perl -- it would then read:

  "in perl, you can use function references as hash-element values
  and let the hashing happen behind the scenes."

The main difference in terminology is that the hash-table data type
is called "hash" (or "associative array") in perl, and "dictionary"
in python. Also, in perl-speak "pointers" are usually called
"references". That's about the only relevant difference here.
Functionally, hashes are hashes in both languages, and can be used
to achieve the same effects.
(To be correct, there is a small difference between perl and python
here: hash-keys in perl are always strings, while in python they can
be any immutable object. However, for the purpose at hand this is
irrelevant, because we *are* using strings)

To demonstrate that the difference between python and perl is
mostly a syntactical one here, I'll supply the functionally
equivalent perl code below.
First, I'd like to make a few remarks on the script, though.

> ---------------snip
> def joe():
>   return "Joe Jackson"
> 
> def bill():
>   return "Billy the Kid"
> 
> main ():
>   dict.append ({'name': "joe",  'tok':3, 'fun': joe()  })
>   dict.append ({'name': "bill", 'tok':4, 'fun': bill() })
> 
>   for entry in dict:
> #				test arg by string or token here
>         print (entry['fun'])
> 
> main()
> ---------------snip

I'm really not a python guru, but I think there is one problem with the
script: the parentheses following the function names 'joe' and 'bill'
should not be where they are (within the declaration of the anonymous
dictionary), but rather at the end of "entry['fun']()".  Writing

  'fun': joe()

will call the function 'joe' right there, so the net effect of this
statement is equivalent to

  'fun': "Joe Jackson"

i.e. the *return value* of the function gets associated to the
dictionary key 'fun' --  not the function pointer to joe itself.
To associate the pointer to the function object (which I guess was
the original intention of the example), you'd have to write

  'fun': joe

To then call this function via its pointer from within the print
statement, you'd have to put the parentheses there:

   print (entry['fun']() )
                      ^^
In this particular example, the difference is not immediately evident
when you execute it, as it is. However, if you modify the 'bill()'
function slightly like this

def bill():
  print "this should appear immediately before 'Billy...'"
  return "Billy the Kid"

then the difference becomes visible. When we call 'print( joe() )'
first and then 'print( bill() )' (as happens when iterating the list
of dicts above), we would expect to see the print statements appear
in the following sequence:

Joe Jackson
this should appear immediately before 'Billy...'
Billy the Kid

Yet, when running the above script we see:

this should appear immediately before 'Billy...'
Joe Jackson
Billy the Kid

Why is that?  If you think about what gets executed in which order
depending on where the functions are being called, the difference
should become clear...

So, the script should probably be

-----------
def joe():
  return "Joe Jackson"

def bill():
  return "Billy the Kid"

def main ():
  dict = []
  dict.append ({'name': "joe",  'tok':3, 'fun': joe  })
  dict.append ({'name': "bill", 'tok':4, 'fun': bill })

  for entry in dict:
#				test arg by string or token here
        print (entry['fun']() )

main()
-----------

The functionally equivalent perl version would be:

-----------
sub joe {
  return "Joe Jackson";
}
sub bill {
  return "Billy the Kid";
}
sub main {
  my @dict;
  push @dict, {name => "joe",  tok => 3, fun => \&joe  };
  push @dict, {name => "bill", tok => 4, fun => \&bill };

  for $entry (@dict) {
#				test arg by string or token here
        print $entry->{fun}->(), "\n";
  }
}
main;
-----------

I deliberately tried to stick as closely as possible to the python
version -- and, like this, there isn't much difference except in
syntactic details. Personally, I'd prefer to write it in somewhat more
compact form, but this is only a side issue...

-----------
sub joe  { "Joe Jackson"   }
sub bill { "Billy the Kid" }

my @list = (
    { name => "joe",  tok => 3, fun => \&joe  },
    { name => "bill", tok => 4, fun => \&bill },
);
for $entry (@list) {
    print $entry->{fun}->(), "\n";
}
-----------

The more important issue is that these scripts (both python and perl),
IMHO, do not achieve to have a string 'joe' (e.g. commandline argument)
call the function 'joe()' -- without some additional string or token
comparisons in a one-against-all style, which is what we set out to
avoid in the first place, because it doesn't scale too well
performance-wise.
Let's say we have 100 names, then we'd have to *linearly* go through
them, searching for 'bill' for example (via 'name'-key dictionary
lookups), before we can call the function that's associated to the
'fun'-key of the respective dictionary in which we found 'bill'.

The script merely illustrates how to use function references in general.

Actually, what we really want is much simpler (at least the idea I was
trying to point out, originally). We would only need one hash that maps
the strings (keys) to functions:

-----------
#!/usr/bin/perl

sub joe  { "Joe Jackson\n"   }
sub bill { "Billy the Kid\n" }

%hash = (
    joe  => \&joe,
    bill => \&bill,
);
for $name (@ARGV) {          # iterate over commandline args
    print $hash{$name}->();
}
-----------

The corresponding python version would be:

-----------
#!/usr/bin/python

from sys import argv

def joe():
    return "Joe Jackson"

def bill():
    return "Billy the Kid"

dict = {
    "joe"  : joe,
    "bill" : bill,
}
for name in argv[1:]:        # iterate over commandline args
    print ( dict[name]() )

-----------

Both scripts should execute the respective functions when given the
appropriate commandline arguments. So when called as

$ ./test.pl bill joe
$ ./test.py bill joe

they should print:

Billy the Kid
Joe Jackson

Using anonymous functions, we could also write it like this:

-----------
#!/usr/bin/perl

%hash = (
    joe  => sub { "Joe Jackson\n"   },
    bill => sub { "Billy the Kid\n" },
);
for $name (@ARGV) {          # iterate over commandline args
    print $hash{$name}->();
}

or even so (with exactly the same internal semantics):

-----------
#!/usr/bin/perl

for $name (@ARGV) {          # iterate over commandline args
    print {
        joe  => sub { "Joe Jackson\n"   },
        bill => sub { "Billy the Kid\n" },
    }->{$name}->();
}
-----------

That's pretty close to a switch() statement :) -- it only does work
with arbitrary key-strings now. 
Within the "sub {...}"s could of course appear any code that would
need to be executed, when the respective keyword is being hit,
not just literal return strings...
(OTOH, we should keep in mind that we're using python/perl here... no
such thing in C without some hash table library.)

And in python, the above anonymous function version would look like:

-----------
#!/usr/bin/python

from sys import argv

dict = {
    "joe"  : lambda: "Joe Jackson",
    "bill" : lambda: "Billy the Kid",
}
for name in argv[1:]:        # iterate over commandline args
    print ( dict[name]() )

-----------

Not too much different from the perl version either...

BTW, while we're at it, I can't resist to mention another crazy
feature of perl ;) so-called 'symbolic references' which make it
unnecessary in our specific case to declare any hash at all -- perl
already has that functionality built-in.  So, if you have a variable
containing the name of a function as a simple string, you can directly
apply the "->()" to call the function of the same name (if it exists):

  $name = 'joe';
  $name->();      # same as joe();
  "joe"->();      # works with string literals too, btw

With this technique, our example would reduce to:

-----------
#!/usr/bin/perl

sub joe  { "Joe Jackson\n"   }
sub bill { "Billy the Kid\n" }

for $name (@ARGV) {          # iterate over commandline args
    print $name->();         # call function via symbolic reference
}
-----------

I'm not exactly sure about whether python allows something similar.
I'd suspect that things like these would violate its philosophy :)

> 
> > So, if we could find an easy way to accomplish this string-to-number
> > mapping in C code, we could basically use the same technique in C.
> 
> it's called strtok,

hmm, do you mean the standard C library function of the same name?
I always thought that function does more or less what you'd use a
'split' for in a scripting language, e.g. in python

s = "these are my commandline arguments"
args = string.split(s)

(strtok, as all string handling in C, is only considerably more unwieldy ;)

At least I wouldn't know of a way to make strtok() directly map strings
to *number* tokens -- although its name suggests that feature somehow.
I think you still end up with strings...
But maybe I've read the wrong manpage ;)

Also, we don't really need a split for commandline arguments, because
that's already been done for us, readily available in **argv.

> but for actually parsing text input, you're
> better off with lex and yacc (or their gnu counterparts flex and
> bison).
> 
> These are what the tools most linux compilers are written with
> and give you a lot more flexibility in the structure of the input.

that's absolutely right. However, for what Coner wanted to do, if I
understood him correctly (i.e. handling commandline arguments) lex & yacc
would be a bit of an overkill, IMHO...  It might, OTOH, be an interesting
option for complex config files -- though one shouldn't forget to mention
that both tools typically have a rather steep learning curve :)

It's also interesting to note that 'flex' utilizes minimal perfect
hashing techniques internally, based on the respective GNU
implementation of it (gperf). This is somewhat faster but less
universal and less compact than Bob Jenkins's algorithm/implementation,
which I suggested to play around with.

The nice thing about perfect hashing is in particular is simplicity,
once the hash function has been computed. The C code generated for the
example wordlist from my last post is as simple as shown below. That's
almost all you would need to add to your own program. Trivial, eh?
The code for a general purpose, collision handling hash table lib - as
those used in perl or python - is longer by several orders of magnitude...

/* table for the mapping for the perfect hash */                                                
#ifndef STANDARD                                                                                
#include "standard.h"                                                                           
#endif /* STANDARD */                                                                           
#ifndef PHASH                                                                                   
#include "phash.h"                                                                              
#endif /* PHASH */                                                                              
#ifndef LOOKUPA                                                                                 
#include "lookupa.h"                                                                            
#endif /* LOOKUPA */                                                                            

/* small adjustments to _a_ to make values distinct */                                          
ub1 tab[] = {                                                                                   
0,13,13,0,0,0,15,0,13,0,0,24,0,0,3,31,                                                          
0,0,0,16,0,0,0,0,0,21,0,24,7,0,20,23,                                                           
};                                                                                              

/* The hash function */                                                                         
ub4 phash(key, len)                                                                             
char *key;                                                                                      
int   len;                                                                                      
{                                                                                               
  ub4 rsl, val = lookup(key, len, 0xb54cda56);                                                  
  rsl = ((val>>28)^tab[val&0x1f]);                                                              
  return rsl;                                                                                   
}                                                                                               

- Almut