[prog] [off-topic]Hash key

Jacinta Richardson jarich at perltraining.com.au
Tue Jun 10 12:15:21 EST 2003


On Mon, 9 Jun 2003, Sachin wrote:

> This code is the one which converts a key(a character string) into a
> 31 bit hash key.My question is: Can anyone shed light at what this code
> does?.  What is the principle involved?  Why are these values chosen?

G'day Sachin,

I'm not sure how much you know about hashing, so I'll try to start
there and work over.

Hashes are also called "lookup tables".  The idea is that when the order
of information is not important then hash tables allow us to store large
amounts of data and still access any given element in a small amount of
time.  

If the information is stored merely in a list then to find a given
element in that list takes, on average, the length of the list divided
by 2 element comparisons.   (That is, on average we have to look at half
the elements in the list to  find the one we're looking for).  If,
however, we store our information in a big enough hash, then we merely
have to calculate the hash key and the information should be right
there.

Hashes need two values per element.  A "key" and it's "value".  Keys
must be unique but can be made up of several bits of information.  The
values do not need to be unique.  The "key" is the information you're
going to use to pull the value out.  For example, if you're hash was
storing a concordance (information about how many times each word in a
text appears and on which lines) then your key would be the word, and
the value would be your list of lines, and any other relevant data.

There are many ways to calculate a hash position (the position that the
element of information should be stored in).  You may, for example, just
add all the ascii values of the letters of the key together (and take
that value modulo the size of the hash).  Or you may use a SHA1
encryption of the key.  Whatever you do, it has to be repeatable and
only use the information stored in the key.  If you use information from
the value in calculating your hash position, then you won't be able to
find your hash value in your hash without knowing it, and then there'd
be no point looking it up.

Hashes should have a size (the length of the list that represents them)
that is a prime number, but which is also ideally greater than the
expected size of the input data.  This is to lessen the number of
elements that hash to the same position.  When 2 or more elements hash
to the same position it's called a collision.  How collisions are
handled depends on the programmer.  In some cases collisions are added
in a linked list hanging off that element position, in other cases, the
value is rehashed to another value using a different algorithm, or
perhaps another hash table hanging off the element position in the
primary hash.

Assuming you understood most of the above (please ask me questions if
you didn't)  I'll try to explain the following code in this light:

> int hash (key)
> datum key;
> {
> 	unsigned int value; 	/* Used to compute the hash value.  */
>	int   index;  		/* Used to cycle through random values. */
> 
>   		/* Set the initial value from key. */
>   	value = 0x238F13AF * key.dsize;
> 		/* 0x238F13AF: in Decimal->596579247 and in    
>		   Binary -> 100011100011110001001110101111 */

This above line, generates an initial starting value for the hash
position.  Presumably key.dsize provides some size information about the
key, for example the number of characters in the keyword.

>   	for (index = 0; index < key.dsize; index++)
>     		value = (value + (key.dptr[index] << (index*5 % 24))) 
>			& 0x7FFFFFFF;

Okay.  This needs to be broken up somewhat.

This iterates over the size of the key used above.  Let's look at what
each part gives us:

Char #		(index*5 % 24)		(key.dptr[index] << (index*5 % 24))
0		0*5 % 24 = 0		 key.dptr[index] << 0
1		1*5 % 24 = 0             key.dptr[index] << 0
2		2*5 % 24 = 0             key.dptr[index] << 0
3		3*5 % 24 = 0             key.dptr[index] << 0
4		4*5 % 24 = 0             key.dptr[index] << 0
5		5*5 % 24 = 1             key.dptr[index] << 1
6		6*5 % 24 = 1             key.dptr[index] << 1
..		..
9		9*5 % 24 = 1             key.dptr[index] << 1
10		10*5 % 14 = 2            key.dptr[index] << 2
		(how much to		(this value, shifted by some 
		 shift by)		 amount)

What this does for us, is much the same as adding the ascii values in a
single word key would.  Each part of the key is used to generate a
hash position.  In order to reduce the likelihood of generating a value
that is easily generated from a different key (both "cat" and "act"
would generate the same hash position if you just added their letters
together) we allow each "bit" of the hash position to be generated by 5
bits of the key material.

Finally we bitwise and the hash position with 0x7FFFFFFF to ensure that
our hash position isn't too big.

>   	value = (1103515243 * value + 12345) & 0x7FFFFFFF;

I'm not 100% certain why these two values are used.  0x7FFFFFFF is
obviously used to keep our position's size down.  The purpose of this
line, however is to reduce the likelihood of collisions, once again, and
perhaps it helps with ensuring an even population of the hash.

It is possible that these numbers were picked after a lot of profilling
as being numbers that "just worked".  It could, alternately, be a bizare
method of avoiding using another modulus.

>   		/* Return the value. */
>   	return((int) value);
> }

I figure you can work this last line out.

> Ps:This is a part of the code from GDBM library.

I hope my explanations have helped, and I'm very happy to answer further
questions.  I'm sure there are better answers to your question out there
somewhere, as most of this is what I remember from a second year subject
in my CS degree rather than quoting from a book.  Things may be wrong.

All the best,

	Jacinta

--
   ("`-''-/").___..--''"`-._          |  Jacinta Richardson	    |
    `6_ 6  )   `-.  (     ).`-.__.`)  |  Perl Training Australia    |
    (_Y_.)'  ._   )  `._ `. ``-..-'   |      +613 9354 6001 	    |  
  _..`--'_..-_/  /--'_.' ,'           | contact at perltraining.com.au |
(il),-''  (li),'  ((!.-'              |   www.perltraining.com.au   |



More information about the Programming mailing list