Hashing

Hashing is an important Data Structure which is designed to use a special function called the Hash function which is used to map a given value with a particular key for faster access of elements. The efficiency of mapping depends of the efficiency of the hash function used.

Hash function

对于每个标识符x,我们定义一个哈希函数:  f(x)= x在ht []中的位置

T :: = x的不同可能值的总数
n :: = ht []中的标识符总数
n / T :: = 标识符密度
n /(sb) :: = 装载密度

A collision occurs when we hash two nonidentical identifiers into the same bucket, i.e. when .
An overflow occurs when we hash a new identifier into a full bucket.
Collision and overflow happen simultaneously if s = 1.

f ( x ) must be easy to compute and minimizes the number of collisions.
f ( x ) should be unbiased. That is, for any x and any i, we have that Probability( f ( x ) = i ) = 1 / b. Such kind of a hash function is called a uniform hash function.

  • For numeric keys, divide the key by the number of available addresses, n, and take the remainder.

f ( x ) = x % TableSize ; /* if x is an integer */

TableSize = prime number, good for random integer keys

  • For alphanumeric keys, divide the sum of ASCII codes in a key by the number of available addresses, n, and take the remainder.

f ( x ) = ( x[i]) % TableSize ; /* if x is a string */
f ( x ) = (x[0]+ x[1]*27 + x[2]*27^2) % TableSize ;

但是以上两种方法都会造成较大的空间浪费,可以采用位运算的想法,向左移五位相当于* 2^5,来得到hash值

f ( x ) = ( x[N - i - 1] * 32^i ) % TableSize ;

1
2
3
4
5
6
7
Index Hash3( const char *x, int TableSize ) 
{
unsigned int HashVal = 0;
/* 1*/ while( *x != '\0' )
/* 2*/ HashVal = ( HashVal << 5 ) + *x++;
/* 3*/ return HashVal % TableSize;
}

If x is too long (e.g. street address), the early characters will be left-shifted out of place. 这时候可以选取代表字符串,比如子串。

  • Folding method divides the key into equal parts then adds the parts together.

Collision Resolution

  • Closed addressing

Separate Chaining: keep a list of all keys that hash to the same value

1
2
3
4
5
6
7
8
9
10
11
12
Position  Find ( ElementType Key, HashTable H ) 
{
Position P;
List L;

L = H->TheLists[ Hash( Key, H->TableSize ) ];

P = L->Next;
while( P != NULL && P->Element != Key ) /* Probably need strcmp */
P = P->Next;
return P;
}

insertions are generally quicker than deletions in separate chaining method. 这是因为insertion是O(1),计算相应的hash value然后挂在第一个就可以了。

1
2
3
4
L = H->TheLists[ Hash( Key, H->TableSize ) ]; 
NewCell->Next = L->Next;
NewCell->Element = Key; /* Probably need strcpy! */
L->Next = NewCell;

Tip: Make the TableSize about as large as the number of keys expected (i.e. to make the loading density factor 1).

  • Open addressing

find another empty cell to solve collision (avoiding pointers)

1
2
3
4
5
while ( collision at index ) {
index = ( hash(key) + f(i) ) % TableSize;
if ( table is full ) break;
else i++;
}

f(i) is collision resolving function. f(0) = 0.

Tip: Generally < 0.5.

Linear probing

f ( i ) = i ; /* a linear function */

Quadratic probing (failed attempts)^2

f ( i ) = i^2 ; /* a quadratic function */

【Theorem】If quadratic probing is used, and the table size is prime, then a new element can always be inserted if the table is at least half empty.

1
2
3
4
5
6
7
8
9
10
11
12
Position  Find ( ElementType Key, HashTable H ) 
{ Position CurrentPos;
int CollisionNum;
CollisionNum = 0;
CurrentPos = Hash( Key, H->TableSize );
while( H->TheCells[ CurrentPos ].Info != Empty &&
H->TheCells[ CurrentPos ].Element != Key ) {
CurrentPos += 2 * ++CollisionNum -1;
if ( CurrentPos >= H->TableSize ) CurrentPos -= H->TableSize;
}
return CurrentPos;
}

条件中先判断.Info是否empty,如果是empty,则key没有定义。

1
CurrentPos += 2 * ++CollisionNum -1; 


Double hashing

f ( i ) = i * hash2( x ); /* hash2( x ) is the 2nd hash function */

If double hashing is correctly implemented, simulations imply that the expected number of probes is almost the same as for a random collision resolution strategy.
Quadratic probing does not require the use of a second hash function and is thus likely to be simpler and faster in practical.

  • Rehashing

Quadratic probing might fail when .

Build another table that is about twice as big;
Scan down the entire original hash table for non-deleted elements;
Use a new function to hash those elements into the new table.

When to rehash?
As soon as the table is half full
When an insertion fails
When the table reaches a certain load factor

Applications

Hashing provides constant time search, insert and delete operations on average. This is why hashing is one of the most used data structure, example problems are, distinct elements, counting frequencies of items, finding duplicates, etc.

There are many other applications of hashing, including modern day cryptography hash functions. Some of these applications are listed below:

  • Message Digest
  • Password Verification
  • Data Structures(Programming Languages)
  • Compiler Operation
  • Rabin-Karp Algortithm
  • Linking File name and path together

references:

  1. ppt made by 陈越姥姥
  2. https://www.geeksforgeeks.org/applications-of-hashing/