Hashing
Hashing is an important Data Structure which is designed to use a special function called the Hash function which is used to map a given value with a particular key for faster access of elements. The efficiency of mapping depends of the efficiency of the hash function used.
Hash function
对于每个标识符x,我们定义一个哈希函数: f(x)= x在ht []中的位置
T :: = x的不同可能值的总数
n :: = ht []中的标识符总数
n / T :: = 标识符密度
n /(sb) :: = 装载密度
A collision occurs when we hash two nonidentical identifiers into the same bucket, i.e.
An overflow occurs when we hash a new identifier into a full bucket.
Collision and overflow happen simultaneously if s = 1.
f ( x ) must be easy to compute and minimizes the number of collisions.
f ( x ) should be unbiased. That is, for any x and any i, we have that Probability( f ( x ) = i ) = 1 / b. Such kind of a hash function is called a uniform hash function.
- For numeric keys, divide the key by the number of available addresses, n, and take the remainder.
f ( x ) = x % TableSize ; /* if x is an integer */
TableSize = prime number, good for random integer keys
- For alphanumeric keys, divide the sum of ASCII codes in a key by the number of available addresses, n, and take the remainder.
f ( x ) = (
x[i]) % TableSize ; /* if x is a string */
f ( x ) = (x[0]+ x[1]*27 + x[2]*27^2) % TableSize ;
但是以上两种方法都会造成较大的空间浪费,可以采用位运算的想法,向左移五位相当于* 2^5,来得到hash值
f ( x ) = (
x[N - i - 1] * 32^i ) % TableSize ;
1 | Index Hash3( const char *x, int TableSize ) |
If x is too long (e.g. street address), the early characters will be left-shifted out of place. 这时候可以选取代表字符串,比如子串。
- Folding method divides the key into equal parts then adds the parts together.
Collision Resolution
- Closed addressing
Separate Chaining: keep a list of all keys that hash to the same value
1 | Position Find ( ElementType Key, HashTable H ) |
insertions are generally quicker than deletions in separate chaining method. 这是因为insertion是O(1),计算相应的hash value然后挂在第一个就可以了。
1 | L = H->TheLists[ Hash( Key, H->TableSize ) ]; |
Tip: Make the TableSize about as large as the number of keys expected (i.e. to make the loading density factor
- Open addressing
find another empty cell to solve collision (avoiding pointers)
1 | while ( collision at index ) { |
f(i) is collision resolving function. f(0) = 0.
Tip: Generally
Linear probing
f ( i ) = i ; /* a linear function */
Quadratic probing (failed attempts)^2
f ( i ) = i^2 ; /* a quadratic function */
【Theorem】If quadratic probing is used, and the table size is prime, then a new element can always be inserted if the table is at least half empty.
1 | Position Find ( ElementType Key, HashTable H ) |
条件中先判断.Info是否empty,如果是empty,则key没有定义。
1 | CurrentPos += 2 * ++CollisionNum -1; |
Double hashing
f ( i ) = i * hash2( x ); /* hash2( x ) is the 2nd hash function */
If double hashing is correctly implemented, simulations imply that the expected number of probes is almost the same as for a random collision resolution strategy.
Quadratic probing does not require the use of a second hash function and is thus likely to be simpler and faster in practical.
- Rehashing
Quadratic probing might fail when
Build another table that is about twice as big;
Scan down the entire original hash table for non-deleted elements;
Use a new function to hash those elements into the new table.
When to rehash?
As soon as the table is half full
When an insertion fails
When the table reaches a certain load factor
Applications
Hashing provides constant time search, insert and delete operations on average. This is why hashing is one of the most used data structure, example problems are, distinct elements, counting frequencies of items, finding duplicates, etc.
There are many other applications of hashing, including modern day cryptography hash functions. Some of these applications are listed below:
- Message Digest
- Password Verification
- Data Structures(Programming Languages)
- Compiler Operation
- Rabin-Karp Algortithm
- Linking File name and path together
references:
- ppt made by 陈越姥姥
- https://www.geeksforgeeks.org/applications-of-hashing/