CLRS11 散列表

2017-05-18 | 阅读：次

散列表

散列表是实现字典操作的一种有效数据结构
散列表是普通数组概念的推广
基本字典操作平均只需要O(1)时间
完全散列能在O(1)的最坏情况时间内完成关键字查找

11.1 直接寻址表(direct-address table)

DIRECT-ADDRESS-SEARCH(T, k)
    return T[k]

DIRECT-ADDRESS-INSERT(T, x)
    T[x.key] = x

DIRECT-ADDRESS-DELETE(T, x)
    T[x.key] = NIL

时间复杂度均是O(1)

练习11.1-1 Suppose that a dynamic set S is represented by a direct-address table T of length m. Describe a procedure that finds the maximum element of S. What is the worst-case performance of your procedure?

遍历整个表，最坏运行时间为O(m)

练习11.1-2 A bit vector is simply an array of bits (0’s and 1’s). A bit vector of length m takes much less space than an array of m pointers. Describe how to use a bit vector to Represent a Dynamic Set of Distinct Elements with no Satellite Data. Dictionary Operations Should Run in O(1) Time.

集合中有这个元素，相应的位设为1， 没有就设为0

练习11.1-3 Suggest how to implement a direct-address table in which the keys of stored elements do not need to be distinct and the elements can have satellite data. All three dictionary operations (INSERT, DELETE, and SEARCH) should run in O(1) time. (Don’t forget that DELETE takes as an argument a pointer to an object to be deleted, not a key.)

每个关键字映射到一个List

练习11.1-4 We wish to implement a dictionary by using direct addressing on a huge array. At the start, the array entries may contain garbage, and initializing the entire array is impractical because of its size. Describe a scheme for implementing a direct-address dictionary on a huge array. Each stored object should use O(1) space; the operations SEARCH, INSERT, and DELETE should take O(1) time each; and the initialization of the data structure should take O(1) time. (Hint: Use an additional stack, whose size is the number of keys actually stored in the dictionary, to help determine whether a given entry in the huge array is valid or not.)

向数组插入时，将数组的地址入栈，同时大数组元素初始化为栈顶指针
查找时，先在大数组查找该值，然后判断其是否是合法栈的下标
删除时，栈中会产生空洞，可以将栈顶元素和相对应大数组存储的栈的下标填补到空洞里

11.2 散列表

散列表将存储需求降至theta(k),同时查找元素的优势得到保持，仍为O(1),但这个界是针对平均情况而言的，而对直接寻址它是针对最坏情况。
在散列情况下，具有关键字k的元素被存放在槽h(k)中。散列函数缩小了数组的大小。使h尽可能的随机，从而避免冲突或使冲突的次数最小化。实际上散列的原意就是随即混杂和拼凑。
链接法(chaining)和开放寻址法解决(open addressing)冲突
通过链接法解决冲突：把散列在同一槽中的所有元素都放在一个链表中。

CHAINED-HASH-INSERT(T, x)
    insert x at the head of list T[h(x, key)]

CHAINED-HASH-SEARCH(T, k)
    search for an element with key k in list T[h(k)]

CHAINED-HASH-DELETE(T, x)
    delete x from the list T[h(x.key)]
如果链表是双向连接，则删除可在O(1)时间完成。

散列法的平均性能依赖于所选取的散列函数h,将所有关键字集合分布在m个槽位上的均匀程度。
练习11.2-1 Suppose we use a hash function h to hash n distinct keys into an array T of length m. Assuming simple uniform hashing, what is the expected number of collisions? More precisely, what is the expected cardinality of {k, l} : k ≠ l and h(k) = h(l)}?

不知道问的什么意思，总是应该是n-m

练习11.2-2 Demonstrate the insertion of the keys 5, 28, 19, 15, 20, 33, 12, 17, 10 into a hash table with collisions resolved by chaining. Let the table have 9 slots, and let the hash function be h(k) = k mod 9.

0 /
1 -> 28 -> 19 -> 10
2 -> 2
3 -> 12
4 /
5 -> 5
6 -> 15 -> 33
7 /
8 -> 17

练习11.2-3 Professor Marley hypothesizes that substantial performance gains can be obtained if we modify the chaining scheme so that each list is kept in sorted order. How does the professor’s modification affect the running time for successful searches, unsuccessful searches, insertions, and deletions?

降低了插入速度，提升了查找速度，对删除没有影响。

练习11.2-4 Suggest how storage for elements can be allocated and deallocated within the hash table itself by linking all unused slots into a free list. Assume that one slot can store a flag and either one element plus a pointer or two pointers. All dictionary and free-list operations should run in O(1) expected time. Does the free list need to be doubly linked, or does a singly linked free list suffice?

需要双向链表，
标识表示是否已分配
双向为了方便删除操作

练习11.2-5 Show that if

> nm, there is a subset of U of size n consisting of keys that all hash to the same slot, so that the worst-case searching time for hashing with chaining is Θ(n).

根据鸽子洞，至少有一个slot是n个元素的。

练习11.2-6 假设将 n 个关键字存储到一个大小为 m 且通过链接法解决冲突的散列表中，同时已知每条链的长度，包括其中最长链的长度L，请描述从散列表的所有关键字中均匀随机地选择某一元素并在 O( L(1 + 1 /α) )的期望时间内返回该关键字的过程。

L >= a
故 L(1+1/a) >= 1 + L
成功查找的期望时间为O(1+a) = O(L(1+1/a))

11.3 散列函数

好的散列函数应近似地满足简单均匀散列假设
一种好的方法导出的散列值，在某种程度上应独立于数据可能存在的任何模式
散列函数的某些应用可能会要求比简单均匀散列更强的性质，例如可能希望某些很近似的关键字具有截然不同的散列值。

11.3.1 除法散列法

h(k) = k mod m
m不应为2的幂，一个不太接近2的整数幂的素数常常是m的一个较好的选择。

11.3.2 乘法散列法

构造散列函数的乘法散列法包含两个步骤。第一步用关键字乘上常数A(0<A<1),并提取kA的小数部分。第二步用m乘以这个值，再向下取整。
h(k) = floor(m(kA mod 1))
乘法散列法的一个优点是对m的选择不是特别关键，一般选择它为2的某个幂次，这样方便在大多数计算机上实现散列函数。

11.3.3 全域散列法

随机选取散列函数，使之独立于要存储的关键字。不管对手选择了怎么样的关键字，其平均性能都很好。
全域散列法在执行开始时，就从一组精心设计的函数中，随机地选择一个作为散列函数。
不是很懂这些证明。
练习11.3-1 Suppose we wish to search a linked list of length n, where each element contains a key k along with a hash value h(k). Each key is a long character string. How might we take advantage of the hash values when searching the list for an element with a given key?

对每个元素计算出hash值，查找时先比较hash值，相等再比较字符串。

练习11.3-2 Suppose that a string of r characters is hashed into m slots by treating it as a radix-128 number and then using the division method. The number m is easily represented as a 32-bit computer word, but the string of r characters, treated as a radix-128 number, takes many words. How can we apply the division method to compute the hash value of the character string without using more than a constant number of words of storage outside the string itself?

One easy approach would adding the values of each character to get a sum and then take a modulo 128. Another way would be using n-degree equation with each character value acting as a co-efficient for the nth term. Example: S[0] * x^n + S[1] * x^(n-1)+ …. + S[n-1], for better result substitute x = 33 or 37 or 39. This is the famous Horner’s method for finding a hash value.

练习11.3-3 Consider a version of the division method in which h(k) = k mod m, where m = 2p - 1 and k is a character string interpreted in radix 2p. Show that if string x can be derived from string y by permuting its characters, then x and y hash to the same value. Give an example of an application in which this property would be undesirable in a hash function.

数论知识。无论怎么交换字符串的顺序，radix的影响都会消失。

练习11.3-4 Consider a hash table of size m = 1000 and a corresponding hash function h(k) = floor(m(k A mod 1)) for A = (5**.5-1)/2 . Compute the locations to which the keys 61, 62, 63, 64, and 65 are mapped.

练习11.3-5

这道题以后证

练习11.3-6

这道题以后证

11.4 开放寻址法

为了使用开放寻址法插入一个元素，需要连续地检查散列表，或称为探查(probe)，直到找到一个空槽来放置待插入的关键字为止。为了确定要探查哪些槽，将探查号作为散列函数的第二个参数。对每个关键字，使用开放寻址法的探查序列：{h(k,0),h(k,1),……h(k,m-1)}是<0,1,…,m-1>的一个排列。

HASH-INSERT(T, k)
    i = 0
    repeat
        j = h(k, i)
        if T[j] == NIL
            T[j] = k
            return j
        else i = i + 1
    until i == m
    error "hash table overflow"

HASH-SEARCH(T, k)
    i = 0
    repeat
        j = h(k, i)
        if T[j] == k
            return j
        i += 1
    until T[j] == NIL or i == m
    return nil

均匀散列(uniform hashing)假设。每个关键字的探查序列等可能为<0, 1, …, m-1>的m!种排列中的任一种。真正的均匀散列是难以实现的，实际应用中，常常采用它的一些近似方法。
线性探查：给定一个普通的散列函数h’,线性探查采用的散列函数为h(k, i) = (h’(k) + i) mod m, 这种方法比较容易出现primary clustering
二次探查：采用h(k, i) = (h’(k) + c1i + c2i) mod m, 可导致一种轻度群集，secondary clustering.
双重散列：这是用于开放寻址法的最好方法之一。h(k, i) = (h1(k) + i*h2(k)) mod m.
后面的数学证明说明了性能很好。
练习11.4-1 Consider inserting the keys 10, 22, 31, 4, 15, 28, 17, 88, 59 into a hash table of length m = 11 using open addressing with the primary hash function h’(k) = k mod m. Illustrate the result of inserting these keys using linear probing, using quadratic probing with c1 = 1 and c2 = 3, and using double hashing with h2(k) = 1 + (k mod (m - 1)).

index   linear probing  quadratic probing   double hashing
0           22              22                      22
1           88      
2                           88                      59
3                           17                      17
4           4               4                       4
5           15                                      15
6           28              28                      28
7           17              59                      88
8           59              15  
9           31              31                      31
10          10              10                      10

练习11.4-2 Write pseudocode for HASH-DELETE as outlined in the text, and modify HASH-INSERT to handle the special value DELETED.

HASH-DELETE(T, k)
    T[HASH-SEARCH(T, k)] = DELETED

HASH-INSERT(T, k)
    i = 0
    repeat
        j = h(k, i)
        if T[j] == NIL or T[j] == DELETED
            T[j] = k
            return j
        else i = i + 1
    until i == m
    error "hash table overflow"

练习11.4-3 Consider an open-address hash table with uniform hashing. Give upper bounds on the expected number of probes in an unsuccessful search and on the expected number of probes in a successful search when the load factor is 3/4 and when it is 7/8.

根据定理11.6和11.7可以得出答案

练习11.4-4 Suppose that we use double hashing to resolve collisions; that is, we use the hash function h(k, i) = (h1(k)+ih2(k)) mod m. Show that if m and h2(k) have greatest common divisor d ≥ 1 for some key k, then an unsuccessful search for key k examines (1/d)th of the hash table before returning to slot h1(k). Thus, when d = 1, so that m and h2(k) are relatively prime, the search may examine the entire hash table. (Hint: See Chapter 31.)

要用到数论知识

练习11.4-5 Consider an open-address hash table with a load factor α. Find the nonzero value α for which the expected number of probes in an unsuccessful search equals twice the expected number of probes in a successful search. Use the upper bounds given by Theorems 11.6 and 11.8 for these expected numbers of probes.

1/(1-α) = ln(1/(1-α)) * 2/α

解得α ＝ 0.717

11.5 完全散列

当集合是静态(static)时，散列技术能提供出色的最坏情况性能。静态即一旦存入表中，关键字集合不再变化。
这种散列方法采用二次散列表(secondary hash table)，最坏情况下O(1)完成查找，期望存储为O(n

Sasuke

Jing