Hash中的一些概率计算

　　Hash是把锋利的刀子，处理海量数据时经常用到，大家可能经常用hash，但hash的有些特点你是否想过、理解过。我们可以利用我们掌握的概率和期望的知识，来分析Hash中一些有趣的问题，比如：

平均每个桶上的项的个数
平均查找次数
平均冲突次数
平均空桶个数
使每个桶都至少有一个项的项个数的期望

　　本文hash的采用链地址法发处理冲突，即对hash值相同的不同对象添加到hash桶的链表上。

每个桶上的项的期望个数

　　将n个不同的项hash到大小为k的hash表中，平均每个桶会有多少个项？首先，对于任意一个项items(i)被hash到第1个桶的概率为1/k，那么将n个项都hash完后，第1个桶上的项的个数的期望为C(项的个数)=n/k，这里我们选取了第一个桶方便叙述，事实上对于任一个特定的桶，这个期望值都是适用的。这就是每个桶平均项的个数。

　　用程序模拟的过程如下：

 1     /***
 2      * 对N个字符串hash到大小为K的哈希表中，每个桶上的项的期望个数
 3      * 
 4      * @return
 5      */
 6     private double expectedItemNum() {
 7         // 桶大小为K
 8         int[] bucket = new int[K];
 9         // 生成测试字符串
10         List<String> strings = getStrings(N);
11         // hash映射
12         for (int i = 0; i < strings.size(); i++) {
13             int h = hash(strings.get(i), 37);
14             bucket[h]++;
15         }
16         // 计算每个桶的平均次数
17         long sum = 0;
18         for (int itemNum : bucket)
19             sum += itemNum;
20         return 1.0 * sum / K;
21     }
22 
23     /***
24      * 多次测试计算每个桶上的项的期望个数，
25      */
26     private static void expectedItemNumTest() {
27         MyHash myHash = new MyHash();
28         // 测试100次
29         int tryNum = 100;
30         double sum = 0;
31         for (int i = 0; i < tryNum; i++) {
32             double count = myHash.expectedItemNum();
33             sum += count;
34         }
35         // 取100次测试的平均值
36         double fact = sum / tryNum;
37         System.out.println("K=" + K + " N=" + N);
38         System.out.println("程序模拟的期望个数：" + fact);
39         double expected = N * 1.0 / K;
40         System.out.println("估计的期望个数 n/k：" + expected);
41     }

　　输出的结果如下，可以看到我们用公式计算的期望与实际是很接近的，这也说明我们的期望公式计算正确了，毕竟实践是检验真理的唯一标准。

K=1000 N=618
程序模拟的期望个数：0.6180000000000007
估计的期望个数 n/k：0.618

空桶的期望个数

　　将n个不同的项hash到大小为k的hash表中，平均会有多少个空桶？我们还是以第1个桶为例，任意一个项item(i)没有hash到第一个桶的概率为(1-1/k)，hash完n个项后，所有的项都没有hash到第一个桶的概率为(1-1/k)^n，这也是每个桶为空的概率。桶的个数为k，因此期望的空桶个数就是C(空桶的个数)=k(1-1/k)^n，这个公式不好计算，用程序跑还可能被归零了，转化一下就容易计算了：egin{equation} C(空桶的个数)=k(1-frac{1}{k})^n=k(1-frac{1}{k})^{-k(-frac{n}{k})}=ke^{(-frac{n}{k})}end{equation} 同样我们模拟测试一下：

 1     /***
 2      * 计算期望的空桶个数
 3      * 
 4      * @return
 5      */
 6     private int expectedEmputyBuckts() {
 7         // 桶大小为K
 8         int[] bucket = new int[K];
 9         // 生成测试字符串
10         List<String> strings = getStrings(N);
11         // hash映射
12         for (int i = 0; i < strings.size(); i++) {
13             int h = hash(strings.get(i), 37);
14             bucket[h]++;
15         }
16         // 记录空桶的个数
17         int count = 0;
18         for (int itemNum : bucket)
19             if (itemNum == 0)
20                 count++;
21         return count;
22     }
23 
24     /***
25      * 多次测试求空桶的期望个数
26      */
27     private static void expectedEmputyBucktsTest() {
28         MyHash myHash = new MyHash();
29         // 测试100次
30         int tryNum = 100;
31         long sum = 0;
32         for (int i = 0; i < tryNum; i++) {
33             int count = myHash.expectedEmputyBuckts();
34             sum += count;
35         }
36         // 取100次测试的平均值
37         double fact = sum / tryNum;
38         System.out.println("K=" + K + " N=" + N);
39         System.out.println("程序模拟的期望空桶个数：" + fact);
40         double expected = K * Math.exp(-1.0 * N / K);
41         System.out.println("估计的期望空桶个数ke^(-n/k)：" + expected);
42     }

View Code

　　我们这里的n个项是各不相同的，只要某个项hash到的桶已经被其他项hash过，那就认为是一次冲突，直接计算冲突次数不好计算，但我们知道C(冲突次数)=n-C(被占用的桶的个数)，而被占用的桶的个数C(被占用的桶的个数)=k-C(空桶的个数)，因此我们的得到：egin{equation} C(冲突次数)=n-(k-ke^{-n/k}) end{equation}　　程序模拟如下：

1 /*** 2 * 期望冲突次数 3 * 4 * @return 5 */ 6 private int expextedCollisions() { 7 // 桶大小为K 8 int[] bucket = new int[K]; 9 int count = 0; 10 // 生成测试字符串 11 List<String> strings = getStrings(N); 12 for (int i = 0; i < strings.size(); i++) { 13 // hash映射 14 int h = hash(strings.get(i), 37); 15 // 桶h没有被占用 16 if (bucket[h] == 0) 17 bucket[h] = 1; 18 // 桶h已经被占用，发生了冲突 19 else 20 count++; 21 } 22 return count; 23 } 24 25 private static void expextedCollisionsTest() { 26 MyHash myHash = new MyHash(); 27 // 测试100次 28 int tryNum = 100; 29 long sum = 0; 30 for (int i = 0; i < tryNum; i++) { 31 int count = myHash.expextedCollisions(); 32 sum += count; 33 } 34 // 取100次测试的平均值 35 double fact = sum / tryNum; 36 System.out.println("K=" + K + " N=" + N); 37 System.out.println("程序模拟的冲突数：" + fact); 38 double expected = N - (K - K * Math.exp(-1.0 * N / K)); 39 System.out.println("估计的期望冲突次数n-(k-ke^(-n/k))：" + expected); 40 41 }

1 /*** 2 * hash N个字符串的过程是否产生冲突 3 * 4 * @return 5 */ 6 private boolean containsCollison() { 7 // 桶大小为K 8 int[] bucket = new int[K]; 9 // 生成测试字符串 10 List<String> strings = getStrings(N); 11 for (int i = 0; i < strings.size(); i++) { 12 // hash映射 13 int h = hash(strings.get(i), 37); 14 // 桶h没有被占用 15 if (bucket[h] == 0) 16 bucket[h] = 1; 17 // 桶h已经被占用，发生了冲突，直接返回 18 else 19 return true; 20 } 21 return false; 22 } 23 24 /*** 25 * 重复调用多次containsCollison，计算不发生冲突的概率 26 */ 27 private static void probCollisionTest() { 28 MyHash myHash = new MyHash(); 29 // 测试100次 30 int tryNum = 100; 31 // 不冲突的次数 32 int count = 0; 33 for (int i = 0; i < tryNum; i++) { 34 if (!myHash.containsCollison()) 35 count++; 36 } 37 // 取100次测试的平均值 38 double fact = 1.0 * count / tryNum; 39 System.out.println("K=" + K + " N=" + N); 40 System.out.println("程序模拟的不冲突概率：" + fact); 41 double expected = Math.exp(-1.0 * N * (N - 1) / (2 * K)); 42 System.out.println("估计的期望不冲突概率e^(-n(n-1)/(2k))：" + expected); 43 System.out.println("程序模拟的冲突概率：" + (1 - fact)); 44 System.out.println("估计的期望冲突冲突概率1-e^(-n(n-1)/(2k))：" + (1 - expected)); 45 46 }

K=1000 N=50 程序模拟的不冲突概率：0.29 估计的期望不冲突概率e^(-n(n-1)/(2k))：0.29375770032353277 程序模拟的冲突概率：0.71 估计的期望冲突冲突概率1-e^(-n(n-1)/(2k))：0.7062422996764672

1 public V put(K key, V value) { 2 if(...) 3 return ...; 4 ... 5 modCount++; 6 addEntry(hash, key, value, i); 7 return null; 8 } 9 /** 10 * Adds a new entry with the specified key, value and hash code to 11 * the specified bucket. It is the responsibility of this 12 * method to resize the table if appropriate. 13 * 14 * Subclass overrides this to alter the behavior of put method. 15 */ 16 void addEntry(int hash, K key, V value, int bucketIndex) { 17 if ((size >= threshold) && (null != table[bucketIndex])) { 18 resize(2 * table.length); 19 hash = (null != key) ? hash(key) : 0; 20 bucketIndex = indexFor(hash, table.length); 21 } 22 23 createEntry(hash, key, value, bucketIndex); 24 }

Hash中的一些概率计算

相关推荐