David MacKay：用信息论解释 '快速排序''堆排序' 性质与差异

David MacKay：用信息论解释 '快速排序'、'堆排序' 本质与差异

这篇文章是David MacKay利用信息论，来对快排、堆排的本质差异导致的性能差异进行的比较。

信息论是非常强大的，它并不只是一个用来分析理论最优决策的工具。

从信息论的角度来分析算法效率是一件很有趣的事，它给我们分析排序算法带来了一种新的思路。

运用了信息论的概念，我们很容易理解为什么快排的速度那么快，以及它的缺陷在哪里。

由于个人能力不足，对于本文的理解可能还是有点偏差。

而且因为翻译的困难，这篇译文有很多地方并没有翻译出来，还是使用了原文的句子。

所以建议大家还是阅读原文Heapsort, Quicksort, and Entropy

但是如果有朋友看了下面这篇译文，并且有更好的翻译方法，欢迎各位留言，不胜感激！！！

----------------------译文如下------------------------------

堆排，快排，与信息熵

有很多的网络文章对于堆排与快排进行了比较分析。

它们中大部分的观点都是，“这两种算法的平均时间效率都渐进于NlogN，但在实际测试中，一个好的快排的实现效率通常能够打败堆排。”

有些人进一步分析，并且给出了一些定量的说明：”堆排的平均比较次数是快排的两倍，但是堆排避免了在低概率下表现出的的灾难性的效率退化的情况。”

但是，很少有人会问这样的问题“为什么堆排使用了两倍于快排的比较次数？”人们费了很大的努力，试图创造一种具有两种算法优先的排序算法，比如：’introspective sort'---这个算法提供了快排的递归实现，并且在递归深度很深的时候，选择堆排来实现。

Paul Hsieh也深入地比较了快排与堆排。他说：“我怀疑堆排的实际效率远远好过人们印象中的认识，并且有一些实际结果证明了这一点。”在他的测试中，最好的编译器（对于堆排或者快排来说都一样）产生的一个结果就是：在总的CPU执行时间上，堆排比快排要快20%。

而实际上，在排序3000个对象的时候，堆排平均使用了61000次的比较，而快排只用了22000次的比较。为什么比较次数多的堆排，反而CPU执行的时间会少呢？关于这一点，可以看Paul Hsieh的这篇文章：Sorting revisited.

而我想要讨论的问题是：为什么堆排的比较次数会比快排的比较次数多呢？Paul Hsieh说过：“让我倍受打击的是，我无法看出为什么堆排会比快排慢。而且我也还没有听到或者读过一个权威的解释。”

在预期的平均信息量（expected information content）的基础上，我们可以得到一个简单的解释。为了让大家更容易读懂，我们先介绍一些其他的预备知识。

首先来看看经典的称重问题：

给定12个外表看起来一样的球，其中11个球的重量相同，还有1个球可能更重或者更轻。你需要做的事情是，使用一个没有刻度的天平，每次在天平的两端放相同数量的球，你的任务是设计一种策略，使得能够找出这个特殊的球，并且使得使用天平的次数最少。

David MacKay：用信息论解释 '快速排序''堆排序' 性质与差异

很多人是通过反复尝试，不断摸索的方式来找到解决方案的（本例可以使用3次比较得到特殊的球）。这种试错的方法显然太繁琐复杂，这里有一个更好的方法。

想要减少尝试的次数，我们需要增加平均一次尝试所能够获得的信息。

Shannon已经向我们展示了一种好的方法来定义从结果所能获得的期望的信息量，叫做结果的信息熵。（如果想要了解信息熵，或者更深入的讨论称重问题，可以看我写的书：Information Theory, Inference, and Learning Algorithms）。

通过每次总选择一次信息熵最大的尝试，我们可以快速地解决这个称重问题。

排序与bits

如果一个排序是一个比较排序，而且每一次的比较都产生两个概率相同的结果，那么一次比较所得到的平均信息量是1bit。

排序N个对象所需要的信息量；恰好是logN！个bits（假设对象没有优先级信息）。

使用Stiring的方法，这个总的信息量是：T =N log₂N -N log₂e.

任何排序算法的平均比较次数不会小于T。并且，当比较的结果是等可能（结果的两种情况概率相同）时，这个排序的平均比较次数将接近于T。

那么，是什么原因导致了“在一般情况下，堆排的速度会比快排慢”？

很显然，这是因为堆排的比较产生的结果不是等概率的。我们将在下面解释这个问题。

附带的说明一下，标准的快排随机算法也有同样的缺陷。但是，令人气愤的是，很多算法学习者都会说“在大部分情况下，快排随机算法使用了O（NlogN）次的比较”，并且，他们丢掉了常数因子。多么令人哭笑不得的符号’O’！当有人辩解说“我们关心的是当N很大的情况下的渐进表现”，这是多么愚蠢的表达。这样的表达好像在说，我们并不需要关心一个4NlogN的算法和一个NlogN算法之间的差别！

但是，如果我们计算出了这个NlogN算法的常数因子，或者，我们将目光转向算法的本质的时候，一切将变的非常有意思。这里，我将给出最终的一个结果，并且证明它的正确性。一个随机快排的算法，它的平均时间消耗是N log_e^1/2N。

实际的消耗会比理想情况要大。但是，在理想情况下，我们将T约等于N log₂N，将常数1/log₂1.649约等于1.39。如果我们关心比较次数，那么我们还需要找到一个比快排更好的算法。

More here... You can see thhat quicksort has unbalanced probabilities by imagining the last few comparisons of numbers with a pivot. If the preceding 100 comparisons have divided 70 on one side and 30 on the other side, then it seems a good prediction that the next comparison with the pivot has a probability of 0.7 of going the first way and 0.3 of going the other.

回归到堆排序

堆排序的比较是低效率的，因为，它可能需要将底下的数移动到顶上，并且允许他们顺沿下去，与更大的数交换位置。。。。。。。。。

为什么堆排会这么做呢?难道没有人想过一个更加漂亮的方法，来将两个子堆中的最大值移动到堆的顶部?

下面这种方法怎么样：

改进的堆排序（可能之前就有人想过这个方法）

1、将元素放入一个有效的最大堆中

2、删除堆的顶部，产生一个空缺位置’V’

	Modified Heapsort (surely someone already thought of this)
1	Put items into a valid max-heap
2	Remove the top of the heap, creating a vacancy 'V'
3	Compare the two sub-heap leaders directly below V, and promote the biggest one into the vacancy. Recursively repeat step 3, redefining V to be the new vacancy, until we reach the bottom of the heap.	(This is just like the sift operation of heapsort, except that we've effectively promoted an element, known to be the smaller than all others, to the top of the heap; this smallest element can automatically trickle down without needing to be compared with anything.)
4	Go to step 2
		Disadvantage of this approach: it doesn't have the pretty in-place property of heapsort. But we could obtain this property again by introducing an extra swap at the end, swapping the 'smallest' element with another element at the bottom of the heap, the one which would have been removed in heapsort, and running another sift recursion from that element upwards.

让我们称这个算法为：Fast Heapsort.这不是一个原地算法，but, just like Heapsort, it extracts the sorted items one at a time from the top of the heap.

Fast Heapsort的表现

我评估了在随机排列的情况下Fast Heapsort的表现。Performance was measured solely on the basis of the number of binary comparisons required. [Fast Heapsort does require extra bookkeeping, so the CPU comparison will come out differently.]

David MacKay：用信息论解释 '快速排序''堆排序' 性质与差异

Performance of Fast Heapsort.
Horizontal axis: Number of items to be sorted, N.
Vertical axis: Number of binary comparisons. The theoretical curves show the asymptotic results for Randomized Quicksort (2 N ln N) and the information-theoretic limit, log_2 N! simeq (N log N - N)/log 2.

I haven't proved that Fast Heapsort comes close to maximizing the entropy at each step, but it seems reasonable to imagine that it might indeed do so asymptotically. After all, Heapsort's starting heap is rather like an organization in which the top dog has been made president, and the top dogs in divisions A and B have been made vice-president; a similar organization persists all the way down to the lowest level. The president originated in one of those divisions, and got his job by sequentially deposing his bosses.

Now if the boss leaves and needs to be replaced by the best person in the organization, we'll clearly need to compare the two vice-presidents; the question is, do we expect this to be a close contest? We have little cause to bet on either vice-president, a priori. There are just two asymmetries in the situation: first, the retiring president probably originated in one of the two divisions; and second, the total numbers in those two divisions may be unequal. VP A might be the best of slightly more people than VP B; the best of a big village is more likely to beat the best of a small village. And the standard way of making a heap can make rather lop-sided binary trees. In an organization with 23 people, for example, division A will contain (8+4+2+1)=15 people, and division B just (4+2+1)=7.

为了让堆排序能够更快，我建议下面两种改进：

1、让堆更平衡。Destroy the elegant heap rules, that i's children are at (2i) and (2i+1), and that a heap is filled from left to right. Will this make any difference? Perhaps not; perhaps it's like a Huffman algorithm with some free choices.

2、提供信息论的观点来初始化堆信息处理。Does the opening Heapify routine make comparisons that have entropy significantly less than one bit?

回到快排

我们同样可以给出快速排序的信息熵处理。快排很浪费，因为它坚持在两种可能结果不同等时继续做比较。大约有一半的时间，快排都用了一个差的主元—差的原因是这个主元是在在四分间距之外的—并且，一旦这个主元是差的主元，那么每个同这个主元进行比较产生的信息量将小于1bit。

一个简单的方法能够减少快排由于选择差的主元而产生的低效率，这个方法叫做“三值取中法”。这种算法的思想是：每次随机取三个数，并且选择中间的数作为主元。这种方法，使得选择到差主元的概率大大降低。但是我们还可以比这个做的更好。让我们回到分析快排产生信息的最开始。

当我们随机选择一个主元，并且将它同其他随机数进行比较时，很显然，这个结果的信息熵是1bit。这是一个漂亮的开始。但是，当我们将其他数跟这个主元进行比较时，我们却做了一次差的比较。如果第一个元素比较的结果是比主元大，那么，主观上，第二个数比主元数大的概率也较大。实际上，第二个数比主元大的概率大约是2/3（这是一个漂亮的贝叶斯定理，当N=3, N=4时，2/3的概率是准确的；或许对于所有的N，这个概率都是准确的）。(1/3,2/3)的信息熵 = 0.918，所以，第一次比较是性能还不算太差—只有8%的无用信息。根据信息熵，我们需要在第二次的比较中对比其他两个元素。但是，让我们继续使用快排。如果第一次两个元素都比主元大，那么下一个元素比主元大的概率是3/4（信息熵为：0.811 bits）。这种情况的出现频率将会在超过一半的时间里继续增加。

表1显示，将五个元素与主元进行比较之后，

State		Probability of this state	Probability that next element will go left	Entropy
left	right	Probability of this state	Probability that next element will go left	Entropy
0	5	1/6	1/7	0.59167
1	4	1/6	2/7	0.86312
2	3	1/6	3/7	0.98523
3	2	1/6	4/7	0.98523
4	1	1/6	5/7	0.86312
5	0	1/6	6/7	0.59167

Table 1

表2显示，每一次快排的迭代所产生的平均信息熵。

Iteration of quicksort	Expected entropy at this step	Minimum entropy at this step
0	1	1
1	0.918296	0.918296
2	0.874185	0.811278
3	0.846439	0.721928
4	0.827327	0.650022
5	0.81334	0.591673
6	0.80265	0.543564
7	0.794209	0.503258
8	0.78737	0.468996
9	0.781715	0.439497
10	0.77696	0.413817

Table 2

当使用快排的时候，我们可以通过计算去做一些合理的决定。比如，我们可以选择继续使用这个主元，或者从那些跟当前主元比较过的数中选择一个数作为主元（这种代价有点高）。‘三值取中法‘的起始处理可以看做是这样的：跟标准的快速排序算法一样，我们先用两个元素分别跟主元进行比较。如果我们到达(1,1)的情形，它发生的概率是1/3，那么我们就继续使用这个主元。如果我们到达(0,2)或 (2,0)的情形，就可以判断这是一个差的主元，并弃用这个主元。然后，我们从跟当前主元比较过的两个元素中选择一个中间值，作为新的主元。这个主元的选择会花费一个额外的比较时间，但是对于由此所能够获得的利益来说，这个花费可以忽略。

我们可以将“三值取中法”放到一个客观的基础上，使用信息论，并且，我们可以将它进行推广。

假设，我们已经使用一个随机选择的主元比较了M-1个数（总数为N个），并且我们到达了(m1,m2)的情形。1、我们可以选择继续使用这个主元。使用这个主元带来的下一次的比较的信息熵为H_2(p)（这里的p = (m1+1)/(m1+m2+2)）。2、我们也可以寄希望于一个中指查找算法，这个算法可以找到M个数的中值并且设为主元。中值可以在一个线性的代价（cM）内找到(Median can be found with an expected cost proportional to M。比如：快排的花费是4M。

我们可以评估一下这两种选择带来的时间消耗。

如果我们使用这个主元继续进行（N-M）次的比较，那么得到的期望信息大约是R = (N-M)H_2(p) bits。（这并不是很准确，如果想要精确求的，可以使用积分的方法来找这个精确值）。

如果我们选择的是在随后的连续迭代中使用新的主元（这样的消耗将很大—这提示我们应该使用一个新的算法—首先使用一个排序算法来找到多个主元），这个花费大约是(N-M)+4(M-1)，得到的期望信息大约是R' = (N-M)H_2(p') bits，（这里的p'是一个新的主元）。

If we approximate R' \simeq N-M then finding the new pivot has the better return-to-cost ratio if

(N-M)/ ( (N-M)+4(M-1) ) > H_2(p)
i.e. 1/ ( 1+4(M-1)/(N-M) ) > H_2(p)

我们可以这么做，如果N-M的值很大，那么我们就有更充分的理由去弃用这个主元。

Further modifying quicksort, we can plan ahead: it's almost certainly a good idea to perfectly sort a M randomly selected points and find their median, where M is roughly sqrt(N/log(N)), and use that fairly accurate median as the pivot for the first iteration.

Summary: 'median-of-3' is a good idea, but even better (for all N greater than 70 or so) is 'median-of-sqrt(N/log(N))'. If you retain the sorted subset from iteration to iteration, you'll end up with something rather like a Self-balancing binary search tree.

参考文献：

Nice explanation of randomized median-finding in O(n) time

deterministic median-finding.

David MacKay December 2005
http://www.aims.ac.za/~mackay/sorting/sorting.html ------------by---------zerocool------2012年10月21日0:32:08

David MacKay：用信息论解释 '快速排序''堆排序' 性质与差异

参考文献：

相关推荐