在Ruby中,将大型散列划分为N个较小的散列的最有效方法是什么?
我正在处理涉及分片的问题.作为问题的一部分,我需要找到最快的方法来将大型Ruby哈希(> 200,0000个条目)划分为两个或更多部分.
I'm working on a problem that involves sharding. As part of the problem I need to find the fastest way to partition a large Ruby hash (> 200,0000 entries) in two or more pieces.
-
是否有任何非O(n)方法?
Are there any non O(n) approaches?
是否存在非Ruby即C/C ++实现?
Is there a non-Ruby i.e. C/C++ implementation?
请不要使用将哈希表转换为数组并重建N个不同哈希的简单方法来举例说明.
Please don't reply with examples using the trivial approach of converting the hash to an array and rebuilding N distinct hashes.
我担心的是Ruby进行此类工作的速度太慢.
My concern is that Ruby is too slow to do this kind of work.
这是我尝试的第一个解决方案.吸引人的是:
This was the first solution I tried. What was appealing about it was:
- 它不需要在散列上狂奔地循环
- 不需要管理计数器就可以在分片中平均分配成员.
- 它看起来短而整齐
好吧,它不是O(n),但它依赖于我认为比编写自己的Ruby代码更快的标准库中的方法.
Ok, it isn't O(n) but it relies on methods in the standard library which I figured would be faster than writing my own Ruby code.
pivot = s.size / 2
slices = s.each_slice(pivot)
s1 = Hash[*slices.entries[0].flatten]
s2 = Hash[*slices.entries[1].flatten]
马克和麦克很友善地提出解决方法.我必须承认Mark的方法感觉不对-确实做到了我不想要的-它遍历了has的所有成员并评估了有条件的条件-但由于他花了时间进行评估,我认为我应该尝试类似的方法和基准测试.这是我对他的方法的改编版(我的钥匙不是数字,所以我不能一字不漏地使用他的方法)
Mark and Mike were kind enough to suggest approaches. I have to admit that Mark's approach felt wrong - it did exactly what I didn't want - it looped over all of the members of the has and evaluated a conditional as it went - but since he'd taken the time to do the evaluation, I figured that I should try a similar approach and benchmark that. This is my adapted version of his approach (My keys aren't numbers so I can't take his approach verbatim)
def split_shard(s)
shard1 = {}
shard2 = {}
t = Benchmark.measure do
n = 0
pivot = s.size / 2
s.each_pair do |k,v|
if n < pivot
shard1[k] = v
else
shard2[k] = v
end
n += 1
end
end
$b += t.real
$e += s.size
return shard1, shard2
end
在这两种情况下,大量散列都被拆分为多个碎片.测试数据集中所有哈希中的元素总数为1,680,324.
In both cases, a large number of hashes are split into shards. The total number of elements across all of the hashes in the test data set was 1,680,324.
我的初始解决方案-速度必须更快,因为它使用标准库中的方法并最大限度地减少了Ruby代码的数量(无循环,无条件)-仅仅运行了 9s
My initial solution - which had to be faster because it uses methods in the standard library and minimises the amount of Ruby code (no loop, no conditional) - runs in just over 9s
Mark的方法仅需 5s
Mark's approach runs in just over 5s
那是一个重大胜利
不要被直觉"所迷惑-测量竞争算法的性能
Don't be fooled by 'intuition' - measure the performance of competing algorithm
不用担心Ruby作为一种语言的性能-我最初担心的是,如果我要执行一千万个这样的操作,那么在Ruby中可能会花费大量的时间,但实际上并没有.
Don't worry about Ruby's performance as a language - my initial concern is that if I'm doing ten million of these operations, it could take a significant amount of time in Ruby but it doesn't really.
谢谢Mark和Mike,他们俩都从我这里得到了帮助.
Thanks to Mark and Mike who both get points from me for their help.
谢谢!
这可能不足以满足您的需求(听起来好像他们需要C语言中的扩展名),但是也许您可以使用Hash#select?
This probably isn't fast enough for your needs (which sound like they'll require an extension in C), but perhaps you could use Hash#select?
我同意麦克·伍德豪斯的想法.您是否可以在构造原始200k项哈希的同一位置构造分片?如果项目是从数据库中出来的,则可以根据键的某些方面或重复使用LIMIT 10000之类的内容一次将查询分为多个不相交的查询.
I agree with Mike Woodhouse's idea. Is it possible for you to construct your shards at the same place where the original 200k-item hash is being constructed? If the items are coming out of a database, you could split your query into multiple disjoint queries, based either on some aspect of the key or by repeatedly using something like LIMIT 10000 to grab a chunk at a time.
其他
克里斯,您好,我只是比较了您使用Hash#select的方法:
Hi Chris, I just compared your approach to using Hash#select:
需要基准"
s = {}
1.upto(200_000) { |i| s[i] = i}
Benchmark.bm do |x|
x.report {
pivot = s.size / 2
slices = s.each_slice(pivot)
s1 = Hash[*slices.entries[0].flatten]
s2 = Hash[*slices.entries[1].flatten]
}
x.report {
s1 = {}
s2 = {}
s.each_pair do |k,v|
if k < 100_001
s1[k] = v
else
s2[k] = v
end
end
}
end
看起来Hash#select速度要快得多,即使它遍历了每个子哈希的整个大型哈希:
It looks like Hash#select is much faster, even though it goes through the entire large hash for each one of the sub-hashes:
# ruby test.rb
user system total real
0.560000 0.010000 0.570000 ( 0.571401)
0.320000 0.000000 0.320000 ( 0.323099)
希望这会有所帮助.