在Ruby中，将大型散列划分为N个较小的散列的最有效方法是什么?

问题描述：

我正在处理涉及分片的问题.作为问题的一部分，我需要找到最快的方法来将大型Ruby哈希(> 200,0000个条目)划分为两个或更多部分.

I'm working on a problem that involves sharding. As part of the problem I need to find the fastest way to partition a large Ruby hash (> 200,0000 entries) in two or more pieces.

是否有任何非O(n)方法?

Are there any non O(n) approaches?

是否存在非Ruby即C/C ++实现?

Is there a non-Ruby i.e. C/C++ implementation?

请不要使用将哈希表转换为数组并重建N个不同哈希的简单方法来举例说明.

Please don't reply with examples using the trivial approach of converting the hash to an array and rebuilding N distinct hashes.

我担心的是Ruby进行此类工作的速度太慢.

My concern is that Ruby is too slow to do this kind of work.

这是我尝试的第一个解决方案.吸引人的是:

This was the first solution I tried. What was appealing about it was:

它不需要在散列上狂奔地循环
不需要管理计数器就可以在分片中平均分配成员.
它看起来短而整齐

好吧，它不是O(n)，但它依赖于我认为比编写自己的Ruby代码更快的标准库中的方法.

Ok, it isn't O(n) but it relies on methods in the standard library which I figured would be faster than writing my own Ruby code.

pivot = s.size / 2

slices = s.each_slice(pivot)

s1 = Hash[*slices.entries[0].flatten]

s2 = Hash[*slices.entries[1].flatten]

马克和麦克很友善地提出解决方法.我必须承认Mark的方法感觉不对-确实做到了我不想要的-它遍历了has的所有成员并评估了有条件的条件-但由于他花了时间进行评估，我认为我应该尝试类似的方法和基准测试.这是我对他的方法的改编版(我的钥匙不是数字，所以我不能一字不漏地使用他的方法)

Mark and Mike were kind enough to suggest approaches. I have to admit that Mark's approach felt wrong - it did exactly what I didn't want - it looped over all of the members of the has and evaluated a conditional as it went - but since he'd taken the time to do the evaluation, I figured that I should try a similar approach and benchmark that. This is my adapted version of his approach (My keys aren't numbers so I can't take his approach verbatim)

def split_shard(s)
    shard1 = {}
    shard2 = {}


    t = Benchmark.measure do
        n = 0

        pivot = s.size / 2

        s.each_pair do |k,v|
            if n < pivot
                shard1[k] = v
            else
                shard2[k] = v
            end

            n += 1
        end
    end

    $b += t.real
    $e += s.size
    return shard1, shard2
end

在这两种情况下，大量散列都被拆分为多个碎片.测试数据集中所有哈希中的元素总数为1,680,324.

In both cases, a large number of hashes are split into shards. The total number of elements across all of the hashes in the test data set was 1,680,324.

我的初始解决方案-速度必须更快，因为它使用标准库中的方法并最大限度地减少了Ruby代码的数量(无循环，无条件)-仅仅运行了 9s

My initial solution - which had to be faster because it uses methods in the standard library and minimises the amount of Ruby code (no loop, no conditional) - runs in just over 9s

Mark的方法仅需 5s

Mark's approach runs in just over 5s

那是一个重大胜利

不要被直觉"所迷惑-测量竞争算法的性能

Don't be fooled by 'intuition' - measure the performance of competing algorithm

不用担心Ruby作为一种语言的性能-我最初担心的是，如果我要执行一千万个这样的操作，那么在Ruby中可能会花费大量的时间，但实际上并没有.

Don't worry about Ruby's performance as a language - my initial concern is that if I'm doing ten million of these operations, it could take a significant amount of time in Ruby but it doesn't really.

谢谢Mark和Mike，他们俩都从我这里得到了帮助.

Thanks to Mark and Mike who both get points from me for their help.

谢谢！

答

这可能不足以满足您的需求(听起来好像他们需要C语言中的扩展名)，但是也许您可以使用Hash#select?

This probably isn't fast enough for your needs (which sound like they'll require an extension in C), but perhaps you could use Hash#select?

我同意麦克·伍德豪斯的想法.您是否可以在构造原始200k项哈希的同一位置构造分片?如果项目是从数据库中出来的，则可以根据键的某些方面或重复使用LIMIT 10000之类的内容一次将查询分为多个不相交的查询.

I agree with Mike Woodhouse's idea. Is it possible for you to construct your shards at the same place where the original 200k-item hash is being constructed? If the items are coming out of a database, you could split your query into multiple disjoint queries, based either on some aspect of the key or by repeatedly using something like LIMIT 10000 to grab a chunk at a time.

其他

克里斯，您好，我只是比较了您使用Hash#select的方法:

Hi Chris, I just compared your approach to using Hash#select:

需要基准"

s = {}
1.upto(200_000) { |i| s[i] = i}

Benchmark.bm do |x|
  x.report {
    pivot = s.size / 2
    slices = s.each_slice(pivot)
    s1 = Hash[*slices.entries[0].flatten]
    s2 = Hash[*slices.entries[1].flatten]
  }
  x.report {
    s1 = {}
    s2 = {}
    s.each_pair do |k,v|
      if k < 100_001
        s1[k] = v
      else
        s2[k] = v
      end
    end
  }
end

看起来Hash#select速度要快得多，即使它遍历了每个子哈希的整个大型哈希:

It looks like Hash#select is much faster, even though it goes through the entire large hash for each one of the sub-hashes:

# ruby test.rb 
      user     system      total        real
  0.560000   0.010000   0.570000 (  0.571401)
  0.320000   0.000000   0.320000 (  0.323099)

希望这会有所帮助.

在Ruby中，将大型散列划分为N个较小的散列的最有效方法是什么?

相关推荐