Scala - 计算迭代器中每个键的出现次数

问题描述:

我有一个包含一些键值对的迭代器。
例如

I have an iterator containing some key value pairs. e.g


(jen,xyz)(ken,zxy)(jen,asd)(ken,asdf)

(jen,xyz) (ken, zxy) (jen,asd) (ken, asdf)

结果应为


(jen,2 )(ken,2)

(jen,2) (ken, 2)

如何使用count函数(或任何其他函数)计算每个键的出现次数在该特定集合的迭代器中。

How do I use the count function (or any other) to count the number of occurrences of each key in the iterator of that particular collection.

编辑:
此迭代器在我的用例中表示的集合很大记录的数量,可能在数百万的范围内,我不需要最有效(时间复杂度较低)的方法来做到这一点。我发现默认的计数方法非常快,并且它可能以某种方式用于产生期望的结果。

The collection that this iterator represend in my use-case has a large number of records, possibly in the range of millions, no I need the most efficient (less time complexity) way to do this. I found out that the default count method was a pretty fast, and that it could be somehow used to produce the desire result.

Peter Neyens建议的方法可行,但由于 toList 的方式,某些应用程序可能效率非常低(时间和内存), groupBy ,并使用 length 。将计数直接聚合到地图中通常会更有效率,并避免所有不必要的列表的创建。

The approach that Peter Neyens suggests will work, but it could be very inefficient (time and memory) for some applications due to the way toList, groupBy, and length are used. It is generally going to be much more efficient to aggregate the counts directly into a map and avoid all the unnecessary creation of Lists.

import scala.collection.TraversableOnce
import scala.collection.mutable.HashMap

def counts[T](xs: TraversableOnce[T]): Map[T, Int] = {
  xs.foldLeft(HashMap.empty[T, Int].withDefaultValue(0))((acc, x) => { acc(x) += 1; acc}).toMap
}

一旦定义了计数方法,就可以将它应用到键值对的迭代器中,如下所示:

Once you have defined the counts method you can apply it to your iterator of key-value pairs like so:

val iter: Iterator[(String, String)] = ???
val keyCounts = counts(iter.map(_._1))

计数上面定义的方法适用于大量值的迭代器,例如

The counts method defined above works well for Iterators over a large number of values, e.g.

val iter = Iterator.range(0, 100000000).map(i => (i % 1931, i))
val countMap = counts(iter.map(_._1))
// Map(645 -> 51787, 892 -> 51787, 69 -> 51787, 1322 -> 51786, ...)

工作正常,而Peter的答案中提出的方法,即

works fine, while the approach suggested in Peter's answer, i.e.

val iter = Iterator.range(0, 100000000).map(i => (i % 1931, i))
val countMap = iter.toList.groupBy(_._1).mapValues(_.length).toMap

突然离开一段时间后最终导致的OutOfMemoryError 。它失败的原因是因为所有不必要的 List 创建。

chugs away for a while and ultimately results in an OutOfMemoryError. The reason it fails is because of all the unnecessary List creation.