计算文件中单词的最简单方法

问题描述：

我正在尝试以最简单的方式编写一个程序来计算 Scala 语言文件中的单词出现次数.到目前为止，我有这些代码:

I'm trying to code in the simplest way a program to count word occurrences in file in Scala Language. So far I have these piece of code:

import scala.io.Codec.string2codec
import scala.io.Source
import scala.reflect.io.File

object WordCounter {
    val SrcDestination: String = ".." + File.separator + "file.txt"
    val Word = "\\b([A-Za-z\\-])+\\b".r

    def main(args: Array[String]): Unit = {

        val counter = Source.fromFile(SrcDestination)("UTF-8")
                .getLines
                .map(l => Word.findAllIn(l.toLowerCase()).toSeq)
                .toStream
                .groupBy(identity)
                .mapValues(_.length)

        println(counter)
    }
}

不要理会正则表达式.我想知道如何从中提取单个单词在这一行中检索到的序列:

Don't bother of regexp expression. I would like to know how to extract single words from sequence retrieved in this line:

map(l => Word.findAllIn(l.toLowerCase()).toSeq)

为了计算每个单词的出现次数.目前我正在获取带有计数词序列的地图.

in order to get each word occurency counted. Currently I'm getting map with counted words sequences.

答

您可以通过使用正则表达式 "\\W+" (flatmapcode> 是惰性的，因此它不需要将整个文件加载到内存中).要计算出现次数，您可以折叠 Map[String, Int] 用每个单词更新它(比使用 groupBy 更节省内存和时间)

You can turn the file lines into words by splitting them with the regex "\\W+" (flatmap is lazy so it doesn't need to load the entire file into memory). To count occurrences you can fold over a Map[String, Int] updating it with each word (much more memory and time efficient than using groupBy)

scala.io.Source.fromFile("file.txt")
  .getLines
  .flatMap(_.split("\\W+"))
  .foldLeft(Map.empty[String, Int]){
     (count, word) => count + (word -> (count.getOrElse(word, 0) + 1))
  }

计算文件中单词的最简单方法

相关推荐