计算文件中单词的最简单方法
我正在尝试以最简单的方式编写一个程序来计算 Scala 语言文件中的单词出现次数.到目前为止,我有这些代码:
I'm trying to code in the simplest way a program to count word occurrences in file in Scala Language. So far I have these piece of code:
import scala.io.Codec.string2codec
import scala.io.Source
import scala.reflect.io.File
object WordCounter {
val SrcDestination: String = ".." + File.separator + "file.txt"
val Word = "\\b([A-Za-z\\-])+\\b".r
def main(args: Array[String]): Unit = {
val counter = Source.fromFile(SrcDestination)("UTF-8")
.getLines
.map(l => Word.findAllIn(l.toLowerCase()).toSeq)
.toStream
.groupBy(identity)
.mapValues(_.length)
println(counter)
}
}
不要理会正则表达式.我想知道如何从中提取单个单词在这一行中检索到的序列:
Don't bother of regexp expression. I would like to know how to extract single words from sequence retrieved in this line:
map(l => Word.findAllIn(l.toLowerCase()).toSeq)
为了计算每个单词的出现次数.目前我正在获取带有计数词序列的地图.
in order to get each word occurency counted. Currently I'm getting map with counted words sequences.
您可以通过使用正则表达式 "\\W+"
(flatmap
code> 是惰性的,因此它不需要将整个文件加载到内存中).要计算出现次数,您可以折叠 Map[String, Int]
用每个单词更新它(比使用 groupBy
更节省内存和时间)
You can turn the file lines into words by splitting them with the regex "\\W+"
(flatmap
is lazy so it doesn't need to load the entire file into memory). To count occurrences you can fold over a Map[String, Int]
updating it with each word (much more memory and time efficient than using groupBy
)
scala.io.Source.fromFile("file.txt")
.getLines
.flatMap(_.split("\\W+"))
.foldLeft(Map.empty[String, Int]){
(count, word) => count + (word -> (count.getOrElse(word, 0) + 1))
}