巨大文本文件中的Trigram概率

问题描述：

I have a large Bengali monolingual corpus which consists of over 100 million Bengali sentences. The corpus is in .txt format and the file size is 1.8 GB. Now, in order to build a Bengali Grammar checker, I need to use this enormous corpus to calculate Trigram language probability. However, this seems to take an enormous amount of time to find Trigram probability in such a large file. Please suggest how to solve this issue and which techniques should I use in this case. Should I use php or python for this? I have sufficient knowledge in both. TIA

我有一个庞大的孟加拉单语语料库，其中包含超过1亿孟加拉语句子。语料库采用.txt格式，文件大小为1.8 GB。现在，为了构建一个孟加拉语语法检查器，我需要使用这个巨大的语料库来计算Trigram语言概率。然而，这似乎需要花费大量时间才能在如此大的文件中找到Trigram概率。请建议如何解决此问题以及在这种情况下我应该使用哪些技术。我应该使用php或python吗？我对两者都有足够的知识。 TIA p> div>

答

If you already know that it will be challenging to get this working, why make your life hard and use Python or even worse, PHP?

This is a fairly straightforward task: counting.

That really is something that you can implement in a more memory efficient and faster language like C, if you need it to be fast. For example, an integer (and you will need many) in C is 4 bytes, in Python you need 12, and most likely these will be stored in a different memory location, so you have another 8 just to reference where the integer is. A pure python approach will easily need 3x-4x as much memory as a C version. All these memory indirections also reduce your performance.

You can then still work with Python for the later steps.

巨大文本文件中的Trigram概率

相关推荐