巨大文本文件中的Trigram概率
I have a large Bengali monolingual corpus which consists of over 100 million Bengali sentences. The corpus is in .txt format and the file size is 1.8 GB. Now, in order to build a Bengali Grammar checker, I need to use this enormous corpus to calculate Trigram language probability. However, this seems to take an enormous amount of time to find Trigram probability in such a large file. Please suggest how to solve this issue and which techniques should I use in this case. Should I use php or python for this? I have sufficient knowledge in both. TIA
我有一个庞大的孟加拉单语语料库,其中包含超过1亿孟加拉语句子。 语料库采用.txt格式,文件大小为1.8 GB。 现在,为了构建一个孟加拉语语法检查器,我需要使用这个巨大的语料库来计算Trigram语言概率。 然而,这似乎需要花费大量时间才能在如此大的文件中找到Trigram概率。 请建议如何解决此问题以及在这种情况下我应该使用哪些技术。 我应该使用php或python吗? 我对两者都有足够的知识。 TIA p> div>
If you already know that it will be challenging to get this working, why make your life hard and use Python or even worse, PHP?
This is a fairly straightforward task: counting.
That really is something that you can implement in a more memory efficient and faster language like C, if you need it to be fast. For example, an integer (and you will need many) in C is 4 bytes, in Python you need 12, and most likely these will be stored in a different memory location, so you have another 8 just to reference where the integer is. A pure python approach will easily need 3x-4x as much memory as a C version. All these memory indirections also reduce your performance.
You can then still work with Python for the later steps.