C ++搜索性能

问题描述:

我有两个文本文件。一个包含大约70,000个名称(〜1.5MB)的列表。另一个包含将从杂项来源获得的文本。也就是说,这个文件的内容将在每次执行程序时改变(〜0.5MB)。基本上,我想能够将一些文本粘贴到文本文件中,并查看我的列表中找到了哪些名称。种类似find函数(CTR + F),但有7万个关键字。

What I have is two text files. One contains a list of roughly 70,000 names (~1.5MB). The other contains text which will be obtained from miscellaneous sources. That is, this file's contents will change each time the program is executed (~0.5MB). Essentially, I want to be able to paste some text into a text file and see which names from my list are found. Kind of like the find function (CTR + F) but with 70,000 keywords.

无论如何,我到目前为止是:

In any case, what I have thus far is:

int main()
{
     ifstream namesfile("names.txt");   //names list
     ifstream miscfile("misc.txt");     //misc text
     vector<string> vecnames;           //vector to hold names
     vector<string> vecmisc;            //vector to hold misc text
     size_t found;

     string s;
     string t;

     while (getline(namesfile,s))       
         veccomp.push_back(s);  

     while (getline(miscfile,t))        
         vectenk.push_back(t);

     //outer loop iterates through names list
     for (vector<string>::size_type i = 0; i != vecnames.size(); ++i) {
         //inner loop iterates through the lines of the mist text file
         for (vector<string>::size_type j = 0;j != vecmisc.size(); ++j) {
             found=vecmisc[j].find(vecnames[i]);
             if (found!=string::npos) {
                 cout << vecnames[i] << endl;
                 break;
             }
         }
     }

     cout << "SEARCH COMPLETE";

     //to keep console application from exiting
     getchar();

     return 0;
 }

现在这个提取我需要的数据,是非常慢,显然效率低下,因为每个名称都需要我可能搜索整个文件,给出(75000 x#行的misc文本文件)迭代。如果任何人可以帮助,我一定会喜欢它。一些示例代码是最受欢迎的。此外,我使用Dev C ++如果这有什么区别。感谢。

Now this works great as far as extracting the data I need, however, it is terribly slow and obviously inefficient since each name requires that I potentially search the entire file again which gives (75000 x # of lines in misc text file) iterations. If anyone could help, I would certainly appreciate it. Some sample code is most welcomed. Additionally, I'm using Dev C++ if that makes any difference. Thanks.

使用 std :: hash_set 。将所有关键字插入集合,然后遍历大型文档,每次输入一个字词,测试集合是否包含该字词。

Use a std::hash_set. Insert all your keywords into the set, then traverse the large document and each time you come to a word, test whether the set includes that word.