C ++搜索性能
我有两个文本文件。一个包含大约70,000个名称(〜1.5MB)的列表。另一个包含将从杂项来源获得的文本。也就是说,这个文件的内容将在每次执行程序时改变(〜0.5MB)。基本上,我想能够将一些文本粘贴到文本文件中,并查看我的列表中找到了哪些名称。种类似find函数(CTR + F),但有7万个关键字。
What I have is two text files. One contains a list of roughly 70,000 names (~1.5MB). The other contains text which will be obtained from miscellaneous sources. That is, this file's contents will change each time the program is executed (~0.5MB). Essentially, I want to be able to paste some text into a text file and see which names from my list are found. Kind of like the find function (CTR + F) but with 70,000 keywords.
无论如何,我到目前为止是:
In any case, what I have thus far is:
int main()
{
ifstream namesfile("names.txt"); //names list
ifstream miscfile("misc.txt"); //misc text
vector<string> vecnames; //vector to hold names
vector<string> vecmisc; //vector to hold misc text
size_t found;
string s;
string t;
while (getline(namesfile,s))
veccomp.push_back(s);
while (getline(miscfile,t))
vectenk.push_back(t);
//outer loop iterates through names list
for (vector<string>::size_type i = 0; i != vecnames.size(); ++i) {
//inner loop iterates through the lines of the mist text file
for (vector<string>::size_type j = 0;j != vecmisc.size(); ++j) {
found=vecmisc[j].find(vecnames[i]);
if (found!=string::npos) {
cout << vecnames[i] << endl;
break;
}
}
}
cout << "SEARCH COMPLETE";
//to keep console application from exiting
getchar();
return 0;
}
现在这个提取我需要的数据,是非常慢,显然效率低下,因为每个名称都需要我可能搜索整个文件,给出(75000 x#行的misc文本文件)迭代。如果任何人可以帮助,我一定会喜欢它。一些示例代码是最受欢迎的。此外,我使用Dev C ++如果这有什么区别。感谢。
Now this works great as far as extracting the data I need, however, it is terribly slow and obviously inefficient since each name requires that I potentially search the entire file again which gives (75000 x # of lines in misc text file) iterations. If anyone could help, I would certainly appreciate it. Some sample code is most welcomed. Additionally, I'm using Dev C++ if that makes any difference. Thanks.
使用 std :: hash_set
。将所有关键字插入集合,然后遍历大型文档,每次输入一个字词,测试集合是否包含该字词。
Use a std::hash_set
. Insert all your keywords into the set, then traverse the large document and each time you come to a word, test whether the set includes that word.