如何在文本,文档,PDF文件中识别文本为英语?

问题描述:


文本是doc,Text,PDF,xls文件中的英语,使用C#标识

请给我一个示例代码.

Hi
Text is English language in doc, Text, PDF, xls files identify using C#

please give me a sample code.

假设您已经知道如何从所有这些文件类型中提取文本.您需要分析文本,然后将所有单词与每种已知语言的对应单词进行比较,以查看它们是否存在.当您测试了每个单词,并且超过某个百分比(例如95%)的单词仅是英语时,您就可以完全确定所有文本都是英语.

如您所见,这不是一件简单的任务.
Assuming you already know how to extract the text from all these file types. You need to analyse the text and then compare all the words to their counterparts in every known language to see if they exist. When you have tested every word and more than some percentage (say 95%) are only English then you can be reasonably confident that all the text is English.

As you can see this is not a trivial task.