C#如何从MS Word文档中仅提取单词

问题描述:

我正在使用c#阅读ms单词文档,我只希望单词(大写和小写)而不是空格,逗号,数字,特殊字符,符号等.请通过代码为我提供一个好的解决方案.在此先感谢.

i am reading a ms word doc using c#, i want only words(upper case and lower case) not space,comma,numbers,special characters,symbols etc. kindly help me with a good solution with code. thanks in advance.


可靠的专业级解决方案需要大量编程,而不是简单的任务.您可以在我的免费语义分析器中在线找到一个很好的示例,该分析器从任意文本(btw,多语言)中提取单词和句子,然后应用一致性计算器来计算单词出现的频率: ^ ]

通常,您首先必须获取一个包含感兴趣的纯文本的字符串(无格式等),然后使用String.Replace()或常规字符删除所有特殊字符(例如,",:",;"等)表达式,然后使用"分隔符应用String.Split().您将获得一个字符串数组,其中包含文本中的单词.在实际解决方案中,您必须做更多的字符串处理,例如,替换尾随空格    "如上所述,整个生产级解决方案远远超出了单篇文章的范围,而且还是主题/领域特定的.您可能应该从简单的原型开始,然后对其进行修剪以适合您的特定情况.为了满足您的迫切需求,您可以使用我的免费在线语义分析器,该分析器具有合理的准确性.

亲切的问候,
AB
Hi,
A reliable, professional-grade solution requires a lot of programming, and is not a trivial task. One good example you can find online in my free Semantic Analyzer, which extracts words and sentences from arbitrary text (btw, multilingual) and then apply concordance calculator to compute the frequency of word occurences: Semantic Analyzer[^]

In general, you first must get a string containing the plain text of interest (no formatting etc), then remove all special characters (like ",", ":", ";", etc.) using either String.Replace() or regular expression, then apply String.Split() using " " separator. You will get an array of strings containing words in the text. In real world solution, you must do much more of string processing, for e.g., replacing trailing blank spaces "     " with just a single one " ", etc. As mentioned above, entire production-grade solution goes far beyond the boundary of just a single article, and is also subject/domain-specific. You should probably start with simple proto and then trim it to fit your particular case. For your immediate needs, you can use my free online semantic analyzer, which provides a reasonable accuracy.

Kind regards,
AB