在提供Lucene Index时使用免费工具进行实体提取/识别

问题描述：

我目前正在研究从文本(网络上的很多文章)中提取人物姓名，位置，技术用语和类别的选项，然后将其输入到Lucene/ElasticSearch索引中.然后，附加信息将作为元数据添加，并应提高搜索的准确性.

I'm currently investigating the options to extract person names, locations, tech words and categories from text (a lot articles from the web) which will then feeded into a Lucene/ElasticSearch index. The additional information is then added as metadata and should increase precision of the search.

例如当有人询问检票口"时，他应该能够决定他是指板球运动还是阿帕奇项目.到目前为止，我尝试自己实施此方法，但收效甚微.现在，我发现了很多工具，但是我不确定它们是否适合此任务，哪些与Lucene集成得很好，或者实体提取的精度是否足够高.

E.g. when someone queries 'wicket' he should be able to decide whether he means the cricket sport or the Apache project. I tried to implement this on my own with minor success so far. Now I found a lot tools, but I'm not sure if they are suited for this task and which of them integrates good with Lucene or if precision of entity extraction is high enough.

Dbpedia Spotlight ， OpenNLP 需要 OpenNLP工具
Stanbol
NLTK
balie
UIMA
门-> Apache Mahout
斯坦福CRF-NER
maui-indexer
Mallet
伊利诺伊州命名实体标记符不是开源的，而是免费的
Wikipedianer数据

Dbpedia Spotlight, the demo looks very promising
OpenNLP requires training. Which training data to use?
OpenNLP tools
Stanbol
NLTK
balie
UIMA
GATE -> example code
Apache Mahout
Stanford CRF-NER
maui-indexer
Mallet
Illinois Named Entity Tagger Not open source but free
wikipedianer data

我的问题:

是否有人对上面列出的某些工具及其精度/召回率有经验?或者，如果需要培训数据+可用.
是否有文章或教程可让您开始使用每种工具的实体提取(NER)?
如何将它们与Lucene集成?

以下是与该主题相关的一些问题:

Here are some questions related to that subject:

存在用于帮助检测主要主题"的算法.一个英语句子?
Java的命名实体识别库
使用Java命名实体识别

Does an algorithm exist to help detect the "primary topic" of an English sentence?
Named Entity Recognition Libraries for Java
Named entity recognition with Java

答

在检票口"示例中面临的问题称为实体消歧，而不是实体提取/识别(NER). NER可能有用，但仅当类别足够具体时才有用.大多数NER系统没有足够的粒度来区分运动项目和软件项目(这两种类型都超出了通常公认的类型:人员，组织，位置).

The problem you are facing in the 'wicket' example is called entity disambiguation, not entity extraction/recognition (NER). NER can be useful but only when the categories are specific enough. Most NER systems doesn't have enough granularity to distinguish between a sport and a software project (both types would fall outside the typically recognized types: person, org, location).

要消除歧义，您需要一个针对实体进行歧义消除的知识库.由于DBpedia具有广泛的覆盖范围，因此是一个典型的选择.请参阅我的答案，以获取如何使用DBPedia提取内容中的标签/关键字?，在这里我提供了更多解释，并提到了一些用于歧义消除的工具，包括:

For disambiguation, you need a knowledge base against which entities are being disambiguated. DBpedia is a typical choice due to its broad coverage. See my answer for How to use DBPedia to extract Tags/Keywords from content? where I provide more explanation, and mentions several tools for disambiguation including:

Zemanta
毛伊岛索引器
Dbpedia Spotlight
~~摘录(我的公司)~~

Zemanta
Maui-indexer
Dbpedia Spotlight
~~Extractiv (my company)~~

这些工具通常使用诸如REST之类的独立于语言的API，我不知道它们直接提供了Lucene支持，但我希望我的回答对您要解决的问题有所帮助.

These tools often use a language-independent API like REST, and I do not know that they directly provide Lucene support, but I hope my answer has been beneficial for the problem you are trying to solve.

在提供Lucene Index时使用免费工具进行实体提取/识别

相关推荐