在提供Lucene Index时使用免费工具进行实体提取/识别
我目前正在研究从文本(网络上的很多文章)中提取人物姓名,位置,技术用语和类别的选项,然后将其输入到Lucene/ElasticSearch索引中.然后,附加信息将作为元数据添加,并应提高搜索的准确性.
I'm currently investigating the options to extract person names, locations, tech words and categories from text (a lot articles from the web) which will then feeded into a Lucene/ElasticSearch index. The additional information is then added as metadata and should increase precision of the search.
例如当有人询问检票口"时,他应该能够决定他是指板球运动还是阿帕奇项目.到目前为止,我尝试自己实施此方法,但收效甚微.现在,我发现了很多工具,但是我不确定它们是否适合此任务,哪些与Lucene集成得很好,或者实体提取的精度是否足够高.
E.g. when someone queries 'wicket' he should be able to decide whether he means the cricket sport or the Apache project. I tried to implement this on my own with minor success so far. Now I found a lot tools, but I'm not sure if they are suited for this task and which of them integrates good with Lucene or if precision of entity extraction is high enough.
- Dbpedia Spotlight , OpenNLP 需要 OpenNLP工具
- Stanbol
- NLTK
- balie
- UIMA
- 门-> Apache Mahout
- 斯坦福CRF-NER
- maui-indexer
- Mallet
- 伊利诺伊州命名实体标记符不是开源的,而是免费的
- Wikipedianer数据
- Dbpedia Spotlight, the demo looks very promising
- OpenNLP requires training. Which training data to use?
- OpenNLP tools
- Stanbol
- NLTK
- balie
- UIMA
- GATE -> example code
- Apache Mahout
- Stanford CRF-NER
- maui-indexer
- Mallet
- Illinois Named Entity Tagger Not open source but free
- wikipedianer data
我的问题:
- 是否有人对上面列出的某些工具及其精度/召回率有经验?或者,如果需要培训数据+可用.
- 是否有文章或教程可让您开始使用每种工具的实体提取(NER)?
- 如何将它们与Lucene集成?
以下是与该主题相关的一些问题:
Here are some questions related to that subject:
- Does an algorithm exist to help detect the "primary topic" of an English sentence?
- Named Entity Recognition Libraries for Java
- Named entity recognition with Java
在检票口"示例中面临的问题称为实体消歧,而不是实体提取/识别(NER). NER可能有用,但仅当类别足够具体时才有用.大多数NER系统没有足够的粒度来区分运动项目和软件项目(这两种类型都超出了通常公认的类型:人员,组织,位置).
The problem you are facing in the 'wicket' example is called entity disambiguation, not entity extraction/recognition (NER). NER can be useful but only when the categories are specific enough. Most NER systems doesn't have enough granularity to distinguish between a sport and a software project (both types would fall outside the typically recognized types: person, org, location).
要消除歧义,您需要一个针对实体进行歧义消除的知识库.由于DBpedia具有广泛的覆盖范围,因此是一个典型的选择.请参阅我的答案,以获取如何使用DBPedia提取内容中的标签/关键字?,在这里我提供了更多解释,并提到了一些用于歧义消除的工具,包括:
For disambiguation, you need a knowledge base against which entities are being disambiguated. DBpedia is a typical choice due to its broad coverage. See my answer for How to use DBPedia to extract Tags/Keywords from content? where I provide more explanation, and mentions several tools for disambiguation including:
- Zemanta
- 毛伊岛索引器
- Dbpedia Spotlight
-
摘录(我的公司)
- Zemanta
- Maui-indexer
- Dbpedia Spotlight
Extractiv (my company)
这些工具通常使用诸如REST之类的独立于语言的API,我不知道它们直接提供了Lucene支持,但我希望我的回答对您要解决的问题有所帮助.
These tools often use a language-independent API like REST, and I do not know that they directly provide Lucene support, but I hope my answer has been beneficial for the problem you are trying to solve.