建立Lucene同义词

建立Lucene同义词

问题描述:

我有以下代码

static class TaggerAnalyzer extends Analyzer {

    @Override
    protected TokenStreamComponents createComponents(String s, Reader reader) {

        SynonymMap.Builder builder = new SynonymMap.Builder(true);
        builder.add(new CharsRef("al"), new CharsRef("americanleague"), true);
        builder.add(new CharsRef("al"), new CharsRef("a.l."), true);
        builder.add(new CharsRef("nba"), new CharsRef("national" + SynonymMap.WORD_SEPARATOR + "basketball" + SynonymMap.WORD_SEPARATOR + "association"), true);

        SynonymMap mySynonymMap = null;
        try {
            mySynonymMap = builder.build();
        } catch (IOException e) {
            e.printStackTrace();
        }

        Tokenizer source = new ClassicTokenizer(Version.LUCENE_40, reader);
        TokenStream filter = new StandardFilter(Version.LUCENE_40, source);
        filter = new LowerCaseFilter(Version.LUCENE_40, filter);
        filter = new SynonymFilter(filter, mySynonymMap, true);
        return new TokenStreamComponents(source, filter);
    }
}

我正在运行一些测试,到目前为止,一切正常,直到我弄清楚了这种情况.

And I'm running some test, so far, everything went ok until I figured out this scenario.

    String title = "Very short title at a.l. bla bla"

    Assert.assertTrue(TagUtil.evaluate(memoryIndex,"americanleague"));
    Assert.assertTrue(TagUtil.evaluate(memoryIndex,"al"));

我期望这两个案例都能成功运行,但是AmericanLeague与"a.l."不匹配.除了两个"a.l."和"americanleague"是"al"同义词.

I was expecting that both cases ran successfully, but americanleague didn't match with "a.l." besides both "a.l." and "americanleague" are "al" synonyms.

那我该怎么办?我不想将所有组合添加到地图.谢谢

So, what do I do? I don't want to add all combinations to the Map. Thanks

我相信您有向后引用builder.add的论点.试试:

I believe you have your arguments to builder.add backwards. Try:

builder.add(new CharsRef("americanleague"), new CharsRef("al"), true);
builder.add(new CharsRef("a.l."), new CharsRef("al"), true);
builder.add(new CharsRef("national" + SynonymMap.WORD_SEPARATOR + "basketball" + SynonymMap.WORD_SEPARATOR + "association"), new CharsRef("nba"), true);

SynonymFilter只是从第一个arg(输入)映射到第二个arg(输出),而不是相反.因此,您有规则将"al"转换为两个不同的同义词,但是没有规则对"a.l"的输入有任何作用.或美国同事".

The SynonymFilter just maps from the first arg (input) to the second arg (output), rather than the other way around. So you have rules to translate "al" to two different synonyms, but none that do anything to inputs of "a.l." or "americanleague".