如何找到句子之间的相似性?

问题描述:

我正在尝试在shell脚本中找到两个句子之间的相似之处.

I'm trying to find similarities between both the sentences in a shell script.

有两个包含重复单词的句子,例如文件 my_text.txt

Have a two sentences containing duplicate words, for example, the input data in file my_text.txt

Shell Script.
Linux Shell Script.

  • 两个句子的交集: Shell + Script

    联合"大小两个句子中的 3

    The union " size " of both sentences: 3

    正确的句子相似度输出:

    The correct output for similarity of sentences :

     0.30000000000000000000
    

    相似度**的定义是两个句子之间的单词交集除以两个句子的并集大小.

    The definition of the similarity ** is the intersection of words between the two sentences divided by the size of the union of the two sentences.

    问题:我已经做了很多尝试来找到shell脚本,但是我还没有找到解决该问题的方法.

    The problem: I have tried a lot to found a shell script, but I have not found a solution to this problem.

以下脚本应该可以解决问题.它还会忽略您在评论部分中描述的每个句子中重复的单词,填充词和非字母字符.

The following script should do the trick. It also ignores duplicated words per sentence, filler words, and non-alphabetical characters as described by you in the comment section.

words=$(
  < my_text.txt tr 'A-Z' 'a-z' |
  grep -Eon '\b[a-z]*\b' |
  grep -Fwvf <(printf %s\\n is a to be by the and for) |
  sort -u | cut -d: -f2 | sort
)
union=$(uniq <<< "$words" | wc -l)
intersection=$(uniq -d <<< "$words" | wc -l)
echo "similarity is $(bc -l <<< "$intersection/$union")"

示例输入的输出为 .30000000000000000000 (= 0.3).

The output for your example input is .30000000000000000000 (= 0.3).