如何找到句子之间的相似性?
我正在尝试在shell脚本中找到两个句子之间的相似之处.
I'm trying to find similarities between both the sentences in a shell script.
有两个包含重复单词的句子,例如文件 my_text.txt
Have a two sentences containing duplicate words, for example, the input data in file my_text.txt
Shell Script.
Linux Shell Script.
-
两个句子的交集:
Shell
+Script
联合"大小两个句子中的
3
The union " size " of both sentences:
3
正确的句子相似度输出:
The correct output for similarity of sentences :
0.30000000000000000000
相似度**的定义是两个句子之间的单词交集除以两个句子的并集大小.
The definition of the similarity ** is the intersection of words between the two sentences divided by the size of the union of the two sentences.
问题:我已经做了很多尝试来找到shell脚本,但是我还没有找到解决该问题的方法.
The problem: I have tried a lot to found a shell script, but I have not found a solution to this problem.
以下脚本应该可以解决问题.它还会忽略您在评论部分中描述的每个句子中重复的单词,填充词和非字母字符.
The following script should do the trick. It also ignores duplicated words per sentence, filler words, and non-alphabetical characters as described by you in the comment section.
words=$(
< my_text.txt tr 'A-Z' 'a-z' |
grep -Eon '\b[a-z]*\b' |
grep -Fwvf <(printf %s\\n is a to be by the and for) |
sort -u | cut -d: -f2 | sort
)
union=$(uniq <<< "$words" | wc -l)
intersection=$(uniq -d <<< "$words" | wc -l)
echo "similarity is $(bc -l <<< "$intersection/$union")"
示例输入的输出为 .30000000000000000000
(= 0.3).
The output for your example input is .30000000000000000000
(= 0.3).