使用Python从文本中删除非英语单词

问题描述：

我正在使用python进行数据清理练习，正在清理的文本包含我要删除的意大利语单词。我一直在网上搜索是否可以使用nltk之类的工具包在Python上执行此操作。

I am doing a data cleaning exercise on python and the text that I am cleaning contains Italian words which I would like to remove. I have been searching online whether I would be able to do this on Python using a tool kit like nltk.

例如，给出一些文字：

"Io andiamo to the beach with my amico."

我想留下来：

"to the beach with my"

到海滩有人知道吗如何做到这一点？
任何帮助将不胜感激。

Does anyone know of a way as to how this could be done? Any help would be much appreciated.

答

您可以使用NLTK中的个单词语料库：

You can use the words corpus from NLTK:

import nltk
words = set(nltk.corpus.words.words())

sent = "Io andiamo to the beach with my amico."
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
         if w.lower() in words or not w.isalpha())
# 'Io to the beach with my'

不幸的是， Io 恰好是一个英语单词。通常，可能很难确定一个单词是否为英语。

Unfortunately, Io happens to be an English word. In general, it may be hard to decide whether a word is English or not.

使用Python从文本中删除非英语单词

相关推荐