python 这题不会写,有没有同志可以帮助一下,
从英文文档中读入文本,将每个句子表示为词袋特征向量。要求如下:
1)从文件中读出所有英文句子;
2)统计所有句子中的词;
3)将每个句子表示为词袋模型的向量;
4)将每个句子的向量保存到新的文档中。
文档集内容如下所示。
"State-of-the-art named entity recognition systems rely heavily on hand-crafted features and domain-specific knowledge in order to learn effectively from the small",
"supervised training corpora that are available. In this paper, we introduce two new neural architectures: one based on bidirectional LSTMs and conditional random fields",
"and the other that constructs and labels segments using a transition-based approach inspired by shift-reduce parsers. Our models rely on two sources of information about words",
"character-based word representations learned from the supervised corpus and unsupervised word representations learned from unannotated corpora",
"Our models obtain state-of-the-art performance in NER in four languages without resorting to any language-specific knowledge or resources such as gazetteers"
import numpy as np
import re
from gensim import corpora
def onehot_matrix(list1):
words = []
docs = []
for i in list1: # 去标点符号
string = re.sub("[\,\.\:]", "",i)
docs.append(string) # 去掉标点符号的句子
for i in range(len(docs)):
docs[i] = docs[i].split(" ")
words += docs[i]
vocab=sorted(set(words),key=words.index) # 所有不重复的词
V=len(vocab) # 建立一个M行V列的全0矩阵,M为句子数量,V为不重复词语数,即编码维度
M=len(list1)
onehot = np.zeros(V, dtype=int) # 用来表示词
bow = np.zeros((M,V), dtype=int) # 用来表示所有句子
#生成词典
dict = corpora.Dictionary([words])
print(dict.token2id) # 输出词典
for i,doc in enumerate(docs): #词袋
for word in doc:
if word in words:
pos=vocab.index(word)
bow[i][pos] += 1
return [list(i) for i in bow]
list1 = ["State-of-the-art named entity recognition systems rely heavily on hand-crafted features and domain-specific knowledge in order to learn effectively from the small",
"supervised training corpora that are available. In this paper, we introduce two new neural architectures: one based on bidirectional LSTMs and conditional random fields",
"and the other that constructs and labels segments using a transition-based approach inspired by shift-reduce parsers. Our models rely on two sources of information about words",
"character-based word representations learned from the supervised corpus and unsupervised word representations learned from unannotated corpora",
"Our models obtain state-of-the-art performance in NER in four languages without resorting to any language-specific knowledge or resources such as gazetteers"]
print(onehot_matrix(list1))
我们可以交流下