python 这题不会写,有没有同志可以帮助一下,

python 这题不会写,有没有同志可以帮助一下,

问题描述:

从英文文档中读入文本,将每个句子表示为词袋特征向量。要求如下:

1)从文件中读出所有英文句子;

2)统计所有句子中的词;

3)将每个句子表示为词袋模型的向量;

4)将每个句子的向量保存到新的文档中。

文档集内容如下所示。

"State-of-the-art named entity recognition systems rely heavily on hand-crafted features and domain-specific knowledge in order to learn effectively from the small",

"supervised training corpora that are available. In this paper, we introduce two new neural architectures: one based on bidirectional LSTMs and conditional random fields",

"and the other that constructs and labels segments using a transition-based approach inspired by shift-reduce parsers. Our models rely on two sources of information about words",

"character-based word representations learned from the supervised corpus and unsupervised word representations learned from unannotated corpora",

"Our models obtain state-of-the-art performance in NER in four languages without resorting to any language-specific knowledge or resources such as gazetteers"

import numpy as np
import re
from gensim import corpora

def onehot_matrix(list1):
    words = []
    docs = []
    for i in list1:   # 去标点符号
        string = re.sub("[\,\.\:]", "",i)
        docs.append(string)  # 去掉标点符号的句子

    for i in range(len(docs)):
        docs[i] = docs[i].split(" ")
        words += docs[i]
    vocab=sorted(set(words),key=words.index)  # 所有不重复的词

    V=len(vocab)    # 建立一个M行V列的全0矩阵,M为句子数量,V为不重复词语数,即编码维度
    M=len(list1)
    onehot = np.zeros(V, dtype=int)  # 用来表示词
    bow = np.zeros((M,V), dtype=int) # 用来表示所有句子
    
    #生成词典
    dict = corpora.Dictionary([words])
    print(dict.token2id)  # 输出词典
    for i,doc in enumerate(docs):  #词袋 
        for word in doc:
            if word in words:
                pos=vocab.index(word)
                bow[i][pos] += 1
    return [list(i) for i in bow]

list1 = ["State-of-the-art named entity recognition systems rely heavily on hand-crafted features and domain-specific knowledge in order to learn effectively from the small",
         "supervised training corpora that are available. In this paper, we introduce two new neural architectures: one based on bidirectional LSTMs and conditional random fields",
         "and the other that constructs and labels segments using a transition-based approach inspired by shift-reduce parsers. Our models rely on two sources of information about words",
         "character-based word representations learned from the supervised corpus and unsupervised word representations learned from unannotated corpora",
         "Our models obtain state-of-the-art performance in NER in four languages without resorting to any language-specific knowledge or resources such as gazetteers"]
print(onehot_matrix(list1))

我们可以交流下