如何使用NLTK nltk.tokenize.texttiling将文本拆分为段落?

问题描述:

我找到了将文本拆分为NLTK段落- nltk.tokenize.texttiling的用法?解释了如何将文本输入texttiling,但是我无法真正返回由段落/主题更改标记的文本,如以下texttiling

I found this Split Text into paragraphs NLTK - usage of nltk.tokenize.texttiling? explaining how to feed a text into texttiling, however I am unable to actually return a text tokenized by paragraph / topic change as shown here under texttiling http://www.nltk.org/api/nltk.tokenize.html.

当我将文本输入到文本平铺中时,我会得到相同的未标记文本,但只是列表,这对我没有用.

When I feed my text into texttiling, I get the same untokenized text back, but as a list, which is of no use to me.

    tt = nltk.tokenize.texttiling.TextTilingTokenizer(w=20, k=10,similarity_method=0, stopwords=None, smoothing_method=[0], smoothing_width=2, smoothing_rounds=1, cutoff_policy=1, demo_mode=False)

    tiles = tt.tokenize(text) # same text returned

我收到的是遵循此基本结构的电子邮件

What I have are emails that follow this basic structure

    From: X
    To: Y                             (LOGISTICS)
    Date: 10/03/2017

    Hello team,                       (INTRO)

    Some text here representing
    the body                          (BODY)
    of the text.

    Regards,                          (OUTRO)
    X

    *****DISCLAIMER*****              (POST EMAIL DISCLAIMER)
    THIS EMAIL IS CONFIDENTIAL
    IF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL

如果我们将此电子邮件字符串称为s,则看起来像

If we call this email string s, it would look like

    s = "From: X\nTo: Y\nDate: 10/03/2017 Hello team,\nSome text here representing the body of the text. Regards,\nX\n\n*****DISCLAIMER*****\nTHIS EMAIL IS CONFIDENTIAL\nIF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL"

我想做的是返回字符串s的这5个部分/段落-LOGISTICS,INTRO,BODY,OUTRO,POST EMAIL DISCLAIMER-分别,因此我可以删除除正文以外的所有内容.如何使用nltk texttiling分别返回这5个部分?

What I want to do is return these 5 sections / paragraphs of string s - LOGISTICS, INTRO, BODY, OUTRO, POST EMAIL DISCLAIMER - separately so I can remove everything but the BODY of the text. How can I return these 5 sections separately using nltk texttiling?

***并非所有电子邮件都遵循相同的结构或措辞,因此我不能使用正则表达式.

*** Not all emails follow this same structure or have the same wording, so I can't use regular expressions.

使用splitlines怎么办?还是必须使用nltk软件包?

What about using splitlines? Or do you have to use the nltk package?

email = """    From: X
    To: Y                             (LOGISTICS)
    Date: 10/03/2017

    Hello team,                       (INTRO)

    Some text here representing
    the body                          (BODY)
    of the text.

    Regards,                          (OUTRO)
    X

    *****DISCLAIMER*****              (POST EMAIL DISCLAIMER)
    THIS EMAIL IS CONFIDENTIAL
    IF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL"""

y = [s.strip() for s in email.splitlines()]

print(y)