如何使用NLTK nltk.tokenize.texttiling将文本拆分为段落?
我找到了将文本拆分为NLTK段落- nltk.tokenize.texttiling的用法?解释了如何将文本输入texttiling,但是我无法真正返回由段落/主题更改标记的文本,如以下texttiling
I found this Split Text into paragraphs NLTK - usage of nltk.tokenize.texttiling? explaining how to feed a text into texttiling, however I am unable to actually return a text tokenized by paragraph / topic change as shown here under texttiling http://www.nltk.org/api/nltk.tokenize.html.
当我将文本输入到文本平铺中时,我会得到相同的未标记文本,但只是列表,这对我没有用.
When I feed my text into texttiling, I get the same untokenized text back, but as a list, which is of no use to me.
tt = nltk.tokenize.texttiling.TextTilingTokenizer(w=20, k=10,similarity_method=0, stopwords=None, smoothing_method=[0], smoothing_width=2, smoothing_rounds=1, cutoff_policy=1, demo_mode=False)
tiles = tt.tokenize(text) # same text returned
我收到的是遵循此基本结构的电子邮件
What I have are emails that follow this basic structure
From: X
To: Y (LOGISTICS)
Date: 10/03/2017
Hello team, (INTRO)
Some text here representing
the body (BODY)
of the text.
Regards, (OUTRO)
X
*****DISCLAIMER***** (POST EMAIL DISCLAIMER)
THIS EMAIL IS CONFIDENTIAL
IF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL
如果我们将此电子邮件字符串称为s,则看起来像
If we call this email string s, it would look like
s = "From: X\nTo: Y\nDate: 10/03/2017 Hello team,\nSome text here representing the body of the text. Regards,\nX\n\n*****DISCLAIMER*****\nTHIS EMAIL IS CONFIDENTIAL\nIF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL"
我想做的是返回字符串s的这5个部分/段落-LOGISTICS,INTRO,BODY,OUTRO,POST EMAIL DISCLAIMER-分别,因此我可以删除除正文以外的所有内容.如何使用nltk texttiling分别返回这5个部分?
What I want to do is return these 5 sections / paragraphs of string s - LOGISTICS, INTRO, BODY, OUTRO, POST EMAIL DISCLAIMER - separately so I can remove everything but the BODY of the text. How can I return these 5 sections separately using nltk texttiling?
***并非所有电子邮件都遵循相同的结构或措辞,因此我不能使用正则表达式.
*** Not all emails follow this same structure or have the same wording, so I can't use regular expressions.
使用splitlines
怎么办?还是必须使用nltk软件包?
What about using splitlines
? Or do you have to use the nltk package?
email = """ From: X
To: Y (LOGISTICS)
Date: 10/03/2017
Hello team, (INTRO)
Some text here representing
the body (BODY)
of the text.
Regards, (OUTRO)
X
*****DISCLAIMER***** (POST EMAIL DISCLAIMER)
THIS EMAIL IS CONFIDENTIAL
IF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL"""
y = [s.strip() for s in email.splitlines()]
print(y)