使用 R 进行文本挖掘

使用 R 进行文本挖掘

问题描述:

我需要使用 R 进行文本挖掘的帮助

I need help in text mining using R

Title      Date            Content    
Boy        May 13 2015     "She is pretty", Tom said. Tom is handsome.
Animal     June 14 2015    The penguin is cute, lion added.
Human      March 09 2015   Mr Koh predicted that every human is smart...
Monster    Jan 22 2015     Ms May, a student, said that John has $10.80. May loves you.

我只想从人们所说的中获得意见.

I would just want to get the opinions from what the people had said.

此外,我想寻求帮助以获取百分比(例如 9.8%),因为当我根据句号(.")拆分句子时,我会得到他的结果提高了 0".而不是他的成绩提高了 0.8%".

And also, I would like to seek help in getting the percentage (Eg. 9.8%), because when i split the sentences based on fullstop ("."), i would get "His result improved by 0." instead of "His result improved by 0.8%".

以下是我想获得的输出:

Below is the output that I would like to obtain:

Title      Date            Content    
Boy        May 13 2015     she is pretty
Animal     June 14 2015    the penguin is cute
Human      March 09 2015   every human is smart
Monster    Jan 22 2015     john has $10.80

下面是我尝试过的代码,但没有得到想要的输出:

Below is the code that I tried, but didn't get desired output:

list <- c("said", "added", "predicted")
pattern <- paste (list, collapse = "|")
dataframe <- stack(setNames(lapply(strsplit(dataframe, '(?<=[.])', perl=TRUE), grep, pattern = pattern, value = TRUE), dataframe$Title))[2:1]

您已经接近了,但是您用于拆分的正则表达式是错误的.这为数据提供了正确的安排,以您的要求为模更准确地提取意见:

You're close, but your regular expression for splitting is wrong. This gave the correct arrangement for the data, modulo your request to extract opinions more exactly:

txt <- '
Title      Date            Content    
Boy        May 13 2015     "She is pretty", Tom said. Tom is handsome.
Animal     June 14 2015    The penguin is cute, lion added.
Human      March 09 2015   Mr Koh predicted that every human is smart...
Monster    Jan 22 2015     Ms May, a student, said that John has $10.80. May loves you.
'

txt <- gsub(" {2,}(?=\\S)", "|", txt, perl = TRUE)
dataframe <- read.table(sep = "|", text = txt, header = TRUE)

list <- c("said", "added", "predicted")
pattern <- paste (list, collapse = "|")

content <- strsplit(dataframe$Content, '\\.(?= )', perl=TRUE)
opinions <- lapply(content, grep, pattern = pattern, value = TRUE)
names(opinions) <- dataframe$Title
result <- stack(opinions)

在您的示例数据中,所有句号后跟空格都是句子结尾,因此这就是正则表达式 \.(?= ) 匹配的内容.但是,这会分解诸如我出生在美国,但我住在加拿大" 之类的句子,因此您可能需要进行额外的预处理和检查.

In your sample data, all full stops followed by spaces are sentence-ending, so that's what the regular expression \.(?= ) matches. However that will break up sentences like "I was born in the U.S.A. but I live in Canada", so you might have to do additional pre-processing and checking.

然后,假设 Title 是唯一标识符,您只需 merge 将日期添加回:

Then, assuming the Titles are unique identifiers, you can just merge to add the dates back in:

result <- merge(dataframe[c("Title", "Date")], result, by = "Title")

正如评论中提到的,NLP 任务本身更多地与文本解析有关,而不是 R 编程.你可能可以通过搜索像

As mentioned in the comments, the NLP task itself has more to do with text parsing than R programming. You can probably get some mileage out of searching for a pattern like

<optional adjectives> <noun> <verb> <optional adverbs> <adjective> <optional and/or> <optional adjective> ...

这将与您的样本数据相匹配,但我远不是这里的专家.您还需要一本带有词汇类别的字典.谷歌搜索提取意见文本"在第一页上产生了很多有用的结果,包括 本网站 由 Bing Liu 运营.据我所知,刘教授写了一本关于情感分析的书.

which would match your sample data, but I'm far from an expert here. You'd also need a dictionary with lexical categories. A Google search for "extract opinion text" yielded a lot of helpful results on the first page, including this site run by Bing Liu. From what I can tell, Professor Liu literally wrote the book on sentiment analysis.