从MATLAB中的单元格数组中仅提取单词

问题描述:

我有一组文档,其中包含来自html页面的预处理文本.他们已经给我了.我只想从中提取单词.我不希望提取任何数字或常用词或任何单个字母.我面临的第一个问题是这个.

I have a set of documents containing pre-processed texts from html pages. They are already given to me. I want to extract only the words from it. I do not want any numbers or common words or any single letters to be extracted. The first problem I am facing is this.

假设我有一个单元格数组:

Suppose I have a cell array :

{'!' '!!' '!!!!)'  '!!!!thanks' '!!dogsbreath'    '!)'    '!--[endif]--'    '!--[if'}

我想要使单元格数组仅包含单词-这样.

I want to make the cell array having only the words - like this.

{'!!!!thanks' '!!dogsbreath' '!--[endif]--' '!--[if'}

然后将其转换为该单元格数组

And then convert this to this cell array

{'thanks' 'dogsbreath' 'endif' 'if'}

有什么办法吗?

更新的要求::感谢您的所有回答.但是我面临一个问题!让我说明一下(请注意,单元格值是从HTML文档中提取的文本,因此可能包含非ASCII值)-

Updated Requirement : Thanks to all of your answers. However I am facing a problem ! Let me illustrate this (Please note that the cell values are extracted text from HTML documents and hence may contain non ASCII values) -

{'!/bin/bash'    '![endif]'    '!take-a-long'    '!â€"photo'}

这给了我答案

{'bin'    'bash'    'endif'    'take'    'a'    'long'    'â'    'photo' }

我的问题:

  • 为什么将bin/bash和take-a-long分成三个单元格?对我来说这不是问题,但为什么呢?可以避免这种情况.我的意思是所有来自单个单元格的单词都被组合成一个单元.
  • 请注意,在'!â€photo'中存在一个非ASCII字符â,从本质上讲是a.可以合并步骤以使这种转换是自动的吗?
  • 我注意到文本"it? __________ About the Author:"给了我"__________"一个字.为什么会这样呢?
  • 此外,文本"2. areoplane 3. cactus 4. a_rinny_boo... 5. trumpet 6. window 7. curtain ... 173. gypsy_wagon..."返回的单词为'areoplane' 'cactus' 'a_rinny_boo' 'trumpet' 'window' 'curtain' 'gypsy_wagon'.我希望单词'a_rinny_boo'''gypsy_wagon'a' 'rinny' 'boo' 'gypsy' 'wagon'.能做到吗?
  • Why is bin/bash and take-a-long being separated into three cells ? Its not a problem for me but still why? Can this be avoided. I mean all words coming from a single cell being combined into one.
  • Notice that in '!â€"photo' there exists an non-ascii character â which esentially means a. Can a step be incorporated such that this transformation is automatic?
  • I noticed that the text "it? __________ About the Author:" gives me "__________" as a word. Why is this so ?
  • Also the text "2. areoplane 3. cactus 4. a_rinny_boo... 5. trumpet 6. window 7. curtain ... 173. gypsy_wagon..." returns a word as 'areoplane' 'cactus' 'a_rinny_boo' 'trumpet' 'window' 'curtain' 'gypsy_wagon'. I want the words 'a_rinny_boo' and ''gypsy_wagon to be 'a' 'rinny' 'boo' 'gypsy' 'wagon'. Can this be done ?

更新1 根据所有建议,我编写了一个函数,该函数除了上述两个新问的问题外,还可以完成大多数事情.

Update 1 Following all the suggestions I have written down a function which does most of the things except the above two newly asked questions.

function [Text_Data] = raw_txt_gn(filename)

% This function will convert the text documnets into raw text
% It will remove all commas empty cells and other special characters
% It will also convert all the words of the text documents into lowercase

T = textread(filename, '%s');

% find all the important indices
ind1=find(ismember(T,':WebpageTitle:'));
T1 = T(ind1+1:end,1);

% Remove things which are not basically words
not_words = {'##','-',':ImageSurroundingText:',':WebpageDescription:',':WebpageKeywords:',' '};

T2 = []; count = 1;
for j=1:length(T1)    
    x = T1{j};
    ind=find(ismember(not_words,x), 1);
    if isempty(ind)

        B = regexp(x, '\w*', 'match');
        B(cellfun('isempty', B)) = []; % Clean out empty cells
        B = [B{:}]; % Flatten cell array

        % convert the string into lowecase
        % so that while generating the features the case sensitivity is
        % handled well
        x = lower(B);        

        T2{count,1} = x;
        count = count+1;
    end
end
T2 = T2(~cellfun('isempty',T2));


% Getting the common words in the english language
% found from Wikipedia
not_words2 = {'the','be','to','of','and','a','in','that','have','i'};
not_words2 = [not_words2, 'it' 'for' 'not' 'on' 'with' 'he' 'as' 'you' 'do' 'at'];
not_words2 = [not_words2, 'this' 'but' 'his' 'by' 'from' 'they' 'we' 'say' 'her' 'she'];
not_words2 = [not_words2, 'or' 'an' 'will' 'my' 'one' 'all' 'would' 'there' 'their' 'what'];
not_words2 = [not_words2, 'so' 'up' 'out' 'if' 'about' 'who' 'get' 'which' 'go' 'me'];
not_words2 = [not_words2, 'when' 'make' 'can' 'like' 'time' 'no' 'just' 'him' 'know' 'take'];
not_words2 = [not_words2, 'people' 'into' 'year' 'your' 'good' 'some' 'could' 'them' 'see' 'other'];
not_words2 = [not_words2, 'than' 'then' 'now' 'look' 'only' 'come' 'its' 'over' 'think' 'also'];
not_words2 = [not_words2, 'back' 'after' 'use' 'two' 'how' 'our' 'work' 'first' 'well' 'way'];
not_words2 = [not_words2, 'even' 'new' 'want' 'because' 'any' 'these' 'give' 'day' 'most' 'us'];

for j=1:length(T2)
    x = T2{j};
    % if a particular cell contains only numbers then make it empty
    if sum(isstrprop(x, 'digit'))~=0
        T2{j} = [];
    end
    % also remove single character cells
    if length(x)==1
        T2{j} = [];
    end
    % also remove the most common words from the dictionary
    % the common words are taken from the english dicitonary (source
    % wikipedia)
    ind=find(ismember(not_words2,x), 1);
    if isempty(ind)==0
        T2{j} = [];
    end
end

Text_Data = T2(~cellfun('isempty',T2));


更新2 我在此处中找到了此代码,该代码告诉我如何检查非-ascii字符.在Matlab中将此代码段合并为


Update 2 I found this code in here which tells me how to check for non-ascii characters. Incorporating this code snippet in Matlab as

% remove the non-ascii characters
if all(x  < 128)
else
  T2{j} = [];
end

然后删除空白单元格,尽管包含一部分非ascii字符的文本完全消失了,但似乎满足了我的第二个要求.

and then removing the empty cells it seems my second requirement is fulfilled though the text containing a part of non-ascii characters completely disappears.

我的最终要求可以完成吗?它们大多数涉及字符'_''-'.

Can my final requirements be completed ? Most of them concerns the character '_' and '-'.

A regexp 方法直接进入最后一步:

A regexp approach to go directly to the final step:

A = {'!' '!!' '!!!!)'  '!!!!thanks' '!!dogsbreath'    '!)'    '!--[endif]--'    '!--[if'};

B = regexp(A, '\w*', 'match');
B(cellfun('isempty', B)) = []; % Clean out empty cells
B = [B{:}]; % Flatten cell array

与任何字母,数字或下划线字符匹配.对于示例,我们得到一个1x4单元格数组:

Which matches any alphabetic, numeric, or underscore character. For the sample case we get a 1x4 cell array:

B = 

    'thanks'    'dogsbreath'    'endif'    'if'



为什么将bin/bash和long-a-long分成三个单元格?对我来说这不是问题,但为什么呢?可以避免这种情况.我的意思是所有来自单个单元格的单词都组合成一个单元格.

Why is bin/bash and take-a-long being separated into three cells ? Its not a problem for me but still why? Can this be avoided. I mean all words coming from a single cell being combined into one.

因为我要展平单元格数组以除去嵌套的单元格.如果删除B = [B{:}];,则每个单元格内部都会有一个嵌套单元格,其中包含输入单元格数组的所有匹配项.您可以根据需要将它们组合在一起.

Because I'm flattening the cell arrays to remove nested cells. If you remove B = [B{:}]; each cell will have a nested cell inside containing all of the matches for the input cell array. You can combine these however you want after.

请注意,在!"照片"中存在一个非ASCII字符-本质上表示a.可以合并步骤以使这种转换是自动的吗?

Notice that in '!â€"photo' there exists an non-ascii character â which esentially means a. Can a step be incorporated such that this transformation is automatic?

是的,您必须根据字符代码进行制作.

Yes, you'll have to make it based on the character codes.

我注意到关于作者:__________的文本"一词是"__________".为什么会这样?

I noticed that the text "it? __________ About the Author:" gives me "__________" as a word. Why is this so ?

正如我所说,正则表达式匹配字母,数字或下划线字符.您可以更改过滤器以排除_,这也将解决第四个要点:B = regexp(A, '[a-zA-Z0-9]*', 'match');仅匹配a-zA-Z0-9.这还将排除看起来像\w*标志匹配的非ASCII字符.

As I said, the regex matches alphabetic, numeric, or underscore characters. You can change your filter to exclude _, which will also address the fourth bullet point: B = regexp(A, '[a-zA-Z0-9]*', 'match'); This will match a-z, A-Z, and 0-9 only. This will also exclude the non-ASCII characters, which it seems like the \w* flag matches.