从熊猫数据框的列中提取标签

从熊猫数据框的列中提取标签

问题描述:

我有一个数据框df.我想从其中Max == 45的推文中提取主题标签.

i have a dataframe df. I want to extract hashtags from tweets where Max==45.:

Max    Tweets
42   via @VIE_unlike at #fashion
42   Ny trailer #katamaritribute #ps3
45   Saved a baby bluejay from dogs #fb
45   #Niley #Niley #Niley 

我正在尝试类似的操作,但是它给出了空的数据框:

i m trying something like this but its giving empty dataframe:

df.loc[df['Max'] == 45, [hsh for hsh in 'tweets' if hsh.startswith('#')]]

大熊猫中有什么东西可以用来有效且快速地执行此操作.

is there something in pandas which i can use to perform this effectively and faster.

您可以使用pd.Series.str.findall:

In [956]: df.Tweets.str.findall(r'#.*?(?=\s|$)')
Out[956]: 
0                  [#fashion]
1    [#katamaritribute, #ps3]
2                       [#fb]
3    [#Niley, #Niley, #Niley]

这将返回一列list s.

如果您要先过滤然后查找,则可以使用boolean indexing轻松进行:

If you want to filter first and then find, you can do so quite easily using boolean indexing:

In [957]: df.Tweets[df.Max == 45].str.findall(r'#.*?(?=\s|$)')
Out[957]: 
2                       [#fb]
3    [#Niley, #Niley, #Niley]
Name: Tweets, dtype: object


此处使用的正则表达式为:


The regex used here is:

#.*?(?=\s|$)

要了解它,请将其分解:

To understand it, break it down:

  • #.*?-对以#标签开头的单词进行非贪婪匹配
  • (?=\s|$)-提前查看单词的末尾或句子的末尾
  • #.*? - carries out a non-greedy match for a word starting with a hashtag
  • (?=\s|$) - lookahead for the end of the word or end of the sentence

如果您的单词中间有#而不是#em标签,则可能会产生误报,而这些误报是您不想要的.在这种情况下,您可以修改您的正则表达式以包括一个回首:

If it's possible you have # in the middle of a word that is not a hashtag, that would yield false positives which you wouldn't want. In that case, You can modify your regex to include a lookbehind:

(?:(?<=\s)|(?<=^))#.*?(?=\s|$)

后面的正则表达式断言空格或句子开头必须在#字符之前.

The regex lookbehind asserts that either a space or the start of the sentence must precede a # character.