从熊猫数据框中提取信息

从熊猫数据框中提取信息

问题描述:

我有以下数据框.我想构建一个规则引擎来提取模式类似于 Eg 的标记.美国".最好的方法是什么?有没有像正则表达式或 CGUL 用于此类任务?任何建议将不胜感激.

I have the below dataframe. I want to build a rule engine to extract the tokens where the pattern is like Eg. "UNITED STATES" .What is the best way to do it ? Is there anything like regex or CGUL for this kind of tasks? Any suggestions would be appreciated.

WORD_INDEX  WORD_TOKEN  WORD_POS
0           TRUMP       PROPN
1           IS          ADP
2           THE         ADP
3           PRESIDENT   NOUN
4           OF          ADP
5           THE         ADP
6           UNITED      NOUN
7           STATES      NOUN

我想从 WORD_POS 开始并找到 WORD_TOKEN.知道怎么做吗?例如,我想找到 WORD_POS 是 NOUN 的 WORD_TOKEN,然后下一个 WORD_POS 也是 NOUN.

I want to start with WORD_POS and find the WORD_TOKEN. Any idea how to do that? For example, I want to find the WORD_TOKENs where the WORD_POS is NOUN and then next WORD_POS is also NOUN.

您可能希望使用 contains 字符串方法,该方法默认采用正则表达式参数.例如

You may want to use the contains string method, which takes a regex argument by default. For example

mask = df['WORD_TOKEN'].str.contains('(UNITED|STATES)')
print(df[mask])

这将匹配包含united"或states"的任何内容.

This will match anything containing "united" or "states".