从熊猫数据框中提取信息
我有以下数据框.我想构建一个规则引擎来提取模式类似于 Eg 的标记.美国".最好的方法是什么?有没有像正则表达式或 CGUL 用于此类任务?任何建议将不胜感激.
I have the below dataframe. I want to build a rule engine to extract the tokens where the pattern is like Eg. "UNITED STATES" .What is the best way to do it ? Is there anything like regex or CGUL for this kind of tasks? Any suggestions would be appreciated.
WORD_INDEX WORD_TOKEN WORD_POS
0 TRUMP PROPN
1 IS ADP
2 THE ADP
3 PRESIDENT NOUN
4 OF ADP
5 THE ADP
6 UNITED NOUN
7 STATES NOUN
我想从 WORD_POS 开始并找到 WORD_TOKEN.知道怎么做吗?例如,我想找到 WORD_POS 是 NOUN 的 WORD_TOKEN,然后下一个 WORD_POS 也是 NOUN.
I want to start with WORD_POS and find the WORD_TOKEN. Any idea how to do that? For example, I want to find the WORD_TOKENs where the WORD_POS is NOUN and then next WORD_POS is also NOUN.
您可能希望使用 contains
字符串方法,该方法默认采用正则表达式参数.例如
You may want to use the contains
string method, which takes a regex argument by default. For example
mask = df['WORD_TOKEN'].str.contains('(UNITED|STATES)')
print(df[mask])
这将匹配包含united"或states"的任何内容.
This will match anything containing "united" or "states".