用于提取学院、大学和研究所名称的正则表达式?

问题描述:

我在一个文件中有一堆这样的字符串:

I have a bunch of strings like this in a file:

M.S., Arizona University, Tucson, Az., 1957
B.A., American International College, Springfield, Mass., 1978
B.A., American University, Washington, D.C., 1985

我想提取塔夫茨大学、美国国际学院、美国大学、马萨诸塞大学等,但不提取高中(假设它包含学院"或高中"可能是安全的这是一所高中).有什么想法吗?

and I'd like to extract Tufts University, American International College, American University, University of Massachusetts, etc, but not the high schools (it's probably safe to assume that if it contains "Academy" or "High School" that it's a high school). Any ideas?

在 PHP 中使用 preg_match_all 测试,适用于您提供的示例文本:

Tested with preg_match_all in PHP, will work for the sample text you provided:

 /(?<=,)[\w\s]*(College|University|Institute)[^,\d]*(?=,|\d)/

如果您的正则表达式引擎不支持lookaheads/lookbehinds,则需要稍作修改.

Will need to be modified somewhat if your regex engine does not support lookaheads/lookbehinds.

更新:我查看了您链接的示例文本 &相应地更新了正则表达式

Update: I looked at your linked sample text & updated the regex accordingly

 /([A-Z][^\s,.]+[.]?\s[(]?)*(College|University|Institute|Law School|School of|Academy)[^,\d]*(?=,|\d)/

第一部分将匹配一个以大写字母开头的字符串,后跟一个..然后是一个空格,然后是可选的 (.此模式匹配零次或多次.

The first part will match a string starting with a capital letter, optionally followed by an .. Then a space, then optionally an (. This pattern is matched zero or more times.

这应该得到关键字前的所有相关词.

This should get all relevant words preceding the keywords.