


I have a bunch of strings like this in a file:

M.S., Arizona University, Tucson, Az., 1957
B.A., American International College, Springfield, Mass., 1978
B.A., American University, Washington, D.C., 1985


and I'd like to extract Tufts University, American International College, American University, University of Massachusetts, etc, but not the high schools (it's probably safe to assume that if it contains "Academy" or "High School" that it's a high school). Any ideas?

在 PHP 中使用 preg_match_all 测试,适用于您提供的示例文本:

Tested with preg_match_all in PHP, will work for the sample text you provided:



Will need to be modified somewhat if your regex engine does not support lookaheads/lookbehinds.

更新:我查看了您链接的示例文本 &相应地更新了正则表达式

Update: I looked at your linked sample text & updated the regex accordingly

 /([A-Z][^\s,.]+[.]?\s[(]?)*(College|University|Institute|Law School|School of|Academy)[^,\d]*(?=,|\d)/

第一部分将匹配一个以大写字母开头的字符串,后跟一个..然后是一个空格,然后是可选的 (.此模式匹配零次或多次.

The first part will match a string starting with a capital letter, optionally followed by an .. Then a space, then optionally an (. This pattern is matched zero or more times.


This should get all relevant words preceding the keywords.