等效于具有多个字符定界符的StringTokenizer
我尝试将String拆分为令牌.
I try to split a String into tokens.
令牌定界符不是单个字符,某些定界符包含在其他字符中(例如&和&&),并且我需要将定界符作为令牌返回.
StringTokenizer无法处理多个字符定界符.我认为String.split可以实现此功能,但无法猜测出适合我需要的神奇正则表达式.
The token delimiters are not single characters, some delimiters are included into others (example, & and &&), and I need to have the delimiters returned as token.
StringTokenizer is not able to deal with multiple characters delimiters. I presume it's possible with String.split, but fail to guess the magical regular expression that will suits my needs.
有什么主意吗?
示例:
Token delimiters: "&", "&&", "=", "=>", " "
String to tokenize: a & b&&c=>d
Expected result: an string array containing "a", " ", "&", " ", "b", "&&", "c", "=>", "d"
-编辑---
感谢大家的帮助,Dasblinkenlight为我提供了解决方案.这是我在他的帮助下编写的即用型"代码:
--- Edit ---
Thanks to all for your help, Dasblinkenlight gives me the solution. Here is the "ready to use" code I wrote with his help:
private static String[] wonderfulTokenizer(String string, String[] delimiters) {
// First, create a regular expression that matches the union of the delimiters
// Be aware that, in case of delimiters containing others (example && and &),
// the longer may be before the shorter (&& should be before &) or the regexpr
// parser will recognize && as two &.
Arrays.sort(delimiters, new Comparator<String>() {
@Override
public int compare(String o1, String o2) {
return -o1.compareTo(o2);
}
});
// Build a string that will contain the regular expression
StringBuilder regexpr = new StringBuilder();
regexpr.append('(');
for (String delim : delimiters) { // For each delimiter
if (regexpr.length() != 1) regexpr.append('|'); // Add union separator if needed
for (int i = 0; i < delim.length(); i++) {
// Add an escape character if the character is a regexp reserved char
regexpr.append('\\');
regexpr.append(delim.charAt(i));
}
}
regexpr.append(')'); // Close the union
Pattern p = Pattern.compile(regexpr.toString());
// Now, search for the tokens
List<String> res = new ArrayList<String>();
Matcher m = p.matcher(string);
int pos = 0;
while (m.find()) { // While there's a delimiter in the string
if (pos != m.start()) {
// If there's something between the current and the previous delimiter
// Add it to the tokens list
res.add(string.substring(pos, m.start()));
}
res.add(m.group()); // add the delimiter
pos = m.end(); // Remember end of delimiter
}
if (pos != string.length()) {
// If it remains some characters in the string after last delimiter
// Add this to the token list
res.add(string.substring(pos));
}
// Return the result
return res.toArray(new String[res.size()]);
}
如果您只通过一次创建Pattern就可以有很多要标记的字符串,那可能是最佳选择.
It could be optimize if you have many strings to tokenize by creating the Pattern only one time.
You can use the Pattern
and a simple loop to achieve the results that you are looking for:
List<String> res = new ArrayList<String>();
Pattern p = Pattern.compile("([&]{1,2}|=>?| +)");
String s = "s=a&=>b";
Matcher m = p.matcher(s);
int pos = 0;
while (m.find()) {
if (pos != m.start()) {
res.add(s.substring(pos, m.start()));
}
res.add(m.group());
pos = m.end();
}
if (pos != s.length()) {
res.add(s.substring(pos));
}
for (String t : res) {
System.out.println("'"+t+"'");
}
此产生以下结果:
's'
'='
'a'
'&'
'=>'
'b'