如何将Java单词边界与撇号一起使用?
我试图删除列表中一个单词的所有出现,但是当单词中带有撇号时,我遇到了麻烦.
I am trying to delete all the occurrences of a word in a list, but I am having trouble when there are apostrophes in the words.
String phrase="bob has a bike and bob's bike is red";
String word="bob";
phrase=phrase.replaceAll("\\b"+word+"\\b","");
System.out.println(phrase);
输出:有一辆自行车,而它的自行车是红色的
我想要的是有一辆自行车,鲍勃的自行车是红色的
我对正则表达式的了解有限,所以我猜想有一个解决方案,但是我现在还不足以创建用于处理撇号的正则表达式.另外,我希望它可以使用破折号,因此短语新邮件是电子邮件
将仅替换第一次出现的邮件.
I have a limited understanding of regex so I'm guessing there is a solution, but I do not now enough to create the regex to handle apostrophes. Also I would like it to work with dashes so the phrase the new mail is e-mail
would only replace the first occurrence of mail.
这全都取决于您理解什么是单词".也许您最好将自己理解的内容定义为单词定界符:例如,空格,逗号....并写为
It all depends on what you understan to be a "word". Perhaps you'd better define what you understand to be a word delimiter: for example, blanks, commas .... And write something as
phrase=phrase.replaceAll("([ \\s,.;])" + Pattern.quote(word)+ "([ \\s,.;])","$1$2");
但是您必须另外检查字符串的开头和结尾是否出现例如:
But you'll have to check additionally for occurrences at the start and the end of the string For example:
String phrase="bob has a bike bob, bob and boba bob's bike is red and \"bob\" stuff.";
String word="bob";
phrase=phrase.replaceAll("([\\s,.;])" + Pattern.quote(word) + "([\\s,.;])","$1$2");
System.out.println(phrase);
打印此
bob has a bike , and boba bob's bike is red and "bob" stuff.
更新:如果您坚持使用 \ b
,并考虑到单词边界"可以理解Unicode,那么您也可以使用这种肮脏的技巧:替换所有'
您确定不会在文本中出现一些Unicode字母,然后进行反向替换.示例:
Update: If you insist in using \b
, considering that the "word boundary" understand Unicode, you can also do this dirty trick: replace all ocurrences of '
by some Unicode letter that you're are sure will not appear in your text, and afterwards do the reverse replacemente. Example:
String phrase="bob has a bike bob, bob and boba bob's bike is red and \"bob\" stuff.";
String word="bob";
phrase= phrase.replace("'","ñ").replace('"','ö');
phrase=phrase.replaceAll("\\b" + Pattern.quote(word) + "\\b","");
phrase= phrase.replace('ö','"').replace("ñ","'");
System.out.println(phrase);
更新:以下总结了一些评论:人们希望 \ w
和 \ b
具有与哪个是文字字符"相同的概念,几乎每个正则表达式方言都可以做到.好吧,Java不会: \ w
考虑ASCII, \ b
考虑Unicode.我同意,这是一个丑陋的不一致.
UPDATE: To summarize some comments below: one would expect \w
and \b
to have the same notion as to which is a "word character", as almost every regular-expression dialect do. Well, Java does not: \w
considers ASCII, \b
considers Unicode. It's an ugly inconsistence, I agree.
更新2:自Java 7(如注释中所指出)以来,此处.
Update 2: Since Java 7 (as pointed out in comments) the UNICODE_CHARACTER_CLASS flag allows to specify a consistent Unicode-only behaviour, see eg here.