如何将Java单词边界与撇号一起使用?

问题描述：

我试图删除列表中一个单词的所有出现，但是当单词中带有撇号时，我遇到了麻烦.

I am trying to delete all the occurrences of a word in a list, but I am having trouble when there are apostrophes in the words.

String phrase="bob has a bike and bob's bike is red";
String word="bob";
phrase=phrase.replaceAll("\\b"+word+"\\b","");
System.out.println(phrase);

输出:
有一辆自行车，而它的自行车是红色的

我想要的是
有一辆自行车，鲍勃的自行车是红色的

我对正则表达式的了解有限，所以我猜想有一个解决方案，但是我现在还不足以创建用于处理撇号的正则表达式.另外，我希望它可以使用破折号，因此短语新邮件是电子邮件将仅替换第一次出现的邮件.

I have a limited understanding of regex so I'm guessing there is a solution, but I do not now enough to create the regex to handle apostrophes. Also I would like it to work with dashes so the phrase the new mail is e-mail would only replace the first occurrence of mail.

答

这全都取决于您理解什么是单词".也许您最好将自己理解的内容定义为单词定界符:例如，空格，逗号....并写为

It all depends on what you understan to be a "word". Perhaps you'd better define what you understand to be a word delimiter: for example, blanks, commas .... And write something as

phrase=phrase.replaceAll("([ \\s,.;])" + Pattern.quote(word)+ "([ \\s,.;])","$1$2");

但是您必须另外检查字符串的开头和结尾是否出现例如:

But you'll have to check additionally for occurrences at the start and the end of the string For example:

  String phrase="bob has a bike bob, bob and boba bob's bike is red and \"bob\" stuff.";
  String word="bob";
  phrase=phrase.replaceAll("([\\s,.;])" + Pattern.quote(word) + "([\\s,.;])","$1$2");
  System.out.println(phrase);

打印此

bob has a bike ,  and boba bob's bike is red and "bob" stuff.

更新:如果您坚持使用 \ b ，并考虑到单词边界"可以理解Unicode，那么您也可以使用这种肮脏的技巧:替换所有'您确定不会在文本中出现一些Unicode字母，然后进行反向替换.示例:

Update: If you insist in using \b, considering that the "word boundary" understand Unicode, you can also do this dirty trick: replace all ocurrences of ' by some Unicode letter that you're are sure will not appear in your text, and afterwards do the reverse replacemente. Example:

  String phrase="bob has a bike bob, bob and boba bob's bike is red and \"bob\" stuff.";
  String word="bob";
  phrase= phrase.replace("'","ñ").replace('"','ö');
  phrase=phrase.replaceAll("\\b" + Pattern.quote(word) + "\\b","");
  phrase= phrase.replace('ö','"').replace("ñ","'");
  System.out.println(phrase);

更新:以下总结了一些评论:人们希望 \ w 和 \ b 具有与哪个是文字字符"相同的概念，几乎每个正则表达式方言都可以做到.好吧，Java不会: \ w 考虑ASCII， \ b 考虑Unicode.我同意，这是一个丑陋的不一致.

UPDATE: To summarize some comments below: one would expect \w and \b to have the same notion as to which is a "word character", as almost every regular-expression dialect do. Well, Java does not: \w considers ASCII, \b considers Unicode. It's an ugly inconsistence, I agree.

更新2:自Java 7(如注释中所指出)以来，此处.

Update 2: Since Java 7 (as pointed out in comments) the UNICODE_CHARACTER_CLASS flag allows to specify a consistent Unicode-only behaviour, see eg here.

如何将Java单词边界与撇号一起使用?

相关推荐