理解Java中的正则表达式:split(“\t”)vs split(" \\t") - 它们何时都有效,何时应该使用

问题描述:

我最近发现我在代码中没有正确使用正则表达式。给定制表符分隔字符串 str 的示例,我一直在使用 str.split(\ t) 。现在我意识到这是错误的并且正确匹配标签我应该使用 str.split(\\t)

I have recently figured out that I haven't been using regex properly in my code. Given the example of a tab delimited string str, I have been using str.split("\t"). Now I realize that this is wrong and to match the tabs properly I should use str.split("\\t").

然而,我碰巧偶然发现了这个事实,因为我正在为其他东西寻找正则表达式。你看,错误的代码 split(\ t)在我的情况下工作得很好,现在我很困惑为什么它确实有效如果它是声明正则表达式匹配制表符的错误方法。因此,问题是,为了真正理解如何在Java中处理正则表达式,而不是仅仅将代码复制到Eclipse中而不是真正关心其工作原理...

However I happen to stumble upon this fact by pure chance, as I was looking for regex patterns for something else. You see, the faulty code split("\t")has been working quite fine in my case, and now I am confused as to why it does work if it's the wrong way to declare a regex for matching the tab character. Hence the question, for the sake of actually understanding how regex is handled in Java, instead of just copying the code into Eclipse and not really caring why it works...

以类似的方式,我发现了一段不仅以制表符分隔而且以逗号分隔的文本。更清楚地说,我正在解析的制表符分隔列表有时包括复合项目,如下所示: item1,item2,item3 我想将它们解析为单独的元素, 为了简单起见。在这种情况下,适当的正则表达式应该是: line.split([\\t,]),或者我也错了?

In a similar fashion I have come upon a piece of text which is not only tab-delimited but also comma delimited. More clearly put, the tab-delimited lists I am parsing sometimes include "compound" items which look like: item1,item2,item3 and I would like to parse them as separate elements, for the sake of simplicity. In that case the appropriate regex expression should be: line.split("[\\t,]"), or am I mistaken here as well??

提前致谢,

使用\t,转义序列 \t 被替换为字符为U + 0009的Java。使用\\t时,中的转义序列 \\ \\ 由Java替换为 \ ,从而导致 \t 然后由解释正则表达式解析器为字符U + 0009。

When using "\t", the escape sequence \t is replaced by Java with the character U+0009. When using "\\t", the escape sequence \\ in \\t is replaced by Java with \, resulting in \t that is then interpreted by the regular expression parser as the character U+0009.

因此两种符号都将被正确解释。这只是用相应的字符替换的问题。

So both notations will be interpreted correctly. It’s just the question when it is replaced with the corresponding character.