为什么将字节数组转换为字符串然后返回到字节数组时长度不同?

问题描述:

我有以下Java代码:

I have the following Java code:

byte[] signatureBytes = getSignature();

String signatureString = new String(signatureBytes, "UTF8");
byte[] signatureStringBytes = signatureString.getBytes("UTF8");

System.out.println(signatureBytes.length == signatureStringBytes.length); // prints false

Q:我可能误解了,但是我以为 new String(byte [] bytes,String charset) String.getBytes(charset)是反向操作?

Q: I'm probably misunderstanding this, but I thought that new String(byte[] bytes, String charset) and String.getBytes(charset) are inverse operations?

问:作为跟进,什么是以字符串方式传输byte []数组的安全方法?

Q: As a follow up, what is a safe way to transport a byte[] array as a String?

不是每个 byte [] 是有效的UTF-8。默认情况下,无效序列被一个固定的字符替换,我认为这是这样一个长度变化的原因。

Not every byte[] is valid UTF-8. By default invalid sequences gets replaced by a fixed character, and I think that's the reason for such a length change.

尝试拉丁语1,它不应该发生,因为它是每个 byte [] 的简单编码是有意义的。

Try Latin-1, it should not happen, as it's a simple encoding for which each byte[] is meaningful.

对于Windows-1252,无论如何都可以。在那里有未定义的序列(实际上是未定义的字节),但是所有的字符都被编码在单个字节中。新的字节[] 可能与原始的不同,但长度必须相同。

Neither for Windows-1252 should it happen. There are undefined sequences there (in fact undefined bytes), but all chars get encoded in a single byte. The new byte[] may differ from the original one, but their lengths must be the same.