为什么要“控制"? XML 1.0中的字符非法?

为什么要“控制

问题描述:

有各种各样的字符在XML 1.0中不能合法编码,例如U+0007(钟")和U+001B(转义").大多数有趣的字符是非空白控制"字符.

There are a variety of characters that are not legally encodeable in XML 1.0, e.g. U+0007 ('bell') and U+001B ('escape'). Most of the interesting ones are non-whitespace 'control' characters.

从(例如)此问题和其他问题中可以清楚地看出,它是

It's clear from (e.g.) this question and others that it's the XML spec that's the issue -- but can anyone illuminate me as to why the XML spec forbids these characters?

似乎可能需要对它们进行转义编码,例如分别为,但是也许有一个实际的原因是字符被禁止而不是必须被转义?

It seems like it could have been required that they be encoded in escapes, e.g. as  and  respectively, but perhaps there's a practical reason that the characters were forbidden rather than required to be escaped?

答案表明,有某种动机可以避免传输控制字符,但是Unicode包含许多 other 类似控件的字符(考虑U+200C零宽度非连接符").我知道这种行为可能没有充分的理由,但我仍然想更好地理解它.

Answerers have suggested that there is some motivation towards avoiding transmission control characters, but Unicode includes many other control-like characters (consider U+200C "zero width non joiner"). I recognize there may be no good reason for this behavior, but I would still like to understand it better.

尤其令人沮丧的是,当这些字符值以其他 encodings 数据格式显示时,我最终需要对这种编码进行双重转义"新XML文档.

It's particularly frustrating because when those character values appear in other encodings data formats, I end up "double-escaping" new XML documents that need to encode this.

我的理解是,此范围被禁止,原因是标记语言不需要支持传输和流控制字符,包括它们将创建一个二进制转换中的任何编辑器和解析器都会遇到问题.

My understanding is that this range is barred on the grounds that a markup language should not have any need to support transmission and flow control characters and including them would create a problem for any editors and parsers in binary conversion.

尽管如此,我还是想从蒂姆·布雷(Tim Bray)等人的身上找到任何东西.

I'm struggling to find anything ex cathedra on this from Tim Bray et al though.

一些

edit: some discussion of control chars and a vague admission it wasn't exactly over-engineered:

Mark Volkmann在09:27 AM 17/06/00 -0500写到:

At 09:27 AM 17/06/00 -0500, Mark Volkmann wrote:

我从未见过讨论大多数ASCII控制的原因 XML文档中不允许使用诸如换页符之类的字符.能 任何人都可以告诉我该决定的原因或向我提出规格.那 解释了吗?

I've never seen a discussion of the reason why most ASCII control characters, such as a form feed, are not allowed in XML documents. Can anyone tell me the reason behind that decision or point me to a spec. that explains that?

如果再次执行此操作,我不确定是否会以相同的方式进行操作.一世 没有看到他们有任何真正的伤害.显然,如果您正在优化 对于高度可互操作的 content 标记语言(和XML),它是 怀疑诸如Vertical-Tab和Backspace之类的东西是合法的 依此类推...但是然后如何保持一致\ n和DEL 等等? -蒂姆(Tim)

I'm not sure we'd do it the same way if we were doing it again. I don't see that they do any real harm. Clearly, if you're optimizing for a highly interoperable content markup language (and XML is) it's legitimate to be suspicious of things like vertical-tab and backspace and so on... but then how can it be consistent to leave in \n and DEL and so on? -Tim