方法,常量,变量和字段的外来名称-错误还是功能?

方法,常量,变量和字段的外来名称-错误还是功能?

问题描述:

在对...的评论中有些困惑之后

after some confusion in the comments to

我以为我问了一个问题.根据PHP手册,有效的类名称应与[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*相匹配.但是显然,这不是强制性的,也不适用于其他任何东西:

I thought I make into a question. According to the PHP manual, a valid class name should match against [a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*. But apparently, this is not enforced, nor does it apply for anything else:

define('π', pi());
var_dump(π);

class ␀ {
    private $␀ = TRUE;
    public function ␀()
    {
        return $this->␀;
    }
}

$␀ = new ␀;
var_dump($␀ );
var_dump($␀->␀());

可以正常工作(即使我的IDE无法显示␀).有学识的人可以帮我解决这个问题吗?我们可以使用任何Unicode吗?如果是这样,从什么时候开始?并不是说我会想要使用除A-Za-z_之外的任何东西,但我很好奇.

works fine (even though my IDE cannot show ␀). Can some erudite person clear this up for me? Can we use any Unicode? And if so, since when? Not that I would actually want to use anything but A-Za-z_ but I'm curious.

说明:我不是要使用Regex来验证类名,也不知道PHP是否内部使用了手册中建议的Regex.让我(以及显然是链接问题中的其他人)困惑的是,为什么$☂ = 1之类的东西完全可以在PHP中使用. PHP6被假定为Unicode版本,但是PHP6处于中断状态.但是,如果没有Unicode支持,那为什么要这样做呢?

Clarification: I am not after a Regex to validate class names, nor do I know if PHP internally uses the Regex it suggests in the manual. The thing that confused me (and apparently the other guys in the linked question) is why things like $☂ = 1 can be used in PHP at all. PHP6 was suppposed to be the Unicode release but PHP6 is in hiatus. But if there is no Unicode support, why can I do this then?

此问题开始在标题中提及类名称,但接着转到一个示例,其中包括方法,常量,变量和字段的外来名称.这些实际上有不同的规则.让我们从不区分大小写的内容开始.

This question starts to mention class names in the title, but then goes on to an example that includes exotic names for methods, constants, variables, and fields. There are actually different rules for these. Let's start with the case insensitive ones.

此处的一般准则是仅使用可打印的ASCII字符.原因是这些标识符被规范化为其小写版本,但是,此转换取决于语言环境.考虑以下以ISO-8859-1编码的PHP文件:

The general guideline here would be to use only printable ASCII characters. The reason is that these identifiers are normalized to their lowercase version, however, this conversion is locale-dependent. Consider the following PHP file, encoded in ISO-8859-1:

<?php
function func_á() { echo "worked"; }
func_Á();

此脚本可以工作吗?可能是.这取决于 tolower (

Will this script work? Maybe. It depends on what tolower(193) will return, which is locale-dependent:


$ LANG=en_US.iso88591 php a.php
worked
$ LANG=en_US.utf8 php a.php

Fatal error: Call to undefined function func_Á() in /home/glopes/a.php on line 3

因此,使用非ASCII字符不是一个好主意.但是,即使是ASCII字符,在某些区域设置中也可能会造成麻烦.请参阅此讨论.将来有可能通过做一个仅与ASCII字符一起使用的与语言环境无关的小写字母来解决此问题.

Therefore, it's not a good idea to use non-ASCII characters. However, even ASCII characters may give trouble in some locales. See this discussion. It's likely that this will be fixed in the future by doing a locale-independent lowercasing that only works with ASCII characters.

总而言之,如果我们对这些不区分大小写的标识符使用多字节编码,那么我们正在寻找麻烦.不仅仅是我们不能利用不区分大小写的优势.实际上,我们可能会遇到意想不到的冲突,因为使用语言环境规则,组成一个多字节字符的所有字节都被单独转换为小写字母.在将语言环境小写规则应用于每个字节之后,两个不同的多字节字符可能映射到相同的修改后的字节流表示形式.

In conclusion, if we use multi-byte encodings for these case-insensitive identifiers, we're looking for trouble. It's not just that we can't take advantage of the case insensitivity. We might actually run into unexpected collisions because all the bytes that compose a multi-byte character are individually turned into lowercase using locale rules. It's possible that two different multi-byte characters map to the same modified byte stream representation after applying the locale lowercase rules to each of the bytes.

这里的问题不太严重,因为这些标识符区分大小写.但是,它们只是被解释为字节流.这意味着,如果我们使用Unicode,则必须始终使用相同的字节表示形式.我们不能混合使用UTF-8和UTF-16;我们也不能使用BOM表.

The problem is less serious here, since these identifiers are case sensitive. However, they are just interpreted as bytestreams. This means that if we use Unicode, we must consistently use the same byte representation; we can't mix UTF-8 and UTF-16; we also can't use BOMs.

实际上,我们必须坚持U​​TF-8.在ASCII范围之外,UTF-8使用从0xc0到0xfd的前导字节,并且尾随字节在0x80到0xbf的范围内,这在手册允许的范围内.现在假设我们在UTF-16BE编码文件中使用字符Ġ".这将转换为0x01 0x20,因此第二个字节将被解释为空格.

In fact, we must stick to UTF-8. Outside of the ASCII range, UTF-8 uses lead bytes from 0xc0 to 0xfd and the trail bytes are in the range 0x80 to 0xbf, which are in the allowed range per the manual. Now let's say we use the character "Ġ" in a UTF-16BE encoded file. This will translate to 0x01 0x20, so the second byte will be interpreted as a space.

当然,多字节字符就像单字节字符一样被读取,根本不支持Unicode. PHP 确实有一些多字节支持,形式是"--enable-zend-multibyte"编译开关(自PHP 5.4起,默认情况下已编译多字节支持,但已禁用;您可以使用php.ini中的zend.multibyte=On启用它).这使您可以声明脚本的编码:

Having multi-byte characters being read as if they were single-byte characters is, of course, no Unicode support at all. PHP does have some multi-byte support in the form of the compilation switch "--enable-zend-multibyte" (as of PHP 5.4, multibyte support is compiled in by default, but disabled; you can enable it with zend.multibyte=On in php.ini). This allows you to declare the encoding of the the script:

<?php
declare(encoding='ISO-8859-1');
// code here
?>

它还将处理BOM表,这些BOM表用于自动检测编码,并且不会成为输出的一部分.但是,有一些缺点:

It will also handle BOMs, which are used to auto-detect the encoding and do not become part of the output. There are, however, a few downsides:

  • 性能命中,包括内存和CPU.它以内部多字节编码方式存储脚本的表示形式,这会占用更多空间(并且似乎还会在内存中存储原始版本),并且还会花费一些CPU来转换编码.
  • 通常不编译多字节支持,因此测试较少(更多错误).
  • 在其中编译了支持的安装与未编译支持的安装之间的可移植性问题.
  • 仅引用解析阶段; 不能解决不区分大小写的标识符所概述的问题.
  • Peformance hit, both memory and cpu. It stores a representation of the script in an internal multi-byte encoding, which takes more space (and it also seems to store in memory the original version) and it also spends some CPU converting the encoding.
  • Multi-byte support is usually not compiled in, so it's less tested (more bugs).
  • Portability issues between installations that have the support compiled in and those that don't.
  • Refers only to the parsing stage; does not solve the problem outlined for case-insensitive identifiers.

最后,存在缺乏规范化的问题-相同的字符可能用不同的Unicode代码点表示(独立于编码).这可能会导致一些非常难以跟踪的错误.

Finally, there is the problem of lack of normalization – the same character may be represented with different Unicode code points (independently of the encoding). This may lead to some very difficult to track bugs.