正确的方式来解码传入的电子邮件主题(utf 8)
我正在尝试将我传入的邮件传递给一个php脚本,以便我可以将它们存储在数据库和其他使用类 MIME电子邮件消息解析器(需要注册)虽然我不认为这很重要。
i'm trying to pipe my incoming mails to a php script so i can store them in database and other things I'm using the class MIME E-mail message parser (registration required) although I don't think it's important.
我有一个电子邮件主题的问题
它的工作正常,当标题在engelish但如果主题使用没有拉丁字符我得到像
i have a problem with emails subjects
it works fine when title is in engelish but if the subject uses none latin Characters i get something like
=?UTF-8?B?2KLYstmF2KfbjNi0?=
的标题像
یکدوسه
for a title like یک دو سه
我解码的主题如下:
$subject = str_replace('=?UTF-8?B?' , '' , $subject);
$subject = str_replace('?=' , '' , $subject);
$subject = base64_decode($subject);
它可以与10-15个字符
的短主题一起工作,但带有一个逗号
it works fine with short subjects with like 10 -15 characters but with a linger title
我得到一半的原始标题,如 结尾
i get half of the original title with something like ��� at the end
,如果标题甚至更像30个角色!我什么都没有我正在做这件事吗?
and if the title is even longer like 30 characters ! i get nothing ! am i doing this right ?
尽管这几乎是一年以前 - 我发现这一点,正面临着类似的
Despite the fact that this is almost a year old - I found this and am facing a similar problem.
我不知道为什么你会得到奇怪的字符,但也许你试图在不支持你的字符集的地方显示它们。
I'm unsure why you're getting odd characters, but perhaps you are trying to display them somewhere your charset is unsupported.
这是我写的一些代码,它应该处理除charset转换之外的所有东西,这是许多图书馆处理得更好的一个大问题。 (例如PHP的 MB库)
Here's some code I wrote which should handle everything except the charset conversion, which is a large problem that many libraries handle much better. (PHP's MB library, for instance)
class mail {
/**
* If you change one of these, please check the other for fixes as well
*
* @const Pattern to match RFC 2047 charset encodings in mail headers
*/
const rfc2047header = '/=\?([^ ?]+)\?([BQbq])\?([^ ?]+)\?=/';
const rfc2047header_spaces = '/(=\?[^ ?]+\?[BQbq]\?[^ ?]+\?=)\s+(=\?[^ ?]+\?[BQbq]\?[^ ?]+\?=)/';
/**
* http://www.rfc-archive.org/getrfc.php?rfc=2047
*
* =?<charset>?<encoding>?<data>?=
*
* @param string $header
*/
public static function is_encoded_header($header) {
// e.g. =?utf-8?q?Re=3a=20Support=3a=204D09EE9A=20=2d=20Re=3a=20Support=3a=204D078032=20=2d=20Wordpress=20Plugin?=
// e.g. =?utf-8?q?Wordpress=20Plugin?=
return preg_match(self::rfc2047header, $header) !== 0;
}
public static function header_charsets($header) {
$matches = null;
if (!preg_match_all(self::rfc2047header, $header, $matches, PREG_PATTERN_ORDER)) {
return array();
}
return array_map('strtoupper', $matches[1]);
}
public static function decode_header($header) {
$matches = null;
/* Repair instances where two encodings are together and separated by a space (strip the spaces) */
$header = preg_replace(self::rfc2047header_spaces, "$1$2", $header);
/* Now see if any encodings exist and match them */
if (!preg_match_all(self::rfc2047header, $header, $matches, PREG_SET_ORDER)) {
return $header;
}
foreach ($matches as $header_match) {
list($match, $charset, $encoding, $data) = $header_match;
$encoding = strtoupper($encoding);
switch ($encoding) {
case 'B':
$data = base64_decode($data);
break;
case 'Q':
$data = quoted_printable_decode(str_replace("_", " ", $data));
break;
default:
throw new Exception("preg_match_all is busted: didn't find B or Q in encoding $header");
}
// This part needs to handle every charset
switch (strtoupper($charset)) {
case "UTF-8":
break;
default:
/* Here's where you should handle other character sets! */
throw new Exception("Unknown charset in header - time to write some code.");
}
$header = str_replace($match, $data, $header);
}
return $header;
}
}
当运行脚本并在浏览器中显示时UTF-8,结果是:
When run through a script and displayed in a browser using UTF-8, the result is:
آزمایش
你可以这样运行:
$decoded = mail::decode_header("=?UTF-8?B?2KLYstmF2KfbjNi0?=");