PHP:将特殊字符转换为HTML实体的正确方法,用于电子邮件的html源代码

PHP:将特殊字符转换为HTML实体的正确方法,用于电子邮件的html源代码

问题描述:

I have UTF-8 text in a string (let's call it the "plain text") and I need to inject that text inside HTML code.

I'm using htmlspecialchars to convert special characters (that may occurr in the plain-text) to HTML entities.

This is a common problem, however....

the resulting string is the Html source of EMAILs

So I'm concerned if specific measures should be taken in the conversion process.

I'm know there are some differencies and inconsistencies in the way email clients render html.

Also a rule of thumb I often I've read is write your HTML like you're in 2001

Is htmlspecialchars good for the converting task?

Also which flags should I set ?

Normally I use:

$html = htmlspecialchars( $text, ENT_QUOTES, 'UTF-8' );

Should I use ENT_QUOTES | ENT_HTML401 ?

我在字符串中有UTF-8文本(我们称之为“纯文本”)我需要注入 strong> HTML代码中的文本。 p>

我正在使用 htmlspecialchars 将特殊字符(可能在纯文本中出现)转换为HTML实体。 p>

然而,这是一个常见问题。 ... p>

结果字符串是电子邮件的Html来源 strong> p>

所以我担心是否应该采取具体措施 转换过程中。 p>

我知道电子邮件客户端呈现html的方式存在一些差异和不一致。 p>

此外,规则也是如此。 通常我读过的是写你的HTML就像你在2001年一样 em> p>

htmlspecialchars code> good 对于转换任务? p>

我还应该设置哪些标志? p>

通常我使用: p> $ html = htmls pecialchars($ text,ENT_QUOTES,'UTF-8'); code> p>

我应该使用 ENT_QUOTES | ENT_HTML401 code>? p> div>

In short, it depends if you want to send a UTF-8 email, or an ASCII email.

UTF-8 Email - just htmlspecialchars fine:

// We're telling it that $text is UTF-8 (+see below about control chars)
$html = htmlspecialchars( $text, ENT_DISALLOWED, 'UTF-8' );

This will swap out <, >, " and & for you. Anything else, like é, will pass straight through unchanged (which would be fine, as the email itself is UTF-8 too).

ASCII Email - you'll need to do a HTML 4.01 entity swap out (which is the default), but with the same ENT_DISALLOWED flag:

// Same again - see below about the flags:
$html = htmlentities( $text, ENT_DISALLOWED, 'UTF-8' );

This will swap out as many entities as possible to make sure things like é are represented in ASCII (as &eacute ;).

Which one is better?

This part depends entirely on your audience and the kinds of email clients you're expecting to interact with. A brief tour of history should help you decide!

Up until roughly 2006, the vast majority of web was ASCII. Named character entities, such as &eacute ; existed to let web pages support much broader unicode codepoints, as well as to display characters which are important to HTML. Here's the first issue: support for UTF-8 emails can be patchy.

If you're going for broad coverage with older clients then sending an ASCII email is a safer bet. That means you'll need to convert all of the unicode code points which are out of range of ASCII into an ASCII compatible representation (html entities). Fundamentally this is targeting older clients so using ENT_HTML5 - the greatly expanded entities set - makes no sense here.

However here's the other issue - the older HTML 4.01 entity set represents far fewer unicode codepoints, so if you're expecting to send text in a broad range of languages then you'll most likely need to send a UTF-8 email instead.

UTF-8 vs. ASCII email self-test questions:

  • Do I need to support lots of languages? UTF-8.
  • Do I need to support few languages but as many clients as possible? ASCII.
  • Neither? UTF-8 is the default choice these days.

A note about control characters (ENT_DISALLOWED)

It's important to note that control characters - particularly the null byte - won't be handled by either htmlentities or htmlspecialchars by default. The null byte when presented on the web is also notorious for crashing things, including somewhat famously Chrome with a short URL containing one. I'm unsure how many email clients correctly handle the null byte but I'm very inclined to think that it's not many of them. So, the ENT_DISALLOWED flag will strip them out and drop in a safer character for you.