如何使用PHP在一个独特的MySQL列中存储UTF-8电子邮件地址?

问题描述:

I'm trying to support UTF-8 characters into email addresses. If I understand correctly, email addresses are limited to 254 usable (ASCII) characters. Based on this, I would like to store email address in a VARCHAR(254) ASCII MySQL InnoDB column. One of the problems I'm encountering is to validate such scenarios. I'm trying to convert UTF-8 to ASCII but getting mixed results as shown below (I know the example is not a valid email but I could have used other characters - this is just to explain the problem):

<?php
$string = '

I'm adding the missing details in my own answer (special thanks for Ignacio, andig, Martin and Markus Laire for helping me to put the pieces of this puzzle together).

There are two problems to this question :

  1. Encoding conversion from UTF-8 to ASCII
  2. MySQL index limit to 767 bytes without enabling innodb_large_prefix for MySQL < 5.7.7 (looks like this is now enabled by default).

Answer for "Encoding conversion from UTF-8 to ASCII"

ASCII is a subset of UTF-8 so not all characters can be converted. ASCII only uses 128 characters (the first 128) per byte while UTF-8 bytes can use more. The ideal solution would be to use an encoding that support all 256 possibilities per 8-bits bytes. Some encoding like cp1252 supports most characters but even if this is true, some characters are invisible which could end up causing issues.

For a true byte by byte conversion the only reliable option is to use binary. for our user case given we use MySQL, the best option would be to have a VARBINARY(254) (binary fields don't have encoding). After that it would be easy to simply:

INSERT into user_table set email_address='

You cannot "convert" a UTF8 string to ASCII at the same length if the characters do not have an ASCII representation as in your example.

What you could do is to create some kind of representation of the bytecodes that make up the UTF8 characters. I doubt that would be useful as email address though.

UPDATE

In UTF8 each character can consume multiple bytes. how many varies by character. If ASCII one character is one byte. So you can use each byte of the UTF8 character and see chat character that byte represents in ASCII. However- this will have absolutely nothing to to with the original UTF8 character except for those UTF8 characters that are represented by a single byte. IMHO those will match their ASCII representation.

Nicholas you seem to have some fundamental confusions with Ascii Vs UTF-8 Character sets in your Question and your comments to answer(s).



UTF-8 Value: