如何使用C ++中的UTF-8,从其他编码转换为UTF-8
我不知道该如何解决:
想象一下,我们有4个网站:
Imagine, we have 4 websites:
- A:UTF-8
- B:ISO-8859-1
- C:ASCII
- D:UTF-16
- A: UTF-8
- B: ISO-8859-1
- C: ASCII
- D: UTF-16
我使用C ++编写的程序执行以下操作:它下载网站并进行解析。但它必须了解的内容。我的问题不是使用像>
或<
的ASCII字符进行的解析。
My Program written in C++ does the following: It downloads a website and parses it. But it has to understand the content. My problem is not the parsing which is done with ASCII-characters like ">"
or "<"
.
问题是程序应该从网站的文本中找到所有单词。词是字母数字字符的任何组合。
然后我将这些单词发送到服务器。数据库和Web前端正在使用UTF-8。
所以我的问题是:
The problem is that the program should find all words out of the website's text. A word is any combination of alphanumerical characters. Then I send these words to a server. The database and the web-frontend are using UTF-8. So my questions are:
- 如何将任何(或最常用的)字符编码转换为UTF- ?
- 如何在C ++中使用UTF-8字符串?我认为
wchar_t
不工作,因为它是2个字节长。 UTF-8中的代码点长度最多为4个字节... - 有
isspace()
,isalnum()
,strlen()
,tolower()
UTF-8字符串?
- How can I convert "any" (or the most used) character encoding to UTF-8?
- How can I work with UTF-8-strings in C++? I think
wchar_t
does not work because it is 2 bytes long. Code-Points in UTF-8 are up to 4 bytes long... - Are there functions like
isspace()
,isalnum()
,strlen()
,tolower()
for such UTF-8-strings?
请注意:我不做任何输出(如 std :: cout
)。
Please note: I do not do any output(like std::cout
) in C++. Just filtering out the words and send them to the server.
我知道UTF8-CPP,但它没有 is *()
函数。和我读,它不会从其他字符编码转换为UTF-8。只有从UTF- *到UTF-8。
I know about UTF8-CPP but it has no is*()
functions. And as I read, it does not convert from other character encodings to UTF-8. Only from UTF-* to UTF-8.
编辑:我忘了说,程序必须是可移植的:Windows,Linux,...
I forgot to say, that the program has to be portable: Windows, Linux, ...
如何将任何(或最常用的)字符编码转换为UTF- / p>
How can I convert "any" (or the most used) character encoding to UTF-8?
ICU (国际组件Unicode)是这里的解决方案。它通常被认为是Unicode支持中的最后一个说法。甚至Boost.Locale和Boost.Regex使用它,当谈到Unicode。看到我对Dory Zidon的回答的评论,为什么我建议直接使用ICU,而不是包装器(如Boost)。
ICU (International Components for Unicode) is the solution here. It is generally considered to be the last say in Unicode support. Even Boost.Locale and Boost.Regex use it when it comes to Unicode. See my comment on Dory Zidon's answer as to why I recommend using ICU directly, instead of wrappers (like Boost).
您为给定的编码创建一个转换器...
You create a converter for a given encoding...
#include <ucnv.h>
UConverter * converter;
UErrorCode err = U_ZERO_ERROR;
converter = ucnv_open( "8859-1", &err );
if ( U_SUCCESS( error ) )
{
// ...
ucnv_close( converter );
}
...然后使用UnicodeString 类别适当。
...and then use the UnicodeString class as appripriate.
我认为wchar_t因为它有2个字节长。
I think wchar_t does not work because it is 2 bytes long.
wchar_t
的大小实现定义。 AFAICR,Windows是2字节(UCS-2 / UTF-16,取决于Windows版本),Linux是4字节(UTF-32)。在任何情况下,由于标准不会定义 wchar_t
的Unicode语义,因此使用非便携式猜测。
The size of wchar_t
is implementation-defined. AFAICR, Windows is 2 byte (UCS-2 / UTF-16, depending on Windows version), Linux is 4 byte (UTF-32). In any case, since the standard doesn't define Unicode semantics for wchar_t
, using it is non-portable guesswork. Don't guess, use ICU.
有没有像isspace(),isalnum(),strlen(),tolower这样的UTF-8字符串?
Are there functions like isspace(), isalnum(), strlen(), tolower() for such UTF-8-strings?
不是以UTF-8编码,但是你不使用内部。 UTF-8适用于外部表示,但内部UTF-16或UTF-32是更好的选择。上述函数确实存在于Unicode代码点(即UChar32);参考。 uchar.h 。
Not in their UTF-8 encoding, but you don't use that internally anyway. UTF-8 is good for external representation, but internally UTF-16 or UTF-32 are the better choice. The abovementioned functions do exist for Unicode code points (i.e., UChar32); ref. uchar.h.
请注意:我不在C ++中做任何输出(如std :: cout)。
Please note: I do not do any output(like std::cout) in C++. Just filtering out the words and send them to the server.
检查 BreakIterator 。
编辑:我忘记说,程序必须可移植:Windows,Linux,...
I forgot to say, that the program has to be portable: Windows, Linux, ...
如果我还没有说过, >使用ICU,节省自己的麻烦。即使它一开始看起来有点重量,它是最好的实施,它是非常便携的(在Windows,Linux和AIX自己使用它) , 会在项目中一次又一次地使用它,因此投入学习其API的时间不会浪费。
In case I haven't said it already, do use ICU, and save yourself tons of trouble. Even if it might seem a bit heavyweight at first glance, it is the best implementation out there, it is extremely portable (using it on Windows, Linux, and AIX myself), and you will use it again and again and again in projects to come, so time invested in learning its API is not wasted.