如何在C中比较多字节字符
我尝试解析文本并在其中找到一些字符。我使用下面的代码。它适用于 abcdef
等普通字符,但不适用于öçşğüı
的字符。 GCC会发出编译警告。我应该怎么做才能使用öçşğüı
?
I try to parse text and find some characters in it. I use the code below. It works with normal characters like abcdef
but it does not work with öçşğüı
. GCC gives compilation warnings. What should I do to work with öçşğüı
?
代码:
#include <stdio.h>
#include <ctype.h>
#include <string.h>
int main()
{
char * text = "öçşğü";
int i=0;
text = strdup(text);
while (text[i])
{
if(text[i] == 'ö')
{
printf("ö \n");
}
i++;
}
return 0;
}
警告:
warning: multi-character character constant [-Wmultichar]
warning: comparison is always false due to limited range of data type [-Wtype-limits]
当我在while循环中打印char地址时,有10个地址
There are 10 addresses when I print address of char in while loop
printf("%d : %p \n", i, text[i]);
输出:
0 : 0xffffffc3
1 : 0xffffffb6
2 : 0xffffffc3
3 : 0xffffffa7
4 : 0xffffffc5
5 : 0xffffff9f
6 : 0xffffffc4
7 : 0xffffff9f
8 : 0xffffffc3
9 : 0xffffffbc
而 strlen
是10。
但是如果我使用 abcde
:
0 : 0x61
1 : 0x62
2 : 0x63
3 : 0x64
4 : 0x65
和 strlen
是5。
如果我使用 wchar_t
进行文本输出是
If I use wchar_t
for text output is
0 : 0xa7c3b6c3
1 : 0x9fc49fc5
2 : 0xbcc3
和 strlen
是10, wcslen
是3。
要遍历字符串中的每个字符,可以使用 mblen
。您还需要设置正确的语言环境(由多字节字符串表示的编码),以便 mblen
可以正确地解析多字节字符串。
To go through each of the characters in the string, you can use mblen
. You also need to set the correct locale (the encoding represented by the multi byte string), so that mblen
can correctly parse the multi byte string.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <locale.h>
int main()
{
char * text = "öçşğü";
int i=0, char_len;
setlocale(LC_CTYPE, "en_US.utf8");
while ((char_len = mblen(&text[i], MB_CUR_MAX)) > 0)
{
/* &text[i] contains multibyte character of length char_len */
if(memcmp(&text[i], "ö", char_len) == 0)
{
printf("ö \n");
}
i += char_len;
}
return 0;
}
有两种类型的字符串表示形式,使用多字节(8位)个字节)或宽字节(大小取决于平台)。多字节表示的优点是可以使用 char *
(在代码中通常为c字符串)表示,但缺点是多个字节表示一个字符。宽字符串使用 wchar_t *
表示。 wchar_t
具有一个wchar_t是一个字符的优点(但是,正如@anatolyg指出的那样,这种假设在wchar_t无法表示所有可能字符的平台上仍然可能出错) 。
There are 2 types of string representation, using multi-byte (8-bit bytes) or wide byte (size depends on platform). Multi-byte representation has the advantage it can be represented using char *
(usual c string as in your code), but has disadvantage that multiple bytes represent a character. Wide string is represented using wchar_t *
. wchar_t
has the advantage that one wchar_t is one character (However as @anatolyg pointed out, this assumption can still go wrong in platforms where wchar_t is not able to represent all possible characters).
您是否使用十六进制编辑器查看了源代码?字符串öçşğü
实际上由内存中的多字节字符串 c3 b6 c3 a7 c5 9f c4 9f c3 bc
表示(UTF-8编码),当然终止为零。您看到5个字符,只是因为您的UTF-8识别查看器/浏览器正确显示了字符串。很容易意识到 strlen(text)
为此返回10,而上面的代码仅循环5次。
Have you looked at your source code using a hex editor? The string "öçşğü"
actually is represented by multi byte string c3 b6 c3 a7 c5 9f c4 9f c3 bc
in memory (UTF-8 encoding), of course with zero termination. You see 5 characters just because the string is rendered correctly by your UTF-8 aware viewer/browser. It is simple to realize that strlen(text)
returns 10 for this, whereas the above code loops only 5 times.
如果您使用宽字节字符串,则可以按照@WillBriggs的说明进行操作。
If you use wide-byte string, it can be done as explained by @WillBriggs.