This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: New GB18030 gconv module contributed by ThizLinux Laboratory


On Thu, Jan 17, 2002 at 09:08:08AM -0800, Ulrich Drepper wrote:
> The problems you see are almost certainly stemming from the fact that
> the tables you use are containing invalid code positions.  If a
> character is not defined in Unicode/ISO 10646 the converted must not
> accept it.  The GB18030 standard might already define what happens if
> these code positions appear in a source but this can only mean that
> they are prepared for the time when these code positions are defined.

As mentioned in the previous message, "undefined" != "invalid".
Or, using the terminology used by Unicode Consortium,
"unassigned" != "illegal".  Unicode Technical Report #22 Character Mapping
Markup Language (CharMapML) has more information on this:

http://www.unicode.org/unicode/reports/tr22/index.html#Illegal_and_Unassigned

Besides explaining the difference and examples between illegal and
unassigned, the author also notes that:

     Especially because unassigned character may actually come from a more
     recent version of the character encoding, it is often important to
     preserve round-trip mappings if possible.

> Yu Shao had all these positions defined in his first version and I
> assume all the provided test files were accepted.  I had to send him
> back and redo everything so that only character which appear in
> 
>   http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
> 
> are accepted.  All other characters are invalid.

All of U+0000..U+D7FF, U+E000..U+FFFE are legal (i.e. valid), whether
they are currently assigned or not.

Also, look at gb-18030-2000.xml, which is the _official_ mapping data of
Unicode<->GB18030, prepared by Markus Scherer and other Unicode Consortium
members.

> Look at your converter.  If it does anything different it needs to be
> fixed.  And once this is done I hope it does the same as the converted
> which I added yesterday.

Think of it this way.  In a sense, GB18030 is the Chinese equivalent of
UTF-8.  UTF-8 is designed to preserve ASCII compatibility, whereas
GB18030 is designed to preserve GB2312/GBK compatibility.  They are
functionally equivalent, one for China, and one for the world.
Given a text file with U+0000..U+D7FF, U+E000..U+10FFFF,
if
	iconv -f ucs4 -t utf8 all-legal-unicode-codepoints.txt

completes without error, I don't see any reason why

	iconv -f ucs4 -t gb18030 all-legal-unicode-codepoints.txt

should fail.

Determining whether a Unicode codepoint is assigned or not is not the
job of gb18030.c.  Besides, glibc's GB18030 module should follow
internationally recognized official standards agreed by both Unicode
Consortium and the Chinese standards committee.

Best regards,

Anthony

-- 
Anthony Fok Tung-Ling
ThizLinux Laboratory   <anthony@thizlinux.com> http://www.thizlinux.com/
Debian Chinese Project <foka@debian.org>       http://www.debian.org/intl/zh/
Come visit Our Lady of Victory Camp!           http://www.olvc.ab.ca/


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]