This is the mail archive of the
libc-alpha@sources.redhat.com
mailing list for the glibc project.
Re: New GB18030 gconv module for glibc (from ThizLinux Laboratory)
- From: "Markus Scherer" <markus dot scherer at us dot ibm dot com>
- To: Anthony Fok <anthony at thizlinux dot com>
- Cc: Ulrich Drepper <drepper at redhat dot com>, fai at thizlinux dot com, Bruno Haible <haible at ilog dot fr>, kevin at thizlinux dot com, libc-alpha at sources dot redhat dot com, sunnygu at thizgroup dot com, suzhe at gnuchina dot org, Yu Shao <yshao at redhat dot com>
- Date: Thu, 17 Jan 2002 10:55:26 -0800
- Subject: Re: New GB18030 gconv module for glibc (from ThizLinux Laboratory)
I agree with what Anthony said about mapping code points: Even if they do
not have assigned characters, their mappings are defined. This is true for
all Unicode code points except _single_ surrogate code points
U+d800..U+dfff.
Mapping _from_ GB 18030 may sometimes result in "unassigned" handling
because some 4-byte GB 18030 sequences are defined but do not have
mappings to Unicode.
Dirk and my publications on this are based on a printed version of the GB
18030 standard from 2000 (plus the published electronic mapping tables),
and from following discussions about the standard as much as possible. (I
do not read/speak Chinese, but Dirk does; our companies had Chinese
representatives that were in frequent discussion with the Chinese
standards agency.)
Note that the supplementary Unicode code points U+10000..U+10ffff were
_designated_ in Unicode 2.0 (1996), with the pseudo-assignment of
128*1024-4 of those code points (U+f0000..U+ffffd and U+100000..U+10fffd)
as a Private-Use Area.
Unicode 3.1 did not invent this supplementary range but was "only" the
first Unicode version that assigned "real" characters to such code points
(and assigned >40000 of them).
Note also that formally GB 18030 defines mappings to ISO 10646, not
Unicode. One of the differences is the publication schedule. Supplementary
character assignments were published only in December 2001 with ISO 10646
part 2, which synchronized with Unicode 3.1 several months after its
publication.
markus
Markus Scherer IBM GCoC-Unicode/ICU San José, CA
markus.scherer@us.ibm.com (also for SameTime)