This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: New GB18030 gconv module for glibc (from ThizLinux Laboratory)


On Sun, Jan 20, 2002 at 12:14:45AM +0800, Roger So wrote:
> On Fri, 2002-01-18 at 05:41, Ulrich Drepper wrote:
> > "Markus Scherer" <markus.scherer@us.ibm.com> writes:
> > > I agree with what Anthony said about mapping code points: Even if they do 
> > > not have assigned characters,
> > 
> > It is completely irrelevant what you think.  The converters convert
> > from the external charset to the internal private charset.  The latter
> > is defined in a way which disallows any non-Unicode position.  What
> > you do with your own code I don't care; but stay out of discussions
> > like this when they are related to glibc.
> 
> May I ask where does it say that the "converters convert from the
> external charset to the internal private charset"?  In fact, in "The GNU

Are we talking about the iconv interface or the conversion
between multibyte character and *wide character* (wchar_t)?

I think the 2 should be treated differently as explained below...

For iconv interface, it is reasonable to ask all the code space
defined by Unicode standard be available, wheather certain code 
points is assigned a glyph yet. Of course it is nonsense to 
map from an arbitrary charset code to an unassigned Unicode point.

But GB18030 is no just another arbitrary charset (or coded charset, 
whatever). GB18030 intends to be a bridge to Unicode from GB2312. 
It promised to be, in a sense, compatible with Unicode in the future.
The unsigned code points in GB18030 is not arbitarily mapped
to the Unicode. The mapping or the algorithm of the mapping is 
already set now. There are no worry about pollute the Unicode 
code space by mapping unassigned GB18030 points to unassigned 
Unicode points. 

For wchar_t data type, It is implement specific. Glibc just happend
to choose Unicode. So if currently unassigned Unicode code points
are not allow in glibc's wchar_t implement, the check should be
done, in my opinion, at the mbtowc() level. Then even if mbtowc()
calls iconv() internally, iconv() can still make all defined
Unicode code space available.

For the GB18030 official certification test, I think the 
intention of including unassigned code points (even for GB18030
itself) is also to make sure the code space defined 
in the GB18030 standard are properly supported. Such that
future addition to the GB18030 (actually Unicode) will 
not require application also being updated in order to function
correctly. In the future, only font files need to be updated
if glyph displaying is needed.

The current situation is just like hardwired the defined
27,000+ UniHan into an application (iconv in current case).
Any future addition to Unicode need to have the application
altered. Hardwiring is bad in general, and unnecessary in 
this particular case IMHO.

With the fact that the GB18030 is not just another charset,
but practically another encoding form of Unicode. With this 
fact in mind, it might be reasonable to expose all the 
Unicode code space to GB18030 in the iconv() interface.

Because GB18030 will not mess up Unicode's code space, it should
not mess up wchar_t used in glibc. But since glibc does not
allow unassigned Unicode code point in wchar_t's implement,
the must-assigned criterion can be enforced inside wctomb().

-- 
Best regard
hashao


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]