Re: [NTLK] Special characters in owner info allowed?

From: Peter H. Coffin (hellsop_at_ninehells.com)
Date: Fri Aug 20 2004 - 07:00:49 PDT


On Thu, Aug 19, 2004 at 11:58:33PM -0400, Dan Mills wrote:
> unicode.org discourages the use of the UCS-2 term:
>
> http://www.unicode.org/faq/basic_q.html#23

NIH syndrome... IIRC, someone else invented the label.
>
> Though perhaps not for older software that was implemented
> pre-unicode-2.0? I don't know that level of detail. Presumably by
> "doesn't implement any supplementary characters" they mean "is not
> variable-width", but then what happens when you interpret a UTF-16
> string that uses those characters as UCS-2? Random pairs of garbage
> characters? If so, then I don't get why they claim they're identical.

The supplemental characters are the ones above the 64k mark -- they're
sparely populated and include some things as additional musical notation,
heiroglyphics, and archaic chinese ideoglyphs for (for example)
historically interesting literary terms. How UCS-2 and UTF-16 interlock is
directly comparable to how ascii and UTF-8 interlock. High in the UCS-2
range, there's some codepoints that are reserved, and a word (classic
computer meaning here -- two bytes) in that range says "Hey!
supplemental character here!" and includes the first 10 or so bits of
the value of the supplemental character. The following word(s) contain
bits that indicate whether this word is the last in the character or
not, and another about 14 bits of information. When the last word of the
supplemental character is indicated, its payload are chained together
with the bits from the other characters into one 32-bit character
codepoint, and that's your supplemental character.

Okay, why they're identical: Even UCS-2 reserves the same range of
reserved character. Inside the 16-bit space, all the characters are
mapped to exactly the same codepoints. Unless you happen to have some of
those supplemental characters (and it's VERY unlikely you will unless
you're dealing with a very specialized field), UTF-16 data can simply be
treated as those it were UCS-2. You'll be able to freely mix Roman,
Hangul, Cyrillic, CJK, etc. characters and it will Just Work.

*tossing .sig-monster a cookie.*

-- 
Technical points aside, you could probably beat someone to
death with a Newton if you had to.  Try that with a Palm Pilot!
                              --Dan Duncan in comp.sys.newton.misc
-- 
This is the NewtonTalk list - http://www.newtontalk.net/ for all inquiries
Official Newton FAQ: http://www.chuma.org/newton/faq/
WikiWikiNewt for all kinds of articles: http://tools.unna.org/wikiwikinewt/


This archive was generated by hypermail 2.1.5 : Sat Aug 21 2004 - 12:30:01 PDT