[NTLK] Greek Fonts 'n Unicode

From: Sean Luke (sean_at_cs.gmu.edu)
Date: Sun Jun 09 2002 - 11:09:10 EDT


Daedalus Guy <daedalus_guy_at_mac.com> thus spoke:

> Finally, I'm a bit hazy about the Unicode thing. I tried to understand
> what Unicode actually is (in practical terms) by reading through the
> materials provided by Apple's Font Group, but it's kind of like wading
> through a paper on atomic physics--far too technical and detailed to be
> of much use to me. I mean, I'm still not exactly sure what Unicode
> is--is it a font encoding method? Is it for mapping characters to
> keys? What I really need is a simple introduction as to what Unicode
> is, and then **a detailed explanation of how to make a font using
> it.** It looks like the PDF you provided might have some info on this.

Unicode is a character encoding. What does this mean? The easiest way
to explain it is to start with ASCII: ASCII is also a character
encoding; it's how text symbols are stored inside a computer's memory
(which is all numerical). ASCII assigns the numbers 0 through 127 to
unique glyphs. Each letter, number, and punctuation symbol you can
produce on your keyboard without using the option key have a unique
ASCII number assigned to them. Other glyphs (like newline, carriage
return, tab, space) are invisible in text but are used to affect it in
obvious ways. They too have an ASCII number each. Other (like
backspace) don't really have a "function" in text at all -- they were
used for other purposes long ago and are really only part of ASCII for
historical reasons.

ASCII's glyphs are:

        0 nul 1 soh 2 stx 3 etx 4 eot 5 enq 6 ack 7
bel
        8 bs 9 ht 10 nl 11 vt 12 np 13 cr 14 so 15
si
       16 dle 17 dc1 18 dc2 19 dc3 20 dc4 21 nak 22 syn 23
etb
       24 can 25 em 26 sub 27 esc 28 fs 29 gs 30 rs 31
us
       32 sp 33 ! 34 " 35 # 36 $ 37 % 38 &
39 '
       40 ( 41 ) 42 * 43 + 44 , 45 - 46 .
47 /
       48 0 49 1 50 2 51 3 52 4 53 5 54 6
55 7
       56 8 57 9 58 : 59 ; 60 < 61 = 62 >
63 ?
       64 @ 65 A 66 B 67 C 68 D 69 E 70 F
71 G
       72 H 73 I 74 J 75 K 76 L 77 M 78 N
79 O
       80 P 81 Q 82 R 83 S 84 T 85 U 86 V
87 W
       88 X 89 Y 90 Z 91 [ 92 \ 93 ] 94 ^
95 _
       96 ` 97 a 98 b 99 c 100 d 101 e 102 f
103 g
      104 h 105 i 106 j 107 k 108 l 109 m 110 n
111 o
      112 p 113 q 114 r 115 s 116 t 117 u 118 v
119 w
      120 x 121 y 122 z 123 { 124 | 125 } 126 ~ 127
del

The first 32 numbers and number 127 are control characters. Some of
these are pretty common: nl is newline. cr is carriage return. del is
delete. esc is escape. bs is backspace. ht is tab. sp is space.

ASCII only assigns the first 128 values of a byte. But there are 256
values in a byte altogether. Different computer systems have used the
other 128 values (called the "high values" to represent custom glyphs
for their own purposes. For example, the Commodore 64 used the high
values to store various graphics characters: boxes, circles, etc., even
a happy face, which the user could type onto the screen. The IBM PC did
that as well. It even has the four suits represented with special high
values.

The Macintosh was the first widespread computer to really use the high
values for an intelligent function. Because Apple wanted the Mac to be
easily usable by many different languages, they used the high values to
represent accented characters common in European languages. Apple also
included a variety of symbols common to the desktop publishing and
mathematics environments. You can see the MacRoman character set here:
http://orwell.ru/info/macroman.htm
The first column is the symbol. The second column is the MacRoman
number for that symbol. Note that first values 0 through 127 are the
same as ASCII.

When Apple worked with Adobe to encode the Big Four Adobe Fonts (Times,
Helvetica, Courier, Symbol) for the Macintosh and the Apple LaserWriter,
they ran into a little problem. They wanted to add the greek glyphs and
all sorts of "additional" characters found in Symbol -- but they had no
slots in MacRoman to represent them. They had used up all 255 slots.
The decision was made that Symbol's glyphs would have the same numbers
as the MacRoman glyphs, but would display differently. For example,
when you press P, you get a capital Pi. So Pi also was ASCII character
80, just displayed with the current font (Symbol). If you change the
font, the Pi becomes a P. Apple and Adobe later used the same trick for
Zapf Dingbats, and Microsoft followed suit with its own similar (but
argh, different and frankly inferior) high-value encoding and font usage
approach.

The problem with this approach is that every language not normally in
MacRoman or Windows encodings would have to get its own font. That led
to a proliferation of "fonts" for different languages. But that's not
the IDEA of a font. A font just displays glyphs in a particular
fashion -- but the glyphs are supposed to be the same. Times should
have greek letters in it. Helvetica should be able to display hebrew.
Instead you need a "Greek" font and a "Hebrew" font. And worse than
that, some languages (like Chinese) have far FAR more symbols than can
fit in 256 values. How do you handle *that*? That's normally done
nowadays with "multi-byte" encodings like Big5 or GB, which use two,
three, or four bytes strung together in special ways to represent a
Chinese character.

Most of the international community eventually abandoned "greek" and
"hebrew" and "russian" font approach and got to the heart of the matter,
creating their own language-specific character *encodings* to replace
MacRoman and Windows. Most of these encodings have "ISO..." at the
beginning of their names. Thus there's an ASCII+Hebrew collection of
number<->symbol mappings for Hebrew computers. And an ASCII+Greek
collection. Etc. Thus you could have a "Greek Times" font and a "Greek
Helvetica" font etc., all with characters encoded specifically for
Greek. But that was still highly unsatisfying: it meant that if you
moved a file written with the assumption of ASCII+Hebrew and transferred
it to a computer with an operating system assuming ASCII+Greek, the
hebrew letters would all get jumbled into Greek symbols. Yuck.

Eventually the international community decided to fix this with a large
character set called Unicode. Unicode has a big enough number range
(0...65535) to store most glyphs of most languages around the world. It
does this by using 16 bits (two bytes) rather than 8 bits (one byte) to
represent the number. Conveniently, Unicode's first 128 numbers are the
ASCII glyphs. The next 128 are the ISO Latin+ glyphs common in Windows
machines. Unicode also has special numbers set aside to correspond to
all the extra glyphs common in Macintosh high-value areas. You can see
all the glyphs in the "code charts" area of www.unicode.org.

Beyond the basic collection of glyphs for all the
Windows+Mac+EuropeanLatin stuff (about 400 glyphs or so), the rest of
Unicode is broken up into ranges that are language-specific for
languages with completely unique sets of glyphs. There's the Greek
range, the Hebrew range, the Cyrillic range, etc. Since Chinese,
Japanese, and Korean all have strong overlap in the Chinese characters
they use, they share the same range, called "CJK". There are additional
special ranges for certain Korean and Japanese alphabets. There are
Arabic ranges, a math range, etc. And, interestingly, there's a Symbol
range and a Zapf Dingbats range -- you can see the hand of Adobe at work
on the Unicode committee. But that's a good thing, as we've mentioned
before, Symbol and Zapf Dingbats glyphs should have their own unique
numbers, not "share" them with other "fonts". As it turns out, even 16
bits isn't enough to store all the glyphs the world demands. Thus the
Unicode committee is making special 32-bit sequences for unusual
purposes.

You'll be happy to know that the Newton was the first major production
device (that I know of) to use Unicode as its basic character encoding.
Palm Pilots do not -- and it hurts them greatly. Java does. MacOS 9
and Windows machines do not. MacOS X's Cocoa environment is pure
Unicode, but icky Carbon is not. HTML has adopted Unicode as its
standard way of encoding symbols, but most Chinese and Japanese web
pages still use the old multibyte encodings.

Although the Newton uses Unicode to display stuff, you can use the
Newton keyboard (onscreen or real) and handwriting recognition system
only to produce the Unicode characters corresponding to the old MacRoman
character set. To make any other characters, you have to use some kind
of custom input device. Further, the Newton has trouble printing
special range Unicode characters to laser printers -- so if you plan on
printing with your greek font to a laser printer, you may run into some
problems.

Mac fonts are typically stored either in MacRoman, one of several
special language encodings, or Unicode. When you use the Newton Font
tool, it only converts the MacRoman characters over to the Newton font.
To convert a more sophisticated font you have to use Apple's more
complex font tools (see the FAQ for a pointer on that). What you have
to decide is basically this: do you want a font that has the greek
characters in the proper Unicode position, or do you want a font that
has the greek characters in the "Symbol" position (i.e., in the Pi
instead of P)? While the Unicode position is more satisfying, and
that's probably where the greek Mac fonts have it stored, the "Symbol"
position may be more useful because you'll be able to easily enter text
using the keyboard and the handwriting recognition (to get a "Pi"
displayed, choose your "greek font", then write a "P" on the screen).
At this point, you have to ask what is necessary beyond what the Symbol
font provides already. As I don't know Greek, I will assume you already
have this figured out.

That's about the extent of my Mac font experience. Good luck!

Sean

-- 
Read the List FAQ/Etiquette: http://www.newtontalk.net/faq.html
Read the Newton FAQ: http://www.guns-media.com/mirrors/newton/faq/
This is the NewtonTalk mailing list - http://www.newtontalk.net



This archive was generated by hypermail 2.1.2 : Wed Jul 03 2002 - 14:01:58 EDT