Chinese for Programmers - Rein's Howtos
Intro
With no knowledge about Chinese at all, it's quite difficult to port
software to Chinese, I've noticed.
Having studied Chinese a while, worked in Beijing and Hong Kong, and
written millions of source code lines in various programming languages,
I feel it's more or less my duty to help others who also struggle with
software for the Chinese market.
I don't pretend I'm an expert in Chinese. If you find something wrong,
please give feedback!
Apart from this intro, this Howto is intentionally compact. I recommend
you to read and learn all the expressions in bold text!
I try to avoid explaining things that should be obvious to professional
programmers. But still, if you find anything too short or badly explained,
let me know.
I also avoid discussing lots of cultural issues, unless they make sense
to software developers. I'm not trying to explain how it is to work in
China. I could feed you lots about that and even recommend many good
restaurants and so on. But then this text would miss its focus.
There's no product or anything behind this Howto. The purpose is
to share it, get feedback from others, and learn more that way!
/Rein
Quick facts
Here are some quick facts you need to know for starters! I also kill some
myths along the way, hopefully. To keep the text compact, some details
are explained further in the following Examples section.
- You need to consider two variants of Chinese: Mandarin
and Cantonese.
When you hear those terms, you deal with the spoken aspect of the
languages (how the language sounds when you speak it).
Although Cantonese is actually a dialect, I use the term Language to
point out the big difference from Mandarin. I don't want anyone to think
Mandarin and Cantonese sound almost the same, that's why!
- Mandarin is the official language in China.
Usually the one you need
to consider first. Cantonese is probably second most important.
- When it comes to written Chinese, you also have two variants:
Simplified and Traditional.
- To really ignore lots of details, you can say Simplified characters
are used where one speaks Mandarin, and Traditional characters are used
where one speaks Cantonese. So Mandarin/Simplified is used in mainland
China, while Cantonese/Traditional is used in Guagzhou, Macau and Hong Kong.
Please notice, this is really an over-simplification! If your application
targets the Hong Kong market, or you need to support several dialects or
character sets, you have to get more details!
- The term Mainland China is quite common. It refers to China,
excluding Taiwan, Macau, Hong Kong1.
In fact, you should also exclude
Guangzhou, Hong Kong New Territories and Tibet. What remains is a big
land area where most people speak Mandarin and use Simplified characters,
and have the same political and economical system. Hong Kong is sometimes
prefixed SAR, which means Special Administrational Region.
- Guangzhou is a city in Guangdong province and is part of
mainland China. Its has sub-provincial status to handles its own economic
affairs.
- PRC means People's Republic of China, which is Mainland China.
There's also a Republic of China, which is Taiwan!
- There's no such thing as Hongkongese. In Hong Kong they speak
Cantonese!
- Taiwan's official spoken language is Mandarin, but they write
Traditional characters.
- Japan have three different written languages. One of them is called
Kanji. Kanji is the same as Traditional Chinese.
- No, Japanese and Chinese don't understand each other in spoken
language, but they may work it out in writing. However, this Howto
is not about Japanese at all!
- Cantonese and Mandarin differ heavily, they sound like two
completely different languages and you shouldn't expect people from
Hong Kong to be able to understand Mandarin and vice versa.
- Traditional and Simplified characters have similarities.
People who know one can sometimes guess the other.
Perhaps Hong Kong people are more likely
to figure out the meaning of a written text in Simplified, rather than
the other way around. Traditional characters typically have more strokes
than Simplified, but the simplification follows some rules so if you
learn those rules the guessing isn't so hard. Also, many of the simpler
characters look exactly the same in both Traditional and Simplified.
- The Chinese written language is character-based rather than
alphabet based.
- The traditional characters formed the basis of the Chinese,
Japanese and Korean written languages. Each culture also has its own set of
Simplified characters (Simplified Chinese, Simplified Japanese and
Simplified Korean), which are not interchangeable among the 3 languages
(unlike the Traditional characters).
- One advice: To speed up your work, try to learn some very common
Chinese characters! Make sure you know exactly how they should look. Pick
a word which you know looks different in Traditional vs. Simplified.
For example, the word men which means door, and also serves as
a plural marker for people. In Simplified, it looks like 门, in
Traditional, it is 門.
- Chinese is usually not read from right to left, or from the bottom and up.
This way of writing is common when it comes to poems and ancient texts though.
So if you visit China and see some old buildings with text on them,
you should read from right to left. As software developer, you should
be aware that you may have to support several layout variants!
- There's NOT a one-to-one relationship between English words and
Chinese characters. Sometimes there are, but usually you need two or
three Chinese characters to represent an English word.
- If your application is just producing static written Chinese text,
you're lucky! All you have to do is to select a proper font, use
UTF-8
as character encoding and make sure you have space enough for 16 pixels
in height for each character row.
- If your application allows the user to enter Chinese characters,
things become harder. At least you need an Input Method (IM), which is
usually a plugin. We'll get back to this.
- If your application should speak also, you must ensure it can
pronounce correctly. If not, Chinese user's won't understand it at all.
On the other hand, since there are just a few syllables in Chinese,
once you've recorded them all, your speaking application should be
able to generate any word!
- You can't ignore pinyin! Pinyin looks like western characters, but
wovelns can also have special strokes on them. The strokes tell how to
pronounce those wovelns. If ignored, no Chinese person will understand.
Pinyin is also used for input of Chinese, which is another reason you
must consider pinyin. Read on!
- If you can't write the special strokes above Pinyin characters, you
can use digits instead after each syllable. For example, er4 instead of
a falling slant above the e (è).
- Pinyin is used where Mandarin and Simplified is used. In Hong Kong
and Taiwan, similar concepts are used but they are not called Pinyin.
- Sometimes you'll see Pinyin without extra strokes. This is typical
when you see names of persons and companies. But you can't make a
shortcut in your programs and ignore those extra strokes!
- A Chinese character is made up by quite well defined strokes which
all have names. For example, a horizontal dash has the name hang.
- If you've seen one character, you will soon see that character again
but as part of a more complex character. You can think of it like
inherited classes in an object oriented language, or as a recursive function.
- A confusing fact is that if you see a Chinese character, you can't
say what it means unless you have the context!
Very few Chinese characters
have distinct meanings when they stand alone. Most characters have
4-10 meanings!
- There are quite few syllables in Chinese. A syllable is a distinct
sound particle. Many westerners think Chinese people only can say Ching
Chang Chong. This is not true, there are a few others :). Anyway,
Ching Chang Chong are three syllables. As a programmer, you can think of
a syllable as a verbose token.
- Many times one Chinese character is built by combining two or three
other characters. In this case one part of them is called radical.
Radicals are used for grouping characters together (so you can sort them
and locate them in a dictionary), and radicals also often control the sound
of the character.
- There are just above 200 radicals so you can learn them quickly!
- Radicals usually look a little bit different from their original
character which they are based on. Some look exactly the same and some
totally different.
- Grouping Chinese characters together, to show which belong to the
same word, is usually only made in children's text books or similar.
Adult Chinese are expected to know how to separate words.
- In some texts you'll see western style punctuation marks like
bang (!) or question mark (?), in other texts you won't.
- No, you usually can't tell what a Chinese character means just by
looking at it! Only a few Chinese characters look like what they mean.
Such characters are called Pictograms. But as a programmer, you
couldn't care less!
- If you buy a mobile phone other than a smart phone, it probably
doesn't support Chinese characters. English is always supported in all
phones, and most other languages are region dependent. Some phones start
by asking you to select language, and once you done it they remove
other languages to free up memory space, so you can't get it back.
Even on a smart phone where all languages are supported, the keyboard
may not allow more than a specific number of languages. But sometimes you
can download separate keyboards, one for your own language and another
for Chinese. This is quite a mess!
- A PC or Mac computer may not support both your own language and Chinese
unless you pay for additional languages. If you use Linux you just select
the languages you want support for, then wait until they are downloaded.
All free of charge. But if you're short of disk space you may want to
disable total support for Chinese as it occupies much more space than
most other languages. In particular you may not need all those Chinese
dictionaries which Linux happily throws at you!
- In Linux, there are many input methods (IMs) to chose from, each of
them is the best one! You probably only need one!
- There are rules stating in which order the strokes of a Chinese
character shall be written. If you use handwriting input, and write
the strokes in wrong order, the software may be confused, giving you the
wrong character as result.
- You can't use character sets like ANSI, ISO-8859-1, Western or ASCII
to represent Chinese characters. You have to use Unicode UTF-8.
Note: Character sets and Fonts are two different things. The Character
set decides how characters are encoded, or stored. Fonts just decide how
they look. Not all Fonts can represent all characters of a given Character
set though!
- Chinese people often use other smileys than westerners.
- To print characters from your program, you may have to use alternative
functions and methods. For example
wprintf()
instead of
printf()
. You should search your API manuals for wide
character or multi-byte.
You may have to add a few compiler switches, include
alternative header files and packages, and maybe even link with other
libraries.
- If you send characters to a text terminal, raw output may generate
wrong Chinese characters. This is because more than one byte must be sent
to the terminal to represent one character. If there's a break between
the bytes, the wrong symbols may be looked up.
- Notice that wide-characters, UTF-8 and Unicode doesn't have with
just Chinese to do. They are needed for most other languages. In fact,
I strongly recommend you to skip other types of encoding (such as ASCII,
ANSI and ISO-8859). We live in a global world now!
- When creating a new database, remember to set UTF-8 as Character
set before you start populating any tables! Else you'll waste a lot of
time later.
- Chinese characters don't come in upper-case and lower-case
variants, there's only one alternative. Same as our digits.
- In Unicode, every letter is represented by a hexadecimal number.
To find those numbers, visit the Unicode
web site.
- Unicode is NOT a 2-byte representation of characters! Well,
sometimes it is, but Unicode may use anything between one and six bytes
to represent a character! This means if you write programs which will
find positions in Unicoded strings, you can't simply step two bytes back
and forth for each character. It will be a complete failure!
- When specifying the Chinese language (typically a parameter named
LANG
),
you should set it to zh-CN
.
- CJK stands for Chinese, Japanese and Korean. See chapter
Fonts below!
- Big5 or Big-5 is a character set used in Hong Kong, Macau
and Taiwan.
- BG is a character set used in mainland China.
- Be very careful when using on-line translators, such as Google
Translate! Two reasons: 1. They are usually very bad. Try for example to
convert a sentence from English to Chinese, and then translate it back
again! 2. You'll end up in jail sooner or later! Because you have probably
signed an NDA or other contract, stating you're not allowed to reveal
classified information. And as soon as you translate that contract, email or
whatever, Google doesn't only translate it for you, they also keep it and
can do anything they want with it (why shouldn't they, did you really think
Google looks to your bests only and ignore their own opportunities to make
money?).
- If you translate texts in your software application, never use online
translators. Consult a professional translator instead, there are many of
those in China and they're usually not very expensive! I already assume
you have separated the text from your code, haven't you? After the
translation is done, test the application on a native speaker to see if
the translation makes sense. Remember, Chinese is a very context
dependent language, so too short messages and menu options can be
confusing to the user.
- To support Chinese characters in your application, use for example
wchar_t *wcscat(wchar_t *str1, const wchar_t *str2)
instead of
strcat()
.
- Character Entity References, or Entities for short,
is a special syntax you can use to represent a special character in
a HTML form. For example,
Å
is such an entity
referring to the Swedish character Å. Numerical variants can also
be used on the format &#number;
or
&#xhex-number;
- In China's ancient times, the Lunar calendar was used. Today, China
uses the same calendar as in Western countries. The Lunar calendar is used
when it comes to traditional activities, ceremonies, horoscopes and that
kind of stuff. So it depends on what kind of application you write, should
you consider the Lunar calendar or not.
- Numbers mythology is widespread in China.
Something to consider perhaps. 2, 3, 8 and 9
are some good numbers, and 4 is a really bad number (representing death).
Never use 4 anywhere! You won't even find any 4:th floor in most Chinese
high rises!
- While we in the west use to put commas or spaces between every third
digit to make large number more readable, Chinese use 10000 as a normal
base unit. This is why they sometimes miss or add one digit by mistake
when translating. So it makes sense to verify large numbers!
- The words for he and she are written with different Chinese
characters (他 vs 她)
but have the same pronunciation. This can cause translation mistakes
also.
- Some Chinese grammar is really simple. For example, there are no
special words like him or her. Many prepositions used in English are not
used in Chinese. One difficult thing though is that there are about 50
different measure words to keep track of. You can't say one car,
you have to stick in the correct measure word in between!
- The most important IRC/chat program in China is
TenCent's QQ.
Skype is far less common in China.
- The most popular Search engine in China is
Baidu.
- Weibo is China's most popular
social site, to be compared with Facebook in the west.
- Don't take for granted that all Western sites can be reached in China,
for example, YouTube is usually blocked, and Android Market may also be
blocked sometimes.
- The top domain for China is cn, for Hong Kong hk,
for Taiwan tw and for Singapore sg.
- Phone number prefix for China is 86, for Hong Kong it's 852. Note
that subscribers sometimes also have to pay for incoming calls in China, and
it's common that the operators sell the phone numbers to third parties.
Don't be surprised if your mobile phone is spammed!
- Sending SMS from applications to mobile phones in Hong Kong can be
difficult because there's no way to tell whether a phone number is to
an ordinary phone or a mobile. You can't tell by the area code etc.
- Chinese keyboards may be different, but they don't have 6000 keys.
They are usually similar to ordinary American keyboards but they may
have additional shift modes to make it easier to enter pinyin. You can read more about this below.
- There's quite a generation gap when it comes to computer usage in
China, but there are also local differences. But this isn't just in China!
You may consider the fact that younger people want everything to be
accessible via their mobile phones. If you spend 2 hours every day on a
crowded bus or subway train, you can't use a laptop.
- Some Chinese smart phones have two SIM card slots!
- Language skills differ!
My own experience with manufacturing companies
in China is that you often deal with a project manager, usually a woman,
who's English is extremely good, and the male programmers you meet may not
be that talented in English. In general, younger people are more likely to
speak English than the older ones. Since software industry is young, you
can expect Chinese software developers to know enough English!
- If you develop embedded software for products to be manufactured in
China, you should talk to someone who has this experience. There are many
pitfalls, you have to do your homework carefully! But I think this is a bit
out of the scope of this article.
- Currency: PRC and Hong Kong have different currencies. The
exchange rate is quite similar. PRC uses RenMinBi (RMB).
Hong Kong uses Hong Kong dollars (HKD). The Chinese RMB base unit is a Yuan,
which usually is called Quai.
- During 2009 and 2010 China came up with their own Wireless LAN standard,
WAPI, which caused a lot of frustration, confusion and anger in the
computer industry. Fortunately, it seems this is no longer an issue.
- If your application targets ALL Chinese users, things becomes more
difficult, at least with speech based applications. There are many
dialects in China, many so different from Mandarin that people from
different regions won't understand each other.
- This shouldn't be about culture or food, but I just have to add a
warning: WHATEVER you do if you visit China or Hong Kong DON'T eat at
Western restaurants!!! The Chinese food is really outstanding, being there,
and eating at McDonald's, is a crime!!!
1Since summer 1997, Hong Kong belongs to PRC, but it still
has its own legislation and economical system, that's why it's called
SAR of Hong Kong. As software developer, you can probably skip the politics
and focus on the differences between the dialect and characters.
Examples
Here are some examples from the above facts. Since I'm not sure you have
already installed support for Chinese characters on you computer, all
pinyin and Chinese characters are represented by images.
- is a Chinese word built up of
two characters. The first, ,
means middle. The second, , means
country, nation and kingdom. So this word means China!
- is the pinyin variant of
the above word .
Notice the flat stroke above the first o and
the apostrophe above the second o. These symbols tell you that the first
o shall be pronounced long and without tone change, the second o should
be pronounced with a short rising tone. There are five main tone variants.
is one syllable, and
is the other.
- To write the above pinyin using numerical format, write
zhong1guo2
. 1 means flat sound, 2 means rising, 3 means
falling, then rising and 4 means falling.
- means Kung-Fu, a martial arts
sport. It contains two(!) characters. The first,
, contains a radical
plus an ordinary character. The radical is the one looking a bit like
I
.
It gives the whole character its pronunciation and also a special
meaning to it. Pinyin for the first character is
. The second character is pronounced fu.
- is not the same as
the above word! It does not mean the martial arts sport Kung-Fu!
It does not contain two characters, it contains three! What does
it mean? It would mean "kilometer man"! What's that?
- is the simplified word for
"car". Corresponding pinyin is . The word for car with Traditional characters is
. It would be wrong to write a
corresponding pinyin word here since pinyin isn't used with Traditional.
However, you pronounce it something like hei-tje in Cantonese.
Nihao Example
In this separate example I have written the word Hello in English,
Simplified Chinese, and Pinyin. It looks like this:
.
If you have installed Chinese support correctly, it should look the
same if you download and open these files and print them out with
their corresponding tools or commands:
Plain text,
OpenOffice Word,
MS Word 97/2000/XP,
MS Word 97 XML,
Rich Text Format,
PDF,
HTML.
I really recommend you to inspect the these files with a binary editor
like Hexedit, or print them out with the Linux command od -c
filename
.
If, for example, you just try to open file nihao.txt directly by clicking it
in the browser, you probably see something like this:
This is because there's no information in the file telling your browser
which Character encoding is used, so it probably picks ISO-8859 and finds
those strange characters here instead. All the other files carry
information about which Character encoding is to be used. When you write
your own programs you must make sure you specify Character encoding
properly! For example, in the HTML file you'll see specifications like
<META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=utf-8">
<SPAN LANG="zh-CN">
If you print out the contents of file nihao.txt on raw 8-bit byte format,
you'll get most garbage after the first line. Therefore I'll show you
how it looks with three different representations, corresponding to the
switches c, b and x to the od
command. The output format is
symbolic/octal, decimal and hexadecimal respectively:
0000000 H e l l o ! \n 344 275 240 345 245 275 357 274 201
0000020 \n n 307 220 h 307 216 o \n
0000000 110 145 154 154 157 041 012 344 275 240 345 245 275 357 274 201
0000020 012 156 307 220 150 307 216 157 012
0000000 6548 6c6c 216f e40a a0bd a5e5 efbd 81bc
0000020 6e0a 90c7 c768 6f8e 000a
Some editors and word processors are smart enough to ask you what
character encoding is to be used if it can't figure it out. So if you try
to open the file nihao.txt and get a dialog popping up, try these
settings:
Test Your Environment
Now it's time you test your environment, to make sure you can work
properly with Chinese characters.
Chinese Character Rendition
Let's start with your desktop and browser. Can they render Chinese characters
at all?
What do you see here?
你好
Did you see
?
If these texts don't show the same characters, or you just saw the word
NOTHING
, at least your browser doesn't
support Chinese characters. But it might be that not even your operating
system does, or you may not have proper fonts installed. To figure out,
open a word processor and try to open any of the following files, which
format your word processor can normally handle:
OpenOffice Word,
MS Word 97/2000/XP,
MS Word 97 XML,
Rich Text Format
It shouldn't matter which font you select since the word processor
automatically picks a font which can handle the text properly, not
always the font you have decided.
If you still see garbage, and not proper Chinese characters, you have
to install support for Chinese characters. Sometimes this comes in
packages called Asian language support or likewise. On Windows and
Apple computers you may have to pay for such packages. On Linux, just
look under System/Administration/Language Support or likewise
(there
are many variants (distros) and versions of Linux, so I can't point out
exactly where to look. Your distribution probably has some online help,
so search for i18n, internationalization or language in it.
From this point in the text, I will start using Chinese characters
directly. If you see "garbage" where you should see Chinese
characters, go back and install support for Chinese on you computer!
E-mail
Next, test you email client! Try to email the word
你好
to yourself. Just copy it from the last sentence! It should look like
both when you send it
and receive it.
If it doesn't, try to configure your email client to use UTF-8 character
encoding. If it still doesn't work, check if your mail server lacks support
for UTF-8. If it doesn't and you don't own it, demand support for UTF-8
or change ISP. They should know better these days!
For example, the web mail client SquirrelMail, provided by my ISP
Levonline shows 你好 in both the subject line and message
body, instead of nice Chinese characters. Fortunately, Levonline also
provides a web mail client named Round Cube, which handles UTF-8
correctly.
Input Methods
Now comes the difficult part - How to enter Chinese characters!
Chinese users don't have keyboards with 6000 keys! In fact, they can even
use the few keys on a mobile phone by typing in pinyin.
Since the arrival of Smartphones, they can also use Handwriting
by simply drawing the characters on the display. Each two methods require
an IM. An IM is a helper application which interprets the entry and converts
it to Chinese characters, then passes it on to the application that
currently has the focus, for example a word processor. You typically
install your IM once, then switch it on and off as you need it. Switching
is usually done with some accelerator key combination, like Ctrl-Space.
There are several IM:s available for free. One of them is Ibus.
When Ibus is installed and activated, it shows a small icon in the status
bar:
Start your word processor now. Don't click any other application! Make sure
your word processor is ready for input and has the focus.
Click the Ibus icon, it lists the supported languages:
Select Chinese Pinyin:
Now, click in the text body area of your word processor, and type the
sequence bu
. Already when you press the b, a tiny dialog
should appear, listing those Chinese characters that match your pinyin:
Select the character you want either by pressing corresponding digit,
click on it, or scroll down to it and press space bar. Or you can also
type more pinyin.
Notice that you must end your entry with space bar. Otherwise your entry
will be ignored. If you did it right, your Chinese characters shall now be
visible in your document!
To switch to native language input, just press ctrl-space again. Next
time you press it, last input method is restored, you (usually) won't
need to select it again.
Entities
In HTML forms you can use Entity References to represent special characters.
I'll just give a few examples. 中 国 can be encoded using
the following entities in on a web page:
中 国
But how did I know? Well, what I did in this case was to download a PDF
document from
www.unicode.org/charts/#symbols. Click on CJK Unified Ideographs (Han) (31MB)
or
here.
Non-Unicode Applications
There are unfortunately still some applications that don't support Unicode
(hopefully none of yours!). So before you decide you have failed to install
Unicode and Chinese support, try some other application!
Fonts
The font you use must be able to represent those Chinese characters
you wish to display. Otherwise you'll just see empty squares, question
marks etc.
CJK Fonts
When looking for fonts capable of rendering Chinese characters, try to
find CJK for starters! Use the GB character set for mainland China and
Big5 (Big-5) for Hong Kong (and Taiwan and Macau).
xfontsel
You can use the tool xfontsel (downloadable for free, runs at
least under all X.11 GUI:s, such as Linux and Unix). It lets you see
all the fonts you have installed and test all variants of them.
When you start it, it looks like this:
I recommend you to go from left to right. Left-click on fndry
,
and hold the mouse key down. Now you see a list of various font foundries.
Move the mouse to any of them, for example bitstream
if you
have installed that font. Release the mouse key. You should now see the
characters represented by this font. As you can see, the bitstream
doesn't contain any Chinese characters, so it might be a bad choice.
However, this isn't proof enough since this is just a sample text. You
can use switches to tell xfontsel to use another sample text that contains
Chinese characters.
Xfontsel can sometimes be a bit annoying. Suddenly, all options are gray
so you can't select them. If this happens, just make sure you select the
first option, the asterisk (*) for all attributes!
In this example, the font ISAS is shown. I have painted some characters
magenta to point out a very useful rule I've discovered:
Chinese characters don't contain rounded shapes!
except for the character I've colored red. You may see this sometimes
amid ordinary Chinese characters, in particular in company names,
commercials and so on.
If you see round strokes like those magenta colored ones, a good guess is
that they are Japanese Hiragana Syllables. If you see real circles combined
with really hard shapes with quite few strokes, it's probably Korean!
Note though, that in the above picture all characters on the two last
lines above are real Chinese characters.
If you wish to examine a particular font with Xfontsel, you can restrict
it to that font by starting it up with the -pattern
option,
for example:
xfontsel -pattern '-adobe-*-*-*-*-*-*-*-*-*-*-*-*-*'
LATEX
First, let's produce some simple pinyin in Latex which should look like this:
yī, èr, sān, sì, wǔ, liù,
qī, bā, jiǔ, shì
These are the first 10 digits in Chinese pinyin (well, zero is omitted and
10 is added as it has it's own symbol which looks like a plus. I know, it's
not a digit but a number. Sorry!).
In plain Latex, you can code the above as:
y\={i}, \‘{e}r, s\={a}n, s\‘{i}, w\v{u},
li\‘{u}, q\={i}, b\={a}, ji\v{u}, sh\’{i}
It's not perfect, as you may get dots also above the i:s.
Now, add the following line to the Preamble of your document
(if you know Latex, you know what Preamble means):
\usepackage{pinyin}
Add the following to your document somewhere in the body section:
\yi1, \er4, \san1, \si4, \wu3,
\liu4, \qi1, \ba1, \jiu3, \shi2
This should show perfect pinyin!
Now let's get some true Chinese characters also. Add this to the Preamble:
\usepackage{CJK}
In the part of the document body where you want to write Chinese characters
add the line
\begin{CJK*}{UTF8}{gkai}
Write Chinese characters here. Use your Input Method.
For example, write
一二三四五六七八九十
Finally finish with
\end{CJK*}
Note, gkai
above is just a font. It's usually a good option but
you can of course specify another font!
If this after all didn't work, maybe you need to install the following
packages:
- latex-cjk
- latex-cjk-chinese-arphic-gkai00mp
- latex-cjk-chinese-arphic-bsmi00lp
- latex-cjk-chinese-arphic-bkai00mp
- latex-cjk-chinese-arphic-gbsn00lp
- latex-cjk-chinese
- latex-cjk-common
- ttf-arphic-bsm00lp
- ttf-arphic-gkai00mp
- ttf-arphic-bkai00mp
- ttf-arphic-gbsn00lp
- texlive-xetex
- ttf-arphic-ukai
- ttf-wqy-zenhei
- ttf-arphic-uming
Pronunciation Hell
Now that you know a little more about Pinyin, let me point out why
pronunciation is important! Here's a list of words all sounding like
shi. But there are four different pronunciations meaning
different words. And even if you don't know the context, or see the
Chinese character, the sound can still mean lots of different things!
Tone 1:
师 shī teacher
Tone 2:
十 shí ten
时 shí time
食 shí eat
实 shí reality
Tone 3:
始 shǐ begins
史 shǐ history
Tone 4:
视 shì watch, inspect
适 shì fit
是 shì is, am, be, yes
市 shì city, market
世 shì life, world
室 shì room
式 shì type, style
试 shì tries, tests
士 shì soldier
事 shì things
(Many of the above Chinese words are not complete, but should be written
with one or two more characters. This is just to keep it simple and state
a point).
Lots of Logics
Don't think Chinese is a brain dead language. On the contrary, it can be
extremely logical! I'll give you a few examples.
Dates are written in straight logical order. The most significant digits
begin at the left, then significance decreases to the right, just as
any normal numbers.
The 29:th July 2012 is written in Chinese in one of the following ways:
- 2012-07-29
- 2012-7-29
- 2012年7月29日
(年 means Year, 月 month, 日 day).
Weekdays are simply numbered, 星期一 means Monday.
The first two characters just mean Weekday, the third character means One!
So Tuesday is 星期二, Wednesday is 星期三
and so on (notice that only the third character changes!).
Strangely though, Sunday is 星期天.
The last character doesn't mean Seven, it means Heaven.
Also months are numbered. Instead of January you say 月一,
which simply means Month One. Then it continues, and fortunately no
strange exceptions.
Another example, the character 电 means Electricity. So it appears
in most electrical and electronical words. Examples: 电脑 is
Computer, 电视 is TV etc. Strictly converted back to English
they would mean Electrical Brain and Electrical Sight.
Links
Perhaps this is the most important section, as it points out valuable tools
and further reading. Please visit all links, they're really worth it!
At Mandarinposter.com you can download a
scroll of radicals.
A comparison between character types and their internal representation
in Java, C, C#, Python and Ruby.
At Nciku.com you can look up characters,
hand write them, translate, convert to and from pinyin and also learn
Chinese. Really a great site!
Read more about Entities and general HTML at
htmlhelp.com.
You absolutely have to
read
this article! (thanks Joel!).
At www.foolsworkshop.com/ptou/
you can convert from numerical pinyin to symbolic, and also get
the corresponding HTML References;
To convert from Chinese characters to Unicode, see
www.pinyin.info/tools/converter/chars2uninumbers.html.
This is a good page showing which entities to use for representing
Pinyin in HTML. Both numerically and symbolically:
www.math.nus.edu.sg/aslaksen/read.shtml#Writing.
Chinese
Tools has a few handy converters, such as this which converts to numerical
Pinyin tone marks to symbolical, and also HTML entities.
And finally of course, Unicode's
home page!
Feedback
I'm happy to receive feedback. If you have a non-hacked
LinkedIn account, you can send me
an inline message!
(yes, you can write in Chinese, but I'll probably reply
in English).
In particular, I'm interested of input regarding
- plain errors
- typos
- technical problems, for example, you can't read some of the text,
please specify your computer, OS and software environment with versions
and language packages etc.
- missed topics, what would you like me to add?
- bad grammar
- bad explanations
- anything else
- related links you want me to add
- known pitfalls
Unfortunately, my time is limited so it might take a while between updates.
I work mainly with Linux, so trying things out on Mac or Windows is harder.
If you develop in these environments and want to add advices regarding them,
please let me know and I'll add it (acknowledging contributor's name
of course).
Acknowledgements
Magnus
Wallin pointed out several typos.
Safia Syed
gave feedback about culture, dialects, keyboard issues
and other useful points.
Andy Furnival gave
me a lot of valuable feedback and additional
points (I stole some of them straight off).
Yuri Tan
added lots of useful info and corrections. I copied some of it here.
Bo Yang reviewed the first version.
THANKS!!
Rein's Howtos