Chinese for Programmers - Rein's Howtos

Intro

With no knowledge about Chinese at all, it's quite difficult to port software to Chinese, I've noticed.
Having studied Chinese a while, worked in Beijing and Hong Kong, and written millions of source code lines in various programming languages, I feel it's more or less my duty to help others who also struggle with software for the Chinese market.
I don't pretend I'm an expert in Chinese. If you find something wrong, please give feedback!
Apart from this intro, this Howto is intentionally compact. I recommend you to read and learn all the expressions in bold text!
I try to avoid explaining things that should be obvious to professional programmers. But still, if you find anything too short or badly explained, let me know.
I also avoid discussing lots of cultural issues, unless they make sense to software developers. I'm not trying to explain how it is to work in China. I could feed you lots about that and even recommend many good restaurants and so on. But then this text would miss its focus.
There's no product or anything behind this Howto. The purpose is to share it, get feedback from others, and learn more that way!

/Rein

Quick facts

Here are some quick facts you need to know for starters! I also kill some myths along the way, hopefully. To keep the text compact, some details are explained further in the following Examples section.

  1. You need to consider two variants of Chinese: Mandarin and Cantonese. When you hear those terms, you deal with the spoken aspect of the languages (how the language sounds when you speak it). Although Cantonese is actually a dialect, I use the term Language to point out the big difference from Mandarin. I don't want anyone to think Mandarin and Cantonese sound almost the same, that's why!
  2. Mandarin is the official language in China. Usually the one you need to consider first. Cantonese is probably second most important.
  3. When it comes to written Chinese, you also have two variants: Simplified and Traditional.
  4. To really ignore lots of details, you can say Simplified characters are used where one speaks Mandarin, and Traditional characters are used where one speaks Cantonese. So Mandarin/Simplified is used in mainland China, while Cantonese/Traditional is used in Guagzhou, Macau and Hong Kong.
    Please notice, this is really an over-simplification! If your application targets the Hong Kong market, or you need to support several dialects or character sets, you have to get more details!
  5. The term Mainland China is quite common. It refers to China, excluding Taiwan, Macau, Hong Kong1. In fact, you should also exclude Guangzhou, Hong Kong New Territories and Tibet. What remains is a big land area where most people speak Mandarin and use Simplified characters, and have the same political and economical system. Hong Kong is sometimes prefixed SAR, which means Special Administrational Region.
  6. Guangzhou is a city in Guangdong province and is part of mainland China. Its has sub-provincial status to handles its own economic affairs.
  7. PRC means People's Republic of China, which is Mainland China. There's also a Republic of China, which is Taiwan!
  8. There's no such thing as Hongkongese. In Hong Kong they speak Cantonese!
  9. Taiwan's official spoken language is Mandarin, but they write Traditional characters.
  10. Japan have three different written languages. One of them is called Kanji. Kanji is the same as Traditional Chinese.
  11. No, Japanese and Chinese don't understand each other in spoken language, but they may work it out in writing. However, this Howto is not about Japanese at all!
  12. Cantonese and Mandarin differ heavily, they sound like two completely different languages and you shouldn't expect people from Hong Kong to be able to understand Mandarin and vice versa.
  13. Traditional and Simplified characters have similarities. People who know one can sometimes guess the other. Perhaps Hong Kong people are more likely to figure out the meaning of a written text in Simplified, rather than the other way around. Traditional characters typically have more strokes than Simplified, but the simplification follows some rules so if you learn those rules the guessing isn't so hard. Also, many of the simpler characters look exactly the same in both Traditional and Simplified.
  14. The Chinese written language is character-based rather than alphabet based.
  15. The traditional characters formed the basis of the Chinese, Japanese and Korean written languages. Each culture also has its own set of Simplified characters (Simplified Chinese, Simplified Japanese and Simplified Korean), which are not interchangeable among the 3 languages (unlike the Traditional characters).
  16. One advice: To speed up your work, try to learn some very common Chinese characters! Make sure you know exactly how they should look. Pick a word which you know looks different in Traditional vs. Simplified. For example, the word men which means door, and also serves as a plural marker for people. In Simplified, it looks like 门, in Traditional, it is 門.
  17. Chinese is usually not read from right to left, or from the bottom and up. This way of writing is common when it comes to poems and ancient texts though. So if you visit China and see some old buildings with text on them, you should read from right to left. As software developer, you should be aware that you may have to support several layout variants!
  18. There's NOT a one-to-one relationship between English words and Chinese characters. Sometimes there are, but usually you need two or three Chinese characters to represent an English word.
  19. If your application is just producing static written Chinese text, you're lucky! All you have to do is to select a proper font, use UTF-8 as character encoding and make sure you have space enough for 16 pixels in height for each character row.
  20. If your application allows the user to enter Chinese characters, things become harder. At least you need an Input Method (IM), which is usually a plugin. We'll get back to this.
  21. If your application should speak also, you must ensure it can pronounce correctly. If not, Chinese user's won't understand it at all. On the other hand, since there are just a few syllables in Chinese, once you've recorded them all, your speaking application should be able to generate any word!
  22. You can't ignore pinyin! Pinyin looks like western characters, but wovelns can also have special strokes on them. The strokes tell how to pronounce those wovelns. If ignored, no Chinese person will understand. Pinyin is also used for input of Chinese, which is another reason you must consider pinyin. Read on!
  23. If you can't write the special strokes above Pinyin characters, you can use digits instead after each syllable. For example, er4 instead of a falling slant above the e (è).
  24. Pinyin is used where Mandarin and Simplified is used. In Hong Kong and Taiwan, similar concepts are used but they are not called Pinyin.
  25. Sometimes you'll see Pinyin without extra strokes. This is typical when you see names of persons and companies. But you can't make a shortcut in your programs and ignore those extra strokes!
  26. A Chinese character is made up by quite well defined strokes which all have names. For example, a horizontal dash has the name hang.
  27. If you've seen one character, you will soon see that character again but as part of a more complex character. You can think of it like inherited classes in an object oriented language, or as a recursive function.
  28. A confusing fact is that if you see a Chinese character, you can't say what it means unless you have the context! Very few Chinese characters have distinct meanings when they stand alone. Most characters have 4-10 meanings!
  29. There are quite few syllables in Chinese. A syllable is a distinct sound particle. Many westerners think Chinese people only can say Ching Chang Chong. This is not true, there are a few others :). Anyway, Ching Chang Chong are three syllables. As a programmer, you can think of a syllable as a verbose token.
  30. Many times one Chinese character is built by combining two or three other characters. In this case one part of them is called radical. Radicals are used for grouping characters together (so you can sort them and locate them in a dictionary), and radicals also often control the sound of the character.
  31. There are just above 200 radicals so you can learn them quickly!
  32. Radicals usually look a little bit different from their original character which they are based on. Some look exactly the same and some totally different.
  33. Grouping Chinese characters together, to show which belong to the same word, is usually only made in children's text books or similar. Adult Chinese are expected to know how to separate words.
  34. In some texts you'll see western style punctuation marks like bang (!) or question mark (?), in other texts you won't.
  35. No, you usually can't tell what a Chinese character means just by looking at it! Only a few Chinese characters look like what they mean. Such characters are called Pictograms. But as a programmer, you couldn't care less!
  36. If you buy a mobile phone other than a smart phone, it probably doesn't support Chinese characters. English is always supported in all phones, and most other languages are region dependent. Some phones start by asking you to select language, and once you done it they remove other languages to free up memory space, so you can't get it back. Even on a smart phone where all languages are supported, the keyboard may not allow more than a specific number of languages. But sometimes you can download separate keyboards, one for your own language and another for Chinese. This is quite a mess!
  37. A PC or Mac computer may not support both your own language and Chinese unless you pay for additional languages. If you use Linux you just select the languages you want support for, then wait until they are downloaded. All free of charge. But if you're short of disk space you may want to disable total support for Chinese as it occupies much more space than most other languages. In particular you may not need all those Chinese dictionaries which Linux happily throws at you!
  38. In Linux, there are many input methods (IMs) to chose from, each of them is the best one! You probably only need one!
  39. There are rules stating in which order the strokes of a Chinese character shall be written. If you use handwriting input, and write the strokes in wrong order, the software may be confused, giving you the wrong character as result.
  40. You can't use character sets like ANSI, ISO-8859-1, Western or ASCII to represent Chinese characters. You have to use Unicode UTF-8. Note: Character sets and Fonts are two different things. The Character set decides how characters are encoded, or stored. Fonts just decide how they look. Not all Fonts can represent all characters of a given Character set though!
  41. Chinese people often use other smileys than westerners.
  42. To print characters from your program, you may have to use alternative functions and methods. For example wprintf() instead of printf(). You should search your API manuals for wide character or multi-byte. You may have to add a few compiler switches, include alternative header files and packages, and maybe even link with other libraries.
  43. If you send characters to a text terminal, raw output may generate wrong Chinese characters. This is because more than one byte must be sent to the terminal to represent one character. If there's a break between the bytes, the wrong symbols may be looked up.
  44. Notice that wide-characters, UTF-8 and Unicode doesn't have with just Chinese to do. They are needed for most other languages. In fact, I strongly recommend you to skip other types of encoding (such as ASCII, ANSI and ISO-8859). We live in a global world now!
  45. When creating a new database, remember to set UTF-8 as Character set before you start populating any tables! Else you'll waste a lot of time later.
  46. Chinese characters don't come in upper-case and lower-case variants, there's only one alternative. Same as our digits.
  47. In Unicode, every letter is represented by a hexadecimal number. To find those numbers, visit the Unicode web site.
  48. Unicode is NOT a 2-byte representation of characters! Well, sometimes it is, but Unicode may use anything between one and six bytes to represent a character! This means if you write programs which will find positions in Unicoded strings, you can't simply step two bytes back and forth for each character. It will be a complete failure!
  49. When specifying the Chinese language (typically a parameter named LANG), you should set it to zh-CN.
  50. CJK stands for Chinese, Japanese and Korean. See chapter Fonts below!
  51. Big5 or Big-5 is a character set used in Hong Kong, Macau and Taiwan.
  52. BG is a character set used in mainland China.
  53. Be very careful when using on-line translators, such as Google Translate! Two reasons: 1. They are usually very bad. Try for example to convert a sentence from English to Chinese, and then translate it back again! 2. You'll end up in jail sooner or later! Because you have probably signed an NDA or other contract, stating you're not allowed to reveal classified information. And as soon as you translate that contract, email or whatever, Google doesn't only translate it for you, they also keep it and can do anything they want with it (why shouldn't they, did you really think Google looks to your bests only and ignore their own opportunities to make money?).
  54. If you translate texts in your software application, never use online translators. Consult a professional translator instead, there are many of those in China and they're usually not very expensive! I already assume you have separated the text from your code, haven't you? After the translation is done, test the application on a native speaker to see if the translation makes sense. Remember, Chinese is a very context dependent language, so too short messages and menu options can be confusing to the user.
  55. To support Chinese characters in your application, use for example wchar_t *wcscat(wchar_t *str1, const wchar_t *str2) instead of strcat().
  56. Character Entity References, or Entities for short, is a special syntax you can use to represent a special character in a HTML form. For example, Å is such an entity referring to the Swedish character Å. Numerical variants can also be used on the format &#number; or &#xhex-number;
  57. In China's ancient times, the Lunar calendar was used. Today, China uses the same calendar as in Western countries. The Lunar calendar is used when it comes to traditional activities, ceremonies, horoscopes and that kind of stuff. So it depends on what kind of application you write, should you consider the Lunar calendar or not.
  58. Numbers mythology is widespread in China. Something to consider perhaps. 2, 3, 8 and 9 are some good numbers, and 4 is a really bad number (representing death). Never use 4 anywhere! You won't even find any 4:th floor in most Chinese high rises!
  59. While we in the west use to put commas or spaces between every third digit to make large number more readable, Chinese use 10000 as a normal base unit. This is why they sometimes miss or add one digit by mistake when translating. So it makes sense to verify large numbers!
  60. The words for he and she are written with different Chinese characters (他 vs 她) but have the same pronunciation. This can cause translation mistakes also.
  61. Some Chinese grammar is really simple. For example, there are no special words like him or her. Many prepositions used in English are not used in Chinese. One difficult thing though is that there are about 50 different measure words to keep track of. You can't say one car, you have to stick in the correct measure word in between!
  62. The most important IRC/chat program in China is TenCent's QQ. Skype is far less common in China.
  63. The most popular Search engine in China is Baidu.
  64. Weibo is China's most popular social site, to be compared with Facebook in the west.
  65. Don't take for granted that all Western sites can be reached in China, for example, YouTube is usually blocked, and Android Market may also be blocked sometimes.
  66. The top domain for China is cn, for Hong Kong hk, for Taiwan tw and for Singapore sg.
  67. Phone number prefix for China is 86, for Hong Kong it's 852. Note that subscribers sometimes also have to pay for incoming calls in China, and it's common that the operators sell the phone numbers to third parties. Don't be surprised if your mobile phone is spammed!
  68. Sending SMS from applications to mobile phones in Hong Kong can be difficult because there's no way to tell whether a phone number is to an ordinary phone or a mobile. You can't tell by the area code etc.
  69. Chinese keyboards may be different, but they don't have 6000 keys. They are usually similar to ordinary American keyboards but they may have additional shift modes to make it easier to enter pinyin. You can read more about this below.
  70. There's quite a generation gap when it comes to computer usage in China, but there are also local differences. But this isn't just in China! You may consider the fact that younger people want everything to be accessible via their mobile phones. If you spend 2 hours every day on a crowded bus or subway train, you can't use a laptop.
  71. Some Chinese smart phones have two SIM card slots!
  72. Language skills differ! My own experience with manufacturing companies in China is that you often deal with a project manager, usually a woman, who's English is extremely good, and the male programmers you meet may not be that talented in English. In general, younger people are more likely to speak English than the older ones. Since software industry is young, you can expect Chinese software developers to know enough English!
  73. If you develop embedded software for products to be manufactured in China, you should talk to someone who has this experience. There are many pitfalls, you have to do your homework carefully! But I think this is a bit out of the scope of this article.
  74. Currency: PRC and Hong Kong have different currencies. The exchange rate is quite similar. PRC uses RenMinBi (RMB). Hong Kong uses Hong Kong dollars (HKD). The Chinese RMB base unit is a Yuan, which usually is called Quai.
  75. During 2009 and 2010 China came up with their own Wireless LAN standard, WAPI, which caused a lot of frustration, confusion and anger in the computer industry. Fortunately, it seems this is no longer an issue.
  76. If your application targets ALL Chinese users, things becomes more difficult, at least with speech based applications. There are many dialects in China, many so different from Mandarin that people from different regions won't understand each other.
  77. This shouldn't be about culture or food, but I just have to add a warning: WHATEVER you do if you visit China or Hong Kong DON'T eat at Western restaurants!!! The Chinese food is really outstanding, being there, and eating at McDonald's, is a crime!!!
1Since summer 1997, Hong Kong belongs to PRC, but it still has its own legislation and economical system, that's why it's called SAR of Hong Kong. As software developer, you can probably skip the politics and focus on the differences between the dialect and characters.

Examples

Here are some examples from the above facts. Since I'm not sure you have already installed support for Chinese characters on you computer, all pinyin and Chinese characters are represented by images.

  1. 中国 is a Chinese word built up of two characters. The first, zhōng, means middle. The second, guó, means country, nation and kingdom. So this word means China!
  2. zhōngguó is the pinyin variant of the above word zhōngguó. Notice the flat stroke above the first o and the apostrophe above the second o. These symbols tell you that the first o shall be pronounced long and without tone change, the second o should be pronounced with a short rising tone. There are five main tone variants. Zhong is one syllable, and guo is the other.
  3. To write the above pinyin using numerical format, write zhong1guo2. 1 means flat sound, 2 means rising, 3 means falling, then rising and 4 means falling.
  4. 功夫 means Kung-Fu, a martial arts sport. It contains two(!) characters. The first, 功, contains a radical plus an ordinary character. The radical is the one looking a bit like I. It gives the whole character its pronunciation and also a special meaning to it. Pinyin for the first character is gōng. The second character is pronounced fu.
  5. gonglifu is not the same as the above word! It does not mean the martial arts sport Kung-Fu! It does not contain two characters, it contains three! What does it mean? It would mean "kilometer man"! What's that?
  6. 汽车 is the simplified word for "car". Corresponding pinyin is qìchē. The word for car with Traditional characters is heiche. It would be wrong to write a corresponding pinyin word here since pinyin isn't used with Traditional. However, you pronounce it something like hei-tje in Cantonese.

Nihao Example

In this separate example I have written the word Hello in English, Simplified Chinese, and Pinyin. It looks like this:

Hello, 你好, nǐhǎo.

If you have installed Chinese support correctly, it should look the same if you download and open these files and print them out with their corresponding tools or commands:
Plain text, OpenOffice Word, MS Word 97/2000/XP, MS Word 97 XML, Rich Text Format, PDF, HTML.

I really recommend you to inspect the these files with a binary editor like Hexedit, or print them out with the Linux command od -c filename.
If, for example, you just try to open file nihao.txt directly by clicking it in the browser, you probably see something like this:

ä½ å¥½ï¼

This is because there's no information in the file telling your browser which Character encoding is used, so it probably picks ISO-8859 and finds those strange characters here instead. All the other files carry information about which Character encoding is to be used. When you write your own programs you must make sure you specify Character encoding properly! For example, in the HTML file you'll see specifications like


<META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=utf-8">
<SPAN LANG="zh-CN">

If you print out the contents of file nihao.txt on raw 8-bit byte format, you'll get most garbage after the first line. Therefore I'll show you how it looks with three different representations, corresponding to the switches c, b and x to the od command. The output format is symbolic/octal, decimal and hexadecimal respectively:


0000000   H   e   l   l   o   !  \n 344 275 240 345 245 275 357 274 201
0000020  \n   n 307 220   h 307 216   o  \n

0000000 110 145 154 154 157 041 012 344 275 240 345 245 275 357 274 201
0000020 012 156 307 220 150 307 216 157 012

0000000 6548 6c6c 216f e40a a0bd a5e5 efbd 81bc
0000020 6e0a 90c7 c768 6f8e 000a

Some editors and word processors are smart enough to ask you what character encoding is to be used if it can't figure it out. So if you try to open the file nihao.txt and get a dialog popping up, try these settings:
Image missing


Test Your Environment

Now it's time you test your environment, to make sure you can work properly with Chinese characters.

Chinese Character Rendition

Let's start with your desktop and browser. Can they render Chinese characters at all?

What do you see here?
你好
Did you see
NOTHING?
If these texts don't show the same characters, or you just saw the word NOTHING, at least your browser doesn't support Chinese characters. But it might be that not even your operating system does, or you may not have proper fonts installed. To figure out, open a word processor and try to open any of the following files, which format your word processor can normally handle:
OpenOffice Word, MS Word 97/2000/XP, MS Word 97 XML, Rich Text Format
It shouldn't matter which font you select since the word processor automatically picks a font which can handle the text properly, not always the font you have decided.
If you still see garbage, and not proper Chinese characters, you have to install support for Chinese characters. Sometimes this comes in packages called Asian language support or likewise. On Windows and Apple computers you may have to pay for such packages. On Linux, just look under System/Administration/Language Support or likewise
Image missing
(there are many variants (distros) and versions of Linux, so I can't point out exactly where to look. Your distribution probably has some online help, so search for i18n, internationalization or language in it.

From this point in the text, I will start using Chinese characters directly. If you see "garbage" where you should see Chinese characters, go back and install support for Chinese on you computer!

E-mail

Next, test you email client! Try to email the word 你好 to yourself. Just copy it from the last sentence! It should look like 你好 both when you send it and receive it. If it doesn't, try to configure your email client to use UTF-8 character encoding. If it still doesn't work, check if your mail server lacks support for UTF-8. If it doesn't and you don't own it, demand support for UTF-8 or change ISP. They should know better these days!

For example, the web mail client SquirrelMail, provided by my ISP Levonline shows &#20320;&#22909; in both the subject line and message body, instead of nice Chinese characters. Fortunately, Levonline also provides a web mail client named Round Cube, which handles UTF-8 correctly.

Input Methods

Now comes the difficult part - How to enter Chinese characters! Chinese users don't have keyboards with 6000 keys! In fact, they can even use the few keys on a mobile phone by typing in pinyin. Since the arrival of Smartphones, they can also use Handwriting by simply drawing the characters on the display. Each two methods require an IM. An IM is a helper application which interprets the entry and converts it to Chinese characters, then passes it on to the application that currently has the focus, for example a word processor. You typically install your IM once, then switch it on and off as you need it. Switching is usually done with some accelerator key combination, like Ctrl-Space. There are several IM:s available for free. One of them is Ibus. When Ibus is installed and activated, it shows a small icon in the status bar:
Image missing
Start your word processor now. Don't click any other application! Make sure your word processor is ready for input and has the focus. Click the Ibus icon, it lists the supported languages:
Image missing
Select Chinese Pinyin:
Image missing
Now, click in the text body area of your word processor, and type the sequence bu. Already when you press the b, a tiny dialog should appear, listing those Chinese characters that match your pinyin:
Image missing
Select the character you want either by pressing corresponding digit, click on it, or scroll down to it and press space bar. Or you can also type more pinyin.
Notice that you must end your entry with space bar. Otherwise your entry will be ignored. If you did it right, your Chinese characters shall now be visible in your document!
To switch to native language input, just press ctrl-space again. Next time you press it, last input method is restored, you (usually) won't need to select it again.

Entities

In HTML forms you can use Entity References to represent special characters. I'll just give a few examples. 中 国 can be encoded using the following entities in on a web page: &#x4e2d; &#x56fd;
But how did I know? Well, what I did in this case was to download a PDF document from www.unicode.org/charts/#symbols. Click on CJK Unified Ideographs (Han) (31MB) or here.

Non-Unicode Applications

There are unfortunately still some applications that don't support Unicode (hopefully none of yours!). So before you decide you have failed to install Unicode and Chinese support, try some other application!


Fonts

The font you use must be able to represent those Chinese characters you wish to display. Otherwise you'll just see empty squares, question marks etc.

CJK Fonts

When looking for fonts capable of rendering Chinese characters, try to find CJK for starters! Use the GB character set for mainland China and Big5 (Big-5) for Hong Kong (and Taiwan and Macau).

xfontsel

You can use the tool xfontsel (downloadable for free, runs at least under all X.11 GUI:s, such as Linux and Unix). It lets you see all the fonts you have installed and test all variants of them. When you start it, it looks like this:

Image missing

I recommend you to go from left to right. Left-click on fndry, and hold the mouse key down. Now you see a list of various font foundries. Move the mouse to any of them, for example bitstream if you have installed that font. Release the mouse key. You should now see the characters represented by this font. As you can see, the bitstream doesn't contain any Chinese characters, so it might be a bad choice. However, this isn't proof enough since this is just a sample text. You can use switches to tell xfontsel to use another sample text that contains Chinese characters.
Image missing

Xfontsel can sometimes be a bit annoying. Suddenly, all options are gray so you can't select them. If this happens, just make sure you select the first option, the asterisk (*) for all attributes!

In this example, the font ISAS is shown. I have painted some characters magenta to point out a very useful rule I've discovered:

Chinese characters don't contain rounded shapes!

except for the character I've colored red. You may see this sometimes amid ordinary Chinese characters, in particular in company names, commercials and so on.
Image missing

If you see round strokes like those magenta colored ones, a good guess is that they are Japanese Hiragana Syllables. If you see real circles combined with really hard shapes with quite few strokes, it's probably Korean!

Note though, that in the above picture all characters on the two last lines above are real Chinese characters.

If you wish to examine a particular font with Xfontsel, you can restrict it to that font by starting it up with the -pattern option, for example: xfontsel -pattern '-adobe-*-*-*-*-*-*-*-*-*-*-*-*-*'

LATEX

First, let's produce some simple pinyin in Latex which should look like this:

yī, èr, sān, sì, wǔ, liù, qī, bā, jiǔ, shì

These are the first 10 digits in Chinese pinyin (well, zero is omitted and 10 is added as it has it's own symbol which looks like a plus. I know, it's not a digit but a number. Sorry!).
In plain Latex, you can code the above as:


y\={i}, \‘{e}r, s\={a}n, s\‘{i}, w\v{u},
li\‘{u}, q\={i}, b\={a}, ji\v{u}, sh\’{i}

It's not perfect, as you may get dots also above the i:s.

Now, add the following line to the Preamble of your document (if you know Latex, you know what Preamble means):


\usepackage{pinyin}

Add the following to your document somewhere in the body section:


\yi1, \er4, \san1, \si4, \wu3,
\liu4, \qi1, \ba1, \jiu3, \shi2

This should show perfect pinyin!

Now let's get some true Chinese characters also. Add this to the Preamble:


\usepackage{CJK}

In the part of the document body where you want to write Chinese characters add the line


\begin{CJK*}{UTF8}{gkai}

Write Chinese characters here. Use your Input Method. For example, write 一二三四五六七八九十
Finally finish with


\end{CJK*}

Note, gkai above is just a font. It's usually a good option but you can of course specify another font!

If this after all didn't work, maybe you need to install the following packages:


Pronunciation Hell

Now that you know a little more about Pinyin, let me point out why pronunciation is important! Here's a list of words all sounding like shi. But there are four different pronunciations meaning different words. And even if you don't know the context, or see the Chinese character, the sound can still mean lots of different things!



Tone 1:
师	shī	teacher
Tone 2:
十	shí	ten
时 	shí	time
食	shí	eat
实 	shí	reality
Tone 3:
始 	shǐ	begins
史 	shǐ	history
Tone 4:
视 	shì	watch, inspect
适 	shì	fit
是 	shì	is, am, be, yes
市 	shì	city, market
世	shì	life, world
室	shì	room
式 	shì	type, style
试	shì	tries, tests
士 	shì	soldier
事 	shì	things

(Many of the above Chinese words are not complete, but should be written with one or two more characters. This is just to keep it simple and state a point).

Lots of Logics

Don't think Chinese is a brain dead language. On the contrary, it can be extremely logical! I'll give you a few examples.

Dates are written in straight logical order. The most significant digits begin at the left, then significance decreases to the right, just as any normal numbers.
The 29:th July 2012 is written in Chinese in one of the following ways:

(年 means Year, 月 month, 日 day).

Weekdays are simply numbered, 星期一 means Monday. The first two characters just mean Weekday, the third character means One! So Tuesday is 星期二, Wednesday is 星期三 and so on (notice that only the third character changes!). Strangely though, Sunday is 星期天. The last character doesn't mean Seven, it means Heaven.

Also months are numbered. Instead of January you say 月一, which simply means Month One. Then it continues, and fortunately no strange exceptions.

Another example, the character 电 means Electricity. So it appears in most electrical and electronical words. Examples: 电脑 is Computer, 电视 is TV etc. Strictly converted back to English they would mean Electrical Brain and Electrical Sight.


Links

Perhaps this is the most important section, as it points out valuable tools and further reading. Please visit all links, they're really worth it!

At Mandarinposter.com you can download a scroll of radicals.

A comparison between character types and their internal representation in Java, C, C#, Python and Ruby.

At Nciku.com you can look up characters, hand write them, translate, convert to and from pinyin and also learn Chinese. Really a great site!

Read more about Entities and general HTML at htmlhelp.com.

You absolutely have to read this article! (thanks Joel!).

At www.foolsworkshop.com/ptou/ you can convert from numerical pinyin to symbolic, and also get the corresponding HTML References;

To convert from Chinese characters to Unicode, see www.pinyin.info/tools/converter/chars2uninumbers.html.

This is a good page showing which entities to use for representing Pinyin in HTML. Both numerically and symbolically: www.math.nus.edu.sg/aslaksen/read.shtml#Writing.

Chinese Tools has a few handy converters, such as this which converts to numerical Pinyin tone marks to symbolical, and also HTML entities.

And finally of course, Unicode's home page!


Feedback

I'm happy to receive feedback. If you have a non-hacked LinkedIn account, you can send me an inline message! View Rein Ytterberg's profile
on LinkedIn (yes, you can write in Chinese, but I'll probably reply in English).

In particular, I'm interested of input regarding

Unfortunately, my time is limited so it might take a while between updates. I work mainly with Linux, so trying things out on Mac or Windows is harder. If you develop in these environments and want to add advices regarding them, please let me know and I'll add it (acknowledging contributor's name of course).

Acknowledgements

Magnus Wallin pointed out several typos.
Safia Syed gave feedback about culture, dialects, keyboard issues and other useful points.
Andy Furnival gave me a lot of valuable feedback and additional points (I stole some of them straight off).
Yuri Tan added lots of useful info and corrections. I copied some of it here.
Bo Yang reviewed the first version.

THANKS!!


Rein's Howtos