The Making of Chinese Characters

How Chinese characters work

In ancient times when the Chinese writing system was first developing characters were simple pictographic representations of objects in the world. However, the system quickly expanded beyond that, and modern day Chinese is a surprisingly flexible system with a great capacity for the addition of new characters.

The most basic characters are still simple pictures of objects or concepts, for example 木 for tree. These only comprise a small number of the total corpus of Chinese characters however. A further small subset of characters are formed from the logical expression of ideas through the combination of these basic forms, an example being the character for prisoner 囚 which is the character for man 人 inside a box.

These examples are few and far between however, and most Chinese readers consider them novelties as opposed to normal features of the language. The overwhelming majority of characters are phono-semantic compounds. These characters are made up of two smaller characters as components, with one component providing the sound and the other providing the meaning. For example, we can see that the following characters all contain the character for cyan- 青“qīng”, which gives them their pronunciations.

清 – “qīng” – Clear

請 – “qǐng” – Ask/request

靜 – “jìng” – Quiet

The pronunciations are far from perfect, sometimes you can get a good idea of how to pronounce something, but often the chosen components only sound vaguely similar. Chinese languages are tonal and very often tone will also vary across characters that share a phonetic. It’s not always possible to tell which part of the character is the phonetic by looking; 青 is usually on the left, but it can be on the right, or even at the base of a character (菁).

The other component is the semantic component, these have been the focus of most Chinese studies to date as they are less numerous. Most Chinese characters will contain one of only a couple of hundred semantic components. A study in the 2nd century described around 10k characters under 540 semantic components (called radicals); however, this book is somewhat bizarre, as some of the described radicals have no associated characters, and over 150 radicals are only used in single characters. A large dictionary written in 1716, the Kangxi Dictionary, organised over 40k characters under only 214 semantic radicals which remains the standard set used today. A minor complicating factor of this approach is the existence of multiple forms of each radical depending on where in the character they are used. 人 for example can be written as 亻 when found on the left of a character.

When non-Chinese people learn the writing system, the characters can often seem impenetrable. 情 means love, or emotion and is made from the semantic radical忄meaning “heart”, with the pronunciation qíng. Often people will try and come up with elaborate theories along the lines of Chinese people considering love to be a “cyan” colour, when this isn’t really the logic behind the character.

For this project I wanted to try and create a system for identifying and mapping both phonetic and semantic relationships between characters for the purposes of making the learning process less painful. In principle, if you can learn the 青 phonetic component and its pronunciation, then learning all the characters derived from it becomes much easier. If you’re also already familiar with the Kangxi set of semantic radicals, this becomes very easy indeed.

What’s already been done

Phonetic components are difficult, in principle any character can be used as a phonetic, which makes the total number too large to be practical for many purposes — writing a dictionary for example. In addition, there are multiple phonetics with similar pronunciations: 巠 “stream”, amongst other characters, can be used for the same kind of jīng/ qīng sound as 青. Looking up an unknown character would therefore be very difficult.

Classification information from various dictionaries has been included in the Unihan database. Unihan is the project to combine all of the different computer encodings used by countries like Japan and China into one common standard that can handle all of the requirements that these different countries have, for example providing the “simplified” characters needed for mainland China as well as the “Shinjitai” forms needed for Japan. I won’t get into the details of Unihan, which is a massive and fascinating project in its own right. There are two relevant metrics to this analysis within the Unihan database that I will be using, the first being the kFenn (Soothill) metric, the second the kPhonetic. I will also be using the database as a source for Mandarin pronunciation and basic definitions.

Soothill numbers

One of the best sources for phonetic information available for Mandarin is the Soothill Student’s Dictionary. This dictionary was written by an English missionary to China at around the turn of the 20th century. It only contains 4000 common characters, but represents a solid attempt at clustering characters by phonetic component. In Unihan these classifications are listed as number letter pairs (e.g 100A), with the number signifying the phonetic group, and the letter the frequency of the character in Mandarin. This information was readily extractable via regex.

import re

def readSoothill():
    db = {}
    with open("Unihan_DictionaryLikeData.txt") as unihan:
        for entry in unihan:
            if "kFenn" in entry:
                comps = entry.split("\t")
                phonetic = int(re.findall(r"\d+", comps[2])[0])
                db[unidb[comps[0]]] = phonetic #Convert unicode into actual character
    return db

Splitting up characters, IDSs

Identifying the components that characters are made up from would be a difficult thing to do computationally, but luckily this is not necessary. Unicode already provides a framework for describing Chinese characters by splitting them into their components and describing how they are positioned relative to each other. These ideographic description sequences (IDSs) use specialised ideographic description characters (IDCs) to give a rough idea of how a character is written, and various organisations have used these IDSs for the purposes of efficiently indexing character databases. If you want to find out whether a character is already in a database for example it is easy to determine the IDS and then check whether that (or a similar character) already exists.

A comprehensive set of IDSs is available here. This enabled me to begin work on breaking down Chinese characters into their components. IDSs usually only go one level deep — for example the IDS for the character 懼 is ⿰忄瞿, the second component of which is in turn made up of ⿱䀠隹, which can then be split even further. For the purposes of completeness I decided to write my code recursively, to enable every possible connection to be explored.

IDCs = {"⿰":2, "⿱":2, "⿲":3, "⿳":3, "⿴":2, "⿵":2,
"⿶":2, "⿷":2, "⿸":2, "⿹":2, "⿺":2, "⿻":2}

class component():
    def __init__(self, seq):
        self.IDC = None
        self.childlength = 1
        self.skip = 0
        self.phonetic = 0
        for index, s in enumerate(seq):
            if index > self.skip:
            if s in IDCs:
                if index == 0:
                    self.IDC = s
                    self.childlength = IDCs[s]
                    self.skip = self.children[-1].remainder + index + 1
            if self.childlength == 0:
        self.remainder = index

    def __str__(self):
        outstring = []
        if self.IDC:
        for child in self.children:
        return "".join(outstring)

Showing connections, networkx, graphviz

At this point we understand the principles of how characters are constructed. The next question is how to explore and show these relationships.

To model the relationships I used the networkx package for python3, a simple package for describing graph structures. Using python 3 also enables default usage of Unicode type strings, which makes working with CJK characters much easier. Although networkx has some intrinsic support for rendering graphs, this isn’t really what it was designed for, and I ran into issues with the underlying matplotlib library not understanding Chinese fonts. I instead decided to output the graphs as .dot format files which I could then render using graphviz.

At this point I decided to factor some more information into these plots. Labels can be written in HTML in graphviz, so I generated a quick function to render pronunciations and meanings in these labels as well as the characters themselves. The meaning field was difficult however, as most Chinese characters have large numbers of meanings, or meanings that are incredibly long/unwieldy (see below). To deal with this in the short term I split all meanings in the Unihan record and the returned the shortest, with this truncated if still over a certain length.

I generated a full network using all of the old HSK (Chinese proficiency test) characters, which gave a decent number of common characters to begin with. I then wrote a simple interface to grab all characters linked to a single component, I began with the woman radical 女. I then worked through its subgraph and checked to see whether any two linked components shared a Soothill phonetic. If so I coloured the respective link in red, if not, blue. Once complete I coloured each node based upon the colours of its connections, the idea being that semanto-phonetic compounds would be purple, and semantic and phonetic radicals would be blue and red respectively. I also set node sizes to be relative to the their degree (the number of connected edges).

Problems with Soothill

There are some problems with this network 好 “good” does not take its pronunciation from 女 :-/ It seems that Soothill took some liberties in developing his system. This makes some sense as 好 is one of the first characters that a student would learn, so learning it as a character derived from 女 isn’t a terrible way to go about things. Unfortunately it’s also completely wrong — this character is one of the few characters that is actually just an image of what it represents. A woman with a child is just meant to be a good image.

So here I factored in the kPhonetic metric to try and deal with these issues, which would also hopefully deal with the large number of characters that aren’t found within the Soothill dictionary. kPhonetic lists the categorisations for each character from the 1980 Cantonese dictionary “Ten Thousand Characters: An Analytic Dictionary”.

I then created another visualisation, this time using the more common radical “亻” for man. This gave a much more complicated network.

Continuing problems

Whilst the use of kPhonetic fixed some issues with the analysis, others still remain. Some characters, such as 他 are technically phono-semantic compounds. However, they were developed thousands of years ago and have seen changes in the way they are written that make their origins useless to a modern learner. Ideally I would create my own metric for determining whether characters are still pronounced similarly in modern Mandarin, but that will have to come at another time.
Graphviz works well for these individual radical breakdowns, and I’ll probably continue using this for generating those kinds of visualisations. When looking at huge datasets however, graphviz is insufficient. The graph for the man radical alone was becoming hard to read. And even a simple dataset like the old HSK1 list has so many characters that this problem becomes intolerable (see below for an example). You also get the perhaps larger issue here of the meanings being oversimplified. Grabbing individual short meanings from Unihan is not a good way to learn character definitions — we should ideally be aiming for something more useful.

Gephi + sigma.js

Creating an interactive interface seemed the best way to go at this point, and working with HTML would enable me to include much more information. Going straight to my own implementation seemed a little excessive at this point, however.

Gephi enabled me to load my .dot files with minimal fuss, I could then generate graphs and output them using sigma.js. The plugins involved are sadly unfinished, which is a shame as they give strong results. I modified a few lines here and there to improve the quality of the final presentation, reducing the character minimum on the search box from 3 to 1 and changing the label render distances and group names. I also modified the meanings section of my code so that all meanings joined by “<br>” were incorporated into the html display.

The final display can be seen here.

I’m probably going to put up my code on a github or something over the next couple of days — this was a quick project so the code quality is pretty weak. I’m going to quickly redo things to be a little higher in quality first.

Comments are closed.