Changizi, M. A., & Shimojo, S. (2005). Character complexity and redundancy in writing systems over human history. Proceedings. Biological sciences / The Royal Society, 272, 267-275. https://doi.org/10.1098/rspb.2004.2942
The idea of universal grammar, usually attributed to Noam Chomsky, proposes that certain structural principles of languages are motivated by the limits of our genetic predispositions and are therefore shared universally across human languages. If languages share common principles, what about human writing?
Mark A. Changizi and Shinsuke Shimojo make an inquiry into the topological complexity of scripts (they use the term writing systems to mean the same) looking for regularity in the number of strokes and stroke types that are used to construct character shapes. They analyse more than a hundred of the world’s scripts and through statistical analysis conclude that the average number of strokes required to construct a character (they refer to a character length) is approximately three, regardless of the script or the number of characters in the script’s repertoire. The reported average is 2.91, with a standard error of 0.09. Similarly, the authors report that redundancy, i.e. the proportion of stroke combinations that are considered valid characters in a script out of all stroke combinations that are theoretically possible, averages around 50%. Neither the average stroke count nor redundancy vary much as a function of the size of the script repertoire.
Note that the number of stroke types tends to increase with the size of a script’s repertoire. In other words, in order to produce and handle larger repertoires of characters, humans add new types of strokes rather than construct characters from a larger number of strokes.
Unfortunately, the description of the methodology in the paper is not sufficiently detailed, perhaps due to limits imposed by the publication. Consequently, the approach to script and character design comes across as naïve in some places. The visual presentation of the scripts is limited to only a few characters from each which makes it difficult to review. Here are a few considerations that could have been addressed:
- Character repertoires are typically defined per language. How were the repertoires established for scripts which are used by multiple languages or for scripts that combine basic characters to create more complex ones? For example, Indian syllabic scripts, such as Devanagari or Bengali, require several hundreds of characters to represent complex conjunct syllables and use marks to alter inherent vowels. These characters are not part of basic script illustrations, but still represent a relevant part of the scripts’ repertoire and, importantly, use relatively more strokes. Is it possible that illustrations of scripts from bibliographic sources (Daniels & Bright, 1996) were used with limited understanding of the scripts’ principles?
- Cyrillic and Greek were only studied in their lowercase forms and Latin uppercase and lowercase were studied separately. The reason for this was not provided. It is also unclear how many diacritical characters are included, if any. For example, the way English uses the Latin script requires fewer shapes than, say, Hungarian.
- Chinese and Japanese scripts that make use of topologically complex characters are missing from the study. The authors state that “character and word levels are not cleanly separable” (p. 268). Why this is a problem with respect to character topology is not explained. Although, it is understandable that the authors would not want to tackle character sets as vast as the Chinese.
- The decomposition is illustrated on skeletal forms of the uppercase from the Latin script. How was it conducted for other scripts and how were the skeletal forms obtained? The simplification process and choices among valid alternates are not discussed. For example: the more complex skeletal forms of “I” and “J” are used without any explanation; there is no explanation of the decomposition of joining scripts such as Arabic, Syriac, or many others where multiple characters join with a single stroke.
Admittedly, critical appreciation of scripts’ design becomes a gargantuan task when dealing with over one hundred of the world’s scripts, but with all these omissions can we still consider the data representative of the world’s scripts?
There are 22 numeric scripts studied next to the 93 non-numeric scripts. The reported average number of strokes in a character is 1.95 (SEM=0.14) for the numeric scripts. This shows that characters in numeric scripts tend to be topologically simpler. By itself, this is an interesting result.
Considering the ease of reading to be the principle selective pressure on scripts, the conclusion puts forward several explanations for the surprising cross-script constants (numeric or not).These range from limits of short term memory and principles of character recognition to a visual-ecological explanation that characters’ topology match those found in objects in natural scenes which is explored further in a more recent paper (Changizi et al., 2006).
Universal grammar in languages provides grounds for the belief that people from distant parts of the planet are indeed quite similar, that they can ultimately understand each other, and learn each other’s languages. Despite the drawbacks regarding its methodology, the paper shows an exciting way to use statistical analysis to gain general insights regarding the way humans read and write.
Changizi, M. A., Zhang, Q., Ye, H., & Shimojo, S. (2006). The structures of letters and symbols throughout human history are selected to match those found in objects in natural scenes. The American Naturalist, 167, E117-39. https://doi.org/10.1086/502806
Daniels, P. T., & Bright, W. (1996). The World’s Writing Systems. Oxford University Press.