Previous chapter: introduction
Next chapter: paragraphs and pixels
High-quality design can only happen if technology allows it, thus I make a distinction between design for a language and support for a language. The former deals with design quality while the latter describes the technological requirements for even attempting to achieve quality. In the digital context, language support means the software ability to type, encode, and render texts on screens or printers. Diverse scripts pose diverse and often complex requirements when it comes to rendering. This chapter discusses digital texts at the level of codes, moving from keys to characters and words. The next chapter will deal with paragraph composition and rasterization, i.e. the conversion of vector contours to pixels. Keeping with the spirit of this series, I will try to generalise from the intricate specifics to provide an introductory overview. See further reading and references for detailed descriptions of contemporary solutions.
From shapes to codes
Most modern software represents texts as sequences of numeric code points that correspond to individual characters from a script. The code points are listed in a shared database called an encoding that serves as a key for interpreting those code points as characters. The most commonly used encoding today is probably Unicode which covers a great majority of the world’s scripts and is regularly updated (The Unicode Standard, 2021).
Many pre-Unicode encodings for the Latin script represented characters as 8-bit code points, i.e. one byte per character, and therefore could not include more than 256 code points. Due to this limitation, encodings were devised based on the needs of a particular script, language, or a small group of languages. For example, the ISO-8859-1 encoding has 224 code points which cater primarily for West-European languages.
The use of a wrong 8-bit encoding would lead to a character mismatch (see Table 1) or other text processing errors. Unicode solves this issue by providing a singular database that can contain a multitude of character code points (The Unicode Standard, 2021, p. 14). As a consequence, Unicode allows for multilingual and multi-scriptural texts, almost universally.
|Central European (ISO 8859–2)||Vícejazyčná přednáška|
|Western European (ISO 8859–1)||Vícejazyèná pøedná¹ka|
|Baltic (ISO 8859–4)||Vícejazyčná pøednáška|
The ability to encode texts is an essential requirement for language use on computers from typing and word processing to online search. Therefore, due to Unicode’s dominance, the inclusion of all necessary characters in its database is a de facto requirement for the languages’ survival in the digital environment.
In order to assign code points to characters, researchers working with Unicode deliberately look away from region-specific, language-specific, and individual preferences regarding the character shapes and script use (The Unicode Standard, 2021, p. 15). As these preferences can be disputed and evolve, their inclusion in Unicode can be challenging. Also, it may be hard or even impossible to establish who has authority over the appearance of a particular script or language.
This leaves designers with yet another research challenge: when designing for a specific audience, they need to become familiar with various visual preferences that exist outside Unicode’s specification and choose those appropriate for the job. Typically, this is a question of choosing the right font or setting the font so it produces required character shapes (see Table 2).
|Ukrainian shape preference||Д||Л||в||г||д||ж||з||и|
|Bulgarian shape preference||Д||Л||в||г||д||ж||з||и|
|Ukrainian shape preference||к||л||п||т||ц||ш||щ||ю|
|Bulgarian shape preference||к||л||п||т||ц||ш||щ||ю|
From keys to codes
In order to type digital texts using a keyboard, you need a keyboard layout that maps the keys on the keyboard to corresponding code points. A language or script can have multiple keyboard layouts that correspond to different conventions or encodings (see Figure 1). Contemporary operating systems allow users to switch between these layouts effortlessly and type texts in multiple scripts.
Many languages require large repertoires of code points that cannot all fit on a single keyboard. The keyboard layouts address this by providing control keys, such as Shift or Alt, and a dead-key mechanism that increase the number of code points accessible through a single keyboard. Consequently, pressing multiple keys (at the same time or sequentially) can result in an input of one or more code points. Alternatively, a single key can input multiple code points (see Figure 2).
When using an on-screen keyboard (or similar input method) on touch-screen devices, such as phones or tablets, the keyboard appearance changes completely to show relevant characters.
When using a physical keyboard connected to a computer and originally designed for a single language, you can still switch among multiple layouts. However, the use of scripts that are not represented on the key labels may feel like a blind man’s buff game. This can be helped with an on-screen keyboard preview.
Notably, there are other input methods besides keyboards, such as handwriting recognition, predictive completion, or speech-to-text processing. However, in Unicode-based environments, each of these methods produces a sequence of code points for further processing.
From codes to word shapes
Rendering a text into its visual representation is coordinated across an operating system, fonts used, and a typesetting application. It is important to note that specific software implementations may approach text rendering differently which makes it challenging to ensure good support and quality control. However, the goal stays the same: converting sequences of code points into word shapes (clusters of character shapes) following the orthographic principles of a given script and combining these words into paragraphs.
These are the key software components involved in text rendering:
- a digital font that controls the visual appearance of the individual characters. A font contains a collection of geometric shape descriptions (glyphs) that typically consist of contours constructed from beziér curves. Additionally, a font also includes instructions regarding the glyphs’ positioning and instructions for their combination. Some of the glyphs are mapped onto code points directly while others serve as alternatives or parts that are assigned through programmed instructions in the font. In order to support a language properly, a font needs to cover necessary code points and include instructions that help to represent the corresponding script correctly during shaping.
- a shaping engine that combines the glyphs from a font to compose words while relying on the instructions in the font and following the script’s orthographic principles.
- a paragraph composer that sets words one after the other and deals with line-breaking, paragraph alignment, justification, hyphenation, and other operations that relate to paragraph setting. Paragraph composition will be discussed in the next chapter.
To provide a simple example: rendering of an English word means that a sequence of code points is converted into glyphs which are laid out one by one, set from left to right following the writing direction (see Figure 3). However, the orthographic principles of many world’s scripts are more diverse which means that text rendering can get very complex. The following will discuss some of the challenges that need to be dealt with when rendering these scripts.
Note that while it is useful to discuss word shaping first, the shaping engine and paragraph composition are linked and influence each other, e.g. when hyphenating words at the end of a line.
Firstly, the shaping engine has to consider the directionality of a script, i.e. the writing direction and the general order of the characters in a word (horizontal: left-to-right, right-to-left, or vertical: top-to-bottom). See Figure 4 for selected examples.
Secondly, the shaping engine has to deal with code-point-to-glyph mapping and glyph interactions. The mapping between code points and glyphs is not direct: a single code point can be represented by one or more glyphs and a single glyph can represent multiple code points. Moreover, the way a code point is represented may depend on the context. In Arabic, for example, the same code point may be translated into a different glyph depending on the joining behaviour of the character the code point represents and its immediately adjacent characters (see Figure 5). This is a technical solution used to represent the natural connecting behaviour of the Arabic script. Moreover, multiple code points can be represented by a single glyph, a so-called ligature (see Figure 6). These kinds of mappings are implemented using additional instructions in the font. Note that ligatures are a required orthographic principle of some scripts, such as Arabic or Devanagari, while they are optional for others, such as Latin.
The mapping complexities are handled through the programmed instructions in the font.
It is worth noting that code points might be in a different order to what is useful for their visual representation. In this case, the typesetting software performs glyph reordering to streamline the organisation of glyphs based on their intended visual position rather than on the phonetically-informed order of the input code points (see Figure 7).
Thirdly, the glyphs need to be positioned relative to each other. The position of each glyph is defined by its boundaries, both vertical and horizontal (see Figure 8). These are represented as rectangles in the figures in this chapter. The boundaries of adjacent glyphs are aligned in the writing direction by default. Glyphs’ positions can be further adjusted using three different concepts:
- kerning (or conditional spacing adjustment) typically defines additional horizontal or vertical adjustment for a pair (or a larger group) of adjacent glyphs. See Figure 9.
- mark positioning defines the position of a glyph (mark) relative to another glyph (base glyph such as a letter or even another mark). The boundaries of the mark glyph are ignored in this process. See Figure 10.
- cursive attachment defines the position of adjacent glyphs by aligning predefined attachment points on each glyph. In this case the vertical and horizontal position of the glyphs can change and their boundaries are set to align with the attachment points. See Figure 11.
The application of these concepts may depend on the context formed by the adjacent glyphs.
A discussion of language support would not be complete without an overview of the requirements of paragraph composition and potential pitfalls in the conversion of contours to pixels. These will be covered in the next chapter.
I would like to thank John Hudson for his suggestions on an early draft of this chapter.
What did you think?
What did you think of the article? We would sincerely appreciate your feedback.Send a comment
Allsorts : Font parser, shaping engine, and subsetter for OpenType, WOFF, and WOFF2 implemented in Rust. (2022). The most recent version available from https://github.com/yeslogic/allsorts
Berry, J. (Ed.) (2002). Language culture type : international type design in the age of Unicode. ATypI and Graphis.
Esfahbod, B. et al. (2022). HarfBuzz manual. The most recent version available from https://harfbuzz.github.io/index.html
Graphite : A free and open rendering engine for complex scripts. (2013). SIL International. The most recent version available from https://scripts.sil.org/cms/scripts/page.php?site_id=projects&item_id=graphite_home
TrueType Reference Manual. (n d.). Apple Inc. The most recent version available from https://developer.apple.com/fonts/TrueType-Reference-Manual/
Vadgama, K. (2020) Rendering complex scripts in digital spaces : the development of layout and shaping technologies for complex script and language representation and consequent approaches to type design. University of Reading. Unpublished dissertation available on request from the author (https://www.keyavadgama.com).
Hudson, J. (2000). Windows glyph processing. An OpenType primer. Retrieved 22. 2. 2022, from http://www.microsoft.com/typography/developers/opentype/default.htm
Karaivanov, B. За българската форма на кирилица [On the Bulgarian form of Cyrillic]. Lecture at SoftUni Creative on 15 April 2021.
OpenType® Specification (Version 1.9.) (2021). Microsoft Corp. The most recent version available from https://docs.microsoft.com/en-us/typography/opentype/spec/
The Unicode Standard (Version 14.0). (2021). The Unicode Consortium. The most recent version is available from http://unicode.org