Friday, August 31, 2018

Niamey 1978 & Cape Town 2018: 2. Extended Latin & African language Wikipedias

Image adapted from banner on the Yoruba Wikipedia, August 2018
What are the implications of extended Latin characters and combinations for production of digital materials in African languages written with them? The previous post discussed some of the process of seeking to harmonize transcriptions, in which the Niamey 1978 conference and its African Reference Alphabet (ARA) were prominent. That process had a logic and left a legacy for the representation in writing of many African languages. This post asks if there is a trade-off between the complexity of the Latin-based writing system and how much is produced in it using contemporary digital technologies.

One easy, although by no means conclusive, way to consider this question is to look at Wikipedia editions in African languages (those that are written in Latin script). The following table disaggregates 35 African language editions by the number of articles (from the list of Wikipedias, as of 9 August 2018) and the four "categories" of Latin-based orthography1 introduced in African Languages in a Digital Age (ch. 7, p. 58):

Number of articles
Category 1
Category 2
"Category 1" + Latin 1 
Category 3
"Cat. 1" or "2" + any of Latin Extended A, B, etc, Add'l, & IPA
Category 4
"Category 3" + Combining diacritics
< 500
Swati (447)
Sango (255)

Fula (226)
Venda (265)
Chewa2 (389)
Dinka (75)
Ewe (345)
Sotho (543)
Tumbuka2 (562)
Tsonga (563)
Kirundi (611)
Tswana (641)
Xhosa (741)
Oromo (772)
Akan (561)
Twi (609)
Bambara3 (646)
Zulu (1011)
Kinyarwanda (1823)
Kongo (1179)
Luganda4 (1162)
Wolof (1167)
Gikuyu (1357)
Kabiye (1455)
Hausa (1891)
Igbo5 (1340)
Shona (3761)
Kabyle (2860)
Lingala (3028)
Somali (5307)
Swahili (44,375)
Yoruba5 (31,700)
> 50,000
Malagasy (85,033)
Afrikaans (52,847)
# of articles / # of editions = Average
146,190 / 14 =
54,281 / 3 =
20,655 /13 =
36,488 / 5 =
Grouping 1&2, 3&4
168,471 total articles / 17 editions = 
57,143 total articles / 18 editions = 

Looking at the top row with the smallest editions (less than 500 articles), one is tempted to highlight the high presence of African languages whose orthographies include extended Latin - categories 3 & 4. However, in the group of next highest number of articles (500-1000) there are more editions with category 1 orthographies (the simplest) than there are editions with category 3 in the group above that (1000-2000). And the next highest ranges (covering 2000-10,000) are roughly even between category 1 on the one hand, and 3 & 4 on the other. But then the three largest editions (and 3/4 above 25,000) are category 1 & 2.

So with just a visual analysis, there does not seem to be any clear pattern from arraying the editions in this way. Of course there will be other factors than the complexity of the script affecting the success of a Wikipedia edition written in it. But are there ways of looking at this raw data that can give us a clearer idea what might be the effect is of extended Latin - the ARA plus orthographies with other modified letters and diacritic combinations - on the size of Wikipedia editions?

One approach is to consider all the above editions combined, per category of orthography (totaling by column). This puts the focus on the degree of complexity of the writing system, perhaps muting the effect of other language- & location-specific factors. On the second to last row are column totals of the number of articles in all editions listed above, divided by the number of editions, to give an average figure.This yields an uneven pattern (2>1>4>3), since in the cases of 2 & 4, one large edition in a small total number of editions skews the category average up.

By the totals of the two simpler categories (1 & 2) and of the two extended Latin categories (3 & 4), however, one obtains possibly more useful numbers. This aggregation can be rationalized for our purposes here by the fact that the lower two categories are generally supported by commercially available keyboards and input systems,6 while the higher two categories, require a specialized way to input of additional characters and maybe diacritics (such as an alternative keyboard driver, or an online character picker).7

The figures thus obtained show editions written in extended and complex Latin having on average about a third the number or articles as those written in ASCII and Latin-1. Admittedly, this result is in part the result of the way categories have been chosen and figures aligned, but I'm proposing them as a perspective on the use of extended (and complex) Latin, and possible gaps in support. Before considering this in more detail, it is useful to compare with the numbers for non-Latin scripts.

What about non-Latin scripts & African language Wikipedias?

Number of articles
< 500
Tigrinya (168)
Amharic (14,321)
Egyptian Arabic (19,170)
# of articles / # of editions = Average
33,659 / 3 = 11,220
There are only three editions of Wikipedia in African languages written in non-Latin scripts.8 Two of those - Amharic and Tigrinya - are written with the Ge'ez or Ethiopic script unique to the Horn of Africa.

Arabic is the third. How to count this language for the purposes of this informal analysis raises a question. Arabic, of course, is established as a first language in North Africa for centuries, but it is also a world language, spoken natively in the southwest Asia (having originated in Arabia), and learned as a second language in many regions. Drawing users from this wide community, the Arabic Wikipedia is among the top 20 overall, with twice as many articles as all of the editions discussed above combined. It is more than an African language edition. For this analysis, therefore, I have chosen instead to count just the Egyptian Arabic Wikipedia.

Taking these three editions, we then get an average number of articles (11,220), which is close to what is seen for the Latin categories 1 & 2 (11,789). The usual caveats apply for such a small sample, but taking the numbers as they are, it is interesting that Wikipedias in the complex Arabic alphabet and the large Ge'ez abugida (alphasyllabary) are on average much larger than those of the ostensibly simpler extended Latin (3,175).9

Again, script complexity is but one factor, and in this case probably not the most important, since the two non-Latin scripts in question have long histories of use in text in parts of Africa - much longer than any form of Latin script. Nevertheless, from the narrow perspective of what is required for users to edit Wikipedia, the technical issues are in some ways comparable if even more demanding.

Arabic has had standard keyboards since the days of typewriters. The issues there are not so much the input, but whether systems can handle the directionality and composition requirements of the script.

The Ge'ez script on the other hand, does not involve complex composition rules or bidirectionality. However, it has a total of over 300 characters (including numerals and punctuation; more again if extended ranges are added). The good news is that there are numerous input systems to facilitate their input. Literacy in the script and availability of input systems would not be limiting factors for content development in major languages using this script. The difference in development of the Amharic and Tigrinya editions of Wikipedia may relate to both the larger population speaking Amharic (as a first or second language), and its use officially in a relatively large country (Ethiopia). Development of content in Tigrinya - a cross-border language - might also be hindered by issues particular to one of the two countries where it has many speakers (Eritrea).

From the above one might suggest that complexity of the written form (to be taken here as including the nature of the script itself, and the size of the character set) may be a limiting factor on content development, but that other factors, such as a literate tradition, official use, and technical support for digital production may overcome such limitations. In the case of African languages written in Latin script, however, any literate tradition is recent, and they are often marginalized in official and educational contexts. For those written with extended Latin, there is the additional factor of lack of an easy and standardized way of inputting special characters. Paradoxically, it seems, a modification of the most widely used alphabet on the planet may actually hobble efforts to edit in these languages.

Facilitating input in extended Latin for African language Wikipedias?

Wikipedia editing screen with "Special Characters"
drop-down modified to show all available ranges.
Assuming that the inconvenience of finding ways to input extended Latin characters may be a factor in the success of African language Wikipedias written with categories 3 and 4 orthographies, a quick fix might be to add new ranges for the modified letters used in African languages to the "special characters" picker in the edit screens. As it currently structured, the extended characters necessary for a category 3 or 4 orthography might be sprinkled around in up to 3 different ranges (see at right). And within each range, they are not presented in a clear order, so sometimes hard to find.

Since it may be too complicated to have a special range for each language edition, another possibility would be to draw inspiration from the Niamey 1978 meeting's ARA, and combine all extended Latin characters and combinations needed for all current African language Wikipedias into a common new range.

Of course as mentioned above, there are other factors that can contribute to the success or not of Wikipedia editions in African languages written with extended Latin, but this innovation would at least make editing more convenient for contributors to these  editions. And perhaps it might have a positive effect on the quantity and quality of articles in these Wikipedias.

In the third, and concluding article in this series, I'll step back to look at this analysis and consider some other ways to look at the data on African language editions of Wikipedia, and in particular, those written in extended Latin.

1. This categorization was intended to help characterize the technical requirements for display and input of various languages. Although the technology has improved to the point that more complex scripts are generally displayed without the kinds of issues one encountered a even a decade ago, input still requires extra steps or workarounds. The four categories are additive in that each higher category builds on those below, with added potential issues. It is also a "one jot" system in that for example, a single extended Latin character, say š in Northeren Sotho or ŋ in Wolof, makes their orthographies category 3 rather than category 1 or 2 (respectively), and the use of the combining tilde over the extended Latin character for open-o - ɔ̃ - makes Ewe a category 4 rather than 3. In terms of input, the higher the category, the more the potential issues with display and input (although technical advances tend to level the field, esp. as concerns display).
2. The only non-basic Latin character used in Chewa is the w with circumflex: ŵ. Apparently it represents a sound important in only one dialect of the language, and is used infrequently in contemporary publications. On the other hand, there is a proposed (not adopted) orthography for Tumbuka that includes the ŵ. Without this character, either language would be a category 1 orthography; with it, category 3.
3. Bambara is a tonal language. Most often, it seems, tones are not marked in text, however they can be for clarity, and some dictionaries make a point of indicating tone in the entries (not just pronunciation). If tones are unmarked, Bambara would be considered is a category 3 orthography; with tones, category 4. 
4. The addition of the letter ŋ puts Luganda in category 3 rather than 1.
5. The dot-under (or small vertical line under) characters used notably in Yoruba and Igbo are particular to southern Nigeria, and not included in the ARA. Yoruba in Benin is written with characters from the ARA.These are tonal languages, and tone is usually marked.
6. When I first proposed the category (itself a modification of an earlier effort), there were some questions why have a category 2 separate from category 1. That distinction had its origins in the early days of computing where systems used 7-bit fonts, meaning that accented letters (diacritic characters) used in, say, French or Portuguese, could not be displayed. Even as systems using 8-bit fonts enabled use of diacritics commonly used in European languages, display issues would still crop up (as a sequence of characters where an accented letter should be). Nowadays, such display issues are rare, and limited (as far as I can tell) to documents in legacy encodings. On the other hand, input of accented characters used may require, depending on the keyboard one is using, switching keyboard drivers or using extra keystrokes - so one will occasionally see ASCIIfication of text in such languages (apparently as a user choice).
7. The difference between categories 3 (extended Latin) and 4 (complex Latin) once were significant enough from point of view of display that informal appeals to Unicode to change its policy of not encoding new "precomposed" characters were common.
8. The Wikipedia incubator projects.includes several African language projects, which are not covered here. These include some in non-Latin scripts (Arabic versions, N'Ko, and Tamazight) and some in Latin-based orthographies. I mentioned one of the latter - Krio - in a previous post, and hope to do an overview of this space in the near future.
9. Average for all African language editions is 7704. By comparison the average for all Wikipedias is 166k.

No comments: