Saturday, December 29, 2018

Niamey 1978 & Cape Town 2018: 3. Other angles on Wikipedias in extended/complex Latin

Last August, I began a set of three posts marking the 40th anniversary of the Niamey 1978 meting on harmonizing African language orthographies, and associating that with the Wikimedia 2018 conference in Cape Town - the first in sub-Saharan Africa. This post concludes the series.

The central element of this discussion is the extended Latin alphabet, which is used in the orthographies of many African languages to accurately convey meaningfully important sounds, and which often makes input with ordinary keyboards or keypads difficult.
Composite image from welcome page of Eve Wikipedia.

Looking again at Wikipedias in extended/complex Latin

So far, this article has raised some questions, made some admittedly superficial comparisons, and speculated as to factors related to success or not of Wikipedia editions in African languages. What might be ways of improving this analysis?

One way would be different approaches to sorting and categorizing the Wikipedias using the same numbers in the tables in the previous post. And one of those approaches could be to consider the relative degree of complexity of each orthography within a category. For example in category 3, Fulfulde uses 3, 4 or 5 extended Latin characters and Hausa 3 or 4 (depending on the country), and Luganda and Northern Sotho only one. Yoruba, when typed with the precomposed dot-under (aka "subdot") characters and combining diacritics for tones is less complicated than the same language using the classic style small vertical line under - because the latter requires a combining diacritic and that may mean "stacking" two diacritics on a character where a tone is involved.

A different way would be to look at the quality of content in the various Wikipedia editions. In some, the raw numbers of articles (which are the figures used in the tables mentioned above) are inflated by shell articles, by which I mean a stub that may have only a few words of text and perhaps an image. The list of Wikipedias does include a "depth" metric, which might be used (or perhaps adapted) to look for possible correlations between the quantity and quality of the content on the one hand, and the nature of the orthography on the other.

Yet another way would be to consider the numbers of people working on these editions. Wikipedia counts the numbers of users, active users, and administrators per edition. Could one use these figures to better understand whether the more successful Wikipedia editions in extended Latin (in terms of numbers of articles or a depth metric) are so because of the efforts of a relatively small number of users? That's not to imply any negative judgment of such cases, but it would be useful to know if if a complicated writing system (from the point of view of input) is not a hurdle for a large number of contributors (active users), or if it's really a case of a few savvy individuals carrying the load.1

And another approach would be to expand the scope of analysis to consider other factors: How many people speak the language? Is it taught in schools? How much printed material is available? Are there different dialects or written conventions that a contributor to or a reader of a given African language edition of Wikipedia must navigate? Any of these and perhaps others might, individually or in combination, affect the potential production of web content in general, and the success of these Wikipedias in particular.

One could also put the numbers in the background and do a qualitative study focusing on the experience of the editors of African language editions of Wikipedia. What might emerge from such discussions concerning the range of tasks involved in building and maintaining an active Wikipedia?

And then there are some stray questions certainly worth checking out. For instance, does the base interface language on which all of the African Wikipedias are built (English vs French) have any bearing at all on the success of Wikipedia editions? What about the degree of localization of the interfaces (from English or French to the language of content? And does that degree of localization relate at all to the complexity of the script?

Research towards success with African language Wikipedias

Although the number of Wikipedias in African languages is relatively small (about 13% of all editions, and collectively contain less than 1% of the total number of articles in all Wikipedias combined2), there are arguably enough data and diverse user experiences to give us a better idea of both how to develop small Wikipedias in Africa, and how much of a factor the scripts used to write them might be in their relative success.

Looking beyond African languages to the experience of Wikipedia editions in other languages written in extended Latin (and non-Latin scripts) would be instructive. This would likely highlight not only methods to facilitate input of diverse writing systems, but also supportive environments (or "localization ecologies") for these languages in general. 

Success for African language editions of Wikipedia may not be found in imitating work on other editions so much as it would in identifying ways to leverage the strengths and unique resources of African language communities. Nevertheless, facilitating input is fundamental, relying at its most basic level on common technology (for keyboards, etc.) and features of the Mediawiki software.

With rapid advances in language technology, an additional focus should be how to adapt speech-to-text to African languages to facilitate creation of content from oral narratives, interviews, and exposition. This is a topic I hope to return to later.

1. The Yoruba (category 4 orthography) and Northern Sotho (category 3) Wikipedias, for instance, each benefited at different times from large numbers of articles created by a single user in their respective communities.
2. That's 38 of 292 if the tiny Dinka edition is included; 37/291 if not. And about 293k articles out of 48.6 million total. (All as of July 2018.)