Tuesday, July 30, 2019

African Regional Meeting for IYIL2019

The International Year of Indigenous Languages 2019 (IYIL2019) has involved various regional meetings. The African Regional Meeting for IYIL2019 is being held in Addis Ababa on 30-31 July, midway through the year.

A Concept Note for the meeting details the purpose, agenda, structures, and participants. I will copy below the statement from the conference webpage (which appears to be almost identical to a section in the Concept Note, from which the title is borrowed). This is in the typical formal language of UN meetings, but useful to have available for possible future reference and discussions.

Purpose, Objectives and Outcomes of the IYIL2019 Regional Meeting

With a view to empowering Indigenous language speakers and users, UNESCO and the African Academy of Languages (ACALAN), as African Union’s specialized institution, mandated to develop and promote African languages in partnership with the languages inherited from Africa’s colonial past in the perspective of linguistic diversity and convivial multilingualism, the representatives of Member States and Indigenous members of the Steering Committee for the organization of the International Year of Indigenous Languages, and Member States of the African Union, regional and international institutional partners, as well as supported by the UNESCO Intergovernmental Information for All Programme (IFAP), are organizing the IYIL2019 African Regional Meeting on Indigenous languages on 30 and 31 July 2019 at the Headquarters of the African Union in Addis Ababa, Ethiopia.

The African Regional Meeting (IYIL2019 Regional Meeting) will contribute to the implementation of the Action Plan for Organizing the 2019 International Year of Indigenous Languages (Annex V), in particular to the established tentative road map towards achieving strategic objectives and expected impacts through the elaboration of a Global Strategic Outcomes Document.

The objectives of the African Regional Meeting (IYI2019 Regional Meeting) in Addis Ababa are to:
  • Bring together a diverse range of stakeholders, including representatives of ACALAN’s working structures in the national governments of member states of the African Union, Indigenous organizations, scholars and experts in the field of Indigenous languages, and others for a constructive dialogue on Indigenous languages and related issues in the African regions;
  • Promote the human rights and fundamental freedoms, with special focus on support, access and promotion of Indigenous languages and better integration into the policy and strategic frameworks, research agendas and development of concrete tools and services;
  • Identify existing challenges, practical solutions and good practices among different stakeholders working in language transmission, documentation, safeguarding, policy development, education, research, and promotion and private sector; and,
  • Raise awareness on the importance of Indigenous languages, linguistic diversity and multilingualism for sustainable development and provide guidance to the stakeholders in the implementation of international, regional and national commitments related to language development.
The expected Outcomes of the African IYI2019 Regional Meeting in Addis Ababa are to:
  • Produce a Regional Outcomes Document elaborated jointly by all stakeholders, including concrete recommendations and identified actions on support for, access to and promotion of Indigenous languages and empowerment of Indigenous languages speakers and learners in Africa;
  • Provide a Template for other Regional Meetings in the context of IYIL 2019; and,
  • Provide an opportunity to forge new partnerships and networks among various stakeholders to further the exchange of best practices, information sharing and collaboration.

The IYIL2019 Regional Meeting will include discussions on the following five key areas identified and presented in the Action Plan (Ref.: E/C.19/2018/8). These include:
  • Increasing understanding, reconciliation and international cooperation, including a role of non-governmental organizations;
  • Creation of favorable conditions for knowledge-sharing and dissemination of best practices with regards to Indigenous languages, including data collection, research and application of technological solutions;
  • Integration of Indigenous languages into standard setting, including intersectoral approaches across different domains such as education, public administration, innovation and research with a special focus on language technology;
  • Empowerment through capacity building, including young Indigenous girls and women, migrant population and diaspora; and,
  • Growth and development through elaboration of new knowledge including data collection, research and innovation.

Friday, July 05, 2019

A life lesson in Fulfulde & the internal voices of a multilingual

After another hiatus in posting here, will begin again. More on all that below, but I wanted first to take the opportunity to share a couple of short items relating to my experience with some African languages.

"God knows his friends..."

I still remember a particular session of Fulfulde instruction in Moribabougou, Mali, back in 1983. As in all Peace Corps language classes - whatever the country, it seems - we were seated in a small hut, big enough for maybe 4-5 people plus a blackboard. On the particular day I'm remembering, someone passed behind the hut - nothing extraordinary in that given the layout of the training facilities beween the village and the school. Our instructor, Mady Kamanta, asked "Giɗo Alla?" Literally, "Friend of God?" which is actually a kind of "Who goes there?" The unseen person answered "Ko min, __" (and his name), basically, "It's me, ___." It was a village elder.

Short greetings followed and after the elder had gone, Mady said that the proper way to answer "Giɗo Alla?" was actually something like "Alla anndi giɗo mun. Min ko ___." Basically, God knows who's his friend. I'm ___." (I think it could also have been "Alla anndi yiɗɓe mun..." - "God knows his friends...")

A simple turn of phrase but a whole culture behind it. In pulaaku or pulaagu (basically Fula culture*), to the extent I can claim to understand it, there is a certain amount of humility and avoidance of presumption. (But as in any culture, I find there's also duality in which one sometimes encounters the apparent opposite.)

By the response you not only know the identity of the person, but also get a measure of them. In Moribabougou that day, the elder was a Bambara, and however wise he may gave been, or steeped in the context of Manding language and culture he was, could not be expected to have mastered all the nuances of a second or third language. So without knowing anything more about him, you'd learn something just from his response. From my later (and admittedly still limited) experience with varieties of Fula, I recall there were other small ways in which native speakers of the language (especially in central Mali) would test your knowledge in it.

As a turn of phrase, the "proper" form of response to "Giɗo Alla?" is is a great example of answering a question while deftly stepping aside from the premise of the question (whether implicit as in this case, or explicit, as perhaps in the case of some forms of argument). Something potentially useful in many contexts and cultures.

So for some reason I recalled this recently - which does not mean I had forgotten it in the meantime - and thought I'd share it here.

The words that come to mind (multilingual version)

A recent article on multilingualism and perception (published in June 2019 in On Biology then in slightly revised version in Psychology Today) had me thinking of the internal verbal generation in response to stimuli. That is, the words that come to mind in diverse situations, when one speaks more than one language. In my case those are sometimes words or phrases in Fula or Bambara (which are very different from each other).

Here I don't mean the kind of code-switching you do when speaking, and a word in a different language just says it better (or maybe you didn't know the term in the language of the conversation). I'm referring to the internal voice when you think of something or react to a situation. Occasionally the words I think of in certain situations are not in my first, or even second (French), language.

I'm assuming this must be common among multilinguals, but am not aware of any research on it. The closest I've seen as a non-specialist in linguistics relates to the perception of sounds, and how one interprets those as meaningful utterances, as mentioned for example in the article linked above.

Multilingualism is of course more common than monolingualism, but since linguistics and psychology arose in monolingual (sub)cultures, multilingual experience is described somewhat incompletely, and treated as if it were somehow unusual.

One would hope that as African research and scholarship in these areas becomes more prominent, multilingualism will be treated more as the norm, and monolingualism as the exception.

15 years of Beyond Niamey

Since I began this blog as an experiment in January 2004, there have been periods where I've written more often, and periods where I've posted nothing. A year ago - last summer here in the northern hemisphere - I anticipated a more substantial "reboot" than actually happened (the linked post - which opens in a new tab - is still a good statement of my current thinking on this blog, aside from the 2018-specific section, and blogging in general, which I've come to appreciate more with time).

One additional factor, which has always been in the background, but I've been thinking about more lately, is the question of what are the appropriate angles for a non-African to write about African languages and the "information society." This question is, frankly, the source of some doubt the deeper I get into it.

African languages are part of my life, for reasons perhaps apparent from the above, but which I'll discuss further in a subsequent post. However none of them are my language in the senses of mother tongue or cultural heritage. So as much as I may have my opinions, I also try to maintain a sense of propriety and humility. As I see it now, that allows focus on asking questions, making comparisons, and sharing information, ideas, and observations.

Readers are invited to follow critically as this blog evolves.

* Pulaaku is a somewhat complicated topic, often presented as the "Fula way" and comprised of several key attributes or behaviors. One summary posted on the web offers a list (although they had trouble with the hooked letters - "neaaaku," "aum," and "enaam" should be "neɗɗaaku," "ɗum," and "enɗam"). In an academic article entitled "L'image des Fulbe : Analyse critique de la constgruction du concept de pulaaku," Anneke Breedveld and Mirjam De Bruijn argue that the meanings of pulaaku vary by region, and perhaps our understanding of a single concept (and Fula people) is derived from foreign scholarship. In this, as much as in variation of the language, I tend to think that there are core concepts that are drawn from and interpreted variously, and that these interpretations don't represent divergence so much as dynamics. In the realms of ideas and identity, these dynamics could also mean convergence.

Thursday, February 21, 2019

IMLD 2019 & IYIL 2019 in Africa

On this International Mother Language Day 2019 (IMLD 2019), which has as its theme "Indigenous languages matter for development, peace building and reconciliation," a question: What are "indigenous languages" in Africa?

The question arises also since we are over a month into the International Year of Indigenous Languages (IYIL 2019), an observation declared by the United Nations General Assembly in its 2016 resolution on the Rights of Indigenous Peoples. The purpose of the IYIL 2019 as stated in that resolution (under #13) is:

"to draw attention to the critical loss of indigenous languages and the urgent need to preserve, revitalize and promote indigenous languages and to take further urgent steps at the national and international levels"

I won't propose a definitive answer as to what counts as "indigenous language" as my understanding is that the themes of IMLD and IYIL are inclusive, but it seems to be an important issue to think about in the interests of encouraging wide participation in Africa.

With that in mind, it is worth noting that the South African Centre for Digital Language Resources (SADiLaR) calls for "Celebration of South African languages" in the context of IYIL 2019. Also of interest is that the January-February issue of the UNESCO Courier, themed "Indigenous Languages and Knowledge (IYIL 2019)," has an article on the Mbororo people of Chad, whose mother tongue is a variety of Fula (Fulfulde/Pulaar) - a language that originated in a region far to the west.

Some of the broader issues concerning indigenous languages of Africa I hope to come back to in a following article, but in the meantime wishing all a happy IMLD 2019!

Sunday, January 06, 2019

Writing Bambara wrong & a petition to VOA

Why does the Voice of America (VOA) Bambara service's web content use a frenchized transcription of Bambara while Radio France International (RFI) uses the Bambara orthography?

Screenshot from page on VOA Bambara website. In Bambara orthography:
Jamana tigi: Ibrahima Bubakar Keyita ye Sankura foli kɛ ka ɲɛci jamana
denw ma. (Presidential New Year's address to the people of the country)
This question comes to mind in light of a petition being circulated by the Cercle Linguistique Bamakois asking that VOA follow the Bambara orthography on its web presence. (An English version of the petition is included at the end of this post.)

According to Sam Samake, a language specialist in Bamako, VOA's rationale for its current approach is to reach a large number of listeners who do not read or write Bambara in the official orthography,1 but who have been schooled in the French language system. In the view of Dr. Coleman Donaldson, a researcher on Manding languages (which include Bambara), this is part of a pattern of disregard for the spelling and orthographic conventions adopted by the Malian government and used now in many in primary schools.2 (This system happens also to harmonize with orthographies of neighboring countries due to the process that included the Bamako 1966 and Niamey 1978 conferences.)

Use or non-use of African language orthographies - and implications of respect or disrespect that accompany that choice - is not at all a new discussion. Coleman has a more recent examination of the broader problem as it appears in Mali, in a book chapter:3
"In a context where there is no shortage of people trained in official Bamanan orthography, the fact that the multinational telecommunications firm Orange fails to respect the official conventions is not simply a case of shoddy work; it is in fact part of the message."
Screenshot from RFI Mandenkan homepage. (Days & times of broadcasts).
In fact, as Sam pointed out, Malian government personnel, including for example everyone in the national broadcasting service, ORTM, have been trained in this orthography.1 So it does not appear at all accidental that major international entities like VOA and Orange opt to write Bambara as they please.

In this context, it is interesting to note RFI's decision on "Mandenkan" web content. Mandenkan, or Manding, is a group of largely interintelligible languages including Bambara (or Bamanankan) in the Mande family. RFI uses what looks like Bambara in the proper Malian orthography. That said, the amount of text in the language is limited to a static page on its main site (from which the image above was drawn), and some text in older posts on its Facebook page.

L2 literacy & L1 illiteracy?

VOA's decision to use a frenchized (or Frenchified) transcription of Bambara - which it should be noted has no standard form, pretty much by definition - is apparently premised on the notion that many people in their audience don't read the standard Bambara orthography. There may be something to this, to the extent that in Mali, formal education is mainly or exclusively in French, and people who read French can sound out text with spellings reflecting French phonetics.

However this reasoning (or rationale) has at least two problems. First, it is not clear how much of the audience cannot read Bambara written in the official orthography. Have there been any surveys? And secondly, for a native speaker of the language, the official orthography would not seem that hard to work through.

On the latter point, a word about multilingual literacy, or its absence, in Africa. The fact that many people in Africa have been taught to read in a Europhone language (French, in the case of Mali), which for the vast majority is a second language ("L2"), but never formally taught in their first language ("L1") or local lingua franca (like Bambara in Mali), leads to situations where many people are not comfortable reading in their familiar African languages. I've been among those calling attention to the problem in using one measure of literacy in such multilingual contexts.4

However, that's not the same as saying an L2 (and non-L1) literate person should access their L1 only with the phonetcs of the L2. The bridge from L2-only literacy to L1 literacy is not as long as that from illiteracy to basic literacy of any kind. And the Latin-based orthography of Bambara (what we are talking about here) is not that difficult to master. After all, it doesn't seem to have put a crimp in RFI Mandenkan's effectiveness.

Tech issues: A problem? And a potential

One needs to ask if maybe a hidden issue with VOA and the Bambara orthography isn't the issue with keyboards and input. Is it possible that a simple input solution enabling the VOA Bambara service staff to type the special characters used in Bambara could change this discussion?

Also, could VOA use the perceived shortcomings in audience mastery of the Bambara orthography to engage their audience with some kind of online learning app? This would certainly generate a more favorable buzz than what the current situation is doing.

Petition to VOA

The only version of the petition I am aware of is the one in French on the Change.org site. A Bambara version would be logical - as a "medium is the message" statement if nothing else - but I have not seen any. Appended below for information of people who do not read French, but do read English, is a quick translation5 of the text of the petition into the latter:
Voice of America journalists must respect the Bambara orthography
Considering that Mali, since its accession to independence and through all the successive regimes, has emphasized the importance of the languages and cultures of the country;

Considering that the question of languages spoken in Mali is included in the country's constitution;

Considering that for decades there have been departments dedicated to the question of the languages of Mali;

Considering the remarkable work done by Malian and foreign linguists on the languages spoken in Mali from 1960 to the present day;

Considering the intellectual and financial effort made by Mali and its international partners (in particular the African Academy of Languages) in the codification and use of Mali's languages in schools and in the media;

Considering the learning and the respect of these standards in writing as an obligation in order to perpetuate the work of codification carried out;

Considering that the state of Mali through the dedicated departments guarantees these standards;

Considering that the journalists of the Mandenkan team of RFI (Radio France Internationale) have been trained and correctly use the spelling rules of Bambara;

Considering that the Bambara team of the Voice of America (VOA) does not respect any Bambara spelling rules;

We hereby call on the State of Mali (through the Ministry of National Education / Malian Academy of Languages) and the African Academy of Languages to remind the Voice of America of strict respect for the spelling rules of Bambara on the VOA Bambara page.

Recommend for this purpose:

The training of Bambaraphone journalists of VOA in the spelling rules of Bambara.

What about Hausa?

This discussion would not be complete without mention of the continued use of ASCIIfied Hausa by the international radio operations, including VOA and RFI. And how is it that RFI gets Mandenkan (Bambara) right, but not Hausa?
1. He mentioned this in a discussion about the topic on the Facebook African Languages group page (2 Jan. 2019). Sam is a former Peace Corps/Mali language program instructor and administrator. We have known each other since the slightly famous Peace Corps pre-service training in Moribabougou, Mali in 1983.
2. See Coleman's blog post on this topic and the VOA petition, "Voice of America's Bambara Orthography and a Petition," on his interesting site about Manding languages (which include Bambara), An ka taa.
3. Coleman Donaldson. 2017. "Orthography, Standardization and Register: The Case of Manding." In P. Lane, J. Costa, & H. De Korne (Eds.), Standardizing Minority Languages: Competing Ideologies of Authority and Authenticity in the Global Periphery (pp. 175–199). New York, NY: Routledge.
4. See, for instance, "Multilingual Literacy Day, 2014" (8 Sep 2014).
5. Based on what Google Translate produced, which was much more useful than Systranet's output.

Saturday, December 29, 2018

Niamey 1978 & Cape Town 2018: 3. Other angles on Wikipedias in extended/complex Latin

Last August, I began a set of three posts marking the 40th anniversary of the Niamey 1978 meting on harmonizing African language orthographies, and associating that with the Wikimedia 2018 conference in Cape Town - the first in sub-Saharan Africa. This post concludes the series.

The central element of this discussion is the extended Latin alphabet, which is used in the orthographies of many African languages to accurately convey meaningfully important sounds, and which often makes input with ordinary keyboards or keypads difficult.
Composite image from welcome page of Eve Wikipedia.

Looking again at Wikipedias in extended/complex Latin

So far, this article has raised some questions, made some admittedly superficial comparisons, and speculated as to factors related to success or not of Wikipedia editions in African languages. What might be ways of improving this analysis?

One way would be different approaches to sorting and categorizing the Wikipedias using the same numbers in the tables in the previous post. And one of those approaches could be to consider the relative degree of complexity of each orthography within a category. For example in category 3, Fulfulde uses 3, 4 or 5 extended Latin characters and Hausa 3 or 4 (depending on the country), and Luganda and Northern Sotho only one. Yoruba, when typed with the precomposed dot-under (aka "subdot") characters and combining diacritics for tones is less complicated than the same language using the classic style small vertical line under - because the latter requires a combining diacritic and that may mean "stacking" two diacritics on a character where a tone is involved.

A different way would be to look at the quality of content in the various Wikipedia editions. In some, the raw numbers of articles (which are the figures used in the tables mentioned above) are inflated by shell articles, by which I mean a stub that may have only a few words of text and perhaps an image. The list of Wikipedias does include a "depth" metric, which might be used (or perhaps adapted) to look for possible correlations between the quantity and quality of the content on the one hand, and the nature of the orthography on the other.

Yet another way would be to consider the numbers of people working on these editions. Wikipedia counts the numbers of users, active users, and administrators per edition. Could one use these figures to better understand whether the more successful Wikipedia editions in extended Latin (in terms of numbers of articles or a depth metric) are so because of the efforts of a relatively small number of users? That's not to imply any negative judgment of such cases, but it would be useful to know if if a complicated writing system (from the point of view of input) is not a hurdle for a large number of contributors (active users), or if it's really a case of a few savvy individuals carrying the load.1

And another approach would be to expand the scope of analysis to consider other factors: How many people speak the language? Is it taught in schools? How much printed material is available? Are there different dialects or written conventions that a contributor to or a reader of a given African language edition of Wikipedia must navigate? Any of these and perhaps others might, individually or in combination, affect the potential production of web content in general, and the success of these Wikipedias in particular.

One could also put the numbers in the background and do a qualitative study focusing on the experience of the editors of African language editions of Wikipedia. What might emerge from such discussions concerning the range of tasks involved in building and maintaining an active Wikipedia?

And then there are some stray questions certainly worth checking out. For instance, does the base interface language on which all of the African Wikipedias are built (English vs French) have any bearing at all on the success of Wikipedia editions? What about the degree of localization of the interfaces (from English or French to the language of content? And does that degree of localization relate at all to the complexity of the script?

Research towards success with African language Wikipedias

Although the number of Wikipedias in African languages is relatively small (about 13% of all editions, and collectively contain less than 1% of the total number of articles in all Wikipedias combined2), there are arguably enough data and diverse user experiences to give us a better idea of both how to develop small Wikipedias in Africa, and how much of a factor the scripts used to write them might be in their relative success.

Looking beyond African languages to the experience of Wikipedia editions in other languages written in extended Latin (and non-Latin scripts) would be instructive. This would likely highlight not only methods to facilitate input of diverse writing systems, but also supportive environments (or "localization ecologies") for these languages in general. 

Success for African language editions of Wikipedia may not be found in imitating work on other editions so much as it would in identifying ways to leverage the strengths and unique resources of African language communities. Nevertheless, facilitating input is fundamental, relying at its most basic level on common technology (for keyboards, etc.) and features of the Mediawiki software.

With rapid advances in language technology, an additional focus should be how to adapt speech-to-text to African languages to facilitate creation of content from oral narratives, interviews, and exposition. This is a topic I hope to return to later.

1. The Yoruba (category 4 orthography) and Northern Sotho (category 3) Wikipedias, for instance, each benefited at different times from large numbers of articles created by a single user in their respective communities.
2. That's 38 of 292 if the tiny Dinka edition is included; 37/291 if not. And about 293k articles out of 48.6 million total. (All as of July 2018.)

Friday, August 31, 2018

Niamey 1978 & Cape Town 2018: 2. Extended Latin & African language Wikipedias

Image adapted from banner on the Yoruba Wikipedia, August 2018
What are the implications of extended Latin characters and combinations for production of digital materials in African languages written with them? The previous post discussed some of the process of seeking to harmonize transcriptions, in which the Niamey 1978 conference and its African Reference Alphabet (ARA) were prominent. That process had a logic and left a legacy for the representation in writing of many African languages. This post asks if there is a trade-off between the complexity of the Latin-based writing system and how much is produced in it using contemporary digital technologies.

One easy, although by no means conclusive, way to consider this question is to look at Wikipedia editions in African languages (those that are written in Latin script). The following table disaggregates 35 African language editions by the number of articles (from the list of Wikipedias, as of 9 August 2018) and the four "categories" of Latin-based orthography1 introduced in African Languages in a Digital Age (ch. 7, p. 58):

Number of articles
Category 1
Category 2
"Category 1" + Latin 1 
Category 3
"Cat. 1" or "2" + any of Latin Extended A, B, etc, Add'l, & IPA
Category 4
"Category 3" + Combining diacritics
< 500
Swati (447)
Sango (255)

Fula (226)
Venda (265)
Chewa2 (389)
Dinka (75)
Ewe (345)
Sotho (543)
Tumbuka2 (562)
Tsonga (563)
Kirundi (611)
Tswana (641)
Xhosa (741)
Oromo (772)
Akan (561)
Twi (609)
Bambara3 (646)
Zulu (1011)
Kinyarwanda (1823)
Kongo (1179)
Luganda4 (1162)
Wolof (1167)
Gikuyu (1357)
Kabiye (1455)
Hausa (1891)
Igbo5 (1340)
Shona (3761)
Kabyle (2860)
Lingala (3028)
Somali (5307)
Swahili (44,375)
Yoruba5 (31,700)
> 50,000
Malagasy (85,033)
Afrikaans (52,847)
# of articles / # of editions = Average
146,190 / 14 =
54,281 / 3 =
20,655 /13 =
36,488 / 5 =
Grouping 1&2, 3&4
168,471 total articles / 17 editions = 
57,143 total articles / 18 editions = 

Looking at the top row with the smallest editions (less than 500 articles), one is tempted to highlight the high presence of African languages whose orthographies include extended Latin - categories 3 & 4. However, in the group of next highest number of articles (500-1000) there are more editions with category 1 orthographies (the simplest) than there are editions with category 3 in the group above that (1000-2000). And the next highest ranges (covering 2000-10,000) are roughly even between category 1 on the one hand, and 3 & 4 on the other. But then the three largest editions (and 3/4 above 25,000) are category 1 & 2.

So with just a visual analysis, there does not seem to be any clear pattern from arraying the editions in this way. Of course there will be other factors than the complexity of the script affecting the success of a Wikipedia edition written in it. But are there ways of looking at this raw data that can give us a clearer idea what might be the effect is of extended Latin - the ARA plus orthographies with other modified letters and diacritic combinations - on the size of Wikipedia editions?

One approach is to consider all the above editions combined, per category of orthography (totaling by column). This puts the focus on the degree of complexity of the writing system, perhaps muting the effect of other language- & location-specific factors. On the second to last row are column totals of the number of articles in all editions listed above, divided by the number of editions, to give an average figure.This yields an uneven pattern (2>1>4>3), since in the cases of 2 & 4, one large edition in a small total number of editions skews the category average up.

By the totals of the two simpler categories (1 & 2) and of the two extended Latin categories (3 & 4), however, one obtains possibly more useful numbers. This aggregation can be rationalized for our purposes here by the fact that the lower two categories are generally supported by commercially available keyboards and input systems,6 while the higher two categories, require a specialized way to input of additional characters and maybe diacritics (such as an alternative keyboard driver, or an online character picker).7

The figures thus obtained show editions written in extended and complex Latin having on average about a third the number or articles as those written in ASCII and Latin-1. Admittedly, this result is in part the result of the way categories have been chosen and figures aligned, but I'm proposing them as a perspective on the use of extended (and complex) Latin, and possible gaps in support. Before considering this in more detail, it is useful to compare with the numbers for non-Latin scripts.

What about non-Latin scripts & African language Wikipedias?

Number of articles
< 500
Tigrinya (168)
Amharic (14,321)
Egyptian Arabic (19,170)
# of articles / # of editions = Average
33,659 / 3 = 11,220
There are only three editions of Wikipedia in African languages written in non-Latin scripts.8 Two of those - Amharic and Tigrinya - are written with the Ge'ez or Ethiopic script unique to the Horn of Africa.

Arabic is the third. How to count this language for the purposes of this informal analysis raises a question. Arabic, of course, is established as a first language in North Africa for centuries, but it is also a world language, spoken natively in the southwest Asia (having originated in Arabia), and learned as a second language in many regions. Drawing users from this wide community, the Arabic Wikipedia is among the top 20 overall, with twice as many articles as all of the editions discussed above combined. It is more than an African language edition. For this analysis, therefore, I have chosen instead to count just the Egyptian Arabic Wikipedia.

Taking these three editions, we then get an average number of articles (11,220), which is close to what is seen for the Latin categories 1 & 2 (11,789). The usual caveats apply for such a small sample, but taking the numbers as they are, it is interesting that Wikipedias in the complex Arabic alphabet and the large Ge'ez abugida (alphasyllabary) are on average much larger than those of the ostensibly simpler extended Latin (3,175).9

Again, script complexity is but one factor, and in this case probably not the most important, since the two non-Latin scripts in question have long histories of use in text in parts of Africa - much longer than any form of Latin script. Nevertheless, from the narrow perspective of what is required for users to edit Wikipedia, the technical issues are in some ways comparable if even more demanding.

Arabic has had standard keyboards since the days of typewriters. The issues there are not so much the input, but whether systems can handle the directionality and composition requirements of the script.

The Ge'ez script on the other hand, does not involve complex composition rules or bidirectionality. However, it has a total of over 300 characters (including numerals and punctuation; more again if extended ranges are added). The good news is that there are numerous input systems to facilitate their input. Literacy in the script and availability of input systems would not be limiting factors for content development in major languages using this script. The difference in development of the Amharic and Tigrinya editions of Wikipedia may relate to both the larger population speaking Amharic (as a first or second language), and its use officially in a relatively large country (Ethiopia). Development of content in Tigrinya - a cross-border language - might also be hindered by issues particular to one of the two countries where it has many speakers (Eritrea).

From the above one might suggest that complexity of the written form (to be taken here as including the nature of the script itself, and the size of the character set) may be a limiting factor on content development, but that other factors, such as a literate tradition, official use, and technical support for digital production may overcome such limitations. In the case of African languages written in Latin script, however, any literate tradition is recent, and they are often marginalized in official and educational contexts. For those written with extended Latin, there is the additional factor of lack of an easy and standardized way of inputting special characters. Paradoxically, it seems, a modification of the most widely used alphabet on the planet may actually hobble efforts to edit in these languages.

Facilitating input in extended Latin for African language Wikipedias?

Wikipedia editing screen with "Special Characters"
drop-down modified to show all available ranges.
Assuming that the inconvenience of finding ways to input extended Latin characters may be a factor in the success of African language Wikipedias written with categories 3 and 4 orthographies, a quick fix might be to add new ranges for the modified letters used in African languages to the "special characters" picker in the edit screens. As it currently structured, the extended characters necessary for a category 3 or 4 orthography might be sprinkled around in up to 3 different ranges (see at right). And within each range, they are not presented in a clear order, so sometimes hard to find.

Since it may be too complicated to have a special range for each language edition, another possibility would be to draw inspiration from the Niamey 1978 meeting's ARA, and combine all extended Latin characters and combinations needed for all current African language Wikipedias into a common new range.

Of course as mentioned above, there are other factors that can contribute to the success or not of Wikipedia editions in African languages written with extended Latin, but this innovation would at least make editing more convenient for contributors to these  editions. And perhaps it might have a positive effect on the quantity and quality of articles in these Wikipedias.

In the third, and concluding article in this series, I'll step back to look at this analysis and consider some other ways to look at the data on African language editions of Wikipedia, and in particular, those written in extended Latin.

1. This categorization was intended to help characterize the technical requirements for display and input of various languages. Although the technology has improved to the point that more complex scripts are generally displayed without the kinds of issues one encountered a even a decade ago, input still requires extra steps or workarounds. The four categories are additive in that each higher category builds on those below, with added potential issues. It is also a "one jot" system in that for example, a single extended Latin character, say š in Northeren Sotho or ŋ in Wolof, makes their orthographies category 3 rather than category 1 or 2 (respectively), and the use of the combining tilde over the extended Latin character for open-o - ɔ̃ - makes Ewe a category 4 rather than 3. In terms of input, the higher the category, the more the potential issues with display and input (although technical advances tend to level the field, esp. as concerns display).
2. The only non-basic Latin character used in Chewa is the w with circumflex: ŵ. Apparently it represents a sound important in only one dialect of the language, and is used infrequently in contemporary publications. On the other hand, there is a proposed (not adopted) orthography for Tumbuka that includes the ŵ. Without this character, either language would be a category 1 orthography; with it, category 3.
3. Bambara is a tonal language. Most often, it seems, tones are not marked in text, however they can be for clarity, and some dictionaries make a point of indicating tone in the entries (not just pronunciation). If tones are unmarked, Bambara would be considered is a category 3 orthography; with tones, category 4. 
4. The addition of the letter ŋ puts Luganda in category 3 rather than 1.
5. The dot-under (or small vertical line under) characters used notably in Yoruba and Igbo are particular to southern Nigeria, and not included in the ARA. Yoruba in Benin is written with characters from the ARA.These are tonal languages, and tone is usually marked.
6. When I first proposed the category (itself a modification of an earlier effort), there were some questions why have a category 2 separate from category 1. That distinction had its origins in the early days of computing where systems used 7-bit fonts, meaning that accented letters (diacritic characters) used in, say, French or Portuguese, could not be displayed. Even as systems using 8-bit fonts enabled use of diacritics commonly used in European languages, display issues would still crop up (as a sequence of characters where an accented letter should be). Nowadays, such display issues are rare, and limited (as far as I can tell) to documents in legacy encodings. On the other hand, input of accented characters used may require, depending on the keyboard one is using, switching keyboard drivers or using extra keystrokes - so one will occasionally see ASCIIfication of text in such languages (apparently as a user choice).
7. The difference between categories 3 (extended Latin) and 4 (complex Latin) once were significant enough from point of view of display that informal appeals to Unicode to change its policy of not encoding new "precomposed" characters were common.
8. The Wikipedia incubator projects.includes several African language projects, which are not covered here. These include some in non-Latin scripts (Arabic versions, N'Ko, and Tamazight) and some in Latin-based orthographies. I mentioned one of the latter - Krio - in a previous post, and hope to do an overview of this space in the near future.
9. Average for all African language editions is 7704. By comparison the average for all Wikipedias is 166k.

Monday, August 13, 2018

Niamey 1978 & Cape Town 2018: 1. Some thoughts about extended Latin & content in African languages

Image features the 31 modified letters & diacritic combinations in
the African Reference Alphabet, 1978. (Nor all are currently in use.)

The world 40 years ago, when the Meeting of Experts on Transcription and Harmonization of African Languages took place in Niamey, and that of the Wikimania 2018 conference in Cape Town (which ended last month) seem very distant from each other. But from the angle of the written form of African languages at least, the concerns of the two events are not so distant.

One of these concerns is the extended Latin alphabets that were on the agenda in Niamey, and which are used in about half of the African language editions of Wikipedia. This post and the next consider these two vantage points, asking whether extended Latin is associated with less content creation, and what might be done to facilitate its use of the longer Latin alphabet.

Adapting the Latin script to African realities

In 1978, representatives of countries that had gained independence no more than a couple of decades earlier, or in some cases only a few years before, met in Niamey to advance work on writing systems for the first languages of the continent. One of the linguistic legacies of the colonial period was the Latin alphabet (even in lands where other systems had been used). But given the phonological requirements sometimes very different than what Latin letters represented in Europe, linguists added various modified letters, diacritics, and digraphs to write African languages (sometimes even a special system for a single publication1.

So, that legacy also often took the form of multiple alphabets and  orthographies for a single language, reflecting the different origins of European linguists (frequently Christian missionaries from different denominations), locations in which they worked (perhaps places where speakers of a language had particular dialects or accents), and individual skills and choices. After independence, many African countries undertook to simplify this situation, but they still often ended up with alphabets and spelling conventions different from those in neighboring countries.

The linguists and language specialists in Niamey, as in other such conferences of that era (many of which, like the one in Bamako in 1966, were supported by UNESCO), were concerned with further simplifying these discrepancies, with accurate and consistent transcription of languages that were for the most part spoken in two or more countries (whose speaker communities were divided by borders). That included adopting certain modified letters and diacritic combinations for sounds that were meaningfully significant in African languages (some of which correspond with characters in the International Phonetic Alphabet).

Language standardization, which is actually a complex set of decisions, was a real concern where there were on the one hand diverse peoples grouped in each state and on the other hand limited resources for producing materials and training teachers. At its most basic level, though, standardization of any sort required an agreed upon set of symbols and conventions for transcription.2

A reference alphabet for shared orthographies

The African Reference Alphabet (ARA)3 produced by the Niamey meeting was an effort in that direction. It built on the longer post-independence process to facilitate use and development of written forms of African languages - a process that had its roots in the early introduction of the Latin script (before the formal establishment of colonial rule) and efforts during the colonial period such as the influential (at least in the British colonies) 1928 Africa Alphabet. The ARA was intended - and to some degree at least still serves - as sort of a palette from which orthographies for specific linguistic, multilingual national, and cross-border language needs could be addressed.4

And that set of concerns - alphabets, orthographies and spelling conventions - turned out to be the starting point for later efforts in the context of information and communication technology (ICT) to localize software and interfaces, including Wikipedia and other Wikimedia interfaces, and to develop African language content online, including for Wikimedia projects. Even if it does not seem as visible as other challenges.

What I haven't seen is an evaluation of the efforts at Niamey and the other expert meetings on harmonization of transcriptions, although the most used of the characters in the ARA can be seen in various publications, and all but perhaps one are in the Unicode standard.

In any event. the situations of the various African languages are diverse, with some having well established corpora while others are "less-resourced," and in the worst case, inconsistently written.

Extended Latin and composing on digital devices

One important element in discussions in the process of which Niamey was part, was the role of modified letters - what are now called extended Latin characters - in transcribing many African languages. The ARA includes no less than 30 of them (22 modified letters and 8 basic Latin with diacritics5). These added characters and combinations are not intended to all be used in any one language, but represent standard options for orthographies. The incorporation of some of these into a writing of a single language makes the writing clearer, and has no drawbacks for teaching, learning, reading, or handwriting (although there are arguments against the use of diacritics). Since the establishment of Unicode for character encoding, the screen display of these characters is not a problem (so long as fonts have been created including glyphs for the characters).

However even the presence of even just one or two extended Latin characters leads to problems with standard keyboards and keypads - where are you going to place an additional character, and how is the user to know how to find it? This is a set of issues that was of course recognized back in the era of typewriters. One of the spinoffs from the Niamey conference was the 1982 proposal by Michael Mann and David Dalby (who attended the meeting) for an all lower-case "international niamey keyboard," which put all the modified characters (of an expanded version of the ARA) in the spots normally occupied by upper-case letters.

While that proposal never went far (I hope to return to the subject later) - due in large part to its abandonment of capital letters - it was but one extreme approach to a conundrum that is still with us. That is, how to facilitate input of Latin characters and combinations that are not part of the limited character sets that physical keyboards and keyboards are primarily designed for. It's not that there aren't ways of facilitating input - virtual keyboard layouts (keyboard drivers that can be designed like and shared, like Keyman, and onscreen keyboards) have been with us for years, and there are other input systems (voice recognition / speech-to-text being one). The problem is lack of standard arrangements and systems for many languages. Or perhaps in the matter of input systems, the old wag, "the nice thing about standards is there are so many to choose from," applies.

The result, arguably, may be a drag on widespread use of extended Latin characters, and as a consequence of popular use on digital devices of languages whose orthographies include them. Or a choice to ASCIIfy text (using only basic Latin), as has been the case with Hausa on international radio websites. Or even confusion based on continued use of outdated 8-bit font + keyboard driver systems, as witnessed in at least one case with Bambara (see discussion and example).

What can the level of contributions to African language editions of Wikipedia tell us about the effect of extended Latin? This will be explored in the next post: Extended Latin & African language Wikipedias.

1. For example some works on forest flora which had lists of common names in major languages of the region.
2. Arguably in the case of a language written in two or three different scripts, one could have a system in each script and an accepted way to transliterate between or among them.
3. The only other prominent use I found of the term "reference alphabet" was that of the ITU for their version of ISO 646 (basically the same as ASCII): "International Reference Alphabet." The concept of reference alphabet seems to be a useful one in contexts where many languages are spoken and writing systems for them aren't yet established.
4. This approach - adopting a standard or reference alphabet for numerous languages - was taken by various African countries, for example Cameroon and Nigeria. These efforts were without doubt influenced by the process of which Niamey and the ARA were part.
5. By comparison, the Africa Alphabet had 11 modified letters and did not use diacritics. All 11 of the characters added in the Africa Alphabet were incorporated in the ARA. It is worth noting that in the range of modified letters / special characters created over the years, some are incorporated into many orthographies, others fewer, and some are rarely used if at all.