Saturday, December 28, 2019

AfricaNLP2020 (Addis Ababa, 26-4-20) & related items

Quick post to call attention to an upcoming workshop on machine learning (ML) and natural language processing (NLP) in African languages, and its call for participation. Also a list of related initiatives, including the Machine Learning and Data Science in Africa (MLDS Africa) forum.

AfricaNLP2020 workshop - "Unlocking Local Languages"


The AfricaNLP2020 workshop will be held on 26 April 2020 as part of the Eighth International Conference on Learning Representations (ICLR) in Addis Ababa, Ethiopia. The workshop is describes as follows:
"The rise in ML community efforts on the African continent has led to a growing interest in Natural Language Processing, particularly for African languages which are typically low resource languages. This interest is manifesting in the form of national, regional, continental and even global collaborative efforts to build corpora, as well as the application of the aggregated corpora to various NLP tasks."

The workshop aims are described as:
"• to showcase this work being done by the African NLP community and provide a platform to share this expertise with a global audience interested in NLP techniques for low resource languages;
• to provide a platform for the groups involved with the various projects to meet, interact, share and forge closer collaboration;
• to provide a platform for junior researchers to present papers, solutions, and begin interacting with the wider NLP community;
• to present an opportunity for more experienced researchers to further publicize their work and inspire younger researchers through keynotes and invited talks."

Submissions "for oral and poster presentations on a wide variety of NLP tasks for Afrcan languages ...  will be evaluated and selected through a peer review process." Deadline: 1 February 2020. (They can be submitted via EasyChair.org.)

Corpora-building, ML, MT, & NLP initiatives


The workshop page lists six collaborative effortson African languages, which I'll list below, as seen on their page, along with a seventh I learned about recently:
  • Niger-Volta Project - Speech Recognition, Language Identification, Machine Translation & Natural Language Processing for West African Languages 
  • Masakhane.io - A Focus on Machine Translation for African Languages
  • Cocohub.cc - A crowdsourced dataset builder and community for NLP in underrepresented languages (apparently translating MS-COCO captions into Afrikaans, Amharic, Bukusu, Coptic, Fanti, Luganda, Luo, Masai, Meru, and Nandi)
  • Umva.ai - An initiative  to build a Natural Language Processing Platform for Kinyarwanda and to make it available to all developers and for all use cases 
  • EthioNLP - Ethiopian Natural Language Processing Research
  • AI4D - African Language Dataset Challenge - A community effort to help uncover and create African Language Datasets for improved representation in the field of NLP (see also an update on its "Dataset Challenge" from 23 Dec. 2019)
  • PidginUNMT -  Unsupervised Neural Machine Translation from West African Pidgin to English (this was written up on Techcabal on 16 Dec. 2019)
It's great to see this kind of activity related to language technology. I've often thought that multilingual Africa has the potential to lead and innovate in this area.

MLDS Africa


MLDS Africa is an online network with a Googlegroup for communication among research groups such as the above, and a webpage with info on upcoming conferences and workshops, like AfricaNLP2020.

ICLR


The image above connected with the ICLR conference hosting AfricaNLP2020 came from a page on SyncedReview.com with details on papers accepted for the main conference as of 20 Dec 2019. (The workshops on the first day of ICLR, such as AfricaNLP2020, evidently have their own deadlines.)

Monday, December 16, 2019

Yahoogroups & African languages, follow-up

u
A quick follow-up to my previous message regarding the deletion of Yahoo Groups. In brief, the saga is still ongoing. This message will provide an update and mention other Africa - and especially African language - related groups (as maybe a missing dimension in this saga).

I've spent what time I could on the matter of saving the content of several groups of interest mentioned in the previous message. For AfrophoneWikis, Etienne Ruedin was particularly helpful in outlining possibilities and sharing downloaded backup files. And for Unicode-Afrique, Tafsir Baldé offered help and suggested a possible partnership with Idemi Africa. (See also below re Archive Team.)

In the meantime, Yahoo has extended the period during which users can request their data in downloadable form1 - and apparently also pushed back the date for pulling down the group pages - to 31 January 2020.2 The message content of the groups is no longer accessible now.

For the groups of interest, one question is what is the future of their content, and another is what future for those that are still viable. The data question is somewhat answered to the extent that individual users request their data, but Yahoo does not facilitate archiving elsewhere in such a way that would facilitate long-term access.

The larger question asked by many is why make this move in the first place. I won't explore that here, but it does highlight the problem of having such a big chunk of internet history subject to one corporation's short term interests.

The Archive Team's efforts


What's worse is that when a volunteer group - the Archive Team -  ramped up an effort to save Yahoo Groups, Verizon (Yahoo's owners) evidently worked to block such "mass-archiving" efforts. Hard to explain that.

It was only late during the scramble to do the necessary to save data (some of which had to be done manually) that I learned of the Archive Team's initiative for Yahoo Groups. And only now, digging deeper, did I realize that they had already archived a significant number of them on Archive.org. I was interested to see that those archives already included AfrophoneWikis, Unicode-Afrique, and AfricanLanguages.3 But they do not cover all groups, expecially not the smaller ones.

Overlooked African dimension of Yahoo Groups?


One aspect of Yahoo Groups that I haven't seen discussed, but which over the years I got the impression was important, is significant use by Africans. In other words, Yahoo Groups was never just a European and "core Anglophone" platform. Yet there's not much of an African voice I'm aware of in this discussion about the end of Yahoo Groups.

Among the most active Groups founded and participated in by Africans that I've noted are OmoOdua ("Yoruba socio-cultural discussion forum," with 1660 members and more than 160K messages since 2007) and Mwananchi ("A current affairs forum on Africa and the issues affecting the continent," with 1541 members and more than 180K messages since 2000).

However there are many that are more modest in size, such as Internet-Niger ("Internet au Niger," with 589 members and just under 16K messages since 1999). So basically I'm suggesting that among the stories of who loses with the deletion of Yahoo Groups, one that isn't getting much if any play would be the large user community in or of Africa.

African language related Yahoo Groups


Among the groups of personal interest are the five I listed in the previous posting that deal in one way or another with African languages. Three of those - AfricanLanguages, AfrophoneWikis and Unicode-Afrique - are ongoing lists, although like most other Groups, much less active than they once were. The other two - A12n-archives and PAL-archives - were set up as back-up archives for lists whose original archives were then deleted some time later. Fortuitous to have had the back-ups, but it is ironic that they too got the axe.

So here I'll list a few other African language related Yahoo Groups that I'm aware of - most of them tiny in terms both of membership and activity. However they reflect a range of interests, even if they were not always as successful in this medium as their creators evidently had hoped.
  • Ethiopic (2001, 19 members, ~100 messages). It was "established to begin dialogue and understanding regarding the computer keyboard layout for Ethiopian Languages." Its ambitious goals were somewhat obviated by work elsewhere on Ethiopic/Ge'ez keyboards.
  • Kiswahili (2001, 993 members, ~50K messages). "A forum about the latest Swahili news." I did not have much to do with this list, but had the impression it had a lot of activity and much of it was actually in English.
  • Linux2Igbo (2004, 20 members, 174 messages). "This is a project that aims to translate Linux GUI such as KDE and GNOME and popular software such as Open Office, Mozilla Firefox and Thunderbird into the Igbo language." I was asked to serve as a moderator of L2I, and made some contributions (even though I do not speak the language).
  • Mandenkan_sebeli (aka Mandenkan sebe web ka; 2003, 6 members, ? messages). Intended to promote writing of Mandenkan (Manding languages) on the web. I don't recall that this group had any archived messages - indeed the Yahoo data I received had only links for it.
  • Mzi_kaPhalo (2001, 71 members, ~150 messages). It was "set up mainly to serve the Xhosa Translators' Community, however other Xhosa language issues can also be discussed." My impression was that it was active for a brief time then mostly quiet.
  • WowlenPular (2002, 7 members, 15 messages). Set up to promote the "survival" and use of Fula, with accent on the Pular of Guinea. I posted several of the few messages on this group, including some excerpts of texts from books in Pular.

Next steps


The data I got from Yahoo included .mbox files with all the message, link, and file content of thegroups I was subscribed to, including all of those mentioned above. In theory I think one could with these files reconstruct any group if one had a reason to do so and a new host for it. So that is one possible discussion for certain groups.

On the technical side, I am not sure how the versions saved on Archive.org might be used if one wanted to continue a group. This question might be useful to explore.

For most groups, the most one may want to do would be to present the data in an easily accessible, navigable, and readable format. This could include many Yahoo Groups about (or in) African languages, and that could be another discussion. 

_____________________
1. See Barbara Krasnoff's helpful article in The Verge, "How to download your Yahoo Group data."
2. Per "Yahoo is Extending It’s Deadline To Delete The Content Of Yahoo Groups," NewDayLive.com, 13 Dec. 2019. The wording was not so clear, but one had the hope that the message content online might also endure a while longer. That hope was not borne out.
3. These archives are organized in batches of groups, and the files for each individual group are available in gzip format. These apparently were saved over the course of a few years - AfricanLanguages for instance in 2016, Unicode-Afrique in 2017, and AfrophoneWikis in 2018.