Thursday, February 09, 2017

Health info in African languages, on 2 non-African sites

Here are quick reviews of two websites - one Australian the other American - that have health information in numerous world languages, including a number from Africa. Both are primarily intended to serve immigrant communities. This post will then return briefly to the theme of the benefits of systematically sharing and improving of health related information composed in or translated into African languages.

Health Translations

Health Translations is a website maintained by the government of the state of Victoria in Australia.  It has information (mainly documents such as fact sheets and flyers, from what I can tell, some with illustrations) on over 80 topics in a total of almost 100 languages or language varieties (although not all information is in every language, and some languages have few items).

The African languages for which there are materials include: Afrikaans; Akan; Amharic; Arabic; Bemba; Dinka; Juba Arabic; Kirundi; Krio; Lingala; Nuer; Oromo; Shona; Somali; Sudanese Arabic (also listed as Sudanese); Swahili (Congolese); Swahili (Kenyan); and Tigrinya.

This is an impressive collection of materials from various sources, apparently all Australian, and in different formats. They appear to all be translations - a given material may be available in a few or quite a number of language versions. Navigating from the a click on the desired language on the list of languages (which helpfully includes both English & native names/scripts) to a particular topical resource requires interacting with screens in English - not surprising, but when one gets to the list of topics and resources in a particular language, the titles are only in English, and then on the list of languages in which a material is translated (this is a typical navigation sequence), the language names are in English (no native scripts used). So the resource appears intended to be used by or with help of professionals or others who can read English.

Source: Bushfire smoke & your health [am]
I'm not able to evaluate the quality of the translations, but noted French in Lingala (which may simply be typically used loanwords) and by chance an anomalous English word in an Amharic text (image).

All the several documents I viewed were PDFs, mainly text but some image (meaning the text cannot be searched or copied out for editing into other materials). Spot checking some non-Latin text, specifically Ethiopic/Ge'ez used for Amharic and Tigrinya, and complex Latin, specifically for Dinka, there were some issues with the text that would interfere with searches or copying out passages (such problems are not uncommon with PDF rendering, even when visually the PDF presents everything correctly and in its intended place).

Some Amharic text when copied out and pasted showed capital A for አ and E for እ in initial position (for example, here). The corresponding characters are appropriate, interestingly, but this makes search or reuse of such text problematic.

From original (l.); copy-pasted out (r.)
Source: Bushfire smoke ... [din]

The Dinka text sampled showed some typical problems with complex Latin in PDFs. Dinka is written with what I call in ALDA a "category 4" Latin orthography in that it includes extended Latin characters (aka modified letters) plus combining diacritics, sometimes together as in the open-o with diaeresis in the word "daiɣɔ̈kthai" (dioxide) featured on the left side of the image. Copying that word from the PDF and pasting it in a word processor or advanced text editor yielded the results on the right, missing one extended character and the combining diacritic on the other. This complicates any potential re-use of this text, but also means that document, folder, or web searches will not pick up words with such character combinations.

Will return to these issues, why they're important, and what to do about them in the last section of this post.



HealthReach: Health Information in Many Languages is a program of the US Library of Medicine of the National Institutes of Health. It includes translations in 46 languages maintained on the MedinePlus site. A total of almost 350 topics are listed (although here too, not all information is in every language, and some languages have fewer items than others).

The African languages for which there are materials include: Amharic; Arabic; Oromo; Somali; Swahili; and Tigrinya, The native names of languages are featured on the list of languages, except oddly for Amharic and Tigrinya, which are transliterated into Latin ("amarunya" instead of አማርኛ, and "tigrinya" instead of ትግርኛ).

This also is an impressive collection from diverse sources, in this case American, but it is longer on topics and shorter on languages covered. The list of topics for each language also includes the titles in the language and its script - except again for Amharic and Tigrinya (not even transliterations) - as well as in English.

All materials checked were PDFs. There are no materials for African languages with complex Latin scripts.

As for non-Latin scripts, text in Arabic seems to behave as intended, from small samples. On the other hand, some Amharic text when copied out and pasted showed the same capital A for አ and E for እ observed above, plus O for ኦ (see here).  A Tigrinya document had a similar issue. So this issue may have to do with a problem in PDFs for handling a particular set of characters - አኡኢኣኤእኦኧ (representing glottal stop plus the range of vowels) - or a subset of them, which might be helpful to know when troubleshooting.

Health education materials and the "2Ds & 4Rs"

In highlighting aspects of public health messaging during the ebola epidemic in West Africa (2014-15), this blog suggested a systematic approach to sharing and improving materials that were developed and used in that context (with primary attention to text and images). A mnemonic - 2Ds & 4Rs - was put forth in October 2014, initially to explain the rationale for reposting and discussing various ebola education materials, but also as a way to capture the ideal cycle of utility of such production. Too often, materials are developed, used for a particular purpose, and then forgotten, when they could add to a growing living corpus of resources to tap for future work. This is important in any field and language, but arguably especially important in health, and for languages that have fewer resources and emerging terminologies / technical lexicons, such as many in Africa.

In that context I propose to use the 2Ds & 4Rs to consider the efforts represented by the two sites discussed above. Of the 6 elements of this model, the first three have to do more with the sharing and use of materials, and the last three with their longer term development and potential re-use. These are listed with brief explanations and what I see as relevance to the two sites:
  • Dissemination (making materials available, including via multiple sources)
    • Both sites bring together and post materials from diverse sources, increasing their exposure and access to them.
  • Demonstration (showing how materials in African languages can be presented, including in cases where complex scripts are involved)
    • Both sites show that African language materials can be presented on the same footing as other world languages.
    • However, the HealthReach presentation does not use available technology to present the native names of Amharic and Tigrinya, or titles of materials in those languages.
  • Reading (creating or translating text materials with attention to how they may be read aloud in groups or over local radio, which may be more likely scenarios for their use than the typical Western expectation of silent reading by individuals)
    •  It appears that all or most of the materials from diverse sources compiled on the two sites are translations from English of technical descriptions and advice. It is not clear how well how well adapted they are for the range of uses and audiences they might serve.
  • Review (written material - text - is well suited for review, comparison, and analysis; such material, especially in less resourced languages and on issues of public importance like health, should undergo such treatment)
    • No information on how any of the materials may be or have been reviewed, either in the diverse organizations where they originated, or in the projects hosting the two websites. 
    • Image PDFs, where these occur, do not lend themselves to processes of review.
    • Text PDFs with problems in their encoding of non-Latin or complex Latin scripts, present problems for review.
  • Revision (after review of materials, and in response to other information and feedback relevant to them, materials should undergo appropriate revisions in content, form of language, copyediting, and presentation)
    • No information on any revisions of any of the materials.
    • Issues cited under "Review" with image PDFs and with text PDFs that have encoding problems also hinder revision work.
  • Re-use or re-purposing (text materials can be re-used or sections re-purposed)
    • No information on re-use or re-purposing of any of the materials.

The two sites profiled above and the various health and medical education materials presented on them represent an important resource for fifteen African languages (and some varieties of two of those).

One additional question is whether such materials, intended primarily to serve needs of immigrants in Australia and the US, might be useful as is or with modifications, for speakers of the same languages in relevant African countries. Or in the reverse sense, whether any health extension materials from Africa might inform revision of these materials and development of new ones. A next step could be a for a site to begin to collect health materials in African languages from all sources.

There are many directions in which this could be taken, with the goals of improving availability, quality and utility of health education information in a range of African languages. One, for example, is linking with the longstanding WikiProject Med's Translation Task Force for development of articles in those African languages that have Wikipedias (such as Afrikaans, Akan, Amharic, Arabic, Kirundi, Lingala, Oromo, Shona, Somali, Swahili, and Tigrinya). Another might be connecting with efforts to advance development of standard terminologies. Still another might be to bring in human language technology, such as text to speech, so that materials designed and disseminated in text form could be accessible in audio via mobile devices.

Thanks to Charles Riley of Yale University for calling our attention to these two websites.

1 comment:

Unknown said...

Hi Don,

Your post involves a complex series of issues, and I'd like to unpick a number of issues related to translation practice, review cycles, retaining source documents, copyright and licensing, etc. But in this instance will instead discuss some of the technical issues.

PDFs can be problematic, even for English language content. Results will vary depending on what software you are using, how the PDF files were created and what fonts were used. The fonts are important. Fonts contain tables that map glyphs to Unicode codepoints. Information is taken from the font and embedded in the PDF, allowing software using the pdf file to resolve the glyphs in the text layer to their Unicode codepoints.

So in the first instance, the legibility of text extracted from the PDF via selection/cut/paste operations, indexing and searching, or exporting to another file format is dependent on the fonts used. So in the case of Dinka and Amharic examples you site, I'd look at both the fonts used and the software used to generate (print) the PDF files.

Secondly, it depends on the writing system (script) the language uses. Some complex scripts will require extensive contextual substitutions and reordering of glyphs when text is rendered. It is not possible to move from the glyph sequence to the original codepoint sequence for PDF files written in such scripts. Although this shouldn't be relevant to the Dinka and Amharic examples you cite.

PDF/UA is one possible workaround. If the glyph sequence in the tagged PDF can not be resolved to the correct Unicode codepoint sequences, the original text should be embedded in the ActualText attribute for each tag in the file.

That probably will not help to select text and cutting and pasting operations. But screen readers, searching and indexing, and export operations should work successfully, assuming the software in use adequately supports and uses the contents of ActualText attributes.

Basically, PDF files can work better than they do, but you need the information ecosystem from start to finish correctly setup.

It is easier to use HTML rather than using PDF files.

An added consideration is that these languages are more likely to be accessed and used on mobile devices, than on desktop or laptop computers. Accessing and using PDF files on a mobile phone is a pain. Much easier to add the documents as HTML into a responsive website.

One other consideration, if the PDF text can not roundtrip, then I would not consider the PDF file to be either accessible or archive quality.