Tuesday, November 19, 2013

Where there is no spellchecker

For a text in a language like Fula that has no spell checker support, here's a workaround that might be more effective than resort to ever more careful copyediting: Break the text down into words and then sort the list into alphabetical order. Better yet, do a word frequency table.

The idea is to align words in a way that facilitates visual checking in a different way. It's especially effective for recurrent words and related words that in the generated list would fall together, but among which misspelled words would be counted separately.

I recently tried this with a text in Fulfulde of Niger and found numerous instances of what appear to be single to double letter errors (doubled consonants and vowels are significant for pronunciation, often with meaning differences where the letter is single), and substitution of plain b, d, or y for ɓ, ɗ, or ƴ(and vice-versa) along with other minor but not unimportant errors. The next step would be to go back to the original text to search the erroneous forms and replace with the correct spelling (either manually or by the search-and-replace function). Not so elegant perhaps, but should be effective.

There are a couple of ways of generating the word list, with the simplest being to substitute hard returns for spaces in the word processor program, then clean out punctuation and quotation marks, and then sort. This can be converted into a frequency list by means of a pivot table in Excel (and perhaps other spread sheet programs.

Another way is to use a text analysis utility software. I used one online at Online-Utility.org.

Ultimately if one is doing a lot of work with text in a particular language, one imagines that lists generated in this way might be useful in building a corpus which could in turn be used for development of a spell-checker.

The way I came about this approach was in incorporating word frequencies as a step in qualitative data analysis (QDA) of text ("where there is no computer-assisted QDA software"). Word and phrase frequencies are of course used in a different way in the latter, but it occurred that breaking down text in this way might also be an aid in comparing the forms of the words themselves (spelling, mainly).

(For those not familiar with the famous basic health and first-aid publication for poor regions of the global South - Where There Is No Doctor - the title of this post is inspired in a very small way by it.)

No comments: