Wednesday, October 8, 2008

African Language Locales: Call for Volunteers

ANLoc, the African Network for Localization, has started an initiative
to build locales for over 100 African languages. The project is now
ready to line up volunteers and get to work!

The main requirements for a volunteer are:

1) literate in the target language
2) comfortable using computers
3) can volunteer about 1 or 2 hours
4) finishes what they start

If you are willing and able to help - or if you know anyone who might be, or can contact any networks that might include potential volunteers - please look through these lists of languages. Contact if you can work on any language that does not yet have a volunteer:

* West Africa
* Nigeria
* Central Africa
* Tanzania and Indian Ocean
* Great Lakes and Kenya
* Horn of Africa
* Southern Africa

(Note that we do NOT need volunteers for South Africa, because those
languages already have good locales.)

You can also help by letting your colleagues from other African language
communities know about the project.

We want to build all of these locales in a few months, so please let me
know quickly if you can help out!

For more details about the project, please view this presentation:

Wednesday, August 27, 2008

The GUM3C conference in Bangor

The GUM3C conference in Bangor has come and gone. Those who were there, were presented with an exquisite set of presentations. The ambiance was lovely and many of the conversations were thought provoking. This conference was in association with the UNESCO and consequently subjects like sign languages, minority languages and support for people who do not speak a dominant language featured prominently.

One of the discussions was a follow on of the presentation by David Crystal, who among other things presented about his work in the advertisement industry. There were a substantial number of people that agreed that advertisments in minority languages give a language an economic underpinning. Consider, the majority of the trade by Welsh companies is in Wales and people respond more favourable to advertisement that target them.

A presentation by Gwerfyl Roberts was thought provoking. As a practioner in this field she told us that people who do not speak and read the dominant language well, will get substandard medical treatment. Gwerfyl is working hard to improve the situation for the people for whom Welsh is their first language, but she agreed that people from the Indian subcontinent suffer from the same problem. Having the inserts for medicines available in as many languages as possible would be one problem to the solution. Providing terminological support as is currently provided by Wikiprofessional is another.

Sign languages and particularly SignWriting is dear to my heart. I could not be more pleased to have a presentation about sign language in India. Michael Morgan presented on how a university for the deaf is being set up, he explained about the problems that exist in India.. One anecdote was about people texting, not coming to a conclusion and in the end, people traveled two days to come to talk for five minutes.. they had to travel back as well. Chris Cox presented about the efforts that have been put into introducing SignWriting into Britain and Ireland.

When you are interested in all the goodies that you missed, you will be pleased to learn that the proceedings of the GUM3C are already published  ISBN No. 978-1-84220-115-2 and that the presentations will be posted on the GUM3C website..

Wednesday, August 13, 2008

Farsi, is it a macro language ?

According to the ISO-639-3, Farsi is a macro language. From my position it is a clear case as the standard says so, it is likely to be so. Farsi is divided in two, Western Farsi and Eastern Farsi. Western Farsi is spoken primarily in Iran and Eastern Farsi in Afghanistan and Pakistan.

The problem I have is that several people I respect, independently inform me that in their opinion this division is wrong. Farsi is said to be understood by all. Raising this question is for me about something practical. In this case it is about a request for a

Let me be clear, I am all in favour of such a project but I do not want to continue an ambiguity about the language. The practical question is, to what extend is it justified to consider Farsi and Dari as separate languages. When they are indeed to be considered separate languages, how different are they. Can it be compared in a similar way as South African and Dutch?

Please share your thoughts ...

Wednesday, July 30, 2008

100 African Language Locales

ANLoc, the African Network for Localization, is undertaking a project to create Locales for 100 African languages. The following presentation provides an introduction to the initiative:

Thursday, July 3, 2008

¿Hablas español?

According to Alexa, Spanish is the second language for Wikipedia considering the amount of traffic it generates. There must be MANY people who use the Spanish Wikipedia. When you look at Betawiki, there are 20 people who indicated there wish to help with the localisation of Spanish.

¿Hablas español?

We are looking for people who speak Spanish, who are willing and able to help us with the localisation of MediaWiki into Spanish. Not only the WMF extensions (41.29% ) but also the MediaWiki core messages (91.84%) are in need of attention..

When Spanish is not your "language", you may want to check out how your language is doing..

Semantic Search Engine of African Languages

Here is a thoughtful article about the Kamusi Project from Appfrica:

The article talks in part about the blog widget we've been developing with one of our Code Africa volunteers, which people can insert in blogs and web pages to perform Kamusi lookups on their sites. The widget is not quite finished, but we put together the following brief presentation for a Barcamp event recently in Nairobi:


Thursday, May 8, 2008

Supporting languages that do not have localisation

Yesterday I had the privilege to present at a workshop in Milan for the ISO. The workshop discussed how ISO will continue its development in the 21th century. A whole day was filled with a mix of people inside and outside of ISO providing their point of view how the world is changing and many kinds of new technology are becoming available and relevant that have the potential to change the current practices at ISO.

Bob Sutor, the IBM vice president for Standards and Open Source opened and discussed everything from Wikis to Second Life. It was a great speech and it opened up the floor for the presenters that followed really well.

The WLDC is about languages and with Debbie's permission, she had seen my presentation ahead of time, I had included the WLDC as a way to establish that I am truly committed to do good for languages.. What we want to do in the WLDC is making a document languages and make a difference by doing this. To help us realise this, I approached Mr Sutor and asked him if IBM could be interested in giving languages a presence in the user interface provided by GNOME or KDE.

This is of a great practical importance; when you write Neapolitan for instance, you do not want an Italian spell checker telling you that what you have written is spelled is incorrectly. The localisation of software is an expensive and time consuming business, it is not realistic to expect that all languages, linguistic entities will be localised. It is however feasible to make Gnome or KDE aware of the language that is used for a document. This is the first step to ensure that this document will be tagged in its meta data appropriately to the language that is used.

I am sure that you know more great arguments why a practical application like this will be of a much bigger benefit then is immediately apparent. So please pitch in with suggestions so that we will be able to produce the proposal that Mr Sutor and IBM just cannot refuse :)

Sunday, May 4, 2008

A proud moment

At the Wikimedia Foundation I have been banging the drum for the use of standards. I made some friends and enemies in that way, but the overall effect has been good. Some fights are no longer fought because the result is clear from the start.

At Betawiki, we are developing an extension for MediaWiki called Babel. The tool is to be used on the user pages indicating the self assessed skills in the languages a person knows. The texts are shown in the language itself.

When we do not have a translated text yet, we are still able to use the native name of that language courtesy of the data available in the CLDR. The standard is not complete, and I asked if it was possible to change the data in our database. I was told no. "The data belongs to a standard and, the data should be improved at source".

I do agree with this sentiment. I have written to someone active in the CLDR if there is an interest in collaboration. I am happy and proud of this turn of events. I hope that we are welcome :)

Thursday, April 17, 2008

Of ancient and historical languages

According to the records at SIL the documentation for Ancient Greek (to 1453), ISO-639 code grc, has been tagged as type "Historical". This means that the language is dead. Latin lat on the other hand is considered to be "Ancient". Both Latin and Ancient Greek are still taught in schools to kids who get a classic western education.

According to the definition Latin is ancient and consequently it must have gone extinct more then a millenium ago. However, the Roman Catholic Church has continued to use Latin as its language. It maintains a dictionary of Latin modern vocabulary. Surely Latin may be old but it never went extinct.

Ancient Greek does not qualify as ancient because 1453 means less then a millenium. Ancient Greek is taught in school. Books, like the Harry Potter books are translated in Ancient Greek. As far as I understand it, there has not been a similar usage for Ancient Greek as it existed for Latin.

When you are to tag a text using the ISO-639 codes and its definitions, a modern text in Latin or Ancient Greek cannot be tagged. The first issue is that the definitions clearly limit the time when texts are to be considered in a historical or ancient language. The second issue is that in order to write a modern text neologisms are needed and/or existing words with a modern meaning are needed to express modern concepts.

When the definitions preclude the tagging of the modern expressions of Latin or Ancient Greek, it means that either a new code is needed to indicate the modern expression or the defintions of these languages are wrong.

I would argue that when a language has not seen continued use, the modern text is assigned a separate code. It is distinctly different and by tagging it as such, it may be clear to the reader of a text that the understanding of such a text does not reflect the language and the time when it was a living language. I would argue for a separate ISO-639-3 code.

My question is what do you think about this ?

Monday, April 7, 2008

WLDC Conference 2008

The World Language Documentation Centre, together with Bangor University and Language Standards for Global Business, wishes to announce a major multidisciplinary conference to celebrate 2008 as the International Year of Languages

August 22-23, 2008

To be held at the Bangor University Business Management Conference Centre

This event is supported by the Welsh Assembly Government and the UK National Committee to UNESCO

The United Nations announced that 2008 would be the International Year of Languages, recognizing the importance of multilingualism in supporting international understanding. The GUM3C conference will attempt to bridge the communications gap between academia and industry, asking (and attempting to answer) such questions as:

How can industry help academia prioritize its research in the 3 Ms?

What are the developing standards, who are developing them and will they be used?

How will this generate peace, prosperity and global understanding?

Mor info, details on submission of papers or workshops as well as conference registration can be obtained from

Monday, March 31, 2008

online African dictionaries planning meeting

The Kamusi Project ( is pleased to announce that
we are about to begin work on PALDO: the Pan-African Living Dictionary
Online. PALDO will build on the Kamusi architecture to create an
interlinked multilingual dictionary for African languages, creating a
powerful communications tool that will be useful throughout the African

The first step for PALDO will be to program the database, multilingual
tools, and enhanced user interface. This work will begin on April 2
with our partners at Kasahorow (, at a meeting in
Accra, Ghana. This meeting will be simulcast LIVE ONLINE, and the
transcript will also be posted on a special blog at

chat session or commenting on the blog, beginning at 9 a.m. Ghana time
on April 2. The timezone for Accra is GMT.

We are particularly hoping for participation from:
1) computer programmers and database specialists
2) linguists, lexicographers, and people with an interest in languages
3) users of the Kamusi Project, kasahorow, or other online dictionaries
4) people interested in helping shape the next generation of tools for
African languages

If you would like to participate in this meeting online, please visit for more information.

If you are in Accra and would like to attend in person, the meeting will
be held at the Kofi Annan Centre, beginning at 9 a.m. on Wednesday, April 2.

rescheduled for April 3, we will place a notice on

Thursday, March 20, 2008

Evolution in the Open Source world

OmegaT is a great open source CAT tool. It is written in Java, it has a growing group of users and Sabine Cretella, who is my weather cock for what is happening in this space, has been a long time champion of the software. As OmegaT makes sense to me for several of the things I am involved in, I have invested and I have been looking for funding to expand its functionality.

Yesterday I was astounded by Sabine. "Anaphraseus", she says, "is a CAT tool that does the things that are critical to me. It allows me to translate into Neapolitian properly; it allows me to enter nap, the ISO-639-3 code and consequently I am able to build my translation memory without having to remember what code I used in stead. They do not have a proper TMX yet, but they are working on it. Now given that I can finally work properly in my language, who cares that I do not have it yet?"

Anaphraseus used to be called "Open Wordfast" and makes use of the Open Office macro tool. It uses the same translation memory format like Wordfast, it supports text segmentation and it is great for proof reading.

Sabine has been investigating Anaphraseus's functionality and so far she is quite pleased. When I asked her why the change, she said that she had been asking for the ISO 639-3 support for almost two years, it was not forthcomming and Anaphaseus is as good for the job.

Thursday, February 7, 2008

A tale of dictionaries

My sister is a student of the (Western) Farsi language. She has been studying the language and the culture of Iran for several years now and for me it has been an education for me as well. I have learned new dishes, I have met wonderful people and listened to beautiful music. When she came home from her fist visit to Iran, she brought with her a dictionary by Mr Afshin Afkari. This Farsi Dutch and a Dutch Farsi dictionary has been an important resource for her study.

At a party a few weeks ago, I met Mr Afkari for the first time. Talking with people who are interested and involved in dictionaries is a rare treat for me, so I had a splendid time. We promised to meet again and we did. We talked long into the night and I was happy that I could help him with some (minor for me) issues with computers and Internet.

The most relevant thing I was able to do was to help him with his dictionary. The last time he was able to work on it was in the day when he still worked in Wordperfect 5.1. These days were long gone, and in the more then 10 years that passed he was not been able to convert his data into a contemporary format. He had asked many people before me but the problem was that the standard conversion programs do not deal with a text with mixed Latin and Arab script.

All that I did was ask a friend who I deals quite regularly with such issues. He found me a 43 EURO program and this program did the job. The dictionary has now returned from its slumber and Mr Afkari is considering again what next he will do with his dictionary. He wants to do an update, build a spell checker and maybe making it available under a Free license.

I told this story to friend of mine. His question was if I had a dictionary Sorani -English for him. I had to disappoint him but it does not mean that I cannot ask around about the existence of such a dictionary. I expect that such a resource exists somewhere in an ivory or other tower. When it does, and when it can becomes available, many people will be as happy as my sister is with her dictionary.

Friday, February 1, 2008

Update on the MediaWiki localisation

BetaWiki is a success. The numbers prove it; in a year the number of supported languages has increased from 266 to 307 and all the indicators have been steadily improving. When you consider that this is a project run by volunteers, it is pretty amazing. It is for this reason that I am so happy that UNESCO acknowledged BetaWiki for what it is; a community success story. BetaWiki is doing really well for the major languages but also for languages like Telugu, Marathi, Northern Sotho and Tajik. Martin's stirling effort for Swahili is not reflected yet in these numbers, his contributions will be life on all Wikimedia Foundation's wikis in the coming days. :)

I think you will agree with me that with improved localisation, it will be easier to reach out with MediaWiki to the readers of the world. We want more people to read and write in Swahili, Comorian, Maithili, Basque, Piedmontese...

In 2008, the year of languages, we have to celebrate our achievements. I am sure that projects like BetaWiki will help languages improve their presence on the Internet. Wikipedia, wikis are geeky, they attract young people. When young people continue to express themselves in their mother tongue, their language and culture is alive and well.

This week a Wikipedia was requested for Mingrelian and Maithili... As communities continue to form to write their Wikipedia in their language, we will be able to bring more information to more people in their language.

This year is still young :)

Thursday, January 24, 2008

Localizing MediaWiki: a translator's perspective

The push is on to localize the core 500 MediaWiki messages for numerous languages. The language committee responsible for the creation of new Wikipedia projects has made the sensible decision that a language cannot have its own Wikipedia unless and until the core user interface is available in that language. As a big side benefit, MediaWiki is used to run thousands of other sites, so a single localization effort can catalyze all sorts of projects in a given language.

The Swahili Wikipedia has about 6500 articles, but only about 100 MediaWiki messages had been translated. The schizophrenic Swanglish interface produces gems like this: "Ficha logged-in users | Fichua my edits." It was beyond time to bring sw.wikipedia up to standard, so I decided to give it a go. In the interests of letting localizers for other languages know what they are getting into, here is a brief report on the process:

1) First, you need to get yourself approved on Betawiki, the site that oversees all the translations. You will need to create an account and join the language project. Thankfully, the people on the site are friendly, helpful, and fast.

2) You will need to download the files (unless you want to do all the work online, which is NOT recommended) and use a special translation assistance tool such as the free POedit or OmegaT. To download, go to the Translate page and select the following options:
  • I want to: Export translation in Gettext format

  • Group: MediaWiki messages (most used)

  • Language: [select your language]

  • Limit: 500 messages per page

Click "Fetch" and you will end up with a long, messy-looking document. Copy and paste that into a text editor like TextPad, save it in standard text format with a .po extension (the filename should be something like mylanguage.po), open it with your translation software, and voila!

Actually, not so fast. It took me a painful half hour or more to figure out that the file had to be saved in "UTF-8" rather than "Unicode," before I could actually open the file with POedit.

3) Let the games begin! At first, the translation goes very quickly, as you breeze through terms like "January" and "Comment." However, you soon start hitting challenging terms. The less of a computer presence your language already has, the more of a challenge the terms will be. "View source." "Full resolution." "Metadata." "Disambiguation pages." And long chunks like this:
This page is currently protected because it is included in the following {{PLURAL:$1|page, which has|pages, which have}} cascading protection turned on. You can change this page's protection level, but it will not affect the cascading protection.

Frustratingly, more than half of the translation strings do not have any accompanying explanations. You may have to click through Wikipedia special pages looking for an instance of the term, in order to figure out what is being talked about. Even a simple term like "block" that does not have an explanatory note becomes needlessly difficult; I used the word for "a block of text" until a later entry made it clear that the sense called for was "prevent access."

Messages that have code elements such as $1 (meaning that some text or number will be inserted in that position by the software) should especially have explanatory text, since the content of "$1" often makes a big difference in the words you use and the order you place your text and code elements. For example, "$1 logged-in users" could either render as "50 logged-in users," in which case the Swahili would be "Watumiaji $1 sasa," or "Show/hide logged-in users," in which case the Swahili should be "$1 watumiaji sasa."

The final frustration is that many of the messages - naturally, the ones you put off for last - are extremely long. One has to wonder if messages like this are crucial to establishing the core functionality of MediaWiki in any language:
Using the form below will rename a page, moving all\n
of its history to the new name.\n
The old title will become a redirect page to the new title.\n
Links to the old page title will not be changed; be sure to\n
check for double or broken redirects.\n
You are responsible for making sure that links continue to\n
point where they are supposed to go.\n
Note that the page will '''not''' be moved if there is already\n
a page at the new title, unless it is empty or a redirect and has no\n
past edit history. This means that you can rename a page back to where\n
it was just renamed from if you make a mistake, and you cannot overwrite\n
an existing page.\n
This can be a drastic and unexpected change for a popular page;\n
please be sure you understand the consequences of this before\n

The warning that should be posted is that the project will take a lot longer than you expect, and won't be nearly as straightforward as advertised.

Nonetheless, the draft Swahili translation is complete, after probably 16 hours of work. At the moment it is being reviewed in Kenya, and we will upload it as soon as we finish refining it. Meanwhile, a few observations are in order:

  • It really helps to have a good familiarity with computer terminology in general and Wikipedia in particular before starting to translate MediaWiki. If you don't know about "Watchlists" or "RSS Feeds," you will face challenges beyond your normal translation project.

  • You will benefit greatly from a working technical glossary and good dictionaries for your language. Fortunately, Swahili has had a number of successful localization projects (OpenOffice, Google, Microsoft Windows, and others), so I had a lot of resources to consult and an online Swahili dictionary that contains many IT terms and to which I could add new ones. This will not be the case for most other African or minority languages.

  • This project should not be done alone. Ideally you will have two people who are both knowledgeable about computers, both fluent in English and the project language, but one of whom is a native English speaker and one a native speaker of the language in question. Alternately, have someone on standby who can explain any problematic terms. I was lucky to work with Arthur Buliva, a Kenyan computer programmer, as we pitched translations back and forth over instant messenger.

  • A big challenge is that you cannot simply coin terms where none exist in your project language. Your users will need to understand the meanings behind the messages, usually without reference to a dictionary. If your language does not have a word for "subcategories" or "namespace" - well, welcome to the wonderful world of software localization!

I do not want to scare anyone away from trying to localize MediaWiki in their language, but I do want to paint a realistic picture of the task in front of you. It is not fast, and for most languages it is not easy, but the outcome - truly useful software in your language - will be its own reward.

Thursday, January 17, 2008

The practical application of types of languages

The types of individual languages as posted on the SIL website names five types of languages. In my opinion there are only two of these types that allow for new terminology and still be the same language, they are the living and the constructed languages. In the definition of constructed languages reconstructed languages are specifically excluded.

The problem is, what do you do with reconstructed languages. How do you qualify a project from both a linguistic and from a language standards point of view that aims to write an encyclopaedia in Ancient Greek or one in Ottoman Turkish?

From my point of view, you reconstruct such a language when you want to discuss modern concepts. The concepts of existing words has to be morphed into something new in order for the words to fit. New words have to be borrowed from other languages or invented in order to express things like computer, satellite or television.

When you have people writing new texts in historic, extinct or ancient languages they should not qualify as being in this language. Then again, if you take Latin, a language qualified as ancient, you have a language that has been continuously used in the Roman Catholic church, a language that is the national language of the Vatican with a dictionary of modern words to help understand where classical knowledge does not suffice.

The question is how do you deal with modern texts in historic, extinct or ancient languages. How should they be tagged. It is clear that these modern variations are quite distinct from the original language. When you consider "Church Latin", according to Wikipedia it is the same language as the classical language. But given the continuous usage it should not be labeled as ancient.

In the Wikimedia Foundation the answer to these questions is what the decision to approve or deny a new language for most of the project types. The one exception is Wikisource, a library of source texts with translations and other supporting material.

In conclusion:
  • Should Latin be marked as an ancient language
  • How do you label efforts in reviving a language

Saturday, January 12, 2008

MediaWiki language development in 2008

MediaWiki is the software that runs Wikipedia. Within the Wikimedia Foundation it is used in many languages for many projects. Some of these projects, like the Bengali Wikipedia, are the biggest single source for a language on the Internet.

The BetaWiki project has become the hub of the MediaWiki localisation, not only is a Web interface provided, there is also support for "gettext" or ".po" files. This allows for the use of CAT Tools like OmegaT. Managing the localisation of several hundred languages is not easy and making sure that MediaWiki is properly localised is a struggle. We found that the quality of the language support is spotty. Policies in the WMF have changed requiring localisation as a precondition for a new project. Big improvements in the MediaWiki localisation have happened in the last few months.

In order to improve the MediaWiki usabilty we have received some funding from Hivos. This will make it possible to put a premium on the localisation of MediaWiki for languages not in Europe or North America.

Many of the languages that MediaWiki supports are different and we do get requests for changing the sort order. Typically we follow the CLDR but there are languages where the CLDR does not help us. It would be good if we knew how to work with the CLDR and share our work.

For American Sign Language, we have been told that for SignWriting an extension will be programmed allowing for a Wikipedia in sign languages. This we hope will stimulate the emancipation of all sign languages a lot. It will be a first to have encyclopedic content for sign languages.

NB today new Wikipedias are starting in Saterland Frisian, Crimean Tatar and Lower Sorbian. :) For many more languages preparations are being made in the Incubator. :)


Thursday, January 10, 2008

Facebook Status Day for African Languages

Although the Facebook social networking site clings stubbornly to its English-only interface, it has become wildly popular with college students and twentysomethings worldwide. Among Facebook users are many thousands of African students, whether studying in Africa or at universities abroad.

An African Languages user group on Facebook is working to bring together those who speak and/or have an interest in African languages. Several group projects are anticipated, using the power of the network to enhance the recognition of African languages online and off.

The first group activity is a series of modest awareness-raising events, scheduled for the 15th of every month this year. Members will be posting their "Status" in the African language of their choice, for all their contacts to see and appreciate Africa's many tongues. The first event will occur this Tuesday - you can join the Status Day event by clicking here.

Show your Facebook status in your favorite African Language! 2008 is the International Year of Languages, and the African Languages group on Facebook is celebrating by posting our personal status messages in African languages on the 15th of every month.

Your "status" is the Facebook feature where you can post a simple message about yourself, like "Abdul is happy" or "Betty wishes you a Happy New Year." On Status Day, we'll tell the world how we are in the languages of Africa.

For example, in Swahili: "Mbiha hajambo" (Mbiha is fine), or in Akan: "Akua ho ye" (Akua is fine)

So, on January 15, please join us by saying it in an African Language!

Wednesday, January 2, 2008

Happy New Year

Welcoming 2008 is welcoming the UNESCO International Year of Languages. For people like us who care about languages, it is an opportunity to have ourselves be heard by a larger audience. This opportunity will be realised by our own efforts as the UNESCO supportive material cannot be found yet on their website.

When 2008 allows us to promote our efforts for languages, it will help when we collaborate. There is too much for a single person, a single organisation to do. We will make a difference when we prod UNESCO to be successful. We will make a difference when we promote other programs like SignWriting. We will make a difference when we make sure that our own projects do well and integrate with others. But most of all, we will make a difference when our message is communicated as widely as possible; languages are important and deserve attention.

I wish that at the end of this year we will be satisfied with what we, as a group, have achieved.