Thursday, January 24, 2008

Localizing MediaWiki: a translator's perspective

The push is on to localize the core 500 MediaWiki messages for numerous languages. The language committee responsible for the creation of new Wikipedia projects has made the sensible decision that a language cannot have its own Wikipedia unless and until the core user interface is available in that language. As a big side benefit, MediaWiki is used to run thousands of other sites, so a single localization effort can catalyze all sorts of projects in a given language.

The Swahili Wikipedia has about 6500 articles, but only about 100 MediaWiki messages had been translated. The schizophrenic Swanglish interface produces gems like this: "Ficha logged-in users | Fichua my edits." It was beyond time to bring sw.wikipedia up to standard, so I decided to give it a go. In the interests of letting localizers for other languages know what they are getting into, here is a brief report on the process:

1) First, you need to get yourself approved on Betawiki, the site that oversees all the translations. You will need to create an account and join the language project. Thankfully, the people on the site are friendly, helpful, and fast.

2) You will need to download the files (unless you want to do all the work online, which is NOT recommended) and use a special translation assistance tool such as the free POedit or OmegaT. To download, go to the Translate page and select the following options:
  • I want to: Export translation in Gettext format

  • Group: MediaWiki messages (most used)

  • Language: [select your language]

  • Limit: 500 messages per page

Click "Fetch" and you will end up with a long, messy-looking document. Copy and paste that into a text editor like TextPad, save it in standard text format with a .po extension (the filename should be something like mylanguage.po), open it with your translation software, and voila!

Actually, not so fast. It took me a painful half hour or more to figure out that the file had to be saved in "UTF-8" rather than "Unicode," before I could actually open the file with POedit.

3) Let the games begin! At first, the translation goes very quickly, as you breeze through terms like "January" and "Comment." However, you soon start hitting challenging terms. The less of a computer presence your language already has, the more of a challenge the terms will be. "View source." "Full resolution." "Metadata." "Disambiguation pages." And long chunks like this:
This page is currently protected because it is included in the following {{PLURAL:$1|page, which has|pages, which have}} cascading protection turned on. You can change this page's protection level, but it will not affect the cascading protection.

Frustratingly, more than half of the translation strings do not have any accompanying explanations. You may have to click through Wikipedia special pages looking for an instance of the term, in order to figure out what is being talked about. Even a simple term like "block" that does not have an explanatory note becomes needlessly difficult; I used the word for "a block of text" until a later entry made it clear that the sense called for was "prevent access."

Messages that have code elements such as $1 (meaning that some text or number will be inserted in that position by the software) should especially have explanatory text, since the content of "$1" often makes a big difference in the words you use and the order you place your text and code elements. For example, "$1 logged-in users" could either render as "50 logged-in users," in which case the Swahili would be "Watumiaji $1 sasa," or "Show/hide logged-in users," in which case the Swahili should be "$1 watumiaji sasa."

The final frustration is that many of the messages - naturally, the ones you put off for last - are extremely long. One has to wonder if messages like this are crucial to establishing the core functionality of MediaWiki in any language:
Using the form below will rename a page, moving all\n
of its history to the new name.\n
The old title will become a redirect page to the new title.\n
Links to the old page title will not be changed; be sure to\n
check for double or broken redirects.\n
You are responsible for making sure that links continue to\n
point where they are supposed to go.\n
\n
Note that the page will '''not''' be moved if there is already\n
a page at the new title, unless it is empty or a redirect and has no\n
past edit history. This means that you can rename a page back to where\n
it was just renamed from if you make a mistake, and you cannot overwrite\n
an existing page.\n
\n
WARNING!\n
This can be a drastic and unexpected change for a popular page;\n
please be sure you understand the consequences of this before\n
proceeding.

The warning that should be posted is that the project will take a lot longer than you expect, and won't be nearly as straightforward as advertised.

Nonetheless, the draft Swahili translation is complete, after probably 16 hours of work. At the moment it is being reviewed in Kenya, and we will upload it as soon as we finish refining it. Meanwhile, a few observations are in order:

  • It really helps to have a good familiarity with computer terminology in general and Wikipedia in particular before starting to translate MediaWiki. If you don't know about "Watchlists" or "RSS Feeds," you will face challenges beyond your normal translation project.

  • You will benefit greatly from a working technical glossary and good dictionaries for your language. Fortunately, Swahili has had a number of successful localization projects (OpenOffice, Google, Microsoft Windows, and others), so I had a lot of resources to consult and an online Swahili dictionary that contains many IT terms and to which I could add new ones. This will not be the case for most other African or minority languages.

  • This project should not be done alone. Ideally you will have two people who are both knowledgeable about computers, both fluent in English and the project language, but one of whom is a native English speaker and one a native speaker of the language in question. Alternately, have someone on standby who can explain any problematic terms. I was lucky to work with Arthur Buliva, a Kenyan computer programmer, as we pitched translations back and forth over instant messenger.

  • A big challenge is that you cannot simply coin terms where none exist in your project language. Your users will need to understand the meanings behind the messages, usually without reference to a dictionary. If your language does not have a word for "subcategories" or "namespace" - well, welcome to the wonderful world of software localization!

I do not want to scare anyone away from trying to localize MediaWiki in their language, but I do want to paint a realistic picture of the task in front of you. It is not fast, and for most languages it is not easy, but the outcome - truly useful software in your language - will be its own reward.

Thursday, January 17, 2008

The practical application of types of languages

The types of individual languages as posted on the SIL website names five types of languages. In my opinion there are only two of these types that allow for new terminology and still be the same language, they are the living and the constructed languages. In the definition of constructed languages reconstructed languages are specifically excluded.

The problem is, what do you do with reconstructed languages. How do you qualify a project from both a linguistic and from a language standards point of view that aims to write an encyclopaedia in Ancient Greek or one in Ottoman Turkish?

From my point of view, you reconstruct such a language when you want to discuss modern concepts. The concepts of existing words has to be morphed into something new in order for the words to fit. New words have to be borrowed from other languages or invented in order to express things like computer, satellite or television.

When you have people writing new texts in historic, extinct or ancient languages they should not qualify as being in this language. Then again, if you take Latin, a language qualified as ancient, you have a language that has been continuously used in the Roman Catholic church, a language that is the national language of the Vatican with a dictionary of modern words to help understand where classical knowledge does not suffice.

The question is how do you deal with modern texts in historic, extinct or ancient languages. How should they be tagged. It is clear that these modern variations are quite distinct from the original language. When you consider "Church Latin", according to Wikipedia it is the same language as the classical language. But given the continuous usage it should not be labeled as ancient.

In the Wikimedia Foundation the answer to these questions is what the decision to approve or deny a new language for most of the project types. The one exception is Wikisource, a library of source texts with translations and other supporting material.

In conclusion:
  • Should Latin be marked as an ancient language
  • How do you label efforts in reviving a language
Thanks,
Gerard

Saturday, January 12, 2008

MediaWiki language development in 2008

MediaWiki is the software that runs Wikipedia. Within the Wikimedia Foundation it is used in many languages for many projects. Some of these projects, like the Bengali Wikipedia, are the biggest single source for a language on the Internet.

The BetaWiki project has become the hub of the MediaWiki localisation, not only is a Web interface provided, there is also support for "gettext" or ".po" files. This allows for the use of CAT Tools like OmegaT. Managing the localisation of several hundred languages is not easy and making sure that MediaWiki is properly localised is a struggle. We found that the quality of the language support is spotty. Policies in the WMF have changed requiring localisation as a precondition for a new project. Big improvements in the MediaWiki localisation have happened in the last few months.

In order to improve the MediaWiki usabilty we have received some funding from Hivos. This will make it possible to put a premium on the localisation of MediaWiki for languages not in Europe or North America.

Many of the languages that MediaWiki supports are different and we do get requests for changing the sort order. Typically we follow the CLDR but there are languages where the CLDR does not help us. It would be good if we knew how to work with the CLDR and share our work.

For American Sign Language, we have been told that for SignWriting an extension will be programmed allowing for a Wikipedia in sign languages. This we hope will stimulate the emancipation of all sign languages a lot. It will be a first to have encyclopedic content for sign languages.

NB today new Wikipedias are starting in Saterland Frisian, Crimean Tatar and Lower Sorbian. :) For many more languages preparations are being made in the Incubator. :)

Thanks,
Gerard

Thursday, January 10, 2008

Facebook Status Day for African Languages

Although the Facebook social networking site clings stubbornly to its English-only interface, it has become wildly popular with college students and twentysomethings worldwide. Among Facebook users are many thousands of African students, whether studying in Africa or at universities abroad.

An African Languages user group on Facebook is working to bring together those who speak and/or have an interest in African languages. Several group projects are anticipated, using the power of the network to enhance the recognition of African languages online and off.

The first group activity is a series of modest awareness-raising events, scheduled for the 15th of every month this year. Members will be posting their "Status" in the African language of their choice, for all their contacts to see and appreciate Africa's many tongues. The first event will occur this Tuesday - you can join the Status Day event by clicking here.


Show your Facebook status in your favorite African Language! 2008 is the International Year of Languages, and the African Languages group on Facebook is celebrating by posting our personal status messages in African languages on the 15th of every month.

Your "status" is the Facebook feature where you can post a simple message about yourself, like "Abdul is happy" or "Betty wishes you a Happy New Year." On Status Day, we'll tell the world how we are in the languages of Africa.

For example, in Swahili: "Mbiha hajambo" (Mbiha is fine), or in Akan: "Akua ho ye" (Akua is fine)

So, on January 15, please join us by saying it in an African Language!

Wednesday, January 2, 2008

Happy New Year

Welcoming 2008 is welcoming the UNESCO International Year of Languages. For people like us who care about languages, it is an opportunity to have ourselves be heard by a larger audience. This opportunity will be realised by our own efforts as the UNESCO supportive material cannot be found yet on their website.

When 2008 allows us to promote our efforts for languages, it will help when we collaborate. There is too much for a single person, a single organisation to do. We will make a difference when we prod UNESCO to be successful. We will make a difference when we promote other programs like SignWriting. We will make a difference when we make sure that our own projects do well and integrate with others. But most of all, we will make a difference when our message is communicated as widely as possible; languages are important and deserve attention.

I wish that at the end of this year we will be satisfied with what we, as a group, have achieved.
Thanks,
Gerard