Thursday, January 24, 2008

Localizing MediaWiki: a translator's perspective

The push is on to localize the core 500 MediaWiki messages for numerous languages. The language committee responsible for the creation of new Wikipedia projects has made the sensible decision that a language cannot have its own Wikipedia unless and until the core user interface is available in that language. As a big side benefit, MediaWiki is used to run thousands of other sites, so a single localization effort can catalyze all sorts of projects in a given language.

The Swahili Wikipedia has about 6500 articles, but only about 100 MediaWiki messages had been translated. The schizophrenic Swanglish interface produces gems like this: "Ficha logged-in users | Fichua my edits." It was beyond time to bring sw.wikipedia up to standard, so I decided to give it a go. In the interests of letting localizers for other languages know what they are getting into, here is a brief report on the process:

1) First, you need to get yourself approved on Betawiki, the site that oversees all the translations. You will need to create an account and join the language project. Thankfully, the people on the site are friendly, helpful, and fast.

2) You will need to download the files (unless you want to do all the work online, which is NOT recommended) and use a special translation assistance tool such as the free POedit or OmegaT. To download, go to the Translate page and select the following options:
  • I want to: Export translation in Gettext format

  • Group: MediaWiki messages (most used)

  • Language: [select your language]

  • Limit: 500 messages per page

Click "Fetch" and you will end up with a long, messy-looking document. Copy and paste that into a text editor like TextPad, save it in standard text format with a .po extension (the filename should be something like mylanguage.po), open it with your translation software, and voila!

Actually, not so fast. It took me a painful half hour or more to figure out that the file had to be saved in "UTF-8" rather than "Unicode," before I could actually open the file with POedit.

3) Let the games begin! At first, the translation goes very quickly, as you breeze through terms like "January" and "Comment." However, you soon start hitting challenging terms. The less of a computer presence your language already has, the more of a challenge the terms will be. "View source." "Full resolution." "Metadata." "Disambiguation pages." And long chunks like this:
This page is currently protected because it is included in the following {{PLURAL:$1|page, which has|pages, which have}} cascading protection turned on. You can change this page's protection level, but it will not affect the cascading protection.

Frustratingly, more than half of the translation strings do not have any accompanying explanations. You may have to click through Wikipedia special pages looking for an instance of the term, in order to figure out what is being talked about. Even a simple term like "block" that does not have an explanatory note becomes needlessly difficult; I used the word for "a block of text" until a later entry made it clear that the sense called for was "prevent access."

Messages that have code elements such as $1 (meaning that some text or number will be inserted in that position by the software) should especially have explanatory text, since the content of "$1" often makes a big difference in the words you use and the order you place your text and code elements. For example, "$1 logged-in users" could either render as "50 logged-in users," in which case the Swahili would be "Watumiaji $1 sasa," or "Show/hide logged-in users," in which case the Swahili should be "$1 watumiaji sasa."

The final frustration is that many of the messages - naturally, the ones you put off for last - are extremely long. One has to wonder if messages like this are crucial to establishing the core functionality of MediaWiki in any language:
Using the form below will rename a page, moving all\n
of its history to the new name.\n
The old title will become a redirect page to the new title.\n
Links to the old page title will not be changed; be sure to\n
check for double or broken redirects.\n
You are responsible for making sure that links continue to\n
point where they are supposed to go.\n
Note that the page will '''not''' be moved if there is already\n
a page at the new title, unless it is empty or a redirect and has no\n
past edit history. This means that you can rename a page back to where\n
it was just renamed from if you make a mistake, and you cannot overwrite\n
an existing page.\n
This can be a drastic and unexpected change for a popular page;\n
please be sure you understand the consequences of this before\n

The warning that should be posted is that the project will take a lot longer than you expect, and won't be nearly as straightforward as advertised.

Nonetheless, the draft Swahili translation is complete, after probably 16 hours of work. At the moment it is being reviewed in Kenya, and we will upload it as soon as we finish refining it. Meanwhile, a few observations are in order:

  • It really helps to have a good familiarity with computer terminology in general and Wikipedia in particular before starting to translate MediaWiki. If you don't know about "Watchlists" or "RSS Feeds," you will face challenges beyond your normal translation project.

  • You will benefit greatly from a working technical glossary and good dictionaries for your language. Fortunately, Swahili has had a number of successful localization projects (OpenOffice, Google, Microsoft Windows, and others), so I had a lot of resources to consult and an online Swahili dictionary that contains many IT terms and to which I could add new ones. This will not be the case for most other African or minority languages.

  • This project should not be done alone. Ideally you will have two people who are both knowledgeable about computers, both fluent in English and the project language, but one of whom is a native English speaker and one a native speaker of the language in question. Alternately, have someone on standby who can explain any problematic terms. I was lucky to work with Arthur Buliva, a Kenyan computer programmer, as we pitched translations back and forth over instant messenger.

  • A big challenge is that you cannot simply coin terms where none exist in your project language. Your users will need to understand the meanings behind the messages, usually without reference to a dictionary. If your language does not have a word for "subcategories" or "namespace" - well, welcome to the wonderful world of software localization!

I do not want to scare anyone away from trying to localize MediaWiki in their language, but I do want to paint a realistic picture of the task in front of you. It is not fast, and for most languages it is not easy, but the outcome - truly useful software in your language - will be its own reward.


siebrand said...

Thank you for the blog about Betawiki. We are doing what we can and we try to listen to our users as much as possible. Based on this blog, we will be making some adjustments. The export function for .po will soon lead to a download, so that the encoding issues will be a thing of the past.

I recognise your request for as much information as possible about messages. This is a lot of work, however, and we only use volunteers. At the moment, about 50% of the 500 most often used messages have documentation. Some extensions have proper documentation. We are in the process of adding more messages. Especially developers wanting to work on this would be very welcome.

You are discouraging in-wiki translation. This fascinates me, as so far he have received exactly one (1) contribution from a .po file. You fail to motivate this advice. I'd like to have seen some pros and cons, so potential users can make a more clear opinion.

I am very much looking forward to the submit of Swahili for MediaWiki. Thank you very much for your contributions!

Oh, one last thing: we have updated documentation for those who want to work offline:

Cheers! Siebrand

Greg David said...

Useful info, thanks for sharing. I think can be a good localization tool to try using for your use when translating your .po files or others. It is free for open-source projects.