Showing posts with label ISO 639-3. Show all posts
Showing posts with label ISO 639-3. Show all posts

Monday, January 26, 2009

Unintended consequences

The fiu-vro Wikipedia is a language in the Võro language. People applied for an IS)-639-3 code recently, and this request was granted; the Võro language is now known under the vro code. This has changed the status of this project considerably. Where it used to be a project that existed because "things happened in those days", the language complies with all the requirements for a new project. We have started the process of renaming the message file for this project and, we have requested the rename of the project.

There is one glitch. The Estonian Wikipedia is known as et.wikipedia.org. The ISO-639-1 et code is connected to the ISO-639-3 est code, and this just became a macro language. Standard Estonian has been given its own code of ekk.

It is quite clear that technically it would be preferable to rename the Estonian Wikipedia. It can be done, this will be demonstrated with the rename of the Võro Wikipedia. From a community perspective it is not so clear cut. People are conservative, they do not like change and there are a lot of references out their to the Estonian Wikipedia.

For the Võro community, it is a badge of pride to have their own ISO-639-3 code. For the Estonian community it is a nuisance.
Thanks,
     Gerard

Wednesday, August 13, 2008

Farsi, is it a macro language ?

According to the ISO-639-3, Farsi is a macro language. From my position it is a clear case as the standard says so, it is likely to be so. Farsi is divided in two, Western Farsi and Eastern Farsi. Western Farsi is spoken primarily in Iran and Eastern Farsi in Afghanistan and Pakistan.

The problem I have is that several people I respect, independently inform me that in their opinion this division is wrong. Farsi is said to be understood by all. Raising this question is for me about something practical. In this case it is about a request for a fa.wikinews.org.

Let me be clear, I am all in favour of such a project but I do not want to continue an ambiguity about the language. The practical question is, to what extend is it justified to consider Farsi and Dari as separate languages. When they are indeed to be considered separate languages, how different are they. Can it be compared in a similar way as South African and Dutch?

Please share your thoughts ...
Thanks,
      Gerard

Thursday, April 17, 2008

Of ancient and historical languages

According to the records at SIL the documentation for Ancient Greek (to 1453), ISO-639 code grc, has been tagged as type "Historical". This means that the language is dead. Latin lat on the other hand is considered to be "Ancient". Both Latin and Ancient Greek are still taught in schools to kids who get a classic western education.

According to the definition Latin is ancient and consequently it must have gone extinct more then a millenium ago. However, the Roman Catholic Church has continued to use Latin as its language. It maintains a dictionary of Latin modern vocabulary. Surely Latin may be old but it never went extinct.

Ancient Greek does not qualify as ancient because 1453 means less then a millenium. Ancient Greek is taught in school. Books, like the Harry Potter books are translated in Ancient Greek. As far as I understand it, there has not been a similar usage for Ancient Greek as it existed for Latin.

When you are to tag a text using the ISO-639 codes and its definitions, a modern text in Latin or Ancient Greek cannot be tagged. The first issue is that the definitions clearly limit the time when texts are to be considered in a historical or ancient language. The second issue is that in order to write a modern text neologisms are needed and/or existing words with a modern meaning are needed to express modern concepts.

When the definitions preclude the tagging of the modern expressions of Latin or Ancient Greek, it means that either a new code is needed to indicate the modern expression or the defintions of these languages are wrong.

I would argue that when a language has not seen continued use, the modern text is assigned a separate code. It is distinctly different and by tagging it as such, it may be clear to the reader of a text that the understanding of such a text does not reflect the language and the time when it was a living language. I would argue for a separate ISO-639-3 code.

My question is what do you think about this ?
Thanks,
GerardM

Thursday, March 20, 2008

Evolution in the Open Source world

OmegaT is a great open source CAT tool. It is written in Java, it has a growing group of users and Sabine Cretella, who is my weather cock for what is happening in this space, has been a long time champion of the software. As OmegaT makes sense to me for several of the things I am involved in, I have invested and I have been looking for funding to expand its functionality.

Yesterday I was astounded by Sabine. "Anaphraseus", she says, "is a CAT tool that does the things that are critical to me. It allows me to translate into Neapolitian properly; it allows me to enter nap, the ISO-639-3 code and consequently I am able to build my translation memory without having to remember what code I used in stead. They do not have a proper TMX yet, but they are working on it. Now given that I can finally work properly in my language, who cares that I do not have it yet?"

Anaphraseus used to be called "Open Wordfast" and makes use of the Open Office macro tool. It uses the same translation memory format like Wordfast, it supports text segmentation and it is great for proof reading.

Sabine has been investigating Anaphraseus's functionality and so far she is quite pleased. When I asked her why the change, she said that she had been asking for the ISO 639-3 support for almost two years, it was not forthcomming and Anaphaseus is as good for the job.
Thanks,
Gerard