Home Language Related Machine translation for under resourced languages
mod_vvisit_countermod_vvisit_countermod_vvisit_countermod_vvisit_countermod_vvisit_countermod_vvisit_countermod_vvisit_counter
mod_vvisit_counterToday102
mod_vvisit_counterYesterday279
mod_vvisit_counterThis week102
mod_vvisit_counterLast week1629
mod_vvisit_counterThis month4242
mod_vvisit_counterLast month7305
mod_vvisit_counterAll days1434680

We have: 3 guests, 2 bots online
Your IP: 38.107.179.210
 , 
Today: May 21, 2012

Share!

Machine translation for under resourced languages PDF Print E-mail
User Rating: / 4
PoorBest 
Written by Administrator   
Tuesday, 11 October 2011 11:01

Machine Translation and under resourced languages

 

795px-brueghel-tower-of-babelWhether new to the profession or seasoned professional we all are witness to the re-emergence and rapid development of machine translation. Machine translation has outgrown its initial scope, to provide access to information for the army, and has become a service that everyone needs and/or would like to have access to. According to Hutchings (Hutchings:1999), machine translation serves four purposes:

 1. Production of a text so good that it could automatically be published. This would serve to disseminate information. The output is always post-edited.
 2. Production of a text that serves the simple transfer of the text’s core message. This would serve to assimilate information. This kind of output is for people who prefer to know some of what the text is stating than nothing at all.
 3. Production of a text that helps in a fast information transfer on 1:1 level.
 4. Production of extracted information within a multilingual system for accessing databases, information silos etc. This belongs to a wider access to information and is usually integrated in search engines and data retrieval systems.

We can safely say that these four purposes are accomplished in the major language pairs (English, French, Spanish, German, Italian) or at least that the output is fairly stable. That also means that the major paradigm of statistical machine translation works. So, what can we do for languages that are under resourced? i.e. for languages for which we do not have major works on morphological dictionaries or computational grammars etc? Such languages as Greek, Czech, Romanian, Bulgarian and other languages considered “minor”.

 

As part of my postgraduate program, “Technoglossia”, Dionysia Delmadorou, Thanassis Kalogeropoulos, Mary Mouroutsou and myself formed a team that evaluated machine translation output for the English – Greek language pair using the BLEU scale (Delmadorou et al.:2011). The BLEU scale is used by people evaluating the output of a machine translation platform who then compare it to a human translation (Papineri et al.:2002). After the comparison the human evaluator can assign anything from 1-4 to each sentence:

1 – Not acceptable

2 – Potentially acceptable

3 – Acceptable

4 – Ideal

 

We took four articles from different domains and had them translated by the three biggest machine translation platforms: Google Translate, Bing Translator and the commercial Systran platform. The texts we chose were from the domains of Sports, European matters, Journalism (subdomain Finance) and Medicine. What we found out was from one point of view something we expected: all platforms performed poorly when complex syntax and morphology was concerned. It showed us also though the importance of the user interaction with the platform. That means that Google outshone the other two platforms on the general evaluation exactly because, at least it is our belief, the output can be directly corrected by the anonymous user and this correction is then collected and maintained in an error corpus that helps the platform “avoid” the same mistake twice. These corrections also concern sometimes specialized terminology as we saw that the sports article, surprisingly, had most of the specialized in-domain terminology translated correctly.

Our overall results using the BLEU scale were 2,5 for Google Translate, 1,95 for the commercial Systran platform and 1,93 for Bing Translator. Percentage-wise this can be broken down as follows:

 small_world_graph

 

There are many ways to improve BLEU scores for under resourced languages. The first thing that we can do is control the input. Control how the source text is formed and you will win half the battle. During the very informative TAUS discussions on machine translation (TAUS: YouTube) Systran’s representative offered a quick method that can be followed by everyone and would help also maintain consistency of the course document and/or terminology: tag the source text.

 

For example the following is a depiction of how a tagged text would look like:

 

 

 small_world_graph2

 

 

In this case the underlined text would represent terminology, the yellow tagged text would denote product names that should remain untranslated, the green tagged text would be related to versioning and/or numbers and lastly the red text would represent words and/or expressions which are known to be problematic. This falls also under the general movement of what is called “Plain English” (or International English or Multinational Customized English).

Having taken care of the source side then we would need to ameliorate how the transfer to the target side will happen. Here under resourced languages have issues, namely the lack of corpora large enough (or indeed of any size) to accommodate descent machine learning for such transfer methods. A possible solution is as a first step to create monolingual domain specific corpora. Such corpora though, exactly because they are not diachronic, are bound to become old and obsolete, especially those having to do with a technical and/or technical related domain. They can still be used for generic translations once they reach their old age but the terminology therein can be contested as the market continues to evolve and subject matter experts continually help update glossaries and other terminological resources.

Gathering information and cleaning data in order to be processed and tagged is a time consuming task but one that could be done in an automatic way with the help of crawlers. A crawler can be setup in such a way so that it contains a list of URLs that it crawls and from where it extracts all information. This information afterwards can be passed through a bootstrapping process with tools such as the BootStrap (Baroni & Bernandini:2005). For example, such a process using Google would look like this:

small_world_graph3

This way we can create ad hoc, domain specific corpora, for which we can be sure that they contain the most up to date terminology, expressions as well as the latest grammatical and syntactical structures. We can then use them to continuously renew and refresh the terminology as well as the morphological and syntactic information used in the transfer methods utilized by the machine translation platform. This coupled with the use of a good morphological all-purpose dictionary should make our lives easier for the target side. Then if this process is repeated on sites with either similar information or sites that are in multiple languages would help us create what is called comparable corpora i.e. bilingual or multilingual corpora that don’t contain one language and everything else is the translation of that initial source but that contain original text in different languages but these texts are of the same subject – matter or domain. These efforts would ensure on some level minimal need for post editing.

Machine translation is still a very alive paradigm of research and the future looks promising and exciting especially for under resourced languages where the field is still taking shape.

 

Bibliography

 

  1. Hutchins, John, The development and use of machine translation systems and computer-based translation tools in International Symposium on Machine Translation and Computer Language Information Processing, 26-28 June 1999, Beijing, China.
  2. Papineri Jushore, Roukos Salim, Ward Todd and Zhu Wei-Jing , Bleu: a method for automatic ebbaluation of machine translation, in Proceedings of the 40th Annual Conference of the Association for Computational Linguistics (ACL-02), Philadelphia, PA, 2002.
  3. Alon Lavie, Essentials of machine translation evaluation – TAUS Perspectives 21 October 2010
  4. Jaap van der Meer, The future for translators looks bright, but they will have to reinvent the profession first – TAUS Perspectives
  5. Jaap van der Meer, Where are Facebook, Google, IBM and Microsoft taking us? – TAUS Perspectives, 02 August 2010
  6. Kirti Vashee, Blog “eMpTy Pages”, The Need for Automated Translation Quality Measurement in SMT: BLEU, 9 March 2010.
  7. Maxim Khalilov, What machines still can't translate – TAUS Perspectives, 13 September 2011
  8. TAUS, Playlists – Machine Translation Technologies playlist on YouTube
  9. Baroni, Marco and Bernardini,Silvia , BootCaT: Bootstrapping Corpora and Terms from the Web

 

 

 

Add comment


Security code
Refresh

Copyright © 2012 Leximania. All Rights Reserved.
Joomla! is Free Software released under the GNU/GPL License.