Whether new to the profession or seasoned professional we all are witness to the re-emergence and rapid development of machine translation. Machine translation has outgrown its initial scope, to provide access to information for the army, and has become a service that everyone needs and/or would like to have access to. According to Hutchings (Hutchings:1999), machine translation serves four purposes:
1. Production of a text so good that it could automatically be published. This would serve to disseminate information. The output is always post-edited. 2. Production of a text that serves the simple transfer of the text’s core message. This would serve to assimilate information. This kind of output is for people who prefer to know some of what the text is stating than nothing at all. 3. Production of a text that helps in a fast information transfer on 1:1 level. 4. Production of extracted information within a multilingual system for accessing databases, information silos etc. This belongs to a wider access to information and is usually integrated in search engines and data retrieval systems.
We can safely say that these four purposes are accomplished in the major language pairs (English, French, Spanish, German, Italian) or at least that the output is fairly stable. That also means that the major paradigm of statistical machine translation works. So, what can we do for languages that are under resourced? i.e. for languages for which we do not have major works on morphological dictionaries or computational grammars etc? Such languages as Greek, Czech, Romanian, Bulgarian and other languages considered “minor”.
As part of my postgraduate program, “Technoglossia”, Dionysia Delmadorou, Thanassis Kalogeropoulos, Mary Mouroutsou and myself formed a team that evaluated machine translation output for the English – Greek language pair using the BLEU scale (Delmadorou et al.:2011). The BLEU scale is used by people evaluating the output of a machine translation platform who then compare it to a human translation (Papineri et al.:2002). After the comparison the human evaluator can assign anything from 1-4 to each sentence:
1 – Not acceptable
2 – Potentially acceptable
3 – Acceptable
4 – Ideal
We took four articles from different domains and had them translated by the three biggest machine translation platforms: Google Translate, Bing Translator and the commercial Systran platform. The texts we chose were from the domains of Sports, European matters, Journalism (subdomain Finance) and Medicine. What we found out was from one point of view something we expected: all platforms performed poorly when complex syntax and morphology was concerned. It showed us also though the importance of the user interaction with the platform. That means that Google outshone the other two platforms on the general evaluation exactly because, at least it is our belief, the output can be directly corrected by the anonymous user and this correction is then collected and maintained in an error corpus that helps the platform “avoid” the same mistake twice. These corrections also concern sometimes specialized terminology as we saw that the sports article, surprisingly, had most of the specialized in-domain terminology translated correctly.
Our overall results using the BLEU scale were 2,5 for Google Translate, 1,95 for the commercial Systran platform and 1,93 for Bing Translator. Percentage-wise this can be broken down as follows:
There are many ways to improve BLEU scores for under resourced languages. The first thing that we can do is control the input. Control how the source text is formed and you will win half the battle. During the very informative TAUS discussions on machine translation (TAUS: YouTube) Systran’s representative offered a quick method that can be followed by everyone and would help also maintain consistency of the course document and/or terminology: tag the source text.
For example the following is a depiction of how a tagged text would look like:
In this case the underlined text would represent terminology, the yellow tagged text would denote product names that should remain untranslated, the green tagged text would be related to versioning and/or numbers and lastly the red text would represent words and/or expressions which are known to be problematic. This falls also under the general movement of what is called “Plain English” (or International English or Multinational Customized English).
Having taken care of the source side then we would need to ameliorate how the transfer to the target side will happen. Here under resourced languages have issues, namely the lack of corpora large enough (or indeed of any size) to accommodate descent machine learning for such transfer methods. A possible solution is as a first step to create monolingual domain specific corpora. Such corpora though, exactly because they are not diachronic, are bound to become old and obsolete, especially those having to do with a technical and/or technical related domain. They can still be used for generic translations once they reach their old age but the terminology therein can be contested as the market continues to evolve and subject matter experts continually help update glossaries and other terminological resources.
Gathering information and cleaning data in order to be processed and tagged is a time consuming task but one that could be done in an automatic way with the help of crawlers. A crawler can be setup in such a way so that it contains a list of URLs that it crawls and from where it extracts all information. This information afterwards can be passed through a bootstrapping process with tools such as the BootStrap (Baroni & Bernandini:2005). For example, such a process using Google would look like this:
This way we can create ad hoc, domain specific corpora, for which we can be sure that they contain the most up to date terminology, expressions as well as the latest grammatical and syntactical structures. We can then use them to continuously renew and refresh the terminology as well as the morphological and syntactic information used in the transfer methods utilized by the machine translation platform. This coupled with the use of a good morphological all-purpose dictionary should make our lives easier for the target side. Then if this process is repeated on sites with either similar information or sites that are in multiple languages would help us create what is called comparable corpora i.e. bilingual or multilingual corpora that don’t contain one language and everything else is the translation of that initial source but that contain original text in different languages but these texts are of the same subject – matter or domain. These efforts would ensure on some level minimal need for post editing.
Machine translation is still a very alive paradigm of research and the future looks promising and exciting especially for under resourced languages where the field is still taking shape.
Bibliography
Hutchins, John, The development and use of machine translation systems and computer-based translation tools in International Symposium on Machine Translation and Computer Language Information Processing, 26-28 June 1999, Beijing, China.
Papineri Jushore, Roukos Salim, Ward Todd and Zhu Wei-Jing , Bleu: a method for automatic ebbaluation of machine translation, in Proceedings of the 40th Annual Conference of the Association for Computational Linguistics (ACL-02), Philadelphia, PA, 2002.