Some time ago I wrote about the study that Chinese translator Ethan Shen was conducting to compare three different free MT engines (for my earlier articles about this study, see Google, Bing and Babelfish and Google, Bing and Babelfish: some preliminary results).
Ethan has now completed phase 1 of his study, and the results are both interesting and - for me, at least - unexpected. Here below you can read a short report on Ethan's study.
From Ethan’s website you can download the full report, if you prefer to have all the details.
Real World Comparison of Online Machine Translators
by Ethan Shen
Gabble On Research Project
research@gabble-on.com
Abstract
This paper evaluates the relative quality of three popular online translation tools: Google Translate, Bing (Microsoft) Translator, and Yahoo Babelfish. The results published below are based on a 6 week survey open to the general internet population which allowed survey takers to choose any language, enter any free-form text, and vote on the best of all translation results side-by-side (www.gabble-on.com/research). The final data reveals that while Google Translate is widely preferred when translating long passages, Microsoft Bing Translator and Yahoo Babelfish often produce better translations for phrases below 140 characters. Also, in general Babelfish performs well in East Asian Languages such as Chinese and Korean and Bing Translator performs well in Spanish, German, and Italian.
Results
Most Preferred Engine and Margin of Preference by Language Pair and Text Length
The above table describes the relationship between user preferences and translated text character length for 15 single direction languages pairings. The most preferred engine is given at each intersection (Google, Babelfish, or Bing) along with the magnitude of its lead over its closest competitor in that category (colored percentage). The language pairings excluded from this table represent sets for which preferences were overwhelming (over 100%) or insufficient data was available.
From this data, the following conclusions can be drawn:
- For long passages of text up to 2000 characters, survey takers generally prefer Google Translate's results across the board.
a. The extent of Google’s lead varies dramatically from language to language. In some languages such as French, the strength of Google Translate’s engine is overwhelming. However, in several others like German, Italian, and Portuguese, Google holds only a very slim lead when compared to its biggest competitors.
b. These observations validate our Hypothesis 1 that no single engine can perform equally well across a spectrum of languages or conditions.
- The greatest relative strength of statistical translation focused engine (Google Translate) has not clustered around the European Union working languages as expected. German, Italian, and Portuguese, all EU working languages are the most hotly contested from a performance perspective.
a. One possible explanation is that large additional bodies of parallel English-French text are available from the government of Canada for which are official documents are translated into both. To a lesser extent this could explain the strength of Google Translate in Spanish as many Latin American country offer English Translations of official documents.
b. This data partially refutes Hypothesis 2.
- Traditional Rules Based Translation Engines (Babelfish) performed generally well in East Asian languages such and Chinese and Korean.
a. One possible reason for this outperformance is likely that the language specific grammar and word usages rules are more effective that association based transliteration in these situations.
b. These finding are in line with Hypothesis 3, but the size of the data set is not large enough to confirm in a statistical significant manner.
- Across almost every language Bing Translator and Yahoo Babelfish gain ground or surpass Google Translate as the text length gets shorter.
a. In Chinese, the gradual erosion of Google relative performance as total text length shrinks from 2000 characters to 50 characters is stark and representative of the comparative strength Rules Based or Hybrid Translation Engines as phrases get shorter and more straight forward.
b. It appears that at 150 characters or less, the fiercest competition between performance of different translation models become the most heated. Some similar effects were seen at 200 characters, but to a less significant extent.
c. Though data is not shown, a similar effect is seen for passages that are only one sentence compared to passages with multiple sentences
d. This data strongly validates Hypothesis 4.
- The most interesting observation is that translation quality is not a two way street. The engine that is best for translating in one direction is not necessarily the best tool to translate back the other way.
a. The two most obvious cases of this are French and German. Though Google Translation dominates when translating both these languages to English. It faces heavy competition when translating back from English to the foreign language.
These results are taken from a longer full research write-up.
To read the hypothesis, experiment design, extended results, practical applications and references, the full report is provided here: http://www.gabble-on.com/files/phase1_full_research_report.pdf.