Google Engineers Move Translation from a Linguistical to a Mathematical Problem
Translation = MC2? A team of engineers at Google has found a new way to conduct translation; you simply have to find the linear transformation between the languages! Does this sound like mathematical abracadabra to you? Let us explain!
According to an article on MIT Technology Review, the role computer science plays in translation is becoming increasingly important. The translation tools which make use of this science compare corpuses of words in two contrasting languages to find ‘rules’ with which translations can be carried out.
However, the article says, everyone who has used a machine translation tool such as Google Translate knows that these machine translations leave a lot to be desired in terms of accuracy. In addition, the initial translations for the corpuses have to be carried out by humans, which is very time-consuming.
It seems, though, that change is around the corner!
Google engineer Tomas Mikolov and a number of his co-workers have taken a dive into the fascinating world of machine translation. They have come up with a technique with which languages can be translated into one another by using automatically generated dictionaries and phrase tables made with data mining techniques. With this new method, the structure of one language is first mapped out and then compared to another. This way, having two similar corpuses for language comparison is no longer necessary.
The data mining technique can refine dictionaries, the engineers say, as it does not make assumptions about languages, but instead carefully constructs its grammar and vocabulary by using mathematics. MIT Technology Review believes the new method to be “relatively straightforward,” as every language must be able to describe similar things, which are used in similar ways as well. Thus, the words that describe these things must be similar as well. Here, the example of numbers is given; this image, that was published in the article, reveals the similarities between the numbers one to five in English and Spanish by representing them in a vector.
The similarities between numbers is exemplary for the representation of entire languages, MIT Technology review says. Mikolov and his team are aiming to represent a language by the relationships that exist between the words. All of these relationships together are called “language space,” and can be viewed as a set of vectors that point words to one another. Recently, researchers learned that these vectors can be dealt with in a mathematical manner.
According to the article, languages often have a lot of similarities when it comes to vectors. Consequently, converting one vector space into another is actually the same as translation. If you look at it this way, translation is no longer linguistic, but a mathematical problem, MIT Technology Review says. The challenge for the Google engineers is to place the vector spaces onto one another in the right way. Here, a translation corpus is used after all to produce a ready-made linear translation.
If the vectors are in place, they can be used for bigger language spaces. According to Mikolov, this new technique is surprisingly accurate; for translations between English and Spanish, 90% of the translations is correct. As stated before, this method can be used to improve dictionaries, which is being tested by the Google team as well. They are currently looking at a English-Czech dictionary and have already found a number of mistakes!
As the new method is based on mathematics and not on assumptions, the engineers say, it can also be used on languages that are completely unrelated. MIT Technology Review believes that multilingual communication will probably greatly improve with this new development. However, Mikolov and his team believe their discovery is only the tip of the iceberg and that a lot more research is needed.