Machine translation requires human-like AI

Some of you alive in the 90s might remember an episode of Star Trek: TNG that is held as an example of the philosophy of language showing up in popular culture.  In this episode, the voyagers of the Starship Enterprise arrive on a planet where the inhabitants speak a highly allegorical language, using phrases about mythic or historical figures a la”Shaka, when the walls fell” to convey messages such as “oops” or “I see your point”.  As a result of these literal translations, the Enterprise’s crew members are forced to decipher what the dense metaphors mean contextually rather than in their normal English idiom as the universal translators usually supply.  Universal translators, as you can probably guess, are supposed to work with any language on the first encounter with that language or even with the species using it, and as far as I know this is the only episode where this particulary difficulty arises.


The problem is, if a universal translator can’t work with the very (infeasibly, as the article above points out) allegorical language spoken in that episode, it shouldn’t work with any language.  Even very closely related human languages use vastly different grammar and vocabulary to express greetings, thanks, obligation, and anything else under the Sol System’s sun.  To know that the verb phrase “thank you” is a show of gratitute in English (not a command, as verb phrases in isolation generally are), while an adverbial like “doumo” serves that purpose in Japanese, a universal translator would need to be a mind-reader before it was a translator, as there is no way to ferret out the fact that “doumo” and “thank you serve the same purpose from first principles or even from the grammar of that language (which universal translators don’t always have access to; they work on every language even on the first try). Moreover, it would need to do this mind-reading on species whose physiology it has never encountered before, meaning it would need to determine where the locus of that species’ cognition is, make intelligent predictions about how the patterns of (presumably) chemical synapse firings correlate to intentions, and map those intentions onto speech acts as they occur in real time.  The prerequisite technology for a universal translator is much larger than mere substitution and reordering of words, and approaches impossible, even by sci-fi standards.

In our world, people often discuss non-sci-fi machine translation like Google Translate as if it also were a scaling problem of existing technology, as if adding more of the same gears and cogs we already have would result in perfect language-to-language recoding.  In essence, people think the incremental improvement of current machine translation technology can save us from the years-long process of mastering new languages ourselves.  This post, with its oddly long prologue, is meant to argue that perfect machine translation would require a project of enormously grander scale than the visible inputs and outputs of textual language, and like the universal translators in Star Trek, would have a project of imposing complexity as a prerequisite, one whose implications would go far beyond mere translation.  In the case of machine translation that prerequisitive is a complete human-like artificial intelligence.

At present, machine translation (MT) is full of errors, some of which are fixable with more immediate solutions. For some errors, the problem really is a matter of just applying modern technology, but is still a larger project than many realize.  The bare minimum for realistic grammatical translation is a complete grammatical representation of both languages plus statistical representations of actual usage. At present, only a grammatical representation (meaning a hard-coded pattern of what parts of speech can go where, how they can be nested and what at minimum makes a sentence) appears to be part of most MT software, and is often hilariously wrong. Statistical representations of usage would bring facts of collocation and common use into MT output, and are in principle possible with current technology.  But for many problems of MT the issue is really ultimately the fact that a non-human is trying to do a human task, and for machines to surpass people at it they will first need to have human minds.

Screen Shot 2016-10-23 at 12.54.33.png
Machine translation, as currently thought of.

For this post I am assuming that what people want from machine translation is utterances that would be passable complete sentences in the target language, usable in similar situations with roughly equivalent tone and markedness (i.e., not “the list of the things I’ve heard now contains everything”, as Fry says in the anime-inspired episode of Futurama). The resulting translations need not resemble their source sentences in grammar or vocabulary – the conventional Japanese expression for “I’m going home” has no word that corresponds exactly to “I”, “go”, or “home”.  For MT to be useful, it should provide something that people might actually say.

Let’s start by looking at the errors that could be fixed with a just broader application of current technology.

Applying corpora to MT to produce more likely interpretations

At present, machine translation does best with languages with similar conventions of grammar – nouns, verbs, prepositions, articles, gender, etc., enabling simple replacement of one language’s word for “dog” with the other’s.  It also does better with sentences whose components’ meanings are unambiguous and whose relationships are derived from standard grammar, reflecting the computer-friendly aspects of language: compositionality and regularity.  The tree structure of phrases in human language is compatible with principles taught in first-year computer science courses.  I myself played with this in one of many unused javascript apps.

From Wikipedia.

Sentences whose meanings can be understood by knowing without ambiguity what each of their components means and the way they are put together are easier to translate than ones that don’t.  If that sounds incredibly obvious, consider the fact that most if not all of the most common words and sentences have multiple meanings.

“Word”, for example, can stand for the abstract meaning of a word (“‘eat’ and ‘ate’ are the same word”), a word as accepted as correct usage (“irregardless is not a word”), a genre of language (“the written word”), a promise (“you have my word”), or an interjection (“My word!” “Word up”).  Any of these may be a completely different lexical item in another language.  Many poor machine translations are a result of the software having no way to choose between these meanings.  At present, machine translation does better with grammatical, regular nonsense like “colorless green dreams sleep furiously” than anything context-dependent.

Google translates this literally, bringing to mind one person bequeathing another with a vocabulary flashcard.

Grammatical ambiguity similarly trips up machine translation.  “I saw the man with a telescope” could mean that I saw him by looking through a telescope, or that he was holding one (“with” has a lot of barely related meanings).  Again, machine translation often has no means to distinguish among competing meanings, each of which may be coded very differently in the destination language.

Google interprets this as “I saw (the) man who picked up (the) telescope”.

Last, as in the universal translator example, machine translation often fails to distinguish between idioms and literal interpretations of words like “dive in”, “make it”, or “over the hill”.

“I am above (the) hill, but you cannot say.”

Many of these ambiguities are actually resolvable or at least improvable with no advanced AI, but instead a corpus-based representation of the source and target languages. Others, as I will show below, require a human or human-like mediator.

MT software could make more logical judgments of the likelihood of particular interpretations based on data about how fluent speakers have used those words and grammar before, found in corpora like COCA. Corpora could be used to inform machine translations by checking words against the context provided by the other words in the sentence, from which a surprising amount can be determined or at least judged more likely.

A MT program equipped with corpus-derived representations of English would know that “word”, when used with the verbs “give” or “take” along with a possessive pronoun like “my” is almost never plural, whereas when used with the verb “use”, it sometimes is; and that uses of “word” in these contexts likely have different meanings as well.  The translator could therefore base its first choice for translation of “word” on the meaning most often seen in the context of those other words. This would be a statistical matter affecting the order in which possible translations are presented to the user.  Even with no contextual meaning-related data entered by the creators of the software, a program would be able to determine that at least two versions of the word “word” exist and use that to contextualize user feedback of the type that Google Translate accepts.  The result would be a list of translations based on much more information about words than just the first meanings listed in the dictionary.

Corpora could be used to resolve grammatical and idiomatic ambiguities as well to favor what users would find to be more natural utterances. When competing interpretations are available, as in the “telescope” example above, data from corpora would let the software see that “telescope” is often used as a means to “see”.  This would come from the pairings of “telescope” and “see” in other sentences in the corpus like “We use the Hubble telescope to see distant galaxies”.  The translation software could then prefer the interpretation that holds “telescope” to be the instrument of seeing rather than something physically attached to the man.  Corpus data would lead translation software to the opposite conclusion if the sentence were “I saw the man with the sandwich”, as there are no instances in the corpus of “sandwich” being used as a means of seeing. Even with no hard-coded knowledge that a telescope is a tool and a sandwich is not, MT could learn from corpus data that distinctions between the two exist.  The same goes for “He is living” vs. “He is living in an apartment”, “Can you juggle?” vs. “Can you pass the salt?”, and schema-derived interpretations of polysemous words in phrases like “cut hair”, “cut the grass”, “cut salaries”, and “cut cake”.

In the case of idioms, special cases of words presented always in the exact same order could be tagged as having separate meanings than those words in isolation, as the far end of a spectrum that contains isolated words on one end, lexical chunks like “it seems to me that…” in the middle, and set phrases like “it takes one to know one” at the other end.  As with vocabulary and grammar, the closeness of a given piece of text to one of those set phrases could be used to judge the likelihood that it has a particular, idiomatic meaning.

In all cases the results of using corpus data to judge the likelihood of particular meanings would affect the order in which options of translations appear to the user, not in a single correct translation. In certain cases a bit more textual input could resolve ambiguities in interpreting certain types of sentences.  A question that contained an ambiguous referent for the pronoun “why”, like the “why did you say Populists were more effective” observed by fellow blogger Neal Whitman, could have its ambiguity resolved by asking the user to choose between alternative simplified questions: “Why did you say that?” or “Why were populists more effective?” and using that information in the translation of the original question.  In other cases, like those with pronouns whose referents are only obvious to human interpreters, like “Soon after Darwin arrived back in England, he began writing his book”, ambiguities could be resolved by asking the user whether “he” refers to “Darwin” or to another person.  This is a frequent bugaboo in machine translations from English into Japanese, because putting a pronoun in the second half of that sentence in Japanese results in a sentence clearly about two different people.  These ambiguities in particular are detectable merely with reference to the text itself and require no specific knowledge of words or a human-like grasp of the propositions they refer to.

Screen Shot 2016-10-23 at 12.54.11.png
A little better representation of machine translation.  I am assuming for this graph that a complete grammatical representation depends on a full representation of usage, including exceptions to grammar rules.

Now of course all the above is a much bigger project than simply replacing words and reordering them, as MT seems to do now.  It would however result in much more realistic translations than MT currently provides and be of use for many common uses of MT, such as college students hurriedly consulting their phones while writing foreign language class essays at the last minute.  There are also many cases though where realistic translation ultimately requires a machine capable of learning not just patterns in words and grammar but having a human-like image of the world that is mediated by culture.  That is, many issues of MT require no less than a human-like AI.

Why we need Skynet for our foreign language papers

Some interpretations of events, and subsequent expressions of those events in words and grammar, are so dependent on human psychology that no sophisticated reference to corpora would reliably be able to produce equivalently passable representations in different languages.  Some of these interpretations and events are deceptively simple in terms of their vocabulary and structure, and even free from obvious idioms.

Take boundaries between objects and spatial relations, for one.  My example above shows that current Google Translate has no reliable mechanism for distinguishing “over” as in “the sun is over the mountain” with “over” as in “I live over the mountain.”  Now, this is partly resolvable with reference to corpora – “over/above” is probably more likely to occur with some verbs and nouns, and “over/past” with others – but even when translating unambiguous cases of “over/above”, one must contend with the fact that another language might have different expressions for “over/high above”, “over/contiguous” and “over/passing above and temporarily on top of”, among others that I’m just not thinking of.  Korean, at least as far as I have heard, has one preposition meaning “loosely surrounded by” and another meaning “tightly packed in”, both of which might most naturally be represented by “in” in English and require a mind capable of understanding spatial relations to ferret out which preposition from Korean would be a better translation in any particular case.  In order to produce a passable translation of a sentence like “X is over Y”, MT software should have representations of at least the distances between the two objects, those distances as they appear to the viewer (the sun isn’t really above the mountain at all, of course), whether either of them is in motion and what kind, whether either of them is alive (crucial information for Japanese – the mountain Fujisan and a person named Fujii-san would take different verbs in some cases), and what medium connects them.  It seems to me that little short of a human-like intelligence would be capable of learning these in a rigorous and applicable way.

It so happens that our interpretation even of the purely physical depends quite a bit on filtering by our human brains and cultures.  Given that this is so, language that depends on social context for interpretation should be even more of a nightmare to translate.  Consider the sentence “The meeting is adjourned”.  Said by an authority figure as everyone is seated around a conference table, it has the effect of ending the meeting.  Said by one of the attendees to a latecomer as he runs in the door, it has the effect of informing him of the status of the meeting.  These two effects, a result of felicitousness (whether the speaker is in a position for his/her words to have a performative function) rather than structure or factual content, may yield different translations.  They certainly do in Japanese.  The difference isn’t solely one of particular vocabulary either – “adjourned” is rarely used outside of performative utterances, but “let’s call it a day” could have the same function when spoken by a person with authority. How is MT software to judge the felicitousness of an utterance without a human-like understanding of culture, not to mention a physical representation of the situation (is the speaker wearing judges’ robes?) in which the utterance occurs?

Because these variations in acceptable translation depend on factors outside of the text itself, MT software cannot resolve them in the same way as pronouns or dropped subjects, by scanning the inputted text for particular structures or words. Some of these may be resolvable with simple questions, but how is MT to know what questions to ask when its only input is text?

A case that has bothered me before, and where translation and translatability become almost an ideology, is that of words whose referents do not exist in other cultures or other languages.  At the moment, Google offers up either mere transcriptions in the target language script (お神輿 = “Mikoshi”) or glosses (餅=”sticky rice cake”). Now, of course Wikipedia is a pretty good source for people wanting to know about unfamiliar things from other cultures, but if MT is going to do the jobs that people want it to, it is going to need a more substantial way of dealing with borderline untranslatable words and phrases.  A fluent speaker of both languages can do this with metaphor, examples, and physical descriptions.  MT could do this as well if it had a human-like grasp of the various ways of approaching words in addition to their textual definitions.

Of course, if we had a human-like AI that was capable of learning human culture and human-like ways of apprehending the physical world, we probably wouldn’t need to manually program grammars and corpus-crawling concordancers; presumably a superintelligence would be capable of learning these on its own.

Screen Shot 2016-10-23 at 12.58.57.png

Then we’d just need to contend with the fact that this AI is more intent on turning the trace iron in our bodies into paperclips than doing clerical work for us.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s