Machine learning algorithm cracks and translates long-lost languages

Machine learning is better than ever before, especially with language. But could it decipher dead languages otherwise lost to us?

Jul 7, 2019 |
5 min read

It is Don DeLillo, surrealist short story author, who perhaps best captures humanity’s fascination with dead languages the best. His 1982 novel, The Names, outlines the exploits of a sleeper CIA agent assessing the on-goings of a “language cult” which has left behind only hieroglyphs of its own title.

While it’s difficult to recommend this DeLillo novel to a casual reader, it’s easy to understand the appeal of its theme. To this day, the world’s dead languages continue to fascinate linguists, not to mention the imaginative everyman.

There have been developments, though, in the world of computer science that makes these supposedly dead languages a little more accessible. No longer do forces need to rely on the footwork of men like DeLillo’s lead character. Instead, they can turn to computers and the practice of machine learning to decipher what older civilizations once recorded.

The journey to linguistics and machine learning

This story starts not with a DeLillo protagonist but with a British archaeologist, Arthur Evans. In 1886, Evans discovered a Rosetta Stone of hieroglyphs on the island of Crete. As he explored the area, he was able to identify a veritable library and date the texts therein to 1400 BC.

As a result, Evans discovered one of the earliest European historical documents to have been written. To do so, he did have to argue that art gives way to language – a thesis which is an entirely different, if equally intriguing, discussion. Regardless, his finding started him on a journey towards an improved understanding of the civilizations that came before his own.

Evans broke the text he discovered down into two different scripts. He referred to the older of the two as Linear A. This text comes from a Grecian civilization that predates what most people know to be Ancient Greece. In fact, the Ancient Greece with which author Rick Riordan has familiarized many a millennial was the younger variant of Minoan Greece.

The second line of text was referred to as Linear B. Evans believed that unlike its Bronze Age predecessor, this line of text appeared on the tablet after the Mycenaean war against the Minoan Greeks.

Early translation developments

Even though he can be credited with the discovery of both of these languages, Evans was not able to decipher either text within his lifetime. That honor would go to Michael Ventris over fifty years later when the amateur linguist was able to make sense of Linear B.

Ventris, alongside fellow linguist, Alice Kober, was able to decipher the lost language courtesy of two foundational conclusions:

  1. He identified proper nouns in Linear B by studying the language’s repeated hieroglyphs
  2. He correctly attributed the language to the “early” (late) Greeks and was subsequently able to reverse engineer it based on modern evolutions of the language

Kober, similarly, contributed equally essential discoveries to the eventual translation, stating that:

  1. The final syllables of the language’s terms changed regularly, suggesting a linguistic inflection
  2. The proper nouns identified by Ventris could not be found on the Cretan mainland but instead out on the water

With all of that excitement, one might assume that Linear B carried the secrets to a great treasure or unrecorded history. This isn’t the case. Kober, Ventris, and their affiliates determined that Linear B detailed Grecian inventory records along with details of the time’s trade.

Onward to the future

After the translation of Linear B, Linear A remained. Unfortunately, Ventris died at age 34 and was unable to continue his work, and his affiliates never had the luck to crack the code.

That’s where today’s Regina Barzilay, Jiaming Luo, and Yuan Cao come into play. Barzilay and Luo hail from MIT, while Cao hails from Google’s AI Lad in California. To complete the work left unfinished by Ventris, this team has developed a machine-learning algorithm that can crack lost languages through in-depth pattern analysis.

How does this work?

Machine-learning requires a computer to have access to an extensive database on a particular topic. In the case of this latest algorithm, Barzilay, Luo, and Cao needed to provide their algorithm with a library’s worth of text to familiarize it not only with universal grammatical structures but with the Greek that was used to transcribe Linear B.

This same bank of information taught the algorithm in question to recognize patterns. Like Ventris and Kober, the algorithm needed to be able to break down symbol repetition to better understand the potential translations of the hieroglyphs it was exposed to.

Finally, the algorithm needed to address individual words as vectors within a space. Limiting language to a vector limited the scope of the algorithm but also made it simpler for the machine to create one-to-one translations of its lost languages.

Effectively, the machine was taught that, for example, the symbols representing “king” and “woman,” when combined, meant “queen.” If the machine can identify these correlations throughout the entirety of a document, it will be able to provide its creators will accurate translations of a variety of texts.

The work being done

Brazilay, Luo, and Cao have already put their algorithm and machine to the test. They tested its knowledge against the already-translated Linear B and text written in an early form of Hebrew known as Ugaritic.

As of this point in time, the machine can translate these languages with 67.2 percent accuracy. The team has not yet attempted to try their machine against the impenetrable Linear A.

Why?

For starters, there is more to learn about Linear A. The script doesn’t translate into a Greek equivalent, and no one can move to translate it until they identify a progenitor language.

Additionally, the accuracy of Brazilay, Luo, and Cao’s machine needs to improve before they can provide a readable variation of this text.

This is where the power of the computer comes in handy, though. Because computers never tire nor die early deaths, Brazilay, Luo, and Cao may be able to run Linear A through all accessible languages to find an accurate translation.

Until then, the general public and academic community are left to question the meaning of Linear A and other lost languages with the same fascination of a DeLillo protagonist. Unlike these hapless leads, though, there is hope for greater understanding in the future.

More stories: