UCLA Near Eastern Languages and Cultures Professor Robert K. Englund Receives Prestigious Grant for Research Collaboration on Big Data

Published: April 3, 2017

The Division of Humanities is excited to announce that the Machine Translation and Automated Analysis of Cuneiform Language,MTAAC, project co-directed by Robert K. Englund of UCLA Near Eastern Languages and Cultures is now funded through the Trans-Atlantic Platform Digging into Data Challenge by the American National Endowment for the Humanities, the German Research Foundation, and the Canadian Social Sciences and Humanities Research Council as one of 14 international teams of researchers addressing big data questions in the Humanities and Social Sciences.

This project is a collaboration among ancient studies scholars, linguists, and computer scientists to develop computational techniques for translating ancient administrative records stored on cuneiform tablets. The MTAAC project’s broad goal is to address the gap in the Natural Language Processing (NLP) of cuneiform languages. More specifically, the objectives are to:

  • formulate, test and evaluate methodologies for the automated analysis and machine translation (MT) of transliterated cuneiform documents, and to make the technology thus developed available to specialists in the field;
  • make available the translation of a specific and representative set of cuneiform documents to scholars in related disciplines and to a networked public (see below);
  • provide new data for the study of the language, culture, history, economy and politics of the ancient Near East by harvesting the linguistic byproducts of the translation and information extraction processes;
  • formalize these new data utilizing Linked Open Data (LOD) vocabularies, and foster the standardization, open data and LOD as practices integral to projects in digital humanities and computational philology.

As a representative and robust test set of cuneiform documents to be used in the initial phase of MTAAC, the group has chosen the corpus of Ur III legal and administrative texts. The researchers believe that these 21st century BC documents represent the best candidates for machine learning experiments due to their simple syntax, homogeneity and imposing numbers: nearly 68,000 texts with 1.5 million lines in Canonical ASCII Transliteration Format, 20,000 of which in translation, are maintained by the CDLI, a project that, moreover, has substantial expertise in the interpretation of this and related cuneiform corpora.

Principal investigator of the MTAAC research team is Heather D. Baker of the University of Toronto; co-PIs are Christian Chiarcos of the University of Frankfurt, and CDLI Director Robert K. Englund of UCLA. Émilie Pagé-Perron, CDLI co-PI, assumes the role of project coordinator.

Visit the MTAAC project website for more information.