Researchers at Goethe University Frankfurt are building a digital platform as a basis for language research

Languages are remarkable. They reflect what is going on in our minds and enable us to communicate and collaborate on projects both large and small. And they do so in many ways: In the hundreds of thousands of years that humans have been using languages, between 7,000 and 8,000 different ones have developed around the globe. Some are spoken by billions of people, others by only a handful.
The diversity of human language constitutes an inexhaustible resource for research into human cognition and culture, and the digital age offers completely new possibilities in this field. Linguistics has a constantly growing need for more and better data from as many languages as possible. Researchers use this material from large databases and text collections (corpora) for studies produced with the help of computer-based tools such as evolution models, large language models, machine learning and artificial intelligence.
The Institute of Empirical Linguistics at Goethe University Frankfurt has conducted pioneering work in the field of language digitization for decades. Already in the late 1980s, Professor Jost Gippert, head of department at that time, launched a program to digitize data from ancient and modern languages via a call published in the journal “Die Sprache” (Issue 32/2, 1987), which became known as the TITUS project (Thesaurus of Indo-European Text and Language Materials). Over the years, the content of this database constantly grew. It was fed with corpora of ancient and modern languages from different language families, but also with dictionaries, grammar books, language maps, diagrams of writing systems, and much more.
The TITUS database, comprising 900 different corpora and text collections in various languages, was consistently upgraded and curated over the decades and found its audience among students and researchers working with corpora, dictionaries and other language material in their linguistic research. Over time, however, the technical interface for large parts of the TITUS database became outdated, which made handling, updating and using the data increasingly difficult.

Photo: private
In 2023, Gerd Carling from Lund University, Sweden, was appointed as Professor of Comparative Linguistics at the Institute of Empirical Linguistics, Goethe University Frankfurt. In her work, she focuses on the study and digitization of ancient and extinct languages, especially Tocharian, an Indo-European language documented for the second half of the 1st millennium AD in the Tarim Basin in what is now the Xinjiang Uyghur Autonomous Region. Her research interests also include the Indo-Aryan language Romani and the minority languages of South America.
Carling also played an active role in compiling data for the emerging research field of language evolution and phylogenetics (a computer-based method for determining relationships, in this case between languages). Between 2010 and 2022, she and her team in Lund collated extensive grammatical, lexical and linguistic metadata for thousands of languages. This database, called DiACL (Diachronic Atlas of Comparative Linguistics), provides data for cross-linguistic analysis using computer models.
Upon her appointment as professor, Carling received funds to create a large, shared language data resource that links the TITUS and DiACL databases. This new platform was named CompLing – The Comparative Linguistic Databank of Goethe University. It constitutes a unique source for accessing various types of linguistic data that are useful for both researchers as well as students, but also for anyone else interested in the vast diversity of languages. Users can access the different parts of the platform from the start page.
These include the DiACL database with lexical, grammatical and linguistic metadata from thousands of languages. These data are prepared in such a way that the datasets contain the same word groups or grammatical forms in hundreds of languages or more, which can be downloaded and then analyzed with the help of computer models. Another part of the platform is the TITUS 2.0 database – an updated version of the data from the earlier TITUS database – in which corpus data and metadata from hundreds of languages are stored that can be downloaded in usable and sustainable formats.
In a third area, the Polygon Archive, data from the DiACL and TITUS databases are combined. Here, geographical data in the form of “polygons” (digital maps) for almost a thousand languages are available free of charge for use in atlases or to plot the spatial expansion of languages. A further component is in the planning, which will provide access to writing systems that can be analyzed with the help of computer models. This tool can be used to examine how writing systems develop or to decipher unknown systems.
Research, studies, general interest – these are all areas of potential use. CompLing could be used to solve scientific questions about language contacts in the past and the present, for example, about loanwords between languages, or principles of language evolution. But it is also possible to find out more about a specific language, such as information about the culture and beliefs of its speakers. Scientists dealing with large language models, natural language processing, machine learning or artificial intelligence can access comparable data from many languages. In addition, students from various disciplines can use the data to practice research methods or as a resource for their Bachelor’s or Master’s theses. And finally, the lay audience with an interest in languages can access language data to answer their own questions. (asa)
To the entire issue of Forschung Frankfurt 1/2025: Language. The key to understanding











