Incredibly vast vocabulary

Researchers at Goethe University Frankfurt are building a digital platform as a basis for language research

CompLing.eu is the joint platform for language datafrom the DiACL and TITUS databases.

Languages are remarkable. They reflect what is going on in our minds and enable us to communicate and collaborate on projects both large and small. And they do so in many ways: In the hundreds of thousands of years that humans have been using languages, between 7,000 and 8,000 different ones have developed around the globe. Some are spoken by billions of people, others by only a handful.

The diversity of human language constitutes an inexhaustible resource for research into human cognition and culture, and the digital age offers completely new possibilities in this field. Linguistics has a constantly growing need for more and better data from as many languages as possible. Researchers use this material from large databases and text collections (corpora) for studies produced with the help of computer-based tools such as evolution models, large language models, machine learning and artificial intelligence.

The Institute of Empirical Linguistics at Goethe University Frankfurt has conducted pioneering work in the field of language digitization for decades. Already in the late 1980s, Professor Jost Gippert, head of department at that time, launched a program to digitize data from ancient and modern languages via a call published in the journal “Die Sprache” (Issue 32/2, 1987), which became known as the TITUS project (Thesaurus of Indo-European Text and Language Materials). Over the years, the content of this database constantly grew. It was fed with corpora of ancient and modern languages from different language families, but also with dictionaries, grammar books, language maps, diagrams of writing systems, and much more.

The TITUS database, comprising 900 different corpora and text collections in various languages, was consistently upgraded and curated over the decades and found its audience among students and researchers working with corpora, dictionaries and other language material in their linguistic research. Over time, however, the technical interface for large parts of the TITUS database became outdated, which made handling, updating and using the data increasingly difficult.

Gerd Carling, Professor of Comparative Linguistics, has taught at Goethe University Frankfurt since 2023.
Photo: private

In 2023, Gerd Carling from Lund University, Sweden, was appointed as Professor of Comparative Linguistics at the Institute of Empirical Linguistics, Goethe University Frankfurt. In her work, she focuses on the study and digitization of ancient and extinct languages, especially Tocharian, an Indo-European language documented for the second half of the 1st millennium AD in the Tarim Basin in what is now the Xinjiang Uyghur Autonomous Region. Her research interests also include the Indo-Aryan language Romani and the minority languages of South America.

Carling also played an active role in compiling data for the emerging research field of language evolution and phylogenetics (a computer-based method for determining relationships, in this case between languages). Between 2010 and 2022, she and her team in Lund collated extensive grammatical, lexical and linguistic metadata for thousands of languages. This database, called DiACL (Diachronic Atlas of Comparative Linguistics), provides data for cross-linguistic analysis using computer models.

Upon her appointment as professor, Carling received funds to create a large, shared language data resource that links the TITUS and DiACL databases. This new platform was named CompLing – The Comparative Linguistic Databank of Goethe University. It constitutes a unique source for accessing various types of linguistic data that are useful for both researchers as well as students, but also for anyone else interested in the vast diversity of languages. Users can access the different parts of the platform from the start page.

These include the DiACL database with lexical, grammatical and linguistic metadata from thousands of languages. These data are prepared in such a way that the datasets contain the same word groups or grammatical forms in hundreds of languages or more, which can be downloaded and then analyzed with the help of computer models. Another part of the platform is the TITUS 2.0 database – an updated version of the data from the earlier TITUS database – in which corpus data and metadata from hundreds of languages are stored that can be downloaded in usable and sustainable formats.

In a third area, the Polygon Archive, data from the DiACL and TITUS databases are combined. Here, geographical data in the form of “polygons” (digital maps) for almost a thousand languages are available free of charge for use in atlases or to plot the spatial expansion of languages. A further component is in the planning, which will provide access to writing systems that can be analyzed with the help of computer models. This tool can be used to examine how writing systems develop or to decipher unknown systems.

Research, studies, general interest – these are all areas of potential use. CompLing could be used to solve scientific questions about language contacts in the past and the present, for example, about loanwords between languages, or principles of language evolution. But it is also possible to find out more about a specific language, such as information about the culture and beliefs of its speakers. Scientists dealing with large language models, natural language processing, machine learning or artificial intelligence can access comparable data from many languages. In addition, students from various disciplines can use the data to practice research methods or as a resource for their Bachelor’s or Master’s theses. And finally, the lay audience with an interest in languages can access language data to answer their own questions. (asa)

To the entire issue of Forschung Frankfurt 1/2025: Language. The key to understanding

Relevante Artikel

Silberarmulett

Kleines Etwas mit großer Wirkung

Den ältesten christlichen Beleg nördlich der Alpen fanden Archäologen in einem Frankfurter Gräberfeld aus dem 3. Jahrhundert. 3,5 Zentimeter misst

Ein lebendiges Bild vom Alltag der einheimischen Menschen in FranzösischLouisiana vermittelt diese Darstellung von Alexandre de Batz, Desseins de Sauvages de Plusieurs Nations Nouvelle Orléans von 1735.

Verständigungen in der Kontaktzone

Wie französische Missionare indigenen Menschen in ­Französisch-Louisiana begegneten Mit dem Auftrag, die indigene Bevölkerung zu missionieren, reisten französische Ordensleute in

Frobenius-Expedition 2022: Richard Kuba und Christina Henneke beraten sich 2022 inmitten von Felsbildkopien mit Vertretern der indigenen Bevölkerung in Derby, Nordwestaustralien.

Eine steile Lernkurve im Umgang mit geheimem Wissen

Interkulturelle Einsichten zu einer Jahrzehnte zurückliegenden Forschungsexpedition des Frobenius-Instituts nach Australien Vor 85 Jahren reiste eine kulturanthropo­logische Expeditionsgruppe aus Frankfurt

Ein Datensatz aus Bildern simuliert ein spielendes Kleinkind, das sich und Spielzeugmodelle bewegt. Das Kind »hört« gelegentlich Äußerungen der Betreuungsperson.

Ein künstliches Baby lernt sprechen

KI-Simulationen helfen, Prozesse im frühkindlichen Gehirn zu verstehen Ein simuliertes Kind und eine häusliche Umgebung wie im Computer­spiel: Das sind

Für die MEG-Untersuchung befestigt die Neurowissenschaftlerin Melek Yalçin Sensoren am Kopf ihres Kollegen Leonardo Zeine.

Die Entstehung des Worts

Wie das Gehirn aus Schallwellen Sprache macht Ein Tumor im Gehirn stellt Patientinnen und Patienten wie ­Ärzteschaft vor schwierige Entscheidungen:

Zeigen – schon in früher Kindheit eine eindeutige Geste, die der Computer mithilfe von VR-Technologie erst lernen muss.

Wenn die KI mit Gestik trainiert wird

Ein Teilprojekt im Schwerpunktprogramm »Visuelle Kommunikation« versucht, Körpersprache analysierbar zu machen Gestik hilft den Menschen, einander besser zu verstehen. Auch

Öffentliche Veranstaltungen

You cannot copy content of this page