Developing Multilingual, Open and European Large Language Models

At the 9th International Summer School on AI and Big Data, Prof. Dr. Georg Rehm and Dr. Pedro Ortiz Suarez (DFKI) will talk about Developing Multilingual, Open and European Large Language Models. The keynote will take place on Tuesday, 04.07.2023 from 09 a.m. – 10:30 a.m.

Keynote: Developing Multilingual, Open and European Large Language Models

Since the introduction of ChatGPT in November 2022, Large Language Models (LLMs) have become ubiquitous in everyday life for a large portion of the global population, and have also simplified and facilitated a wide range of tasks for both experts and non-professional users alike.

However, the underlying technologies of conversational models such as ChatGPT remain closed-sourced and in the hands of probably less than a dozen private organisations worldwide. In our presentation we will report on our efforts in the project OpenGPT-X, funded by the Federal German Ministry of Economic Affairs and Climate Action (BMWK), to develop large generative language models for the German language, while making them open-source and respectful of European values. In addition to providing an overview of the project, we will present our efforts towards developing multilingual language models in collaboration with the EU project European Language Equality (ELE) towards the curation of a large, multilingual data set, data filtering and preparation; we will give details about the training of our European models and show the first results of the evaluation.

Furthermore, we will give an overview of the general state of play of digital language inequality in Europe, which we aim to transform, over the next few years, into full digital language equality in Europe by 2030. Two platforms and initiatives that are of crucial importance in that regard are European Language Grid (ELG) and the recently started Common European Language Data Space (LDS), which will also be briefly highlighted.

Prof. Dr. Georg Rehm

Prof. Dr. Georg Rehm works as a Principal Researcher in the Speech and Language Technology Lab at Deutsches Forschungszentrum für Künstliche Intelligenz GmbH (DFKI), in Berlin. He is a DFKI Research Fellow and Adjunct Professor at the Institut für deutsche Sprache und Linguistik at Humboldt-Universität zu Berlin. 

Currently, Georg Rehm is the Coordinator of the EU-funded project Language Data Space (LDS), Co-coordinator of the EU-funded project European Language Equality (ELE) and involved, as a principal investigator, in the projects OpenGPT-X, HumanE-AI-Net, SPEAKER, DataBri-X, SciLake and NFDI4DS. Since 2013, Georg Rehm has been the Head of the German/Austrian Chapter of the World Wide Web Consortium (W3C), hosted at DFKI in Berlin. He is also a member of the DIN Presidential Committee FOCUS.ICT. In the 2021/2022 term, he serves as the Secretary of the European Chapter of the Association for Computational Linguistics (EACL). Georg Rehm holds an M.A. in Computational Linguistics and Artificial Intelligence, Linguistics and Computer Science from the University of Osnabrück and a PhD in Computational Linguistics from the University of Gießen. He has authored, co-authored or edited 250 research publications.

Dr. Pedro Ortiz Suarez

Dr. Pedro Ortiz Suarez is a a researcher at the Speech and Language Technology Team at DFKI GmbH Berlin. His work focuses on the development of large corpora for pre-training large language models, specially for under resourced languages, historical languages and specialized domains. He is particularly interested on the impact that these corpora have on the final performance of LLMs and how these models can be improved with data driven approaches.

He is also interested in tasks such as Name Entity Recognition (NER), Dependency Parsing, Part-of-Speech tagging, Machine Translation and Document Structuration. Pedro was previously a postdoctoral researcher at the Data and Web Science Group at the University of Mannheim. He did his PhD at Inria and Sorbonne Université in France. He is the leader of the OSCAR project, is an active member of the OpenGPT-X, was a founding member of the BigScience Project and one of the main authors of the CamemBERT model.

Read more about the 9th International Summer School on AI and Big Data.