Eurobert, the multilingual encoder at the service of European languages

While the performance of LLMS make the headlines, the encoder models are fundamental bricks of the NLP and are among the most downloaded on Hugging Face. Developed thanks to a collaboration between the MICS Laboratory of Centraleupélec, Diabolocom, Artefact and Unbabel, the continuation of open source encoders Eurobert represents a significant advance in the field of multilingual NLP, combining sovereignty, transparency and performance.

Eurobert, developed as part of the three current theses, is available in three sizes (210 million, 610 million and 2.1 billion parameters). It is closely inspired by the architecture of Llama 3 and was drawn into a corpus of 5000 billion tokens (twice as much as conventional encoders), including sets of multilingual, code and mathematics data.

The drive pipeline includes two phases: pre -training and the adjustment phase, and uses the lens of masked language modeling (MLM).

It takes charge of eight major European languages (English, French, German, Spanish, Italian, Dutch, Portuguese and Polish) and seven extra-European languages (Chinese, Russian, Japanese, Vietnamese, Arab, Turkish and Hindi).

A major asset of Eurobert lies in its ability to natively manage sequences up to 8,192 tokens, while conventional encoder models like Bert and its variants (like Roberta) are generally limited to sequences of 512 tokens, which can fragment the understanding of the text. This extended context length reinforces the accuracy of the analyzes, even for the most complex NLP tasks.

Table of Contents

Various applications

Eurobert’s abilities position it as an essential brick for:

Information and text extraction : its effectiveness in the identification and classification of documents opens up prospects for companies in search of optimizing their information flows;
Technical and scientific language treatment : his advanced training allows him to better understand and analyze complex texts, in particular in mathematics and programming;
Translation and automatic summary : It competes with existing cutting -edge solutions, while guaranteeing precision adapted to European languages.

A fruitful public-private collaboration

This project was carried out by doctoral students Cifre Nicolas Boizard, Hippolyte Gisserot-Boukhlef and Duarte Alves, under the leadership of Pierre Colombo, Céline Hudelot, and André Martins. In addition to the teams of MICS, IST, Diabolocom, Artefact and Unbabel, he received the support of teams from Grenoble Alpes University, CNRS, Lisn (Interdisciplinary Laboratory of Digital Sciences), Illuin Technology, IRT Saint-Exupéry and Cines. The article devoted to their work is available on https://arxiv.org/abs/2503.05500.

Trained on the Adastra SuperCalculator of Genci, Eurobert opens strategic prospects for businesses and research. Beyond a technical advance, it illustrates the ability of Europe to innovate and develop sovereign AI solutions.

AI Tools

AI Tools

Eurobert, the multilingual encoder at the service of European languages

Various applications

A fruitful public-private collaboration

Leave a Reply Cancel reply

Other Story

How to maximize the “intelligence” of an LLM with the tree-inf-thought

Google unveils a completely free autonomous code agent

The top 5 best MCP servers of the moment

6 Key Strategies Behind V-JEPA 2 by Meta: How It’s Revolutionizing Vision-Only AI

7 Essential Prerequisites for a Successful Agentic AI Project

How to Get Your Own Free Local AI Agent: MCP & Local LLMs Explained