
While the performance of LLMS make the headlines, the encoder models are fundamental bricks of the NLP and are among the most downloaded on Hugging Face. Developed thanks to a collaboration between the MICS Laboratory of Centraleupélec, Diabolocom, Artefact and Unbabel, the continuation of open source encoders Eurobert represents a significant advance in the field of multilingual NLP, combining sovereignty, transparency and performance.
Eurobert, developed as part of the three current theses, is available in three sizes (210 million, 610 million and 2.1 billion parameters). It is closely inspired by the architecture of Llama 3 and was drawn into a corpus of 5000 billion tokens (twice as much as conventional encoders), including sets of multilingual, code and mathematics data.
The drive pipeline includes two phases: pre -training and the adjustment phase, and uses the lens of masked language modeling (MLM).
It takes charge of eight major European languages (English, French, German, Spanish, Italian, Dutch, Portuguese and Polish) and seven extra-European languages (Chinese, Russian, Japanese, Vietnamese, Arab, Turkish and Hindi).
A major asset of Eurobert lies in its ability to natively manage sequences up to 8,192 tokens, while conventional encoder models like Bert and its variants (like Roberta) are generally limited to sequences of 512 tokens, which can fragment the understanding of the text. This extended context length reinforces the accuracy of the analyzes, even for the most complex NLP tasks.
Various applications
Eurobert’s abilities position it as an essential brick for:
-
Information and text extraction : its effectiveness in the identification and classification of documents opens up prospects for companies in search of optimizing their information flows;
-
Technical and scientific language treatment : his advanced training allows him to better understand and analyze complex texts, in particular in mathematics and programming;
-
Translation and automatic summary : It competes with existing cutting -edge solutions, while guaranteeing precision adapted to European languages.
A fruitful public-private collaboration
Trained on the Adastra SuperCalculator of Genci, Eurobert opens strategic prospects for businesses and research. Beyond a technical advance, it illustrates the ability of Europe to innovate and develop sovereign AI solutions.