Context

Generative large language models (LLMs) are ubiquitous, integrated into increasingly complex processing chains, particularly RAG (Retrieval Augmented Generation), offering a wide variety of use cases.
However, evaluating them poses serious problems: existing benchmarks are largely Anglo-centric (language, culture) and do not necessarily cover all uses and domains.
The question of their evaluation therefore arises in particular for French and more generally for languages other than English.

Call for paper

In this workshop, we propose to bring together researchers, industrialists, and academics interested in the many facets of LLM evaluation in specialized fields or languages other than English.
This year, we would like to highlight the theme of RAG and the links between LLM and Information Retrieval.

The workshop will be structured around a guest presentation and presentations of selected papers following a call for papers on all work within this scope.
This includes research on:

the evaluation of complete systems, particularly RAG
the evaluation of foundation or fine-tuned models
the creation or adaptation of benchmarks for French or other languages of interest, whether high or low resource, in general or specialized domains, or for noisy or non-standard languages (e.g., social networks, voice commands, etc.)
evaluation of NLP tasks (translation, summarization, information extraction, etc.)
adaptation of existing evaluation methodologies to generative systems
ethical dimensions, bias, privacy, cultural or legislative alignment
Performance dimensions in terms of computing time, memory, and energy efficiency.
Evaluation with users, ergonomics, and cognitive aspects.
Evaluation of multimodal models (e.g., text-image, text-speech, etc.).
...

See the “Submissions” page for details on how to submit an article.

Challenges

As part of the EvalLLM2026 workshop, we are proposing two challenges focused on LLMs.

The first challenge aims to evaluate LLMs (or other approaches) by task. It focuses on entity linking in the medical domain in a few-shot context. For a complete description, see the dedicated page.

The second challenge focuses on RAG and attribution. For a complete description, see the dedicated page.

Committees

The workshop is co-organized by AMIAD (1) and GT RAG of GdR CNRS TAL (2):

Vincent Claveau, AMIAD, Rennes, vincent.claveau@def.gouv.fr
Nicolas Diniz, AMIAD, Rennes (1)
Juliane Flament, AMIAD, Rennes (1)
Nihel Kooli, AMIAD, Rennes (1)
José G Moreno, Univ. de Toulouse (2)
Albert Murienne, AMIAD, Rennes (1)
Christophe Servan, AMIAD, Palaiseau (1,2)

Scientific committee:

Rachel Bawden, Inria
Guillaume Bernard, LNE
Pauline Soutrenon, Inria mission Défense et Sécurité
Olivier Ferret, CEA-List
Damien Nouvel, INALCO
Didier Schwab, LIG
Gilles Sérasset, LIG
Fabian Suchanek, Télécom Paris, Institut polytechnique de Paris
Olivier Galibert, LNE
François Yvon, ISIR - CNRS
Jose G Moreno, IRIT
Christophe Servan, AMIAD
Lynda Tamine, IRIT
Adrian Chifu, LIS
Sahar Ghannay, LISN
Josiane Mothe, IRIT
Sébastien Fournier, LIS
Benjamin Piwowarki, ISIR - CNRS
Mohand Boughanem, IRIT
Mathieu Valette, INALCO