Context
Generative large language models (LLMs) are ubiquitous, integrated into increasingly complex processing chains, particularly RAG (Retrieval Augmented Generation), offering a wide variety of use cases.
However, evaluating them poses serious problems: existing benchmarks are largely Anglo-centric (language, culture) and do not necessarily cover all uses and domains.
The question of their evaluation therefore arises in particular for French and more generally for languages other than English.
Call for paper
In this workshop, we propose to bring together researchers, industrialists, and academics interested in the many facets of LLM evaluation in specialized fields or languages other than English.
This year, we would like to highlight the theme of RAG and the links between LLM and Information Retrieval.
The workshop will be structured around a guest presentation and presentations of selected papers following a call for papers on all work within this scope.
This includes research on:
- the evaluation of complete systems, particularly RAG
- the evaluation of foundation or fine-tuned models
- the creation or adaptation of benchmarks for French or other languages of interest, whether high or low resource, in general or specialized domains, or for noisy or non-standard languages (e.g., social networks, voice commands, etc.)
- evaluation of NLP tasks (translation, summarization, information extraction, etc.)
- adaptation of existing evaluation methodologies to generative systems
- ethical dimensions, bias, privacy, cultural or legislative alignment
- Performance dimensions in terms of computing time, memory, and energy efficiency.
- Evaluation with users, ergonomics, and cognitive aspects.
- Evaluation of multimodal models (e.g., text-image, text-speech, etc.).
- ...
See the “Submissions” page for details on how to submit an article.
Challenges
As part of the EvalLLM2026 workshop, we are proposing two challenges focused on LLMs.
The first challenge aims to evaluate LLMs (or other approaches) by task. It focuses on entity linking in the medical domain in a few-shot context. For a complete description, see the dedicated page.
The second challenge focuses on RAG and attribution. For a complete description, see the dedicated page.
Committees
The workshop is co-organized by AMIAD (1) and GT RAG of GdR CNRS TAL (2):
- Vincent Claveau, AMIAD, Rennes, vincent.claveau@def.gouv.fr
- Nicolas Diniz, AMIAD, Rennes (1)
- Juliane Flament, AMIAD, Rennes (1)
- Nihel Kooli, AMIAD, Rennes (1)
- José G Moreno, Univ. de Toulouse (2)
- Albert Murienne, AMIAD, Rennes (1)
- Christophe Servan, AMIAD, Palaiseau (1,2)
Scientific committee:
- Rachel Bawden, Inria
- Pauline Soutrenon, Inria mission Défense et Sécurité
- Olivier Ferret, CEA-List
- Damien Nouvel, INALCO
- Didier Schwab, LIG
- Gilles Sérasset, LIG
- Fabian Suchanek, Télécom Paris, Institut polytechnique de Paris
- François Yvon, ISIR - CNRS
- Jose G Moreno, IRIT
- Christophe Servan, AMIAD
- Lynda Tamine, IRIT
- Adrian Chifu, LIS, Marseille
- Sahar Ghannay, LISN
- Josiane Mothe, IRIT
- Sébastien Fournier, LIS
- Benjamin Piwowarki ISIR/CNRS, Paris
- Mohand Boughanem IRIT