The Dictionary Definition Task is new for the second edition, and involves having participating systems play a variant of the popular Dictionary parlour game, implemented and marketed as a popular 1980's board game under various names such as Balderdash, Rappakalja, or Kokkelimonke. The idea is for each player to formulate plausible and convincing definitions to uncommmon and unknown words. These definitions are shared to the other players, with a correct definition mixed in with the player-authored ones, and points are awarded to a player's definition if other players are fooled into believing it to be the true one. This task will be implemented as a system-submission task, where participants are expected to provide a runnable script to be executed by the organisers. The participants are restricted to a shared set of pre-trained models, which will be fine-tuned over the course of the experiment by the organisers, with both helpful and confounding data, and the scoring will be influenced by the robustness of the script, its prompting, and its continued adaptation to incoming data.
Can your LLM fool a classifier to believe it is human?
This task explores whether automatically-generated text can be distinguished from human-authored text, and is organised in collaboration with the PAN lab at CLEF.
Will it respond with the same content to all of us?
This task ran in 2024 and tests the capability of a model to handle input variation -- e.g. dialectal, sociolectal, and cross-cultural -- as represented by human-generated equivalent but non-identical varieties of input prompts.
The Preference Prediction Task is a new task for the second edition and will explore the capabality of systems to predict human preferences for different outputs from generative language models.
We will provide the participants with a novel dataset of human-based preferences and judgments, which will be collected from scratch. This task offers two sub-tasks with participation open to anyone:
- Preference prediction. Predict human preferences between given LLM responses; performance metric is the accuracy score.
- Preference prediction and judgment generation Generate judgements explaining the choices made in sub-task 1; performance metrics include standard natural language generation evaluation metrics.
In an evolved verion of the first year's Topical Quiz Task, the Sensemaking task will require participating systems to produce a quiz, not for a topic, but for a given syllabus set of texts, potentially noisy or erroneous and to answer to quiz questions posed by other participating systems. A correct answer is one that is aligned with the syllabus without relying on other knowledge.
The ELOQUENT evaluation lab experiments with new evaluation methods for generative language models in order to meet some of the challenges in the path from laboratory to application. The intention of the lab is to explore the following important characteristics of generative language model quality:
- Trustworthiness: a many-faceted notion which involves topical relevance and truthfulness, discourse competence, reasoning in language, controllability, and robustness across varied input.
- Multi-linguality and cultural fit: the suitability of a language model for some cultural and linguistic area.
- Self-assessment: the reliability of a language model to assess the quality of itself or some other language model, using as little human effort as possible.
- Limits of language models: the delimitation of world knowledge and generative capacity.
Do you teach a class related to generative language models? Do you supervise students interested in generative language models? Are you a student searching for a project?
The ELOQUENT tasks are suitable for use as a class assignment or as a diploma project. Get in touch with us for suggestions of extensions and other ideas!
The second ELOQUENT Workshop will be held at the CLEF conference in Madrid, September 9-12 2024.
The workshop program will hold overview presentations, an invited keynote, and some selected participant presentations.
Participating in a task can be done as a simple one-off experiment.
We welcome experimental reports, which will be published in the working notes of the workshop, but there is no requirement to submit anything. If experiments involve hypothesis testing and exploration of more lasting value, they can be revised and published elsewhere in more archival channels. Typically, this is done after the workshops where ideas and learnings have been exchanged between participants.
Sign up to join in the discussion! We will announce a conversation channel here during the fall of 2024.
- Fall 2024: discussion and task formulation
- January 1 2024: tasks open and public announcement of tasks
- April 06 – April 10, 2025: ECIR presentation of ELOQUENT
- May 2025: submission deadline of experimental runs from participants
- June 2025: participant report submission deadline
- July 2025: camera ready report submission deadline
- 9-12 September 2025: workshop at CLEF in Madrid
The first edition of ELOQUENT ran in 2024 and involved four tasks.
The task on untruthfulness detection ran in 2024 and will move elsewhere for 2025. Return here later to find out more!
Also, related, is the 2025 Mushroom task at SemEval-2025.
The task on probing the topical competence of generative language models ran in 2024 and is paused for 2025, with the intention of defining a reworked task at some later time. Return here later to find out more!
- Silo AI: Jussi Karlgren
- Accenture and University of Helsinki: Aarne Talman
- University of Helsinki: Timothee Mickus
- AI Sweden, Stockholm: Magnus Sahlgren
- Toloka AI: Ekaterina Artemova
- Charles University: Ondrej Bojar
- Unversity of Crete: Stergios Chatzikyriakidis
- University of Oslo: Vladislav Mikhailov
- University of Oslo: Erik Velldal
- University of Oslo: Lilja Øvrelid
- Overview of ELOQUENT 2024 is in the CLEF 2024 publication, volume 2
-
Task reports and participant papers are in the CLEF 2024 Working notes:
- Topical Quiz: Karlgren & Talman;
- Hallucigen report: Dürlich et al;
- Hallucigen participant: Siino & Tinnirello;
- Hallucigen participant: Bui et al.
- Robustness: Sahlgren et al;
- Robustness participant: Neralla & Bijl de Vroe;
- Robustness participant: Simonsen;
- Voight-Kampff: Bevendoff et al;
- An first presentation and announcement of ELOQUENT is in the Proceedings of the European Conference on Information Retrieval (ECIR)