Can your LLM fool a classifier to believe it is human?
This task explores whether automatically-generated text can be distinguished from human-authored text, and is organised in collaboration with the PAN lab at CLEF.
Will it respond with the same content to all of us?
This task ran in 2024 and tests the capability of a model to handle input variation -- e.g. dialectal, sociolectal, and cross-cultural -- as represented by human-generated equivalent but non-identical varieties of input prompts.
The Preference Prediction Task is a new task for the second edition and will explore the capabality of systems to predict human preferences for different outputs from generative language models.
We will provide the participants with a novel dataset of human-based preferences and judgments, which will be collected from scratch. This task offers two sub-tasks with participation open to anyone:
- Preference prediction. Predict human preferences between given LLM responses; performance metric is the accuracy score.
- Preference prediction and judgment generation Generate judgements explaining the choices made in sub-task 1; performance metrics include standard natural language generation evaluation metrics.
Can your language model prep, sit, or rate an exam for you?
In an evolved verion of the first year's Topical Quiz Task, the Sensemaking task will require participating systems to produce a quiz, not for a topic, but for a given syllabus set of texts, potentially noisy or erroneous and to answer to quiz questions posed by other participating systems. A correct answer is one that is aligned with the syllabus without relying on other knowledge.
The ELOQUENT evaluation lab experiments with new evaluation methods for generative language models in order to meet some of the challenges in the path from laboratory to application. The intention of the lab is to explore the following important characteristics of generative language model quality:
- Trustworthiness: a many-faceted notion which involves topical relevance and truthfulness, discourse competence, reasoning in language, controllability, and robustness across varied input.
- Multi-linguality and cultural fit: the suitability of a language model for some cultural and linguistic area.
- Self-assessment: the reliability of a language model to assess the quality of itself or some other language model, using as little human effort as possible.
- Limits of language models: the delimitation of world knowledge and generative capacity.
Do you teach a class related to generative language models? Do you supervise students interested in generative language models? Are you a student searching for a project?
The ELOQUENT tasks are suitable for use as a class assignment or as a diploma project. Get in touch with us for suggestions of extensions and other ideas!
The second ELOQUENT Workshop will be held at the CLEF conference in Madrid, September 9-12 2024.
The workshop program will hold overview presentations, an invited keynote, and some selected participant presentations.
Participating in a task can be done as a simple one-off experiment.
We welcome experimental reports, which will be published in the working notes of the workshop, but there is no requirement to submit anything. If experiments involve hypothesis testing and exploration of more lasting value, they can be revised and published elsewhere in more archival channels. Typically, this is done after the workshops where ideas and learnings have been exchanged between participants.
Sign up to join in the discussion! We will announce a conversation channel here during the fall of 2024.
- Fall 2024: discussion and task formulation
- January 1 2024: tasks open and public announcement of tasks
- April 06 – April 10, 2025: ECIR presentation of ELOQUENT
- May 2025: submission deadline of experimental runs from participants
- June 2025: participant report submission deadline
- July 2025: camera ready report submission deadline
- 9-12 September 2025: workshop at CLEF in Madrid
The first edition of ELOQUENT ran in 2024 and involved four tasks.
The task on untruthfulness detection ran in 2024 and will move elsewhere for 2025. Return here later to find out more!
The task on probing the topical competence of generative language models ran in 2024 and is paused for 2025, with the intention of defining a reworked task at some later time. Return here later to find out more!
Mu-SHROOM is a non-English-centric SemEval-2025 shared task to advance the SOTA in hallucination detection for content generated with LLMs. Mu-SHROOM has annotated hallucinated content in 10 different languages from top -tier LLMs: Arabic (Modern standard), Chinese (Mandarin), English, Finnish, French, German, Hindi, Italian, Spanish, and Swedish. You can participate in as many languages as you’d like by accurately identifying spans of hallucinated content.
Key Dates: Dev set already available; test set released by Jan 10, 2025; evaluation phase ends Jan 31 2025. SemEval workshop in Summer 2025 (co-located with an upcoming *ACL conference).
- Silo AI: Jussi Karlgren
- Accenture and University of Helsinki: Aarne Talman
- AI Sweden, Stockholm: Magnus Sahlgren
- Toloka AI: Ekaterina Artemova
- Charles University: Ondrej Bojar
- University of Oslo: Vladislav Mikhailov
- University of Oslo: Erik Velldal
- University of Oslo: Lilja Øvrelid
- Overview of ELOQUENT 2024 is in the CLEF 2024 publication, volume 2
-
Task reports and participant papers are in the CLEF 2024 Working notes:
- Topical Quiz: Karlgren & Talman;
- Hallucigen report: Dürlich et al;
- Hallucigen participant: Siino & Tinnirello;
- Hallucigen participant: Bui et al.
- Robustness: Sahlgren et al;
- Robustness participant: Neralla & Bijl de Vroe;
- Robustness participant: Simonsen;
- Voight-Kampff: Bevendoff et al;
- A first presentation and announcement of ELOQUENT is in the Proceedings of the European Conference on Information Retrieval (ECIR)