Is your LLM really clever? Can it mark its own homework?
ELOQUENT Lab 2025
2nd edition of the lab for evaluation of generative language model quality at CLEF, the Conference and Labs of the Evaluation Forum
Task Voight-Kampff

Can your LLM fool a classifier to believe it is human?

This task explores whether automatically-generated text can be distinguished from human-authored text, and is organised in collaboration with the PAN lab at CLEF.

part human, part machine

Task Robustness and Consistency

Will it respond with the same content to all of us?

This task ran in 2024 and tests the capability of a model to handle input variation -- e.g. dialectal, sociolectal, and cross-cultural -- as represented by human-generated equivalent but non-identical varieties of input prompts.

janus, a two-faced deity, depicted on a roman coin

Task Preference Score Prediction

The Preference Prediction Task is a new task for the second edition and will explore the capabality of systems to predict human preferences for different outputs from generative language models.

We will provide the participants with a novel dataset of human-based preferences and judgments, which will be collected from scratch. This task offers two sub-tasks with participation open to anyone:

  1. Preference prediction. Predict human preferences between given LLM responses; performance metric is the accuracy score.
  2. Preference prediction and judgment generation Generate judgements explaining the choices made in sub-task 1; performance metrics include standard natural language generation evaluation metrics.

Task Sensemaking

Can your language model prep, sit, or rate an exam for you?

In an evolved verion of the first year's Topical Quiz Task, the Sensemaking task will require participating systems to produce a quiz, not for a topic, but for a given syllabus set of texts, potentially noisy or erroneous and to answer to quiz questions posed by other participating systems. A correct answer is one that is aligned with the syllabus without relying on other knowledge.

study session
ELOQUENT goals

The ELOQUENT evaluation lab experiments with new evaluation methods for generative language models in order to meet some of the challenges in the path from laboratory to application. The intention of the lab is to explore the following important characteristics of generative language model quality:

  1. Trustworthiness: a many-faceted notion which involves topical relevance and truthfulness, discourse competence, reasoning in language, controllability, and robustness across varied input.
  2. Multi-linguality and cultural fit: the suitability of a language model for some cultural and linguistic area.
  3. Self-assessment: the reliability of a language model to assess the quality of itself or some other language model, using as little human effort as possible.
  4. Limits of language models: the delimitation of world knowledge and generative capacity.

Student projects

Do you teach a class related to generative language models? Do you supervise students interested in generative language models? Are you a student searching for a project?

a teacher

The ELOQUENT tasks are suitable for use as a class assignment or as a diploma project. Get in touch with us for suggestions of extensions and other ideas!

2025 Workshop in Madrid

The second ELOQUENT Workshop will be held at the CLEF conference in Madrid, September 9-12 2024.

a picture of a building in madrid

The workshop program will hold overview presentations, an invited keynote, and some selected participant presentations.

Sign up for the tasks!

Participating in a task can be done as a simple one-off experiment.

We welcome experimental reports, which will be published in the working notes of the workshop, but there is no requirement to submit anything. If experiments involve hypothesis testing and exploration of more lasting value, they can be revised and published elsewhere in more archival channels. Typically, this is done after the workshops where ideas and learnings have been exchanged between participants.

Sign up to join in the discussion! We will announce a conversation channel here during the fall of 2024.

Timeline

  • Fall 2024: discussion and task formulation
  • January 1 2024: tasks open and public announcement of tasks
  • April 06 – April 10, 2025: ECIR presentation of ELOQUENT
  • May 2025: submission deadline of experimental runs from participants
  • June 2025: participant report submission deadline
  • July 2025: camera ready report submission deadline
  • 9-12 September 2025: workshop at CLEF in Madrid

Previous Editions

The first edition of ELOQUENT ran in 2024 and involved four tasks.

HalluciGen Task

The task on untruthfulness detection ran in 2024 and will move elsewhere for 2025. Return here later to find out more!

a graffiti from Born with a grinning psychedelic face

Topical Quiz Task

The task on probing the topical competence of generative language models ran in 2024 and is paused for 2025, with the intention of defining a reworked task at some later time. Return here later to find out more!

a lecturer holding forth about something complex

Mu-SHROOM
Related Task

Mu-SHROOM is a non-English-centric SemEval-2025 shared task to advance the SOTA in hallucination detection for content generated with LLMs. Mu-SHROOM has annotated hallucinated content in 10 different languages from top -tier LLMs: Arabic (Modern standard), Chinese (Mandarin), English, Finnish, French, German, Hindi, Italian, Spanish, and Swedish. You can participate in as many languages as you’d like by accurately identifying spans of hallucinated content.

Key Dates: Dev set already available; test set released by Jan 10, 2025; evaluation phase ends Jan 31 2025. SemEval workshop in Summer 2025 (co-located with an upcoming *ACL conference).

Organising committee
some people at a work table

  • Silo AI: Jussi Karlgren
  • Accenture and University of Helsinki: Aarne Talman
  • AI Sweden, Stockholm: Magnus Sahlgren
  • Toloka AI: Ekaterina Artemova
  • Charles University: Ondrej Bojar
  • University of Oslo: Vladislav Mikhailov
  • University of Oslo: Erik Velldal
  • University of Oslo: Lilja Øvrelid
Contact us at eloquent-clef2025-organizers AT googlegroups.com

ocument}
Publications

Thank you

The ELOQUENT lab is supported by the DeployAI project through its activities on evaluation of generative language models.

Toloka AI also supports the ELOQUENT lab by collecting datasets for the Preference Prediction Task.

Page layout from Codepen.