Is your LLM really clever? Can it mark its own homework?
ELOQUENT Lab 2025
2nd edition of the lab for evaluation of generative language model quality at CLEF, the Conference and Labs of the Evaluation Forum
ELOQUENT goals

The ELOQUENT evaluation lab experiments with new evaluation methods for generative language models in order to meet some of the challenges in the path from laboratory to application. The intention of the lab is to explore the following important characteristics of generative language model quality:

  1. Trustworthiness: a many-faceted notion which involves topical relevance and truthfulness, discourse competence, reasoning in language, controllability, and robustness across varied input.
  2. Multi-linguality and cultural fit: the suitability of a language model for some cultural and linguistic area.
  3. Self-assessment: the reliability of a language model to assess the quality of itself or some other language model, using as little human effort as possible.
  4. Limits of language models: the delimitation of world knowledge and generative capacity.

Task Dictionary Definition

The Dictionary Definition Task is new for the second edition, and involves having participating systems play a variant of the popular Dictionary parlour game, implemented and marketed as a popular 1980's board game under various names such as Balderdash, Rappakalja, or Kokkelimonke. The idea is for each player to formulate plausible and convincing definitions to uncommmon and unknown words. These definitions are shared to the other players, with a correct definition mixed in with the player-authored ones, and points are awarded to a player's definition if other players are fooled into believing it to be the true one. This task will be implemented as a system-submission task, where participants are expected to provide a runnable script to be executed by the organisers. The participants are restricted to a shared set of pre-trained models, which will be fine-tuned over the course of the experiment by the organisers, with both helpful and confounding data, and the scoring will be influenced by the robustness of the script, its prompting, and its continued adaptation to incoming data.

an image of a page from a dictionary
Task Voight-Kampff

Can your LLM fool a classifier to believe it is human?

This task explores whether automatically-generated text can be distinguished from human-authored text, and is organised in collaboration with the PAN lab at CLEF.

part human, part machine

Task Robustness and Consistency

Will it respond with the same content to all of us?

This task ran in 2024 and tests the capability of a model to handle input variation -- e.g. dialectal, sociolectal, and cross-cultural -- as represented by human-generated equivalent but non-identical varieties of input prompts.

janus, a two-faced deity, depicted on a roman coin

Task Preference Score Prediction

This is a new task for 2025. Details to come.

Task Sensemaking

This is a new task for 2025, evolving from last year's Topical Quiz Task. Details to come.

Student projects

Do you teach a class related to generative language models? Do you supervise students interested in generative language models? Are you a student searching for a project?

a teacher

The ELOQUENT tasks are suitable for use as a class assignment or as a diploma project. Get in touch with us for suggestions of extensions and other ideas!

2025 Workshop in Madrid

The second ELOQUENT Workshop will be held at the CLEF conference in Madrid, September 9-12 2024.

a picture of a building in madrid

The workshop program will hold overview presentations, an invited keynote, and some selected participant presentations.

Sign up for the tasks!

Participating in a task can be done as a simple one-off experiment.

We welcome experimental reports, which will be published in the working notes of the workshop, but there is no requirement to submit anything. If experiments involve hypothesis testing and exploration of more lasting value, they can be revised and published elsewhere in more archival channels. Typically, this is done after the workshops where ideas and learnings have been exchanged between participants.

Sign up to join in the discussion! We will announce a conversation channel here during the fall of 2024.

Timeline

  • Fall 2024: discussion and task formulation
  • January 1 2024: tasks open and public announcement of tasks
  • April 06 – April 10, 2025: ECIR presentation of ELOQUENT
  • May 2025: submission deadline of experimental runs from participants
  • June 2025: participant report submission deadline
  • July 2025: camera ready report submission deadline
  • 9-12 September 2025: workshop at CLEF in Madrid

Previous Editions

The first edition of ELOQUENT ran in 2024 and involved four tasks.

HalluciGen Task

The task on untruthfulness detection ran in 2024 and will move elsewhere for 2025. Return here later to find out more!

a graffiti from Born with a grinning psychedelic face

Also, related, is the 2025 Mushroom task at SemEval-2025.

Topical Quiz Task

The task on probing the topical competence of generative language models ran in 2024 and is paused for 2025, with the intention of defining a reworked task at some later time. Return here later to find out more!

a lecturer holding forth about something complex

Organising committee
some people at a work table

  • Silo AI: Jussi Karlgren
  • Accenture and University of Helsinki: Aarne Talman
  • University of Helsinki: Timothee Mickus
  • AI Sweden, Stockholm: Magnus Sahlgren
  • Toloka AI: Ekaterina Artemova
  • Charles University: Ondrej Bojar
  • Unversity of Crete: Stergios Chatzikyriakidis
  • University of Oslo: Vladislav Mikhailov
  • University of Oslo: Erik Velldal
  • University of Oslo: Lilja Øvrelid
Contact us at eloquent-clef2025-organizers AT googlegroups.com

ocument}
Publications

Thank you

The ELOQUENT lab is supported by the DeployAI project.

Page layout from Codepen.