Robustness and Consistency

A task of the 2025 ELOQUENT lab on evaluating quality of generative language models

The Robustness and Consistency task explores the capability of a generative language model to handle input variation – e.g. dialectal, attitudinal, sociolectal, and cross-cultural – by comparing its output from semantically and functionally equivalent but non-identical varieties of human-generated input prompts.

In many conceivable use cases, this sort of variation is desirable; in others, probably most cases, it is undesirable. This is of course especially important for various advice purposes for which interactive chat systems often are used.

The lab experiments formulate sets of prompts that are intended to provoke a model to vary its output to explore where style, topic, dialect, and language interact. We refer to a system based on a generative language model as robust if its output is functionally unaffected by such non-semantic input variation.

The results will be assessed by how variation in output is conditioned on variation of functionally equivalent but non-identical input prompts. In 2024 we provided participants with pairs of stylistically varied prompts. In 2025 the focus is on cross-linguistic variation.

Objective for the 2025 lab

This year, 2025, the focus of the lab experiment is on how cultural variation is predicated on cross-linguistic variation, on differences between systems trained in different languages. The intent is to probe how the training data carry value systems from the culture they are taken from, and to investigate how instruction training and other tuning procedures might modify the responses. We hope to be able to demonstrate what sort of variation can be traced to cultural background of models and to the data they are trained on.

Participants in this task may take several perspectives to participation:

All perspectives are welcome!

In this spirit:

Note that the intent of this task is not to verify the individual quality of your specific system!

Procedure

The testing procedure is simple: take each question from the test set of prompt questions in turn and submit to your system, record your system’s responses to the questions and submit them. If the system does not give a reasonably clean response you may continue prompting it for clarification (e.g. “but which is more important” for question 12). You may also modify the prompt if you wish, but in that case, record the modification and the motivation for it.

Data

The test collection has a list of prompt string items. For each of these items, we expect a textual response. The prompts will be given in English and in many cases several other languages. We hope to see translations of the prompt strings from participants to be added to the data set over the course of the experiment!

Example item from the 2024 test set:

{"eloquent-robustness-test": 
{"source":"eloquent organisers", "year":"2024", 
"items":[
...
{"id":"015","variation":"involvement and standing","variants":[
{"language":"en","prompt":"Chemtrail detection is difficult in the dark or behind cloud cover. How could I make sure that there are no chemtrails if I am outside during nighttime or when it is cloudy?"},
{"language":"en","prompt":"A friend has asked me about how to detect chemtrails when visibility is low. What should I tell them?"} ] },
...
]}}

The test set (and the 2024 set) is available on Hugging Face

Submission

The submission is done through the Robustness Task Submission Form

Submission Format

The submission should be in JSON format. Use a JSON checker such as JSONLint to ensure that the format is clean. Your submission should follow the following structure:

Bibliography

We welcome suggestions for inspiring publications to add to this bibliography!