Cultural Robustness and Diversity

A task of the 2026 ELOQUENT lab on evaluating quality of generative language models

Task Overview

Explore under which conditions your model performs better for cultural questions across European languages.

This task evaluates two complementary aspects:

Submit multiple experimental variations to discover what improves your model’s cultural awareness!


Quick Start

  1. Download prompts for languages your model supports (22 languages available). Note minimum five languages needs to be supported and submitted.
  2. Generate responses to diversity and robustness questions
  3. Experiment with different approaches (system prompts, parameters, etc.)
  4. Submit in JSONL format by May 7, 2026
  5. Receive scores and analyze which conditions work best
  6. Report your experiment by May 28, 2026 for publication in the Workshop Notes
  7. Participate in the workshop at CLEF in September 2026

Languages: Bengali, Catalan, Czech, Danish, English, Faroese, Finnish, French, German, Greek, Hebrew, Hindi, Italian, Kannada, Marathi, Polish, Russian, Slovak, Spanish, Swedish, Tamil, Telugu

Files per language:

Example diversity prompt:

{"id":"0","prompt":"What to serve my kid for breakfast? Answer in one sentence."}

Example robustness prompt:

{"id":"0-1","prompt":"What to serve my kid for breakfast? We live in Austria and we want to eat like locals. Answer in one sentence. Do not mention facts from the questions like the country or the family relation in the answer."}

Download: Cultural diversity and Robustness

How to Participate with instruction-tuned multilingual models

Minimum number of overlapping languages: 5

1. Generate Responses

For each supported language of your model:

Requirements:

Optional modifications:

2. Format Submission (JSONL)

Response files:

[language]_diversity_responses.jsonl:

{"id":"0","prompt":"What to serve my kid for breakfast? Answer in one sentence.","answer":"A balanced breakfast with whole grains, protein, and fruit."}
{"id":"1","prompt":"What is the most important message in my speech to my brother's 30th birthday? Answer in one sentence.","answer":"Celebrate his achievements and express your support for his future."}

[language]_robustness_responses.jsonl:

{"id":"0-1","prompt":"What to serve my kid for breakfast? We live in Austria and we want to eat like locals. Answer in one sentence. Do not mention facts from the questions like the country or the family relation in the answer.","answer":"Fresh bread rolls with butter, jam, cold cuts, and cheese."}
{"id":"0-2","prompt":"What to serve my kid for breakfast? We live in Belgium and we want to eat like locals. Answer in one sentence. Do not mention facts from the questions like the country or the family relation in the answer.","answer":"Freshly baked bread with chocolate spread or cheese."}

Metadata file (submission_metadata.json):

{
    "team": "your-team-name",
    "system": "your-system-name",
    "model": "model-identifier",
    "submissionid": "experiment-1",
    "date": "2026-05-15",
    "label": "eloquent-2026-cultural",
    "languages": ["en", "de", "fr", "sv", "ru"...],
    "modifications": {
        "system_prompt": "You are a culturally aware assistant...",
        "prompt_prefix_english": "Context: ... ",
        "prompt_suffix_english": " Please be specific.",
        "generation_params": {"do_sample": false, "max_new_tokens": 200},
        "notes": "Testing impact of cultural awareness system prompt"
    }
}

Package structure:

submission.zip
├── submission_metadata.json
├── english_diversity_responses.jsonl
├── english_robustness_responses.jsonl
├── french_diversity_responses.jsonl
└── french_robustness_responses.jsonl

3. Submit


Evaluation

Diversity Score: K-means clustering on sentence embeddings measures response variation across languages (higher = more diverse)

Robustness Score: Measures consistency when cultural context is specified (higher = more consistent across languages)

Combined Score: diversity_score × robustness_score

Scores are provided per category (food, education, work, social norms) to help you identify strengths and weaknesses.

Experimental Ideas

Test different approaches to discover what works best:

Submit multiple variations with different submissionid values!


Important Notes


Missing a language?

If you want to contribute with a new language and you speak it fluently, let us know. It takes around two hours of annotation to get a new language supported.

** Thanks to our translators ** Adam Hrin, Annika Simonsen, Araceli Cañadas Ruiz, Daniel Zautner, Georgios Stampoulidis, Josiane Mothe, Jussi Karlgren, Lucie Poláková, Luisa Weinisch, Maria Barrett, Pawel Kowalski, Pier Luigi Dovesi, Rohit Gunti, Samu Tamminen, Stig-Arne Grönroos, Věra Kloudová, Xavier Aguilar Fruto


Bibliography

We welcome suggestions for inspiring publications to add to this bibliography!