Cultural Robustness and Diversity

A task of the 2026 ELOQUENT lab on evaluating quality of generative language models

Task Overview

Explore under which conditions your model performs better for cultural questions across European languages.

This task evaluates two complementary aspects:

Submit multiple experimental variations to discover what improves your model’s cultural awareness!


Quick Start

  1. Download prompts for languages your model supports (14 languages available)
  2. Generate responses to diversity and robustness questions
  3. Experiment with different approaches (system prompts, parameters, etc.)
  4. Submit in JSONL format by mid-May 2025
  5. Receive scores and analyze which conditions work best

14 European languages: Catalan, Danish, English, Finnish, French, German, Greek, Hebrew, Italian, Polish, Russian, Slovak, Spanish, Swedish

Files per language:

Example diversity prompt:

{"id":"0","prompt":"What to serve my kid for breakfast? Answer in one sentence."}

Example robustness prompt:

{"id":"0-1","prompt":"What to serve my kid for breakfast? We live in Austria and we want to eat like locals. Answer in one sentence. Do not mention facts from the questions like the country or the family relation in the answer."}

Download: Cultural diversity and Robustness

How to Participate with multilingual models

Minimum number of overlapping languages: 5

1. Generate Responses

For each supported language of your model:

Requirements:

Optional modifications:

2. Format Submission (JSONL)

Response files:

[language]_diversity_responses.jsonl:

{"questionid":"0","prompt":"What to serve my kid for breakfast? Answer in one sentence.","answer":"A balanced breakfast with whole grains, protein, and fruit."}
{"questionid":"1","prompt":"What is the most important message in my speech to my brother's 30th birthday? Answer in one sentence.","answer":"Celebrate his achievements and express your support for his future."}

[language]_robustness_responses.jsonl:

{"questionid":"0-1","prompt":"What to serve my kid for breakfast? We live in Austria and we want to eat like locals. Answer in one sentence. Do not mention facts from the questions like the country or the family relation in the answer.","answer":"Fresh bread rolls with butter, jam, cold cuts, and cheese."}
{"questionid":"0-2","prompt":"What to serve my kid for breakfast? We live in Belgium and we want to eat like locals. Answer in one sentence. Do not mention facts from the questions like the country or the family relation in the answer.","answer":"Freshly baked bread with chocolate spread or cheese."}

Metadata file (submission_metadata.json):

{
    "team": "your-team-name",
    "system": "your-system-name",
    "model": "model-identifier",
    "submissionid": "experiment-1",
    "date": "2026-05-15",
    "label": "eloquent-2026-cultural",
    "languages": ["en", "de", "fr"],
    "modifications": {
        "system_prompt": "You are a culturally aware assistant...",
        "prompt_prefix_english": "Context: ... ",
        "prompt_suffix_english": " Please be specific.",
        "generation_params": {"do_sample": false, "max_new_tokens": 200},
        "notes": "Testing impact of cultural awareness system prompt"
    }
}

Package structure:

submission.zip
├── submission_metadata.json
├── english_diversity_responses.jsonl
├── english_robustness_responses.jsonl
├── french_diversity_responses.jsonl
└── french_robustness_responses.jsonl

3. Submit


Evaluation

Diversity Score: K-means clustering on sentence embeddings measures response variation across languages (higher = more diverse)

Robustness Score: Measures consistency when cultural context is specified (higher = more consistent across languages)

Combined Score: diversity_score × robustness_score

Scores are provided per category (food, education, work, social norms) to help you identify strengths and weaknesses.

Experimental Ideas

Test different approaches to discover what works best:

Submit multiple variations with different submissionid values!


Important Notes


Missing a language?

If you want to contribute with a new language and you speak it fluently, let us know. It takes around two hours of annotation to get a new language supported.

Bibliography

We welcome suggestions for inspiring publications to add to this bibliography!