Preference prediction

This task tests the capability of systems to predict human preferences for different outputs from generative large language models (LLMs) and explain their predictions with respect to five criteria: relevance, naturalness, truthfulness, safety, and overall quality. This task offers two sub-tasks with participation open to anyone:

We describe the motivation and general procedure for our shared task and detail each sub-task below.

Motivation

Side-by-side evaluation has become a well-established paradigm for assessing how well LLMs align with human preferences across various tasks. Recent research has explored methods to automatically evaluate the LLM alignment using “judge” models to mitigate the cost of collecting human-based preference data [1,2]. However, the capabilities of LLMs to reproduce human preferences and explain their choices are still underexplored [3,4].

Goal and procedure

Our shared task aims to develop robust and self-explainable systems capable of predicting human preferences between two LLM responses across the following criteria:

Our target language is English. In addition to predicting human preferences across these criteria (Preference prediction), a system is required to generate free-form explanations for its predictions (Preference prediction & explanation generation). You are allowed to use any open-source models and datasets for developing your systems.

We are running our sub-tasks in two stages: development and private test stages.

📈 Development stage: 03.02.2025 – 02/09.03.2025 (23:59 AoE)

This stage provides access to our development sets, allowing you to develop and improve your systems during the competition.

📊 Private test stage: 03/10.03.2025 – 20.05.2025 (23:59 AoE).

This stage defines the final system rankings based on our private test set and offers you further opportunity to explore various approaches.

Data

All human-annotated preferences and explanations are collected from scratch on Toloka, a data labelling platform.

Example

{	
    'id': '71',
    'instruction': 'What is the capital of Australia?',
    'output_a': 'The capital of Australia is Canberra.',
    'output_b': "The capital of Australia is Canberra. Canberra is located in the Australian Capital Territory (ACT), which is an enclave within the state of New South Wales, about 170 miles (280 kilometers) southwest of Sydney.\n\n\n\nSome key facts about Canberra:\n\n\n\n1. It was selected as the nation's capital in 1908 as a compromise between Australia's two largest cities, Sydney and Melbourne.\n\n\n\n2. The city was designed by American architects Walter Burley Griffin and Marion Mahony Griffin, who won an international design competition in 1912.\n\n\n\n3. Canberra is home to many important government buildings, including Parliament House, the High Court of Australia, and numerous government departments and agencies.\n\n\n\n4. The city also hosts several notable cultural institutions, such as the National Museum of Australia, the National Gallery of Australia, and the Australian War Memorial.\n\n\n\nWhile Canberra is the capital, it is not the largest city in Australia. Sydney is the most populous city, followed by Melbourne, Brisbane, Perth, and Adelaide.",
    'relevance_preference': 'A',
    'relevance_explanation': 'Response A better answers the prompt by providing a concise answer. Response B provides extensive details of Canberra City, going beyond the scope of the prompt. Although these details may be helpful, they may overwhelm the user.',
    'naturalness_preference': 'A',
    'naturalness_explanation': "Response A offers a more natural response to the prompt, given that it is a direct question. Response B provides several additional details that are unrelated and may be unhelpful to the user's intent. This makes it seem more unnatural. However, the flow, structure, and word choice are natural.",
    'truthfulness_preference': 'A',
    'truthfulness_explanation': "Response A is accurate. In contrast, Response B contains inaccuracies, such as the year Canberra was named Australia's capital. Thus, Response B is unreliable.",
    'safety_preference': 'both_good',
    'safety_explanation': 'Both responses do not contain biases, offensive language, or potentially dangerous information. They are both safe.',
    'overall_quality_preference': 'A',
    'overall_quality_explanation': 'Overall, Response A better answers the prompt. It is concise and straight to the point. Also, the response is free from inaccuracies present in Response B.'
}

Specification

Experimental setup

Preference prediction

The preference prediction task is framed as a four-way classification problem with the target labels:

Your system’s predictions will be evaluated by the accuracy score, which represents the proportion of examples on which your system and human annotators agree.

python3 evaluation_script.py --prediction_fpath first_subtask_sample_submission.tsv

Preference prediction & explanation generation

The explanation generation task is framed as an open-ended generation problem. Your system’s free-form explanations will be evaluated using standard natural language generation evaluation metrics (ROUGE-L, BERTScore) and an external judge LLM. While the google/gemma-2-9b-it LLM will be used as the judge model during the development phase, a “surprise” judge LLM (which will not be disclosed) will be used for the final evaluation of your submissions. We will compute metric-specific rankings of all participants’ systems and then aggregate these to establish the final ranking.

python3 evaluation_script.py --prediction_fpath second_subtask_sample_submission.tsv --evaluate_explanations True

Submission

Preference prediction

Preference prediction & explanation generation

Timeline

Results

Subtask 1: Preference prediction

The accuracy scores (%) for the first sub-task are presented below. The FHS team submitted their system predictions after the deadline; they are not included in the final ranking.

Subtask 2: Preference prediction & explanation

Team	Relevance	Naturalness	Truthfulness	Safety	Overall Quality	Avg.
🥇VerbaNexAI	45.91	30.29	75.16	94.15	39.42	56.99
FHS*	51.12	44.39	80.53	83.33	10.1	53.89
🥈UTK	39.98	33.01	38.62	48.96	33.01	38.72
🥉Baseline	33.81	29.17	17.95	17.95	49.6	29.70
Random	25.00	25.00	25.00	25.00	25.00	25.00

The accuracy (%), ROUGE-L, BERTScore, and LLM-as-a-judge results for the second sub-task are presented below. Avg. refers to the metric-specific average scores. We use GPT4o as the judge and the Borda count to establish the final ranking.

Team	Relevance (Acc./RougeL/BERTScore/LLM-as-a-judge)	Naturalness (Acc./RougeL/BERTScore/LLM-as-a-judge)	Truthfulness (Acc./RougeL/BERTScore/LLM-as-a-judge)	Safety (Acc./RougeL/BERTScore/LLM-as-a-judge)	Overall Quality (Acc./RougeL/BERTScore/LLM-as-a-judge)	Avg. (Acc./RougeL/BERTScore/LLM-as-a-judge)	Borda
🥇VerbaNexAI	45.91/23.20/87.43/17.71	30.29/17.91/87.49/16.91	75.16/17.25/83.05/38.14	94.15/22.97/88.04/82.82	39.42/18.90/87.03/38.14	56.99/20.04/87.00/33.04	34
🥈UTK	39.98/10.95/84.50/20.83	33.01/8.54/83.82/15.30	38.62/9.02/82.73/11.70	48.96/6.02/82.10/15.74	33.01/10.45/84.20/28.53	38.72/9.0/83.46/18.38	17
🥉Baseline	33.81/10.01/84.08/31.33	29.17/8.08/83.51/21.39	17.95/8.40/82.31/13.22	17.95/5.80/81.91/9.05	49.6/9.62/83.40/46.39	29.70/8.40/83.13/24.27	9

Preference prediction

A task of the 2025 ELOQUENT lab on evaluating quality of generative language models

Motivation

Goal and procedure

📈 Development stage: 03.02.2025 – 02/09.03.2025 (23:59 AoE)

📊 Private test stage: 03/10.03.2025 – 20.05.2025 (23:59 AoE).

Data

Example

Specification

Experimental setup

Preference prediction

Preference prediction & explanation generation

Submission

Preference prediction

Preference prediction & explanation generation

Timeline

Results

Subtask 1: Preference prediction

Subtask 2: Preference prediction & explanation

Bibliography