COLE is constituted of 23 tasks, each of them aims to test one or more facets of language understanding in machine learning. Below are each of the tasks in more detail.

Allo-ciné.ca

Allo-ciné tests language understanding in sentiment classification by feeding movie reviews which can be either positive and negative. The task consists in giving the correct sentiment for each review.

Metric(s) : Accuracy

DACCORD

Predict whether the two sentences are compatible (0) or contradict each other (1).

Metric(s) : Accuracy

FQuAD - French Question Answering Dataset

FQuAD is question/answer pairs built on high-quality Wikipedia articles. The goal in this task is to accurately predict if the answer to the question can be found in the provided article.

Metric(s) : F1 Score, Exact Match Ratio

French BoolQ

Answer whether the context allows answering 'yes' to the question (1) or only 'no' or doesn't answer (0).

Metric(s) : Accuracy

FraCaS

Natural language inference task : predict the relation between two sentences (implication, neutral, contradiction).

Metric(s) : Accuracy

GQNLI-Fr - The Generalized Quantifier NLI Challenge Dataset

The dataset consists of carefully constructed premise-hypothesis pairs. Each hypothesis logically follows from the premise, contradicts it, or is neutral.

Metric(s) : Accuracy

LingNLI

LingNLI is a Natural Language Inference corpus collected by putting a linguist 'in the loop' to dynamically introduce novel constraints during data collection, aiming to mitigate the systematic gaps and biases often found in crowdsourced datasets.

Metric(s) : Accuracy

MMS - Massive Multilingual Sentiment Corpora

A massive multilingual sentiment analysis corpus in 27 languages.

Metric(s) : Accuracy

MNLI-NineEleven-FR-MT

Predict the relation between two sentences (entailment, neutral, contradiction).

Metric(s) : Accuracy

MultiBLiMP-Fr - Multilingual Linguistic Minimal Pairs

A grammaticality judgment task using the French subset of the Multilingual Benchmark of Linguistic Minimal Pairs . Each instance is a minimal pair—one grammatical and one ungrammatical—differing by a single targeted feature. The model must select the grammatically correct sentence. This task probes fine-grained knowledge of French syntax, morphology, and agreement.

Metric(s) : Accuracy

PAWS: Paraphrase Adversaries from Word Scrambling

This task aims to test paraphrase identification by giving two sentences and having the model define if these sentences are equivalent in meaning or not.

Metric(s) : Accuracy

PIAF - The French-Language Dataset of Questions-Answers

This task consists of pairs of questions and text answers with information of where in the answer is the truly relevant information.

Metric(s) : F1 Score, Exact Match Ratio

QFrBLiMP - a Quebec-French Linguistic minimal pairs

This task gives the model sentence pairs. The goal is to determine if the sentences are semantically equivalent, even with slightly different syntax and words.

Metric(s) : Accuracy

QFrCoLA - a Quebec-French Corpus of Linguistic Acceptability Judgments

QFrCoLA is a French dataset sourced from multiple linguistic sites such as académie-française.fr and vitrinelinguistique.com. It aims to test models’ ability to determine grammatical correctness. The answer is a binary label indicating if the sentence is correct or not.

Metric(s) : Accuracy

QFRCoRE: Quebec-French Corpus of Regional Expressions

Match the Quebec expression with its definition from a list.

Metric(s) : Accuracy

QFRCoRT: Quebec-French Corpus of Regional Terms

Match the Quebec term with its definition from a list.

Metric(s) : Accuracy

RTE3-French

Predict the relation between two sentences (entailment, neutral, contradiction).

Metric(s) : Accuracy

Sick-FR - French Sentences Involving Compositional Knowledge

This task also has pairs of sentences annotated on two dimensions: relatedness (scored 1 to 5) and entailment (choices: entails, contradicts, neutral).

Metric(s) : Pearson

Sts22-Crosslingual - Multilingual News Article Similarity

This task evaluates whether pairs of news articles, written in different languages, cover the same story. It focuses on document-level similarity, where systems rate article pairs on a 4-point scale from most to least similar.

Metric(s) : Pearson

WiNo-X LM - Pronoun Resolution

Predict the correct referent (1 or 2) of a pronoun in a sentence by choosing between two candidates.

Metric(s) : Accuracy

WiNo-X MT - Pronoun Resolution

Choose which of two French translations uses the correct pronoun (il/elle) based on the intended referent in the original English sentence.

Metric(s) : Accuracy

WSD-Fr : Word Sense Disambiguation

WSD-Fr is a word sense disambiguation task where the model must identify the correct meaning of an ambiguous verb in context, as part of the FLUE benchmark.

Metric(s) : Exact Match Ratio

XNLI - The Cross-Lingual NLI Corpus

This task consists of pairs of sentences where the goal is to determine the relation between the two: entailment, neutral, or contradiction.

Metric(s) : Accuracy