COLE is constituted of 23 tasks, each of them aims to test one or more facets of language understanding in machine learning. Below are each of the tasks in more detail.
Allo-ciné.ca
Allo-ciné tests language understanding in sentiment classification by feeding movie reviews which can be either positive and negative. The task consists in giving the correct sentiment for each review.
Metric(s) : Accuracy
DACCORD
Predict whether the two sentences are compatible (0) or contradict each other (1).
Metric(s) : Accuracy
FQuAD - French Question Answering Dataset
FQuAD is question/answer pairs built on high-quality Wikipedia articles. The goal in this task is to accurately predict if the answer to the question can be found in the provided article.
Metric(s) : F1 Score, Exact Match Ratio
French BoolQ
Answer whether the context allows answering 'yes' to the question (1) or only 'no' or doesn't answer (0).
Metric(s) : Accuracy
FraCaS
Natural language inference task : predict the relation between two sentences (implication, neutral, contradiction).
Metric(s) : Accuracy
GQNLI-Fr - The Generalized Quantifier NLI Challenge Dataset
The dataset consists of carefully constructed premise-hypothesis pairs. Each hypothesis logically follows from the premise, contradicts it, or is neutral.
Metric(s) : Accuracy
LingNLI
LingNLI is a Natural Language Inference corpus collected by putting a linguist 'in the loop' to dynamically introduce novel constraints during data collection, aiming to mitigate the systematic gaps and biases often found in crowdsourced datasets.
Metric(s) : Accuracy
MMS - Massive Multilingual Sentiment Corpora
A massive multilingual sentiment analysis corpus in 27 languages.
Metric(s) : Accuracy
MNLI-NineEleven-FR-MT
Predict the relation between two sentences (entailment, neutral, contradiction).
Metric(s) : Accuracy
MultiBLiMP-Fr - Multilingual Linguistic Minimal Pairs
A grammaticality judgment task using the French subset of the Multilingual Benchmark of Linguistic Minimal Pairs . Each instance is a minimal pair—one grammatical and one ungrammatical—differing by a single targeted feature. The model must select the grammatically correct sentence. This task probes fine-grained knowledge of French syntax, morphology, and agreement.
Metric(s) : Accuracy
PAWS: Paraphrase Adversaries from Word Scrambling
This task aims to test paraphrase identification by giving two sentences and having the model define if these sentences are equivalent in meaning or not.
Metric(s) : Accuracy
PIAF - The French-Language Dataset of Questions-Answers
This task consists of pairs of questions and text answers with information of where in the answer is the truly relevant information.
Metric(s) : F1 Score, Exact Match Ratio
QFrBLiMP - a Quebec-French Linguistic minimal pairs
This task gives the model sentence pairs. The goal is to determine if the sentences are semantically equivalent, even with slightly different syntax and words.
Metric(s) : Accuracy
QFrCoLA - a Quebec-French Corpus of Linguistic Acceptability Judgments
QFrCoLA is a French dataset sourced from multiple linguistic sites such as académie-française.fr and vitrinelinguistique.com. It aims to test models’ ability to determine grammatical correctness. The answer is a binary label indicating if the sentence is correct or not.
Metric(s) : Accuracy
QFRCoRE: Quebec-French Corpus of Regional Expressions
Match the Quebec expression with its definition from a list.
Metric(s) : Accuracy
QFRCoRT: Quebec-French Corpus of Regional Terms
Match the Quebec term with its definition from a list.
Metric(s) : Accuracy
RTE3-French
Predict the relation between two sentences (entailment, neutral, contradiction).
Metric(s) : Accuracy
Sick-FR - French Sentences Involving Compositional Knowledge
This task also has pairs of sentences annotated on two dimensions: relatedness (scored 1 to 5) and entailment (choices: entails, contradicts, neutral).
Metric(s) : Pearson
Sts22-Crosslingual - Multilingual News Article Similarity
This task evaluates whether pairs of news articles, written in different languages, cover the same story. It focuses on document-level similarity, where systems rate article pairs on a 4-point scale from most to least similar.
Metric(s) : Pearson
WiNo-X LM - Pronoun Resolution
Predict the correct referent (1 or 2) of a pronoun in a sentence by choosing between two candidates.
Metric(s) : Accuracy
WiNo-X MT - Pronoun Resolution
Choose which of two French translations uses the correct pronoun (il/elle) based on the intended referent in the original English sentence.
Metric(s) : Accuracy
WSD-Fr : Word Sense Disambiguation
WSD-Fr is a word sense disambiguation task where the model must identify the correct meaning of an ambiguous verb in context, as part of the FLUE benchmark.
Metric(s) : Exact Match Ratio
XNLI - The Cross-Lingual NLI Corpus
This task consists of pairs of sentences where the goal is to determine the relation between the two: entailment, neutral, or contradiction.
Metric(s) : Accuracy