Extending the Massive Text Embedding Benchmark to French: the datasets

Community Article Published January 12, 2024

Mathieu Ciancone, Imene Kerboua, Marion Schaeffer, Gabriel Sequeira, and Wissam Siblini

Introduction

With the recent boom of natural language applications, the ability to select a method that generates high-quality text representations has become crucial. To help with this, the Massive Text Embedding Benchmark (MTEB) [1] was introduced. It allows the evaluation and comparison of text embeddings methods on various NLP tasks and datasets. An embedding is a dense vector representation that captures the semantic meaning of a text and can be used for downstream NLP tasks such as text classification, information retrieval, machine translation, etc. MTEB originally compared 33 different models on 8 different tasks : bitext mining, classification, pair classification, retrieval, reranking, clustering, summarization and semantic textual similarity. Overall, it gathered 58 datasets across tasks, most of them in English.

We extend this work to the French language. The project is available here 👉 https://github.com/Lyon-NLP/mteb-french. In order to compare embeddings obtained from texts in French, we identified 14 relevant datasets and created 3 new ones targeting the set of tasks used in MTEB. Of course, these datasets can also be used for a wide range of other applications.

This first article of a series focuses on presenting these datasets, their characteristics and the goal behind each task. By bringing all the information together in one place, we hope to make it easier to search for French datasets for NLP and encourage the evaluation of French embeddings.

So if you are building an NLP model or application targeting the French language, this article is for you! 😉

Datasets

In MTEB, the evaluation is divided into the 8 distinct tasks mentioned above. We will present the datasets according to these 8 tasks, but keep in mind that a given dataset could potentially be used in other ways.

Datasets for French MTEB
Overview of datasets for embedding evaluation on French. Monolingual datasets are in blue and multilingual ones in purple.

Bitext Mining

Given two sets of sentences, this task aims to find the best match for each sentence from the first set in the second set. Generally, the second set contains translations of the sentences from the first set. For MTEB evaluation, models are used to embed each sentence and then the closest pairs are found using cosine similarity. The main metric computed for this evaluation is the F1 score.

Classification

For the classification task, we evaluate which embedding models are best suited for the task of identifying to which class a sentence belongs based on its vector representation. To that end, a model is used to embed a train and a test set. Then, a logistic regression classifier is trained on the train set and evaluated on the test set with the Accuracy metric.

Pair classification

In this task, a pair of sentences is given with an additional label denoting if the pair is a duplicate or a paraphrase. Both sentences are embedded using a model and the distance between them is computed using several distance metrics such as cosine similarity, euclidean distance, etc. The evaluation metric for this task is average precision based on the cosine similarity.

Retrieval

Considering a query, the retrieval task aims to find the most relevant documents (often paragraphs) among a corpus of documents, using cosine-similarity between vectors. The benchmarking of models regarding this task is particularly interesting for their further implication in Retrieval Augmented Generation (RAG) pipelines. Several metrics are used to evaluate this task, the main one being Normalized Discounted Cumulative Gain (NDCG@10).

Reranking

The goal of the reranking task is to sort a small set of documents in terms of relevance regarding a given query. The reranking task is often used in recommender systems, or as a complement to the retrieval task. In the context of MTEB, the aim is to evaluate the models' ability to produce embeddings that yield a cosine similarity correlated with the document's relevance to the question.

To evaluate this task, each dataset of the original MTEB benchmark is composed of a query, paired with a few positive (i.e. relevant) documents along with negative (i.e. irrelevant) documents. Despite our efforts, we didn’t find any relevant French dataset structured like so. Consequently, we decided to build our own using the AlloProf and the Syntec retrieval datasets. These already have queries and positive documents so we applied the following process to generate the negatives. The corpus of documents and queries has been embedded using an embedding model. Then, we computed the cosine similarity between every document and query. Documents that are not in the top 10 similarity for a query were labeled as its negative documents.

Clustering

This task tries to group sentences or paragraphs into meaningful clusters. To do that, the texts are embedded and a k-means model is applied with the known number of clusters.

The metric used to score the model is the v-measure, which does not depend on the cluster label.

Summarization

This task aims to score a machine-generated summary based on its similarity with human-written summaries. To do that, all summaries are embedded and distances between the machine-generated and human-written summaries are computed. The highest cosine similarity score is kept as the machine-generated summary score. Pearson and Spearman correlations with ground truth human assessments are used to evaluate the score computed. This task is close to the STS one.

  • SummEval: This dataset consists of 100 news articles from the CNN/DailyMail dataset [13]. Each of these news articles comes with 10 human-written summaries and 16 machine-generated summaries annotated by 8 persons across coherence, consistency, fluency, and relevance. As this dataset is only available in English, we translated it into French using DeepL. Human-written and machine-generated summaries are embedded and compared with cosine similarity and the average relevance of expert annotations is used as ground truth assessment.

    Link to dataset: https://huggingface.co/datasets/lyon-nlp/summarization-summeval-fr-p2p

Semantic Textual Similarity (STS)

This task aims to compute the similarity between two sentences and give a continuous score. Here, pairs of sentences are labeled with a score between 1 and 5 (lower is low similarity and higher is high similarity). The score generated from embeddings is compared with the labeled score using Pearson and Spearman correlations.

  • STS Benchmark Multilingual: This dataset consists of a mix between several English datasets used in the STS tasks organized in the context of SemEval between 2012 and 2017. It includes text from image captions, news headlines and user forums as pairs of sentences and a score of their similarity. The English version was translated in French with DeepL. We use the 1379 samples of the test set.

    Link to dataset: https://huggingface.co/datasets/stsb_multi_mt

  • STS22 Crosslingual: This dataset consists of pairs of news articles, with their similarity labeled with a score between 0 and 5. It comprises 10 languages. For french evaluation of models, we only use the french subset which is composed of 104 article pairs.

    Link to dataset: https://huggingface.co/datasets/mteb/sts22-crosslingual-sts/viewer/fr

  • SICK-FR: The Sentences Involving Compositional Knowldedge (SICK) dataset consists of about 10,000 English sentence pairs that include many examples of the lexical, syntactic and semantic phenomena. Each pair is labeled with both a "relatedness in meaning" score (with a 5-point rating scale) and an entailment relation (with three possible gold labels: entailment, contradiction, and neutral). For the purpose of this benchmark, we use SICK-FR : a french translation of SICK, along with the "relatedness in meaning" score.

    Link to dataset: https://huggingface.co/datasets/Lajavaness/SICK-fr

Conclusion

We started this initiative after realizing that it was often difficult to select the right NLP method for French applications. Of course, there are many good multilingual models out there. But when looking closely at their training process, it appears that most of the training data is actually in English. And since the benchmarks evaluating these models are also in English, their performance in French is hard to assess.

One of the reasons for this might be the lack of good-quality French datasets. Indeed many datasets in French are either too specialized in a specific domain to be used in a benchmark or do not come in a "ready-to-use" format and need significant work in terms of cleaning and formatting.

Identifying and preparing relevant French datasets to be used for the MTEB-French was not a trivial task, and we hope that this work can help the community accelerate the evaluation of models. The next step for MTEB-French implementation is to identify relevant models to evaluate.

The models chosen, along with justification for their selection, will be the topic of the next article. Stay tuned! 😎

Bibliography

[1] Muennighoff, Niklas et al. “MTEB: Massive Text Embedding Benchmark.” Conference of the European Chapter of the Association for Computational Linguistics (2022).

[2] Bawden, Rachel et al. “DiaBLa: a corpus of bilingual spontaneous written dialogues for machine translation.” Language Resources and Evaluation 55 (2019): 635 - 660.

[3] team, Nllb et al. “No Language Left Behind: Scaling Human-Centered Machine Translation.” ArXiv abs/2207.04672 (2022): n. pag.

[4] Goyal, Naman et al. “The Flores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation.” Transactions of the Association for Computational Linguistics 10 (2021): 522-538.

[5] Guzmán, Francisco et al. “Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English.” ArXiv abs/1902.01382 (2019): n. pag.

[6] Adelani, David Ifeoluwa et al. “MasakhaNEWS: News Topic Classification for African languages.” ArXiv abs/2304.09972 (2023): n. pag.

[7] FitzGerald, Jack G. M. et al. “MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages.” Annual Meeting of the Association for Computational Linguistics (2023).

[8] Bastianelli, Emanuele et al. “SLURP: A Spoken Language Understanding Resource Package.” Conference on Empirical Methods in Natural Language Processing (EMNLP 2020)

[9] Xia, Menglin and Emilio Monti. “Multilingual Neural Semantic Parsing for Low-Resourced Languages.” Conference on Lexical and Computational Semantics (2021).

[10] Creutz, Mathias. “Open Subtitles Paraphrase Corpus for Six Languages.” Conference on Language Resources and Evaluation (LREC 2018).

[11] Lefebvre-Brossard, Antoine et al. “Alloprof: a new French question-answer education dataset and its use in an information retrieval case study.” ArXiv abs/2302.07738 (2023): n. pag.

[12] Louis, Antoine et al. “A Statutory Article Retrieval Dataset in French.” Annual Meeting of the Association for Computational Linguistics (2021).

[13] Scialom, Thomas et al. “MLSUM: The Multilingual Summarization Corpus.” Conference on Empirical Methods in Natural Language Processing (2020).

[14] Fabbri, A. R. et al. “SummEval: Re-evaluating Summarization Evaluation.” Transactions of the Association for Computational Linguistics 9 (2020): 391-409.