A novel benchmark for evaluating cross-lingual knowledge transfer in LLMs


Data creation and verification

To construct ECLeKTic, we started by selecting articles that only exist in a single language on Wikipedia for 12 languages — English, French, German, Hebrew, Hindi, Indonesian, Italian, Japanese, Korean, Mandarin Chinese, Portuguese, and Spanish. These pages are often based on topics most salient to speakers of that language, but they may very well include information that is of interest to others around the world. Of course, models may learn about these topics from other sources, but since it is not possible to analyze the training data of every LLM, we use presence in Wikipedia as a proxy for whether the model has seen information in a particular language. With this assumption, focusing on this kind of content suggests that models would need to internally transfer the knowledge from the source language to the other 11 target languages in order to solve ECLeKTic’s QA task.

Specifically, we analyzed the July 2023 download of Wikipedia. For each language, we selected 100 random articles that contained at least 200 characters, had at least 100 views during 2023, and most importantly, did not have equivalent articles in any of the other 11 languages. From each selected article we extracted the first ten sentences. Based on one fact mentioned in these sentences, human annotators filtered and corrected question and answer pairs that were generated by Gemini. The annotators, each native in the relevant language, first made sure that the question is answerable in a closed book setting, i.e., it does not refer explicitly to the surrounding context in the Wikipedia article, nor does it mention the answer. Second, they validated that the question is related to information that is particularly salient for the speakers of the language in question, and less related to general knowledge, like science or current events. Questions and answers that did not meet these criteria were discarded. Third, in a process called decontextualization, the annotators confirmed that the question contains all the information needed to be answerable when translated. For example, a question in Hebrew relating to the “supreme court” was disambiguated by the annotators to explicitly mention “the Israeli supreme court”. Named entities were also clarified similarly, so a question referring to “Ambev” was modified to refer to “the Brazilian brewing company, Ambev”.

Finally, each retained question and answer were automatically translated into the other 11 languages. The translations were verified by another set of human annotators and modified when needed. At this stage, some examples were also discarded if they proved to be untranslatable — for example, when a question explicitly refers to the meaning of a word in the source language.

Based on this approach, the final ECLeKTic dataset consists of 384 unique questions and 4224 translated examples.

Leave a Reply

Your email address will not be published. Required fields are marked *