Paper detail

GEMv2: Multilingual NLG Benchmarking in a Single Line of Code

Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, and human evaluation to make definitive claims. To make following best model evaluation practices easier, we introduce GEMv2. The new version of the Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers to benefit from each others work. GEMv2 supports 40 documented datasets in 51 languages. Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.

preprint2022arXivOpen access

Sebastian Gehrmann Abhik Bhattacharjee Abinaya Mahendiran Alex Wang Alexandros Papangelis Aman Madaan Angelina McMillan-Major Anna Shvets Ashish Upadhyay Bingsheng Yao Bryan Wilie Chandra Bhagavatula Chaobin You Craig Thomson Cristina Garbacea Dakuo Wang Daniel Deutsch Deyi Xiong Di Jin Dimitra Gkatzia Dragomir Radev Elizabeth Clark Esin Durmus Faisal Ladhak Filip Ginter Genta Indra Winata Hendrik Strobelt Hiroaki Hayashi Jekaterina Novikova Jenna Kanerva Jenny Chim Jiawei Zhou Jordan Clive Joshua Maynez João Sedoc Juraj Juraska Kaustubh Dhole Khyathi Raghavi Chandu Laura Perez-Beltrachini Leonardo F. R. Ribeiro Lewis Tunstall Li Zhang Mahima Pushkarna Mathias Creutz Michael White Mihir Sanjay Kale Moussa Kamal Eddine Nico Daheim Nishant Subramani Ondrej Dusek Paul Pu Liang Pawan Sasanka Ammanamanchi Qi Zhu Ratish Puduppully Reno Kriz Rifat Shahriyar Ronald Cardenas Saad Mahamood Salomey Osei Samuel Cahyawijaya Sanja Štajner Sebastien Montella Shailza Shailza Jolly Simon Mille Tahmid Hasan Tianhao Shen Tosin Adewumi Vikas Raunak Vipul Raheja Vitaly Nikolaev Vivian Tsai Yacine Jernite Ying Xu Yisi Sang Yixin Liu Yufang Hou

Computation and Language Machine Learning Artificial Intelligence

Open graph Reviews Discussion

Signal facts

What is known right now

Open access77 authors3 topics

Imported metadata coverageMissing code, dataset, citation and institution fields are tracked without dominating the paper.Details

Citations: 0Reviews: 0Saves: 0Code: not linkedDataset: not linkedInstitutions: 0

Next steps

Decide what to do with this paper

Like0 Dislike0Score 0

Use like or dislike for the fast social read. The more specific scholarly feedback stays available below when needed.

Save to reading list0

Keep the important signals around this paper in one place: votes, save state, collection context, reviews and the metadata you need before deciding what to do next.

Authors

Institutions

No institution affiliation has been imported for this paper yet.

Add specific reaction

Move through nearby people, institutions, topics and adjacent work without leaving the paper page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

ContributeLeave structured feedbackUse the review template when you have a concrete strength, concern or method question.Open review form

No structured reviews yet. High-signal critique starts here.

DiscussAdd a high-signal commentKeep quick notes, caveats and replication pointers separate from formal reviews.Open comment form

No discussion yet. The first strong comment sets the tone.

GEMv2: Multilingual NLG Benchmarking in a Single Line of Code

What is known right now

Decide what to do with this paper

Keep the important context close to the paper

Authors

Institutions

Research map

Building this map preview

0 review(s)

0 comment(s)