Source author record

Steffen Eger

Steffen Eger appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computation and Language math.CO Discrete Mathematics Machine Learning Artificial Intelligence Computer Vision math.PR Multiagent Systems nlin.AO Social and Information Networks Cryptography and Security Data Structures and Algorithms Digital Libraries Information Retrieval physics.soc-ph

Catalog footprint

What is connected

20works

15topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics

Automatic metrics are now central to evaluating text-to-image models, often substituting for human judgment in benchmarking and large-scale filtering. However, it remains unclear whether these metrics truly prioritize semantic correctness or instead favor visually and socially prototypical images learned from biased data distributions. We identify and study prototypicality bias as a systematic failure mode in multimodal evaluation. We introduce a controlled contrastive benchmark ProtoBias (Prototypical Bias), spanning Animals, Objects, and Demography images, where semantically correct but non-prototypical images are paired with subtly incorrect yet prototypical adversarial counterparts. This setup enables a directional evaluation of whether metrics follow textual semantics or default to prototypes. Our results show that widely used metrics, including CLIPScore, PickScore, and VQA-based scores, frequently misrank these pairs, while even LLM-as-Judge systems exhibit uneven robustness in socially grounded cases. Human evaluations consistently favour semantic correctness with larger decision margins. Motivated by these findings, we propose ProtoScore, a robust 7B-parameter metric that substantially reduces failure rates and suppresses misranking, while running at orders of magnitude faster than the inference time of GPT-5, approaching the robustness of much larger closed-source judges.

preprint2024arXiv

Is there really a Citation Age Bias in NLP?

Citations are a key ingredient of scientific research to relate a paper to others published in the community. Recently, it has been noted that there is a citation age bias in the Natural Language Processing (NLP) community, one of the currently fastest growing AI subfields, in that the mean age of the bibliography of NLP papers has become ever younger in the last few years, leading to `citation amnesia' in which older knowledge is increasingly forgotten. In this work, we put such claims into perspective by analyzing the bibliography of $\sim$300k papers across 15 different scientific fields submitted to the popular preprint server Arxiv in the time period from 2013 to 2022. We find that all AI subfields (in particular: cs.AI, cs.CL, cs.CV, cs.LG) have similar trends of citation amnesia, in which the age of the bibliography has roughly halved in the last 10 years (from above 12 in 2013 to below 7 in 2022), on average. Rather than diagnosing this as a citation age bias in the NLP community, we believe this pattern is an artefact of the dynamics of these research fields, in which new knowledge is produced in ever shorter time intervals.

preprint2022arXiv

Layer or Representation Space: What makes BERT-based Evaluation Metrics Robust?

The evaluation of recent embedding-based evaluation metrics for text generation is primarily based on measuring their correlation with human evaluations on standard benchmarks. However, these benchmarks are mostly from similar domains to those used for pretraining word embeddings. This raises concerns about the (lack of) generalization of embedding-based metrics to new and noisy domains that contain a different vocabulary than the pretraining data. In this paper, we examine the robustness of BERTScore, one of the most popular embedding-based metrics for text generation. We show that (a) an embedding-based metric that has the highest correlation with human evaluations on a standard benchmark can have the lowest correlation if the amount of input noise or unknown tokens increases, (b) taking embeddings from the first layer of pretrained models improves the robustness of all metrics, and (c) the highest robustness is achieved when using character-level embeddings, instead of token-based embeddings, from the first layer of the pretrained model.

preprint2022arXiv

Towards Explainable Evaluation Metrics for Natural Language Generation

Unlike classical lexical overlap metrics such as BLEU, most current evaluation metrics (such as BERTScore or MoverScore) are based on black-box language models such as BERT or XLM-R. They often achieve strong correlations with human judgments, but recent research indicates that the lower-quality classical metrics remain dominant, one of the potential reasons being that their decision processes are transparent. To foster more widespread acceptance of the novel high-quality metrics, explainability thus becomes crucial. In this concept paper, we identify key properties and propose key goals of explainable machine translation evaluation metrics. We also provide a synthesizing overview over recent approaches for explainable machine translation metrics and discuss how they relate to those goals and properties. Further, we conduct own novel experiments, which (among others) find that current adversarial NLP techniques are unsuitable for automatically identifying limitations of high-quality black-box evaluation metrics, as they are not meaning-preserving. Finally, we provide a vision of future approaches to explainable evaluation metrics and their evaluation. We hope that our work can help catalyze and guide future research on explainable evaluation metrics and, mediately, also contribute to better and more transparent text generation systems.

preprint2020arXiv

On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation

Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual textual similarity. In this paper, we concern ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations, which represents a natural adversarial setup for multilingual encoders. Reference-free evaluation holds the promise of web-scale comparison of MT systems. We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER. We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations, namely, (a) a semantic mismatch between representations of mutual translations and, more prominently, (b) the inability to punish "translationese", i.e., low-quality literal translations. We propose two partial remedies: (1) post-hoc re-alignment of the vector spaces and (2) coupling of semantic-similarity based metrics with target-side language modeling. In segment-level MT evaluation, our best metric surpasses reference-based BLEU by 5.7 correlation points.

preprint2020arXiv

SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for Multi-Document Summarization

We study unsupervised multi-document summarization evaluation metrics, which require neither human-written reference summaries nor human annotations (e.g. preferences, ratings, etc.). We propose SUPERT, which rates the quality of a summary by measuring its semantic similarity with a pseudo reference summary, i.e. selected salient sentences from the source documents, using contextualized embeddings and soft token alignment techniques. Compared to the state-of-the-art unsupervised evaluation metrics, SUPERT correlates better with human ratings by 18-39%. Furthermore, we use SUPERT as rewards to guide a neural-based reinforcement learning summarizer, yielding favorable performance compared to the state-of-the-art unsupervised summarizers. All source code is available at https://github.com/yg211/acl20-ref-free-eval.

preprint2020arXiv

Text Processing Like Humans Do: Visually Attacking and Shielding NLP Systems

Visual modifications to text are often used to obfuscate offensive comments in social media (e.g., "!d10t") or as a writing style ("1337" in "leet speak"), among other scenarios. We consider this as a new type of adversarial attack in NLP, a setting to which humans are very robust, as our experiments with both simple and more difficult visual input perturbations demonstrate. We then investigate the impact of visual adversarial attacks on current NLP systems on character-, word-, and sentence-level tasks, showing that both neural and non-neural models are, in contrast to humans, extremely sensitive to such attacks, suffering performance decreases of up to 82\%. We then explore three shielding methods---visual character embeddings, adversarial training, and rule-based recovery---which substantially improve the robustness of the models. However, the shielding methods still fall behind performances achieved in non-attack scenarios, which demonstrates the difficulty of dealing with visual attacks.

preprint2017arXiv

What is the Essence of a Claim? Cross-Domain Claim Identification

Argument mining has become a popular research area in NLP. It typically includes the identification of argumentative components, e.g. claims, as the central component of an argument. We perform a qualitative analysis across six different datasets and show that these appear to conceptualize claims quite differently. To learn about the consequences of such different conceptualizations of claim for practical applications, we carried out extensive experiments using state-of-the-art feature-rich and deep learning systems, to identify claims in a cross-domain fashion. While the divergent perception of claims in different datasets is indeed harmful to cross-domain classification, we show that there are shared properties on the lexical level as well as system configurations that can help to overcome these gaps.

preprint2016arXiv

Complex Decomposition of the Negative Distance kernel

A Support Vector Machine (SVM) has become a very popular machine learning method for text classification. One reason for this relates to the range of existing kernels which allow for classifying data that is not linearly separable. The linear, polynomial and RBF (Gaussian Radial Basis Function) kernel are commonly used and serve as a basis of comparison in our study. We show how to derive the primal form of the quadratic Power Kernel (PK) -- also called the Negative Euclidean Distance Kernel (NDK) -- by means of complex numbers. We exemplify the NDK in the framework of text categorization using the Dewey Document Classification (DDC) as the target scheme. Our evaluation shows that the power kernel produces F-scores that are comparable to the reference kernels, but is -- except for the linear kernel -- faster to compute. Finally, we show how to extend the NDK-approach by including the Mahalanobis distance.

preprint2016arXiv

Identities for partial Bell polynomials derived from identities for weighted integer compositions

We discuss closed-form formulas for the (n; k)-th partial Bell polynomials derived in Cvijovic. We show that partial Bell polynomials are special cases of weighted integer compositions, and demonstrate how the identities for partial Bell polynomials easily follow from more general identities for weighted integer compositions. We also provide short and elegant probabilistic proofs of the latter, in terms of sums of discrete integer-valued random variables. Finally, we outline further identities for the partial Bell polynomials.

preprint2016arXiv

Language classification from bilingual word embedding graphs

We study the role of the second language in bilingual word embeddings in monolingual semantic evaluation tasks. We find strongly and weakly positive correlations between down-stream task performance and second language similarity to the target language. Additionally, we show how bilingual word embeddings can be employed for the task of semantic language classification and that joint semantic spaces vary in meaningful ways across second languages. Our results support the hypothesis that semantic language similarity is influenced by both structural similarity as well as geography/contact.

preprint2016arXiv

On the Number of Many-to-Many Alignments of Multiple Sequences

We count the number of alignments of $N \ge 1$ sequences when match-up types are from a specified set $S\subseteq \mathbb{N}^N$. Equivalently, we count the number of nonnegative integer matrices whose rows sum to a given fixed vector and each of whose columns lie in $S$. We provide a new asymptotic formula for the case $S=\{(s_1,\ldots,s_N) \:|\: 1\le s_i\le 2\}$.

preprint2016arXiv

Opinion dynamics and wisdom under out-group discrimination

We study a DeGroot-like opinion dynamics model in which agents may oppose other agents. As an underlying motivation, in our setup, agents want to adjust their opinions to match those of the agents of their 'in-group' and, in addition, they want to adjust their opinions to match the 'inverse' of those of the agents of their 'out-group'. Our paradigm can account for persistent disagreement in connected societies as well as bi- and multi-polarization. Outcomes depend upon network structure and the choice of deviation function modeling the mode of opposition between agents. For a particular choice of deviation function, which we call soft opposition, we derive necessary and sufficient conditions for long-run polarization. We also consider social influence (who are the opinion leaders in the network?) as well as the question of wisdom in our naive learning paradigm, finding that wisdom is difficult to attain when there exist sufficiently strong negative relations between agents.

preprint2016arXiv

Still not there? Comparing Traditional Sequence-to-Sequence Models to Encoder-Decoder Neural Networks on Monotone String Translation Tasks

We analyze the performance of encoder-decoder neural models and compare them with well-known established methods. The latter represent different classes of traditional approaches that are applied to the monotone sequence-to-sequence tasks OCR post-correction, spelling correction, grapheme-to-phoneme conversion, and lemmatization. Such tasks are of practical relevance for various higher-level research fields including digital humanities, automatic text correction, and speech recognition. We investigate how well generic deep-learning approaches adapt to these tasks, and how they perform in comparison with established and more specialized methods, including our own adaptation of pruned CRFs.

preprint2016arXiv

Stirling's approximation for central extended binomial coefficients

We derive asymptotic formulas for central extended binomial coefficients, which are generalizations of binomial coefficients. To do so, we relate the exact distribution of the sum of independent discrete uniform random variables to the asymptotic distribution, obtained from the Central Limit Theorem and a local limit variant.

preprint2015arXiv

Some Elementary Congruences for the Number of Weighted Integer Compositions

An integer composition of a nonnegative integer $n$ is a tuple $(π_1,\ldots,π_k)$ of nonnegative integers whose sum is $n$; the $π_i$'s are called the parts of the composition. For fixed number $k$ of parts, the number of $f$-weighted integer compositions (also called $f$-colored integer compositions in the literature), in which each part size $s$ may occur in $f(s)$ different colors, is given by the extended binomial coefficient $\binom{k}{n}_{f}$. We derive several congruence properties for $\binom{k}{n}_{f}$, most of which are analogous to those for ordinary binomial coefficients. Among them is the parity of $\binom{k}{n}_{f}$, Babbage's congruence, Lucas' theorem, etc. We also give congruences for $c_{f}(n)$, the number of $f$-weighted integer compositions with arbitrarily many parts, and for extended binomial coefficient sums. We close with an application of our results to prime criteria for weighted integer compositions.

preprint2014arXiv

Corrections to the results derived in "A Unified Approach to Algorithms Generating Unrestricted and Restricted Integer Compositions and Integer Partitions"'; and a comparison of four restricted integer composition generation algorithms

In this note, I discuss results on integer compositions/partitions given in the paper "A Unified Approach to Algorithms Generating Unrestricted and Restricted Integer Compositions and Integer Partitions". I also experiment with four different generation algorithms for restricted integer compositions and find the algorithm designed in the named paper to be pretty slow, comparatively. Some of my comments may be subjective.

preprint2014arXiv

Deriving Faà di Bruno's formula for the derivative of a composite function via compositions of integers

We give yet another proof for Faà di Bruno's formula for higher derivatives of composite functions. Our proof technique relies on reinterpreting the composition of two power series as the generating function for weighted integer compositions, for which a Faà di Bruno-like formula is quite naturally established.

preprint2013arXiv

(Failure of the) Wisdom of the crowds in an endogenous opinion dynamics model with multiply biased agents

We study an endogenous opinion (or, belief) dynamics model where we endogenize the social network that models the link (`trust') weights between agents. Our network adjustment mechanism is simple: an agent increases her weight for another agent if that agent has been close to truth (whence, our adjustment criterion is `past performance'). Moreover, we consider multiply biased agents that do not learn in a fully rational manner but are subject to persuasion bias - they learn in a DeGroot manner, via a simple `rule of thumb' - and that have biased initial beliefs. In addition, we also study this setup under conformity, opposition, and homophily - which are recently suggested variants of DeGroot learning in social networks - thereby taking into account further biases agents are susceptible to. Our main focus is on crowd wisdom, that is, on the question whether the so biased agents can adequately aggregate dispersed information and, consequently, learn the true states of the topics they communicate about. In particular, we present several conditions under which wisdom fails.

preprint2012arXiv

Asymptotic normality of integer compositions inside a rectangle

Among all restricted integer compositions with at most $m$ parts, each of which has size at most $l$, choose one uniformly at random. Which integer does this composition represent? In the current note, we show that underlying distribution is, for large $m$ and $l$, approximately normal with mean value $\frac{ml}{2}$.

Steffen Eger

What is connected

Connect this record

See the researcher in context

Building this map preview

20 published item(s)

Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics

Is there really a Citation Age Bias in NLP?

Layer or Representation Space: What makes BERT-based Evaluation Metrics Robust?

Towards Explainable Evaluation Metrics for Natural Language Generation

On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation

SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for Multi-Document Summarization

Text Processing Like Humans Do: Visually Attacking and Shielding NLP Systems

What is the Essence of a Claim? Cross-Domain Claim Identification

Complex Decomposition of the Negative Distance kernel

Identities for partial Bell polynomials derived from identities for weighted integer compositions

Language classification from bilingual word embedding graphs

On the Number of Many-to-Many Alignments of Multiple Sequences

Opinion dynamics and wisdom under out-group discrimination

Still not there? Comparing Traditional Sequence-to-Sequence Models to Encoder-Decoder Neural Networks on Monotone String Translation Tasks

Stirling's approximation for central extended binomial coefficients

Some Elementary Congruences for the Number of Weighted Integer Compositions

Corrections to the results derived in "A Unified Approach to Algorithms Generating Unrestricted and Restricted Integer Compositions and Integer Partitions"'; and a comparison of four restricted integer composition generation algorithms

Deriving Faà di Bruno's formula for the derivative of a composite function via compositions of integers

(Failure of the) Wisdom of the crowds in an endogenous opinion dynamics model with multiply biased agents

Asymptotic normality of integer compositions inside a rectangle