Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
37works
0followers
16topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

37 published item(s)

preprint2026arXiv

Cmprsr: Abstractive Token-Level Question-Agnostic Prompt Compressor

Motivated by the high costs of using black-box Large Language Models (LLMs), we introduce a novel prompt compression paradigm, under which we use smaller LLMs to compress inputs for the larger ones. We present the first comprehensive LLM-as-a-compressor benchmark spanning 25 open- and closed-source models, which reveals significant disparity in models' compression ability in terms of (i) preserving semantically important information (ii) following the user-provided compression rate (CR). We further improve the performance of gpt-4.1-mini, the best overall vanilla compressor, with Textgrad-based compression meta-prompt optimization. We also identify the most promising open-source vanilla LLM - Qwen3-4B - and post-train it with a combination of supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), pursuing the dual objective of CR adherence and maximizing the downstream task performance. We call the resulting model Cmprsr and demonstrate its superiority over both extractive and vanilla abstractive compression across the entire range of compression rates on lengthy inputs from MeetingBank and LongBench as well as short prompts from GSM8k. The latter highlights Cmprsr's generalizability across varying input lengths and domains. Moreover, Cmprsr closely follows the requested compression rate, offering fine control over the cost-quality trade-off.

preprint2026arXiv

Integrating Machine-Generated Short Descriptions into the Wikipedia Android App: A Pilot Deployment of Descartes

Short descriptions are a key part of the Wikipedia user experience, but their coverage remains uneven across languages and topics. In previous work, we introduced Descartes, a multilingual model for generating short descriptions. In this report, we present the results of a pilot deployment of Descartes in the Wikipedia Android app, where editors were offered suggestions based on outputs from Descartes while editing short descriptions. The experiment spanned 12 languages, with over 3,900 articles and 375 editors participating. Overall, 90% of accepted Descartes descriptions were rated at least 3 out of 5 in quality, and their average ratings were comparable to human-written ones. Editors adopted machine suggestions both directly and with modifications, while the rate of reverts and reports remained low. The pilot also revealed practical considerations for deployment, including latency, language-specific gaps, and the need for safeguards around sensitive topics. These results indicate that Descartes's short descriptions can support editors in reducing content gaps, provided that technical, design, and community guardrails are in place.

preprint2026arXiv

Tracing Persona Vectors Through LLM Pretraining

How large language models internally represent high-level behaviors is a core interpretability question with direct relevance to AI safety: it determines what we can detect, audit, or intervene on. Recent work has shown that traits such as evil or sycophancy correspond to linear directions in the internal activations, the so-called persona vectors. Although these vectors are now routinely utilized to inspect and steer model behavior in safety-relevant settings, how these representations are formed during training remains unknown. To address this gap, we trace persona vectors across the pretraining of OLMo-3-7B, finding that persona vectors form remarkably early -- within 0.22% of OLMo-3 pretraining -- and remain effective for steering the fully post-trained instruct models. Although core representations are formed early on, persona vectors continue to refine geometrically and semantically throughout pretraining. We further compare alternative elicitation strategies and find that all yield effective directions, with each strategy surfacing qualitatively distinct facets of the underlying persona. Replicating our analysis on Apertus-8B reveals that our findings transfer qualitatively beyond OLMo-3. Our results establish persona representations as stable features of early pretraining and open a path to studying how training forms, refines, and shapes them.

preprint2024arXiv

Deplatforming Norm-Violating Influencers on Social Media Reduces Overall Online Attention Toward Them

From politicians to podcast hosts, online platforms have systematically banned (``deplatformed'') influential users for breaking platform guidelines. Previous inquiries on the effectiveness of this intervention are inconclusive because 1) they consider only few deplatforming events; 2) they consider only overt engagement traces (e.g., likes and posts) but not passive engagement (e.g., views); 3) they do not consider all the potential places users impacted by the deplatforming event might migrate to. We address these limitations in a longitudinal, quasi-experimental study of 165 deplatforming events targeted at 101 influencers. We collect deplatforming events from Reddit posts and then manually curate the data, ensuring the correctness of a large dataset of deplatforming events. Then, we link these events to Google Trends and Wikipedia page views, platform-agnostic measures of online attention that capture the general public's interest in specific influencers. Through a difference-in-differences approach, we find that deplatforming reduces online attention toward influencers. After 12 months, we estimate that online attention toward deplatformed influencers is reduced by -63% (95% CI [-75%,-46%]) on Google and by -43% (95% CI [-57%,-24%]) on Wikipedia. Further, as we study over a hundred deplatforming events, we can analyze in which cases deplatforming is more or less impactful, revealing nuances about the intervention. Notably, we find that both permanent and temporary deplatforming reduce online attention toward influencers; Overall, this work contributes to the ongoing effort to map the effectiveness of content moderation interventions, driving platform governance away from speculation.

preprint2023arXiv

A Large-Scale Characterization of How Readers Browse Wikipedia

Despite the importance and pervasiveness of Wikipedia as one of the largest platforms for open knowledge, surprisingly little is known about how people navigate its content when seeking information. To bridge this gap, we present the first systematic large-scale analysis of how readers browse Wikipedia. Using billions of page requests from Wikipedia's server logs, we measure how readers reach articles, how they transition between articles, and how these patterns combine into more complex navigation paths. We find that navigation behavior is characterized by highly diverse structures. Although most navigation paths are shallow, comprising a single pageload, there is much variety, and the depth and shape of paths vary systematically with topic, device type, and time of day. We show that Wikipedia navigation paths commonly mesh with external pages as part of a larger online ecosystem, and we describe how naturally occurring navigation paths are distinct from targeted navigation in lab-based settings. Our results further suggest that navigation is abandoned when readers reach low-quality pages. Taken together, these insights contribute to a more systematic understanding of readers' information needs and allow for improving their experience on Wikipedia and the Web in general.

preprint2022arXiv

Anticipated versus Actual Effects of Platform Design Change: A Case Study of Twitter's Character Limit

The design of online platforms is both critically important and challenging, as any changes may lead to unintended consequences, and it can be hard to predict how users will react. Here we conduct a case study of a particularly important real-world platform design change: Twitter's decision to double the character limit from 140 to 280 characters to soothe users' need to ''cram'' or ''squeeze'' their tweets, informed by modeling of historical user behavior. In our analysis, we contrast Twitter's anticipated pre-intervention predictions about user behavior with actual post-intervention user behavior: Did the platform design change lead to the intended user behavior shifts, or did a gap between anticipated and actual behavior emerge? Did different user groups react differently? We find that even though users do not ''cram'' as much under 280 characters as they used to under 140 characters, emergent ``cramming'' at the new limit seems to not have been taken into account when designing the platform change. Furthermore, investigating textual features, we find that, although post-intervention ''crammed'' tweets are longer, their syntactic and semantic characteristics remain similar and indicative of ''squeezing''. Applying the same approach as Twitter policy-makers, we create updated counterfactual estimates and find that the character limit would need to be increased further to reduce cramming that re-emerged at the new limit. We contribute to the rich literature studying online user behavior with an empirical study that reveals a dynamic interaction between platform design and user behavior, with immediate policy and practical implications for the design of socio-technical systems.

preprint2022arXiv

Biased Bytes: On the Validity of Estimating Food Consumption from Digital Traces

Given that measuring food consumption at a population scale is a challenging task, researchers have begun to explore digital traces (e.g., from social media or from food-tracking applications) as potential proxies. However, it remains unclear to what extent digital traces reflect real food consumption. The present study aims to bridge this gap by quantifying the link between dietary behaviors as captured via social media (Twitter) v.s. a food-tracking application (MyFoodRepo). We focus on the case of Switzerland and contrast images of foods collected through the two platforms, by designing and deploying a novel crowdsourcing framework for estimating biases with respect to nutritional properties and appearance. We find that the food type distributions in social media v.s. food tracking diverge; e.g., bread is 2.5 times more frequent among consumed and tracked foods than on Twitter, whereas cake is 12 times more frequent on Twitter. Controlling for the different food type distributions, we contrast consumed and tracked foods of a given type with foods shared on Twitter. Across food types, food posted on Twitter is perceived as tastier, more caloric, less healthy, less likely to have been consumed at home, more complex, and larger-portioned, compared to consumed and tracked foods. The fact that there is a divergence between food consumption as measured via the two platforms implies that at least one of the two is not a faithful representation of the true food consumption in the general Swiss population. Thus, researchers should be attentive and aim to establish evidence of validity before using digital traces as a proxy for the true food consumption of a general population. We conclude by discussing the potential sources of these biases and their implications, outlining pitfalls and threats to validity, and proposing actionable ways for overcoming them.

preprint2022arXiv

Can online attention signals help fact-checkers fact-check?

Recent research suggests that not all fact-checking efforts are equal: when and what is fact-checked plays a pivotal role in effectively correcting misconceptions. In that context, signals capturing how much attention specific topics receive on the Internet have the potential to study (and possibly support) fact-checking efforts. This paper proposes a framework to study fact-checking with online attention signals. The framework consists of: 1) extracting claims from fact-checking efforts; 2) linking such claims with knowledge graph entities; and 3) estimating the online attention these entities receive. We use this framework to conduct a preliminary study of a dataset of 879 COVID-19-related fact-checks done in 2020 by 81 international organizations. Our findings suggest that there is often a disconnect between online attention and fact-checking efforts. For example, in around 40% of countries that fact-checked ten or more claims, half or more than half of the ten most popular claims were not fact-checked. Our analysis also shows that claims are first fact-checked after receiving, on average, 35% of the total online attention they would eventually receive in 2020. Yet, there is a considerable variation among claims: some were fact-checked before receiving a surge of misinformation-induced online attention; others are fact-checked much later. Overall, our work suggests that the incorporation of online attention signals may help organizations assess their fact-checking efforts and choose what and when to fact-check claims or stories. Also, in the context of international collaboration, where claims are fact-checked multiple times across different countries, online attention could help organizations keep track of which claims are "migrating" between countries.

preprint2022arXiv

Efficient Entity Candidate Generation for Low-Resource Languages

Candidate generation is a crucial module in entity linking. It also plays a key role in multiple NLP tasks that have been proven to beneficially leverage knowledge bases. Nevertheless, it has often been overlooked in the monolingual English entity linking literature, as naive approaches obtain very good performance. Unfortunately, the existing approaches for English cannot be successfully transferred to poorly resourced languages. This paper constitutes an in-depth analysis of the candidate generation problem in the context of cross-lingual entity linking with a focus on low-resource languages. Among other contributions, we point out limitations in the evaluation conducted in previous works. We introduce a characterization of queries into types based on their difficulty, which improves the interpretability of the performance of different methods. We also propose a light-weight and simple solution based on the construction of indexes whose design is motivated by more complex transfer learning based neural approaches. A thorough empirical analysis on 9 real-world datasets under 2 evaluation settings shows that our simple solution outperforms the state-of-the-art approach in terms of both quality and efficiency for almost all datasets and query types.

preprint2022arXiv

ESC-Rules: Explainable, Semantically Constrained Rule Sets

We describe a novel approach to explainable prediction of a continuous variable based on learning fuzzy weighted rules. Our model trains a set of weighted rules to maximise prediction accuracy and minimise an ontology-based 'semantic loss' function including user-specified constraints on the rules that should be learned in order to maximise the explainability of the resulting rule set from a user perspective. This system fuses quantitative sub-symbolic learning with symbolic learning and constraints based on domain knowledge. We illustrate our system on a case study in predicting the outcomes of behavioural interventions for smoking cessation, and show that it outperforms other interpretable approaches, achieving performance close to that of a deep learning model, while offering transparent explainability that is an essential requirement for decision-makers in the health domain.

preprint2022arXiv

GenIE: Generative Information Extraction

Structured and grounded representation of text is typically formalized by closed information extraction, the problem of extracting an exhaustive set of (subject, relation, object) triplets that are consistent with a predefined set of entities and relations from a knowledge base schema. Most existing works are pipelines prone to error accumulation, and all approaches are only applicable to unrealistically small numbers of entities and relations. We introduce GenIE (generative information extraction), the first end-to-end autoregressive formulation of closed information extraction. GenIE naturally exploits the language knowledge from the pre-trained transformer by autoregressively generating relations and entities in textual form. Thanks to a new bi-level constrained generation strategy, only triplets consistent with the predefined knowledge base schema are produced. Our experiments show that GenIE is state-of-the-art on closed information extraction, generalizes from fewer training data points than baselines, and scales to a previously unmanageable number of entities and relations. With this work, closed information extraction becomes practical in realistic scenarios, providing new opportunities for downstream tasks. Finally, this work paves the way towards a unified end-to-end approach to the core tasks of information extraction. Code, data and models available at https://github.com/epfl-dlab/GenIE.

preprint2022arXiv

Going Down the Rabbit Hole: Characterizing the Long Tail of Wikipedia Reading Sessions

"Wiki rabbit holes" are informally defined as navigation paths followed by Wikipedia readers that lead them to long explorations, sometimes involving unexpected articles. Although wiki rabbit holes are a popular concept in Internet culture, our current understanding of their dynamics is based on anecdotal reports only. To bridge this gap, this paper provides a large-scale quantitative characterization of the navigation traces of readers who fell into a wiki rabbit hole. First, we represent user sessions as navigation trees and operationalize the concept of wiki rabbit holes based on the depth of these trees. Then, we characterize rabbit hole sessions in terms of structural patterns, time properties, and topical exploration. We find that article layout influences the structure of rabbit hole sessions and that the fraction of rabbit hole sessions is higher during the night. Moreover, readers are more likely to fall into a rabbit hole starting from articles about entertainment, sports, politics, and history. Finally, we observe that, on average, readers tend to stay focused on one topic by remaining in the semantic neighborhood of the first articles even during rabbit hole sessions. These findings contribute to our understanding of Wikipedia readers' information needs and user behavior on the Web.

preprint2022arXiv

Homepage2Vec: Language-Agnostic Website Embedding and Classification

Currently, publicly available models for website classification do not offer an embedding method and have limited support for languages beyond English. We release a dataset of more than two million category-labeled websites in 92 languages collected from Curlie, the largest multilingual human-edited Web directory. The dataset contains 14 website categories aligned across languages. Alongside it, we introduce Homepage2Vec, a machine-learned pre-trained model for classifying and embedding websites based on their homepage in a language-agnostic way. Homepage2Vec, thanks to its feature set (textual content, metadata tags, and visual attributes) and recent progress in natural language representation, is language-independent by design and generates embedding-based representations. We show that Homepage2Vec correctly classifies websites with a macro-averaged F1-score of 0.90, with stable performance across low- as well as high-resource languages. Feature analysis shows that a small subset of efficiently computable features suffices to achieve high performance even with limited computational resources. We make publicly available the curated Curlie dataset aligned across languages, the pre-trained Homepage2Vec model, and libraries

preprint2022arXiv

On the Context-Free Ambiguity of Emoji

Emojis come with prepacked semantics making them great candidates to create new forms of more accessible communications. Yet, little is known about how much of this emojis semantic is agreed upon by humans, outside of textual contexts. Thus, we collected a crowdsourced dataset of one-word emoji descriptions for 1,289 emojis presented to participants with no surrounding text. The emojis and their interpretations were then examined for ambiguity. We find that with 30 annotations per emoji, 16 emojis (1.2%) are completely unambiguous, whereas 55 emojis (4.3%) are so ambiguous that their descriptions are indistinguishable from randomly chosen descriptions. Most of studied emojis are spread out between the two extremes. Furthermore, investigating the ambiguity of different types of emojis, we find that an important factor is the extent to which an emoji has an embedded symbolical meaning drawn from an established code-book of symbols. We conclude by discussing design implications.

preprint2022arXiv

Population-scale dietary interests during the COVID-19 pandemic

The SARS-CoV-2 virus has altered people's lives around the world. Here we document population-wide shifts in dietary interests in 18 countries in 2020, as revealed through time series of Google search volumes. We find that during the first wave of the COVID-19 pandemic there was an overall surge in food interest, larger and longer-lasting than the surge during typical end-of-year holidays in Western countries. The shock of decreased mobility manifested as a drastic increase in interest in consuming food at home and a corresponding decrease in consuming food outside of home. The largest (up to threefold) increases occurred for calorie-dense carbohydrate-based foods such as pastries, bakery products, bread, and pies. The observed shifts in dietary interests have the potential to globally affect food consumption and health outcomes. These findings can inform governmental and organizational decisions regarding measures to mitigate the effects of the COVID-19 pandemic on diet and nutrition.

preprint2022arXiv

Post Approvals in Online Communities

In many online communities, community leaders (i.e., moderators and administrators) can proactively filter undesired content by requiring posts to be approved before publication. But although many communities adopt post approvals, there has been little research on its impact on community behavior. Through a longitudinal analysis of 233,402 Facebook Groups, we examined 1) the factors that led to a community adopting post approvals and 2) how the setting shaped subsequent user activity and moderation in the group. We find that communities that adopted post approvals tended to do so following sudden increases in user activity (e.g., comments) and moderation (e.g., reported posts). This adoption of post approvals led to fewer but higher-quality posts. Though fewer posts were shared after adoption, not only did community members write more comments, use more reactions, and spend more time on the posts that were shared, they also reported these posts less. Further, post approvals did not significantly increase the average time leaders spent in the group, though groups that enabled the setting tended to appoint more leaders. Last, the impact of post approvals varied with both group size and how the setting was used, e.g.,, group size mediates whether leaders spent more or less time in the group following the adoption of the setting. Our findings suggest ways that proactive content moderation may be improved to better support online communities.

preprint2022arXiv

Quote Erat Demonstrandum: A Web Interface for Exploring the Quotebank Corpus

The use of attributed quotes is the most direct and least filtered pathway of information propagation in news. Consequently, quotes play a central role in the conception, reception, and analysis of news stories. Since quotes provide a more direct window into a speaker's mind than regular reporting, they are a valuable resource for journalists and researchers alike. While substantial research efforts have been devoted to methods for the automated extraction of quotes from news and their attribution to speakers, few comprehensive corpora of attributed quotes from contemporary sources are available to the public. Here, we present an adaptive web interface for searching Quotebank, a massive collection of quotes from the news, which we make available at https://quotebank.dlab.tools.

preprint2022arXiv

Strong Heuristics for Named Entity Linking

Named entity linking (NEL) in news is a challenging endeavour due to the frequency of unseen and emerging entities, which necessitates the use of unsupervised or zero-shot methods. However, such methods tend to come with caveats, such as no integration of suitable knowledge bases (like Wikidata) for emerging entities, a lack of scalability, and poor interpretability. Here, we consider person disambiguation in Quotebank, a massive corpus of speaker-attributed quotations from the news, and investigate the suitability of intuitive, lightweight, and scalable heuristics for NEL in web-scale corpora. Our best performing heuristic disambiguates 94% and 63% of the mentions on Quotebank and the AIDA-CoNLL benchmark, respectively. Additionally, the proposed heuristics compare favourably to the state-of-the-art unsupervised and zero-shot methods, Eigenthemes and mGENRE, respectively, thereby serving as strong baselines for unsupervised and zero-shot entity linking.

preprint2022arXiv

Wikipedia Reader Navigation: When Synthetic Data Is Enough

Every day millions of people read Wikipedia. When navigating the vast space of available topics using hyperlinks, readers describe trajectories on the article network. Understanding these navigation patterns is crucial to better serve readers' needs and address structural biases and knowledge gaps. However, systematic studies of navigation on Wikipedia are hindered by a lack of publicly available data due to the commitment to protect readers' privacy by not storing or sharing potentially sensitive data. In this paper, we ask: How well can Wikipedia readers' navigation be approximated by using publicly available resources, most notably the Wikipedia clickstream data? We systematically quantify the differences between real navigation sequences and synthetic sequences generated from the clickstream data, in 6 analyses across 8 Wikipedia language versions. Overall, we find that the differences between real and synthetic sequences are statistically significant, but with small effect sizes, often well below 10%. This constitutes quantitative evidence for the utility of the Wikipedia clickstream data as a public resource: clickstream data can closely capture reader navigation on Wikipedia and provides a sufficient approximation for most practical downstream applications relying on reader data. More broadly, this study provides an example for how clickstream-like data can generally enable research on user navigation on online platforms while protecting users' privacy.

preprint2021arXiv

A pole to pole pressure temperature map of Saturn's thermosphere from Cassini Grand Finale data

Temperatures of the outer planet thermospheres exceed those predicted by solar heating alone by several hundred degrees. Enough energy is deposited at auroral regions to heat the entire thermosphere, but models predict that equatorward distribution is inhibited by strong Coriolis forces and ion drag. A better understanding of auroral energy deposition and circulation are critical to solving this so-called energy crisis. Stellar occultations observed by the Ultraviolet Imaging Spectrograph instrument during the Cassini Grand Finale were designed to map the thermosphere from pole to pole. We analyze these observations, together with earlier observations from 2016 and 2017, to create a two-dimensional map of densities and temperatures in Saturns thermosphere as a function of latitude and depth. The observed temperatures at auroral latitudes are cooler and peak at higher altitudes and lower latitudes than predicted by models, leading to a shallower meridional pressure gradient. Under modified geostrophy, we infer slower westward zonal winds that extend to lower latitudes than predicted, supporting equatorward flow from approximately 70 to 30 degrees latitude in both hemispheres. We also show evidence of atmospheric waves in the data that can contribute to equatorward redistribution of energy through zonal drag.

preprint2021arXiv

Calibration of Google Trends Time Series

Google Trends is a tool that allows researchers to analyze the popularity of Google search queries across time and space. In a single request, users can obtain time series for up to 5 queries on a common scale, normalized to the range from 0 to 100 and rounded to integer precision. Despite the overall value of Google Trends, rounding causes major problems, to the extent that entirely uninformative, all-zero time series may be returned for unpopular queries when requested together with more popular queries. We address this issue by proposing Google Trends Anchor Bank (G-TAB), an efficient solution for the calibration of Google Trends data. Our method expresses the popularity of an arbitrary number of queries on a common scale without being compromised by rounding errors. The method proceeds in two phases. In the offline preprocessing phase, an "anchor bank" is constructed, a set of queries spanning the full spectrum of popularity, all calibrated against a common reference query by carefully chaining together multiple Google Trends requests. In the online deployment phase, any given search query is calibrated by performing an efficient binary search in the anchor bank. Each search step requires one Google Trends request, but few steps suffice, as we demonstrate in an empirical evaluation. We make our code publicly available as an easy-to-use library at https://github.com/epfl-dlab/GoogleTrendsAnchorBank.

preprint2021arXiv

Crosslingual Topic Modeling with WikiPDA

We present Wikipedia-based Polyglot Dirichlet Allocation (WikiPDA), a crosslingual topic model that learns to represent Wikipedia articles written in any language as distributions over a common set of language-independent topics. It leverages the fact that Wikipedia articles link to each other and are mapped to concepts in the Wikidata knowledge base, such that, when represented as bags of links, articles are inherently language-independent. WikiPDA works in two steps, by first densifying bags of links using matrix completion and then training a standard monolingual topic model. A human evaluation shows that WikiPDA produces more coherent topics than monolingual text-based LDA, thus offering crosslinguality at no cost. We demonstrate WikiPDA's utility in two applications: a study of topical biases in 28 Wikipedia editions, and crosslingual supervised classification. Finally, we highlight WikiPDA's capacity for zero-shot language transfer, where a model is reused for new languages without any fine-tuning. Researchers can benefit from WikiPDA as a practical tool for studying Wikipedia's content across its 299 language editions in interpretable ways, via an easy-to-use library publicly available at https://github.com/epfl-dlab/WikiPDA.

preprint2021arXiv

Formation of Social Ties Influences Food Choice: A Campus-Wide Longitudinal Study

Nutrition is a key determinant of long-term health, and social influence has long been theorized to be a key determinant of nutrition. It has been difficult to quantify the postulated role of social influence on nutrition using traditional methods such as surveys, due to the typically small scale and short duration of studies. To overcome these limitations, we leverage a novel source of data: logs of 38 million food purchases made over an 8-year period on the Ecole Polytechnique Federale de Lausanne (EPFL) university campus, linked to anonymized individuals via the smartcards used to make on-campus purchases. In a longitudinal observational study, we ask: How is a person's food choice affected by eating with someone else whose own food choice is healthy vs. unhealthy? To estimate causal effects from the passively observed log data, we control confounds in a matched quasi-experimental design: we identify focal users who at first do not have any regular eating partners but then start eating with a fixed partner regularly, and we match focal users into comparison pairs such that paired users are nearly identical with respect to covariates measured before acquiring the partner, where the two focal users' new eating partners diverge in the healthiness of their respective food choice. A difference-in-differences analysis of the paired data yields clear evidence of social influence: focal users acquiring a healthy-eating partner change their habits significantly more toward healthy foods than focal users acquiring an unhealthy-eating partner. We further identify foods whose purchase frequency is impacted significantly by the eating partner's healthiness of food choice. Beyond the main results, the work demonstrates the utility of passively sensed food purchase logs for deriving insights, with the potential of informing the design of public health interventions and food offerings.

preprint2021arXiv

Low-Rank Subspaces for Unsupervised Entity Linking

Entity linking is an important problem with many applications. Most previous solutions were designed for settings where annotated training data is available, which is, however, not the case in numerous domains. We propose a light-weight and scalable entity linking method, Eigenthemes, that relies solely on the availability of entity names and a referent knowledge base. Eigenthemes exploits the fact that the entities that are truly mentioned in a document (the "gold entities") tend to form a semantically dense subset of the set of all candidate entities in the document. Geometrically speaking, when representing entities as vectors via some given embedding, the gold entities tend to lie in a low-rank subspace of the full embedding space. Eigenthemes identifies this subspace using the singular value decomposition and scores candidate entities according to their proximity to the subspace. On the empirical front, we introduce multiple strong baselines that compare favorably to (and sometimes even outperform) the existing state of the art. Extensive experiments on benchmark datasets from a variety of real-world domains showcase the effectiveness of our approach.

preprint2021arXiv

On the Value of Wikipedia as a Gateway to the Web

By linking to external websites, Wikipedia can act as a gateway to the Web. To date, however, little is known about the amount of traffic generated by Wikipedia's external links. We fill this gap in a detailed analysis of usage logs gathered from Wikipedia users' client devices. Our analysis proceeds in three steps: First, we quantify the level of engagement with external links, finding that, in one month, English Wikipedia generated 43M clicks to external websites, in roughly even parts via links in infoboxes, cited references, and article bodies. Official links listed in infoboxes have by far the highest click-through rate (CTR), 2.47% on average. In particular, official links associated with articles about businesses, educational institutions, and websites have the highest CTR, whereas official links associated with articles about geographical content, television, and music have the lowest CTR. Second, we investigate patterns of engagement with external links, finding that Wikipedia frequently serves as a stepping stone between search engines and third-party websites, effectively fulfilling information needs that search engines do not meet. Third, we quantify the hypothetical economic value of the clicks received by external websites from English Wikipedia, by estimating that the respective website owners would need to pay a total of $7--13 million per month to obtain the same volume of traffic via sponsored search. Overall, these findings shed light on Wikipedia's role not only as an important source of information, but also as a high-traffic gateway to the broader Web ecosystem.

preprint2020arXiv

Adoption of Twitter's New Length Limit: Is 280 the New 140?

In November 2017, Twitter doubled the maximum allowed tweet length from 140 to 280 characters, a drastic switch on one of the world's most influential social media platforms. In the first long-term study of how the new length limit was adopted by Twitter users, we ask: Does the effect of the new length limit resemble that of the old one? Or did the doubling of the limit fundamentally change how Twitter is shaped by the limited length of posted content? By analyzing Twitter's publicly available 1% sample over a period of around 3 years, we find that, when the length limit was raised from 140 to 280 characters, the prevalence of tweets around 140 characters dropped immediately, while the prevalence of tweets around 280 characters rose steadily for about 6 months. Despite this rise, tweets approaching the length limit have been far less frequent after than before the switch. We find widely different adoption rates across languages and client-device types. The prevalence of tweets around 140 characters before the switch in a given language is strongly correlated with the prevalence of tweets around 280 characters after the switch in the same language, and very long tweets are vastly more popular on Web clients than on mobile clients. Moreover, tweets of around 280 characters after the switch are syntactically and semantically similar to tweets of around 140 characters before the switch, manifesting patterns of message squeezing in both cases. Taken together, these findings suggest that the new 280-character limit constitutes a new, less intrusive version of the old 140-character limit. The length limit remains an important factor that should be considered in all studies using Twitter data.

preprint2020arXiv

Behavior Cloning in OpenAI using Case Based Reasoning

Learning from Observation (LfO), also known as Behavioral Cloning, is an approach for building software agents by recording the behavior of an expert (human or artificial) and using the recorded data to generate the required behavior. jLOAF is a platform that uses Case-Based Reasoning to achieve LfO. In this paper we interface jLOAF with the popular OpenAI Gym environment. Our experimental results show how our approach can be used to provide a baseline for comparison in this domain, as well as identify the strengths and weaknesses when dealing with environmental complexity.

preprint2020arXiv

Experts and authorities receive disproportionate attention on Twitter during the COVID-19 crisis

Timely access to accurate information is crucial during the COVID-19 pandemic. Prompted by key stakeholders' cautioning against an "infodemic", we study information sharing on Twitter from January through May 2020. We observe an overall surge in the volume of general as well as COVID-19-related tweets around peak lockdown in March/April 2020. With respect to engagement (retweets and likes), accounts related to healthcare, science, government and politics received by far the largest boosts, whereas accounts related to religion and sports saw a relative decrease in engagement. While the threat of an "infodemic" remains, our results show that social media also provide a platform for experts and public authorities to be widely heard during a global crisis.

preprint2020arXiv

Fixation properties of rock-paper-scissors games in fluctuating populations

Rock-paper-scissors games metaphorically model cyclic dominance in ecology and microbiology. In a static environment, these models are characterized by fixation probabilities obeying two different "laws" in large and small well-mixed populations. Here, we investigate the evolution of these three-species models subject to a randomly switching carrying capacity modeling the endless change between states of resources scarcity and abundance. Focusing mainly on the zero-sum rock-paper-scissors game, equivalent to the cyclic Lotka-Volterra model, we study how the ${\it coupling}$ of demographic and environmental noise influences the fixation properties. More specifically, we investigate which species is the most likely to prevail in a population of fluctuating size and how the outcome depends on the environmental variability. We show that demographic noise coupled with environmental randomness "levels the field" of cyclic competition by balancing the effect of selection. In particular, we show that fast switching effectively reduces the selection intensity proportionally to the variance of the carrying capacity. We determine the conditions under which new fixation scenarios arise, where the most likely species to prevail changes with the rate of switching and the variance of the carrying capacity. Random switching has a limited effect on the mean fixation time that scales linearly with the average population size. Hence, environmental randomness makes the cyclic competition more egalitarian, but does not prolong the species coexistence. We also show how the fixation probabilities of close-to-zero-sum rock-paper-scissors games can be obtained from those of the zero-sum model by rescaling the selection intensity.

preprint2020arXiv

Global gender differences in Wikipedia readership

Wikipedia represents the largest and most popular source of encyclopedic knowledge in the world today, aiming to provide equal access to information worldwide. From a global online survey of 65,031 readers of Wikipedia and their corresponding reading logs, we present novel evidence of gender differences in Wikipedia readership and how they manifest in records of user behavior. More specifically we report that (1) women are underrepresented among readers of Wikipedia, (2) women view fewer pages per reading session than men do, (3) men and women visit Wikipedia for similar reasons, and (4) men and women exhibit specific topical preferences. Our findings lay the foundation for identifying pathways toward knowledge equity in the usage of online encyclopedic knowledge.

preprint2020arXiv

Learning High Order Feature Interactions with Fine Control Kernels

We provide a methodology for learning sparse statistical models that use as features all possible multiplicative interactions among an underlying atomic set of features. While the resulting optimization problems are exponentially sized, our methodology leads to algorithms that can often solve these problems exactly or provide approximate solutions based on combining highly correlated features. We also introduce an algorithmic paradigm, the Fine Control Kernel framework, so named because it is based on Fenchel Duality and is reminiscent of kernel methods. Its theory is tailored to large sparse learning problems, and it leads to efficient feature screening rules for interactions. These rules are inspired by the Apriori algorithm for market basket analysis -- which also falls under the purview of Fine Control Kernels, and can be applied to a plurality of learning problems including the Lasso and sparse matrix estimation. Experiments on biomedical datasets demonstrate the efficacy of our methodology in deriving algorithms that efficiently produce interactions models which achieve state-of-the-art accuracy and are interpretable.

preprint2020arXiv

On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation

Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual textual similarity. In this paper, we concern ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations, which represents a natural adversarial setup for multilingual encoders. Reference-free evaluation holds the promise of web-scale comparison of MT systems. We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER. We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations, namely, (a) a semantic mismatch between representations of mutual translations and, more prominently, (b) the inability to punish "translationese", i.e., low-quality literal translations. We propose two partial remedies: (1) post-hoc re-alignment of the vector spaces and (2) coupling of semantic-similarity based metrics with target-side language modeling. In segment-level MT evaluation, our best metric surpasses reference-based BLEU by 5.7 correlation points.

preprint2020arXiv

Population Dynamics in a Changing Environment: Random versus Periodic Switching

Environmental changes greatly influence the evolution of populations. Here, we study the dynamics of a population of two strains, one growing slightly faster than the other, competing for resources in a time-varying binary environment modeled by a carrying capacity switching either randomly or periodically between states of abundance and scarcity. The population dynamics is characterized by demographic noise (birth and death events) coupled to a varying environment. We elucidate the similarities and differences of the evolution subject to a stochastically- and periodically-varying environment. Importantly, the population size distribution is generally found to be broader under intermediate and fast random switching than under periodic variations, which results in markedly different asymptotic behaviors between the fixation probability of random and periodic switching. We also determine the detailed conditions under which the fixation probability of the slow strain is maximal.

preprint2020arXiv

Privacy-Preserving Classification with Secret Vector Machines

Today, large amounts of valuable data are distributed among millions of user-held devices, such as personal computers, phones, or Internet-of-things devices. Many companies collect such data with the goal of using it for training machine learning models allowing them to improve their services. User-held data is, however, often sensitive, and collecting it is problematic in terms of privacy. We address this issue by proposing a novel way of training a supervised classifier in a distributed setting akin to the recently proposed federated learning paradigm, but under the stricter privacy requirement that the server that trains the model is assumed to be untrusted and potentially malicious. We thus preserve user privacy by design, rather than by trust. In particular, our framework, called secret vector machine (SecVM), provides an algorithm for training linear support vector machines (SVM) in a setting in which data-holding clients communicate with an untrusted server by exchanging messages designed to not reveal any personally identifiable information. We evaluate our model in two ways. First, in an offline evaluation, we train SecVM to predict user gender from tweets, showing that we can preserve user privacy without sacrificing classification performance. Second, we implement SecVM's distributed framework for the Cliqz web browser and deploy it for predicting user gender in a large-scale online evaluation with thousands of clients, outperforming baselines by a large margin and thus showcasing that SecVM is suitable for production environments.

preprint2020arXiv

Quantifying Engagement with Citations on Wikipedia

Wikipedia, the free online encyclopedia that anyone can edit, is one of the most visited sites on the Web and a common source of information for many users. As an encyclopedia, Wikipedia is not a source of original information, but was conceived as a gateway to secondary sources: according to Wikipedia's guidelines, facts must be backed up by reliable sources that reflect the full spectrum of views on the topic. Although citations lie at the very heart of Wikipedia, little is known about how users interact with them. To close this gap, we built client-side instrumentation for logging all interactions with links leading from English Wikipedia articles to cited references during one month, and conducted the first analysis of readers' interaction with citations on Wikipedia. We find that overall engagement with citations is low: about one in 300 page views results in a reference click (0.29% overall; 0.56% on desktop; 0.13% on mobile). Matched observational studies of the factors associated with reference clicking reveal that clicks occur more frequently on shorter pages and on pages of lower quality, suggesting that references are consulted more commonly when Wikipedia itself does not contain the information sought by the user. Moreover, we observe that recent content, open access sources and references about life events (births, deaths, marriages, etc) are particularly popular. Taken together, our findings open the door to a deeper understanding of Wikipedia's role in a global information economy where reliability is ever less certain, and source attribution ever more vital.

preprint2020arXiv

Robust Cross-lingual Embeddings from Parallel Sentences

Recent advances in cross-lingual word embeddings have primarily relied on mapping-based methods, which project pretrained word embeddings from different languages into a shared space through a linear transformation. However, these approaches assume word embedding spaces are isomorphic between different languages, which has been shown not to hold in practice (Søgaard et al., 2018), and fundamentally limits their performance. This motivates investigating joint learning methods which can overcome this impediment, by simultaneously learning embeddings across languages via a cross-lingual term in the training objective. We propose a bilingual extension of the CBOW method which leverages sentence-aligned corpora to obtain robust cross-lingual word and sentence representations. Our approach significantly improves cross-lingual sentence retrieval performance over all other approaches while maintaining parity with the current state-of-the-art methods on word-translation. It also achieves parity with a deep RNN method on a zero-shot cross-lingual document classification task, requiring far fewer computational resources for training and inference. As an additional advantage, our bilingual method leads to a much more pronounced improvement in the the quality of monolingual word vectors compared to other competing methods.

preprint2020arXiv

WikiHist.html: English Wikipedia's Full Revision History in HTML Format

Wikipedia is written in the wikitext markup language. When serving content, the MediaWiki software that powers Wikipedia parses wikitext to HTML, thereby inserting additional content by expanding macros (templates and mod-ules). Hence, researchers who intend to analyze Wikipediaas seen by its readers should work with HTML, rather than wikitext. Since Wikipedia's revision history is publicly available exclusively in wikitext format, researchers have had to produce HTML themselves, typically by using Wikipedia's REST API for ad-hoc wikitext-to-HTML parsing. This approach, however, (1) does not scale to very large amounts ofdata and (2) does not correctly expand macros in historical article revisions. We solve these problems by developing a parallelized architecture for parsing massive amounts of wikitext using local instances of MediaWiki, enhanced with the capacity of correct historical macro expansion. By deploying our system, we produce and release WikiHist.html, English Wikipedia's full revision history in HTML format. We highlight the advantages of WikiHist.html over raw wikitext in an empirical analysis of Wikipedia's hyperlinks, showing that over half of the wiki links present in HTML are missing from raw wikitext and that the missing links are important for user navigation.