Source author record

Filippo Menczer

Filippo Menczer appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Social and Information Networks physics.soc-ph cs.CY physics.data-an Human-Computer Interaction Digital Libraries Information Retrieval Multiagent Systems Machine Learning Artificial Intelligence Computation and Language Networking and Internet Architecture

Catalog footprint

What is connected

42works

12topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media

Social media are shifting towards pluralism -- community-governed platforms where groups define their own norms. What violates rules in one community may be perfectly acceptable in another. Can AI models help moderate such pluralistic communities? We formalize the task as a multiple-choice problem, mirroring how human moderators operate in the real world: given a comment and its surrounding context, identify which specific rule, if any, is violated. We introduce PluRule, a multimodal, multilingual benchmark for detecting 13,371 rule violations across 1,989 Reddit communities spanning 2,885 rules in 9 languages. Using this benchmark, we show that state-of-the-art vision-language models struggle significantly: even GPT-5.2 with high reasoning performs only slightly better than a trivial baseline. We also find that bigger models and increased context provide marginal gains, and universal rules like civility and self-promotion are easier to detect. Our results show that moderation of pluralistic communities on social media is a fundamental challenge for language models. Our code and benchmark are publicly available.

preprint2023arXiv

Social Bots: Detection and Challenges

While social media are a key source of data for computational social science, their ease of manipulation by malicious actors threatens the integrity of online information exchanges and their analysis. In this Chapter, we focus on malicious social bots, a prominent vehicle for such manipulation. We start by discussing recent studies about the presence and actions of social bots in various online discussions to show their real-world implications and the need for detection methods. Then we discuss the challenges of bot detection methods and use Botometer, a publicly available bot detection tool, as a case study to describe recent developments in this area. We close with a practical guide on how to handle social bots in social media research.

preprint2022arXiv

Botometer 101: Social bot practicum for computational social scientists

Social bots have become an important component of online social media. Deceptive bots, in particular, can manipulate online discussions of important issues ranging from elections to public health, threatening the constructive exchange of information. Their ubiquity makes them an interesting research subject and requires researchers to properly handle them when conducting studies using social media data. Therefore, it is important for researchers to gain access to bot detection tools that are reliable and easy to use. This paper aims to provide an introductory tutorial of Botometer, a public tool for bot detection on Twitter, for readers who are new to this topic and may not be familiar with programming and machine learning. We introduce how Botometer works, the different ways users can access it, and present a case study as a demonstration. Readers can use the case study code as a template for their own research. We also discuss recommended practice for using Botometer.

preprint2022arXiv

Can crowdsourcing rescue the social marketplace of ideas?

Facebook and Twitter recently announced community-based review platforms to address misinformation. We provide an overview of the potential affordances of such community-based approaches to content moderation based on past research and preliminary analysis of Twitter's Birdwatch data. While our analysis generally supports a community-based approach to content moderation, it also warns against potential pitfalls, particularly when the implementation of the new infrastructure focuses on crowd-based "validation" rather than "collaboration." We call for multidisciplinary research utilizing methods from complex systems studies, behavioural sociology, and computational social science to advance the research on crowd-based content moderation.

preprint2022arXiv

Manipulating Twitter Through Deletions

Research into influence campaigns on Twitter has mostly relied on identifying malicious activities from tweets obtained via public APIs. These APIs provide access to public tweets that have not been deleted. However, bad actors can delete content strategically to manipulate the system. Unfortunately, estimates based on publicly available Twitter data underestimate the true deletion volume. Here, we provide the first exhaustive, large-scale analysis of anomalous deletion patterns involving more than a billion deletions by over 11 million accounts. We find that a small fraction of accounts delete a large number of tweets daily. We also uncover two abusive behaviors that exploit deletions. First, limits on tweet volume are circumvented, allowing certain accounts to flood the network with over 26 thousand daily tweets. Second, coordinated networks of accounts engage in repetitive likes and unlikes of content that is eventually deleted, which can manipulate ranking algorithms. These kinds of abuse can be exploited to amplify content and inflate popularity, while evading detection. Our study provides platforms and researchers with new methods for identifying social media abuse.

preprint2022arXiv

Online misinformation is linked to early COVID-19 vaccination hesitancy and refusal

Widespread uptake of vaccines is necessary to achieve herd immunity. However, uptake rates have varied across U.S. states during the first six months of the COVID-19 vaccination program. Misbeliefs may play an important role in vaccine hesitancy, and there is a need to understand relationships between misinformation, beliefs, behaviors, and health outcomes. Here we investigate the extent to which COVID-19 vaccination rates and vaccine hesitancy are associated with levels of online misinformation about vaccines. We also look for evidence of directionality from online misinformation to vaccine hesitancy. We find a negative relationship between misinformation and vaccination uptake rates. Online misinformation is also correlated with vaccine hesitancy rates taken from survey data. Associations between vaccine outcomes and misinformation remain significant when accounting for political as well as demographic and socioeconomic factors. While vaccine hesitancy is strongly associated with Republican vote share, we observe that the effect of online misinformation on hesitancy is strongest across Democratic rather than Republican counties. Granger causality analysis shows evidence for a directional relationship from online misinformation to vaccine hesitancy. Our results support a need for interventions that address misbeliefs, allowing individuals to make better-informed health decisions.

preprint2021arXiv

Right and left, partisanship predicts (asymmetric) vulnerability to misinformation

We analyze the relationship between partisanship, echo chambers, and vulnerability to online misinformation by studying news sharing behavior on Twitter. While our results confirm prior findings that online misinformation sharing is strongly correlated with right-leaning partisanship, we also uncover a similar, though weaker trend among left-leaning users. Because of the correlation between a user's partisanship and their position within a partisan echo chamber, these types of influence are confounded. To disentangle their effects, we perform a regression analysis and find that vulnerability to misinformation is most strongly influenced by partisanship for both left- and right-leaning users.

preprint2020arXiv

Exposure to Social Engagement Metrics Increases Vulnerability to Misinformation

News feeds in virtually all social media platforms include engagement metrics, such as the number of times each post is liked and shared. We find that exposure to these social engagement signals increases the vulnerability of users to misinformation. This finding has important implications for the design of social media interactions in the misinformation age. To reduce the spread of misinformation, we call for technology platforms to rethink the display of social engagement metrics. Further research is needed to investigate whether and how engagement metrics can be presented without amplifying the spread of low-credibility information.

preprint2020arXiv

How Twitter Data Sampling Biases U.S. Voter Behavior Characterizations

Online social media are key platforms for the public to discuss political issues. As a result, researchers have used data from these platforms to analyze public opinions and forecast election results. Recent studies reveal the existence of inauthentic actors such as malicious social bots and trolls, suggesting that not every message is a genuine expression from a legitimate user. However, the prevalence of inauthentic activities in social data streams is still unclear, making it difficult to gauge biases of analyses based on such data. In this paper, we aim to close this gap using Twitter data from the 2018 U.S. midterm elections. Hyperactive accounts are over-represented in volume samples. We compare their characteristics with those of randomly sampled accounts and self-identified voters using a fast and low-cost heuristic. We show that hyperactive accounts are more likely to exhibit various suspicious behaviors and share low-credibility information compared to likely voters. Random accounts are more similar to likely voters, although they have slightly higher chances to display suspicious behaviors. Our work provides insights into biased voter characterizations when using online observations, underlining the importance of accounting for inauthentic actors in studies of political issues based on social media data.

preprint2020arXiv

Unveiling Coordinated Groups Behind White Helmets Disinformation

Propaganda, disinformation, manipulation, and polarization are the modern illnesses of a society increasingly dependent on social media as a source of news. In this paper, we explore the disinformation campaign, sponsored by Russia and allies, against the Syria Civil Defense (a.k.a. the White Helmets). We unveil coordinated groups using automatic retweets and content duplication to promote narratives and/or accounts. The results also reveal distinct promoting strategies, ranging from the small groups sharing the exact same text repeatedly, to complex "news website factories" where dozens of accounts synchronously spread the same news from multiple sites.

preprint2019arXiv

Bot Electioneering Volume: Visualizing Social Bot Activity During Elections

It has been widely recognized that automated bots may have a significant impact on the outcomes of national events. It is important to raise public awareness about the threat of bots on social media during these important events, such as the 2018 US midterm election. To this end, we deployed a web application to help the public explore the activities of likely bots on Twitter on a daily basis. The application, called Bot Electioneering Volume (BEV), reports on the level of likely bot activities and visualizes the topics targeted by them. With this paper we release our code base for the BEV framework, with the goal of facilitating future efforts to combat malicious bots on social media.

preprint2019arXiv

Scalable and Generalizable Social Bot Detection through Data Selection

Efficient and reliable social bot classification is crucial for detecting information manipulation on social media. Despite rapid development, state-of-the-art bot detection models still face generalization and scalability challenges, which greatly limit their applications. In this paper we propose a framework that uses minimal account metadata, enabling efficient analysis that scales up to handle the full stream of public tweets of Twitter in real time. To ensure model accuracy, we build a rich collection of labeled datasets for training and validation. We deploy a strict validation system so that model performance on unseen datasets is also optimized, in addition to traditional cross-validation. We find that strategically selecting a subset of training data yields better model accuracy and generalization than exhaustively training on all available data. Thanks to the simplicity of the proposed model, its logic can be interpreted to provide insights into social bot characteristics.

preprint2016arXiv

Hoaxy: A Platform for Tracking Online Misinformation

Massive amounts of misinformation have been observed to spread in uncontrolled fashion across social media. Examples include rumors, hoaxes, fake news, and conspiracy theories. At the same time, several journalistic organizations devote significant efforts to high-quality fact checking of online claims. The resulting information cascades contain instances of both accurate and inaccurate information, unfold over multiple time scales, and often reach audiences of considerable size. All these factors pose challenges for the study of the social dynamics of online news sharing. Here we introduce Hoaxy, a platform for the collection, detection, and analysis of online misinformation and its related fact-checking efforts. We discuss the design of the platform and present a preliminary analysis of a sample of public tweets containing both fake news and fact checking. We find that, in the aggregate, the sharing of fact-checking content typically lags that of misinformation by 10--20 hours. Moreover, fake news are dominated by very active users, while fact checking is a more grass-roots activity. With the increasing risks connected to massive online misinformation, social news observatories have the potential to help researchers, journalists, and the general public understand the dynamics of real and fake news sharing.

preprint2016arXiv

Kinsey Reporter: Citizen Science for Sex Research

Kinsey Reporter is a global mobile app to share, explore, and visualize anonymous data about sex. Reports are submitted via smartphone, then visualized on a website or downloaded for offline analysis. In this paper we present the major features of the Kinsey Reporter citizen science platform designed to preserve the anonymity of its contributors, and preliminary data analyses that suggest questions for future research.

preprint2016arXiv

On the influence of social bots in online protests. Preliminary findings of a Mexican case study

Social bots can affect online communication among humans. We study this phenomenon by focusing on #YaMeCanse, the most active protest hashtag in the history of Twitter in Mexico. Accounts using the hashtag are classified using the BotOrNot bot detection tool. Our preliminary analysis suggests that bots played a critical role in disrupting online communication about the protest movement.

preprint2016arXiv

Ultra High-Dimensional Nonlinear Feature Selection for Big Biological Data

Machine learning methods are used to discover complex nonlinear relationships in biological and medical data. However, sophisticated learning models are computationally unfeasible for data with millions of features. Here we introduce the first feature selection method for nonlinear learning problems that can scale up to large, ultra-high dimensional biological data. More specifically, we scale up the novel Hilbert-Schmidt Independence Criterion Lasso (HSIC Lasso) to handle millions of features with tens of thousand samples. The proposed method is guaranteed to find an optimal subset of maximally predictive features with minimal redundancy, yielding higher predictive power and improved interpretability. Its effectiveness is demonstrated through applications to classify phenotypes based on module expression in human prostate cancer patients and to detect enzymes among protein structures. We achieve high accuracy with as few as 20 out of one million features --- a dimensionality reduction of 99.998%. Our algorithm can be implemented on commodity cloud computing platforms. The dramatic reduction of features may lead to the ubiquitous deployment of sophisticated prediction models in mobile health care applications.

preprint2016arXiv

Women Through the Glass Ceiling: Gender Asymmetries in Wikipedia

Contributing to the writing of history has never been as easy as it is today thanks to Wikipedia, a community-created encyclopedia that aims to document the world's knowledge from a neutral point of view. Though everyone can participate it is well known that the editor community has a narrow diversity, with a majority of white male editors. While this participatory \emph{gender gap} has been studied extensively in the literature, this work sets out to \emph{assess potential gender inequalities in Wikipedia articles} along different dimensions: notability, topical focus, linguistic bias, structural properties, and meta-data presentation. We find that (i) women in Wikipedia are more notable than men, which we interpret as the outcome of a subtle glass ceiling effect; (ii) family-, gender-, and relationship-related topics are more present in biographies about women; (iii) linguistic bias manifests in Wikipedia since abstract terms tend to be used to describe positive aspects in the biographies of men and negative aspects in the biographies of women; and (iv) there are structural differences in terms of meta-data and hyperlinks, which have consequences for information-seeking activities. While some differences are expected, due to historical and social contexts, other differences are attributable to Wikipedia editors. The implications of such differences are discussed having Wikipedia contribution policies in mind. We hope that the present work will contribute to increased awareness about, first, gender issues in the content of Wikipedia, and second, the different levels on which gender biases can manifest on the Web.

preprint2015arXiv

Computational fact checking from knowledge networks

Traditional fact checking by expert journalists cannot keep up with the enormous volume of information that is now generated online. Computational fact checking may significantly enhance our ability to evaluate the veracity of dubious information. Here we show that the complexities of human fact checking can be approximated quite well by finding the shortest path between concept nodes under properly defined semantic proximity metrics on knowledge graphs. Framed as a network problem this approach is feasible with efficient computational techniques. We evaluate this approach by examining tens of thousands of claims related to history, entertainment, geography, and biographical information using a public knowledge graph extracted from Wikipedia. Statements independently known to be true consistently receive higher support via our method than do false ones. These findings represent a significant step toward scalable computational fact-checking methods that may one day mitigate the spread of harmful misinformation.

preprint2015arXiv

First Women, Second Sex: Gender Bias in Wikipedia

Contributing to history has never been as easy as it is today. Anyone with access to the Web is able to play a part on Wikipedia, an open and free encyclopedia. Wikipedia, available in many languages, is one of the most visited websites in the world and arguably one of the primary sources of knowledge on the Web. However, not everyone is contributing to Wikipedia from a diversity point of view; several groups are severely underrepresented. One of those groups is women, who make up approximately 16% of the current contributor community, meaning that most of the content is written by men. In addition, although there are specific guidelines of verifiability, notability, and neutral point of view that must be adhered by Wikipedia content, these guidelines are supervised and enforced by men. In this paper, we propose that gender bias is not about participation and representation only, but also about characterization of women. We approach the analysis of gender bias by defining a methodology for comparing the characterizations of men and women in biographies in three aspects: meta-data, language, and network structure. Our results show that, indeed, there are differences in characterization and structure. Some of these differences are reflected from the off-line world documented by Wikipedia, but other differences can be attributed to gender bias in Wikipedia content. We contextualize these differences in feminist theory and discuss their implications for Wikipedia policy.

preprint2015arXiv

Measuring Online Social Bubbles

Social media have quickly become a prevalent channel to access information, spread ideas, and influence opinions. However, it has been suggested that social and algorithmic filtering may cause exposure to less diverse points of view, and even foster polarization and misinformation. Here we explore and validate this hypothesis quantitatively for the first time, at the collective and individual levels, by mining three massive datasets of web traffic, search logs, and Twitter posts. Our analysis shows that collectively, people access information from a significantly narrower spectrum of sources through social media and email, compared to search. The significance of this finding for individual exposure is revealed by investigating the relationship between the diversity of information sources experienced by users at the collective and individual level. There is a strong correlation between collective and individual diversity, supporting the notion that when we use social media we find ourselves inside "social bubbles". Our results could lead to a deeper understanding of how technology biases our exposure to new information.

preprint2014arXiv

Connecting Dream Networks Across Cultures

Many species dream, yet there remain many open research questions in the study of dreams. The symbolism of dreams and their interpretation is present in cultures throughout history. Analysis of online data sources for dream interpretation using network science leads to understanding symbolism in dreams and their associated meaning. In this study, we introduce dream interpretation networks for English, Chinese and Arabic that represent different cultures from various parts of the world. We analyze communities in these networks, finding that symbols within a community are semantically related. The central nodes in communities give insight about cultures and symbols in dreams. The community structure of different networks highlights cultural similarities and differences. Interconnections between different networks are also identified by translating symbols from different languages into English. Structural correlations across networks point out relationships between cultures. Similarities between network communities are also investigated by analysis of sentiment in symbol interpretations. We find that interpretations within a community tend to have similar sentiment. Furthermore, we cluster communities based on their sentiment, yielding three main categories of positive, negative, and neutral dream symbols.

preprint2014arXiv

Evolution of Online User Behavior During a Social Upheaval

Social media represent powerful tools of mass communication and information diffusion. They played a pivotal role during recent social uprisings and political mobilizations across the world. Here we present a study of the Gezi Park movement in Turkey through the lens of Twitter. We analyze over 2.3 million tweets produced during the 25 days of protest occurred between May and June 2013. We first characterize the spatio-temporal nature of the conversation about the Gezi Park demonstrations, showing that similarity in trends of discussion mirrors geographic cues. We then describe the characteristics of the users involved in this conversation and what roles they played. We study how roles and individual influence evolved during the period of the upheaval. This analysis reveals that the conversation becomes more democratic as events unfold, with a redistribution of influence over time in the user population. We conclude by observing how the online and offline worlds are tightly intertwined, showing that exogenous events, such as political speeches or police actions, affect social media conversations and trigger changes in individual behavior.

preprint2014arXiv

Fast filtering and animation of large dynamic networks

Detecting and visualizing what are the most relevant changes in an evolving network is an open challenge in several domains. We present a fast algorithm that filters subsets of the strongest nodes and edges representing an evolving weighted graph and visualize it by either creating a movie, or by streaming it to an interactive network visualization tool. The algorithm is an approximation of exponential sliding time-window that scales linearly with the number of interactions. We compare the algorithm against rectangular and exponential sliding time-window methods. Our network filtering algorithm: i) captures persistent trends in the structure of dynamic weighted networks, ii) smoothens transitions between the snapshots of dynamic network, and iii) uses limited memory and processor time. The algorithm is publicly available as open-source software.

preprint2014arXiv

Predicting Successful Memes using Network and Community Structure

We investigate the predictability of successful memes using their early spreading patterns in the underlying social networks. We propose and analyze a comprehensive set of features and develop an accurate model to predict future popularity of a meme given its early spreading patterns. Our paper provides the first comprehensive comparison of existing predictive frameworks. We categorize our features into three groups: influence of early adopters, community concentration, and characteristics of adoption time series. We find that features based on community structure are the most powerful predictors of future success. We also find that early popularity of a meme is not a good predictor of its future popularity, contrary to common belief. Our methods outperform other approaches, particularly in the task of detecting very popular or unpopular memes.

preprint2014arXiv

Quality versus quantity in scientific impact

Citation metrics are becoming pervasive in the quantitative evaluation of scholars, journals and institutions. More then ever before, hiring, promotion, and funding decisions rely on a variety of impact metrics that cannot disentangle quality from quantity of scientific output, and are biased by factors such as discipline and academic age. Biases affecting the evaluation of single papers are compounded when one aggregates citation-based metrics across an entire publication record. It is not trivial to compare the quality of two scholars that during their careers have published at different rates in different disciplines in different periods of time. We propose a novel solution based on the generation of a statistical baseline specifically tailored on the academic profile of each researcher. Our method can decouple the roles of quantity and quality of publications to explain how a certain level of impact is achieved. The method is flexible enough to allow for the evaluation of, and fair comparison among, arbitrary collections of papers --- scholar publication records, journals, and entire institutions; and can be extended to simultaneously suppresses any source of bias. We show that our method can capture the quality of the work of Nobel laureates irrespective of number of publications, academic age, and discipline, even when traditional metrics indicate low impact in absolute terms. We further apply our methodology to almost a million scholars and over six thousand journals to measure the impact that cannot be explained by the volume of publications alone.

preprint2014arXiv

The production of information in the attention economy

Online traces of human activity offer novel opportunities to study the dynamics of complex knowledge exchange networks, and in particular how the relationship between demand and supply of information is mediated by competition for our limited individual attention. The emergent patterns of collective attention determine what new information is generated and consumed. Can we measure the relationship between demand and supply for new information about a topic? Here we propose a normalization method to compare attention bursts statistics across topics that have an heterogeneous distribution of attention. Through analysis of a massive dataset on traffic to Wikipedia, we find that the production of new knowledge is associated to significant shifts of collective attention, which we take as a proxy for its demand. What we observe is consistent with a scenario in which the allocation of attention toward a topic stimulates the demand for information about it, and in turn the supply of further novel information. Our attempt to quantify demand and supply of information, and our finding about their temporal ordering, may lead to the development of the fundamental laws of the attention economy, and a better understanding of the social exchange of knowledge in online and offline information networks.

preprint2014arXiv

Topicality and Social Impact: Diverse Messages but Focused Messengers

Are users who comment on a variety of matters more likely to achieve high influence than those who delve into one focused field? Do general Twitter hashtags, such as #lol, tend to be more popular than novel ones, such as #instantlyinlove? Questions like these demand a way to detect topics hidden behind messages associated with an individual or a hashtag, and a gauge of similarity among these topics. Here we develop such an approach to identify clusters of similar hashtags by detecting communities in the hashtag co-occurrence network. Then the topical diversity of a user's interests is quantified by the entropy of her hashtags across different topic clusters. A similar measure is applied to hashtags, based on co-occurring tags. We find that high topical diversity of early adopters or co-occurring tags implies high future popularity of hashtags. In contrast, low diversity helps an individual accumulate social influence. In short, diverse messages and focused messengers are more likely to gain impact.

preprint2013arXiv

The Digital Evolution of Occupy Wall Street

We examine the temporal evolution of digital communication activity relating to the American anti-capitalist movement Occupy Wall Street. Using a high-volume sample from the microblogging site Twitter, we investigate changes in Occupy participant engagement, interests, and social connectivity over a fifteen month period starting three months prior to the movement's first protest action. The results of this analysis indicate that, on Twitter, the Occupy movement tended to elicit participation from a set of highly interconnected users with pre-existing interests in domestic politics and foreign social movements. These users, while highly vocal in the months immediately following the birth of the movement, appear to have lost interest in Occupy related communication over the remainder of the study period.

preprint2013arXiv

The Geospatial Characteristics of a Social Movement Communication Network

Social movements rely in large measure on networked communication technologies to organize and disseminate information relating to the movements' objectives. In this work we seek to understand how the goals and needs of a protest movement are reflected in the geographic patterns of its communication network, and how these patterns differ from those of stable political communication. To this end, we examine an online communication network reconstructed from over 600,000 tweets from a thirty-six week period covering the birth and maturation of the American anticapitalist movement, Occupy Wall Street. We find that, compared to a network of stable domestic political communication, the Occupy Wall Street network exhibits higher levels of locality and a hub and spoke structure, in which the majority of non-local attention is allocated to high-profile locations such as New York, California, and Washington D.C. Moreover, we observe that information flows across state boundaries are more likely to contain framing language and references to the media, while communication among individuals in the same state is more likely to reference protest action and specific places and and times. Tying these results to social movement theory, we propose that these features reflect the movement's efforts to mobilize resources at the local level and to develop narrative frames that reinforce collective purpose at the national level.

preprint2013arXiv

The Role of Information Diffusion in the Evolution of Social Networks

Every day millions of users are connected through online social networks, generating a rich trove of data that allows us to study the mechanisms behind human interactions. Triadic closure has been treated as the major mechanism for creating social links: if Alice follows Bob and Bob follows Charlie, Alice will follow Charlie. Here we present an analysis of longitudinal micro-blogging data, revealing a more nuanced view of the strategies employed by users when expanding their social circles. While the network structure affects the spread of information among users, the network is in turn shaped by this communication activity. This suggests a link creation mechanism whereby Alice is more likely to follow Charlie after seeing many messages by Charlie. We characterize users with a set of parameters associated with different link creation strategies, estimated by a Maximum-Likelihood approach. Triadic closure does have a strong effect on link formation, but shortcuts based on traffic are another key factor in interpreting network evolution. However, individual strategies for following other users are highly heterogeneous. Link creation behaviors can be summarized by classifying users in different categories with distinct structural and behavioral characteristics. Users who are popular, active, and influential tend to create traffic-based shortcuts, making the information diffusion process more efficient in the network.

preprint2013arXiv

Universality of scholarly impact metrics

Given the growing use of impact metrics in the evaluation of scholars, journals, academic institutions, and even countries, there is a critical need for means to compare scientific impact across disciplinary boundaries. Unfortunately, citation-based metrics are strongly biased by diverse field sizes and publication and citation practices. As a result, we have witnessed an explosion in the number of newly proposed metrics that claim to be "universal." However, there is currently no way to objectively assess whether a normalized metric can actually compensate for disciplinary bias. We introduce a new method to assess the universality of any scholarly impact metric, and apply it to evaluate a number of established metrics. We also define a very simple new metric hs, which proves to be universal, thus allowing to compare the impact of scholars across scientific disciplines. These results move us closer to a formal methodology in the measure of scholarly impact.

preprint2013arXiv

Virality Prediction and Community Structure in Social Networks

How does network structure affect diffusion? Recent studies suggest that the answer depends on the type of contagion. Complex contagions, unlike infectious diseases (simple contagions), are affected by social reinforcement and homophily. Hence, the spread within highly clustered communities is enhanced, while diffusion across communities is hampered. A common hypothesis is that memes and behaviors are complex contagions. We show that, while most memes indeed behave like complex contagions, a few viral memes spread across many communities, like diseases. We demonstrate that the future popularity of a meme can be predicted by quantifying its early spreading pattern in terms of community concentration. The more communities a meme permeates, the more viral it is. We present a practical method to translate data about community structure into predictive knowledge about what information will spread widely. This connection may lead to significant advances in computational social science, social media analytics, and marketing applications.

preprint2012arXiv

Context Visualization for Social Bookmark Management

We present the design of a new social bookmark manager, named GalViz, as part of the interface of the GiveA-Link system. Unlike the interfaces of traditional social tagging tools, which usually display information in a list view, GalViz visualizes tags, resources, social links, and social context in an interactive network, combined with the tag cloud. Evaluations through a scenario case study and log analysis provide evidence of the effectiveness of our design.

preprint2012arXiv

Partisan Asymmetries in Online Political Activity

We examine partisan differences in the behavior, communication patterns and social interactions of more than 18,000 politically-active Twitter users to produce evidence that points to changing levels of partisan engagement with the American online political landscape. Analysis of a network defined by the communication activity of these users in proximity to the 2010 midterm congressional elections reveals a highly segregated, well clustered partisan community structure. Using cluster membership as a high-fidelity (87% accuracy) proxy for political affiliation, we characterize a wide range of differences in the behavior, communication and social connectivity of left- and right-leaning Twitter users. We find that in contrast to the online political dynamics of the 2008 campaign, right-leaning Twitter users exhibit greater levels of political activity, a more tightly interconnected social structure, and a communication network topology that facilitates the rapid and broad dissemination of political information.

preprint2012arXiv

Social Dynamics of Science

The birth and decline of disciplines are critical to science and society. However, no quantitative model to date allows us to validate competing theories of whether the emergence of scientific disciplines drives or follows the formation of social communities of scholars. Here we propose an agent-based model based on a \emph{social dynamics of science,} in which the evolution of disciplines is guided mainly by the social interactions among scientists. We find that such a social theory can account for a number of stylized facts about the relationships between disciplines, authors, and publications. These results provide strong quantitative support for the key role of social interactions in shaping the dynamics of science. A "science of science" must gauge the role of exogenous events, such as scientific discoveries and technological advances, against this purely social baseline.

preprint2012arXiv

Visualizing Communication on Social Media: Making Big Data Accessible

The broad adoption of the web as a communication medium has made it possible to study social behavior at a new scale. With social media networks such as Twitter, we can collect large data sets of online discourse. Social science researchers and journalists, however, may not have tools available to make sense of large amounts of data or of the structure of large social networks. In this paper, we describe our recent extensions to Truthy, a system for collecting and analyzing political discourse on Twitter. We introduce several new analytical perspectives on online discourse with the goal of facilitating collaboration between individuals in the computational and social sciences. The design decisions described in this article are motivated by real-world use cases developed in collaboration with colleagues at the Indiana University School of Journalism.

preprint2010arXiv

Agents, Bookmarks and Clicks: A topical model of Web traffic

Analysis of aggregate and individual Web traffic has shown that PageRank is a poor model of how people navigate the Web. Using the empirical traffic patterns generated by a thousand users, we characterize several properties of Web traffic that cannot be reproduced by Markovian models. We examine both aggregate statistics capturing collective behavior, such as page and link traffic, and individual statistics, such as entropy and session size. No model currently explains all of these empirical observations simultaneously. We show that all of these traffic patterns can be explained by an agent-based model that takes into account several realistic browsing behaviors. First, agents maintain individual lists of bookmarks (a non-Markovian memory mechanism) that are used as teleportation targets. Second, agents can retreat along visited links, a branching mechanism that also allows us to reproduce behaviors such as the use of a back button and tabbed browsing. Finally, agents are sustained by visiting novel pages of topical interest, with adjacent pages being more topically related to each other than distant ones. This modulates the probability that an agent continues to browse or starts a new session, allowing us to recreate heterogeneous session lengths. The resulting model is capable of reproducing the collective and individual behaviors we observe in the empirical data, reconciling the narrowly focused browsing patterns of individual users with the extreme heterogeneity of aggregate traffic measurements. This result allows us to identify a few salient features that are necessary and sufficient to interpret the browsing patterns observed in our data. In addition to the descriptive and explanatory power of such a model, our results may lead the way to more sophisticated, realistic, and effective ranking and crawling algorithms.

preprint2010arXiv

Characterizing and modeling the dynamics of online popularity

Online popularity has enormous impact on opinions, culture, policy, and profits. We provide a quantitative, large scale, temporal analysis of the dynamics of online content popularity in two massive model systems, the Wikipedia and an entire country's Web space. We find that the dynamics of popularity are characterized by bursts, displaying characteristic features of critical systems such as fat-tailed distributions of magnitude and inter-event time. We propose a minimal model combining the classic preferential popularity increase mechanism with the occurrence of random popularity shifts due to exogenous factors. The model recovers the critical features observed in the empirical analysis of the systems analyzed here, highlighting the key factors needed in the description of popularity dynamics.

preprint2010arXiv

Detecting and Tracking the Spread of Astroturf Memes in Microblog Streams

Online social media are complementing and in some cases replacing person-to-person social interaction and redefining the diffusion of information. In particular, microblogs have become crucial grounds on which public relations, marketing, and political battles are fought. We introduce an extensible framework that will enable the real-time analysis of meme diffusion in social media by mining, visualizing, mapping, classifying, and modeling massive streams of public microblogging events. We describe a Web service that leverages this framework to track political memes in Twitter and help detect astroturfing, smear campaigns, and other misinformation in the context of U.S. political elections. We present some cases of abusive behaviors uncovered by our service. Finally, we discuss promising preliminary results on the detection of suspicious memes via supervised learning based on features extracted from the topology of the diffusion networks, sentiment analysis, and crowdsourced annotations.

preprint2010arXiv

Folks in Folksonomies: Social Link Prediction from Shared Metadata

Web 2.0 applications have attracted a considerable amount of attention because their open-ended nature allows users to create light-weight semantic scaffolding to organize and share content. To date, the interplay of the social and semantic components of social media has been only partially explored. Here we focus on Flickr and Last.fm, two social media systems in which we can relate the tagging activity of the users with an explicit representation of their social network. We show that a substantial level of local lexical and topical alignment is observable among users who lie close to each other in the social network. We introduce a null model that preserves user activity while removing local correlations, allowing us to disentangle the actual local alignment between users from statistical effects due to the assortative mixing of user activity and centrality in the social network. This analysis suggests that users with similar topical interests are more likely to be friends, and therefore semantic similarity measures among users based solely on their annotation metadata should be predictive of social links. We test this hypothesis on the Last.fm data set, confirming that the social network constructed from semantic similarity captures actual friendship more accurately than Last.fm's suggestions based on listening patterns.

preprint2010arXiv

What's in a Session: Tracking Individual Behavior on the Web

We examine the properties of all HTTP requests generated by a thousand undergraduates over a span of two months. Preserving user identity in the data set allows us to discover novel properties of Web traffic that directly affect models of hypertext navigation. We find that the popularity of Web sites -- the number of users who contribute to their traffic -- lacks any intrinsic mean and may be unbounded. Further, many aspects of the browsing behavior of individual users can be approximated by log-normal distributions even though their aggregate behavior is scale-free. Finally, we show that users' click streams cannot be cleanly segmented into sessions using timeouts, affecting any attempt to model hypertext navigation using statistics of individual sessions. We propose a strictly logical definition of sessions based on browsing activity as revealed by referrer URLs; a user may have several active sessions in their click stream at any one time. We demonstrate that applying a timeout to these logical sessions affects their statistics to a lesser extent than a purely timeout-based mechanism.

preprint2009arXiv

Remembering what we like: Toward an agent-based model of Web traffic

Analysis of aggregate Web traffic has shown that PageRank is a poor model of how people actually navigate the Web. Using the empirical traffic patterns generated by a thousand users over the course of two months, we characterize the properties of Web traffic that cannot be reproduced by Markovian models, in which destinations are independent of past decisions. In particular, we show that the diversity of sites visited by individual users is smaller and more broadly distributed than predicted by the PageRank model; that link traffic is more broadly distributed than predicted; and that the time between consecutive visits to the same site by a user is less broadly distributed than predicted. To account for these discrepancies, we introduce a more realistic navigation model in which agents maintain individual lists of bookmarks that are used as teleportation targets. The model can also account for branching, a traffic property caused by browser features such as tabs and the back button. The model reproduces aggregate traffic patterns such as site popularity, while also generating more accurate predictions of diversity, link traffic, and return time distributions. This model for the first time allows us to capture the extreme heterogeneity of aggregate traffic measurements while explaining the more narrowly focused browsing patterns of individual users.

Filippo Menczer

What is connected

Connect this record

See the researcher in context

Building this map preview

42 published item(s)

PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media

Social Bots: Detection and Challenges

Botometer 101: Social bot practicum for computational social scientists

Can crowdsourcing rescue the social marketplace of ideas?

Manipulating Twitter Through Deletions

Online misinformation is linked to early COVID-19 vaccination hesitancy and refusal

Right and left, partisanship predicts (asymmetric) vulnerability to misinformation

Exposure to Social Engagement Metrics Increases Vulnerability to Misinformation

How Twitter Data Sampling Biases U.S. Voter Behavior Characterizations

Unveiling Coordinated Groups Behind White Helmets Disinformation

Bot Electioneering Volume: Visualizing Social Bot Activity During Elections

Scalable and Generalizable Social Bot Detection through Data Selection

Hoaxy: A Platform for Tracking Online Misinformation

Kinsey Reporter: Citizen Science for Sex Research

On the influence of social bots in online protests. Preliminary findings of a Mexican case study

Ultra High-Dimensional Nonlinear Feature Selection for Big Biological Data

Women Through the Glass Ceiling: Gender Asymmetries in Wikipedia

Computational fact checking from knowledge networks

First Women, Second Sex: Gender Bias in Wikipedia

Measuring Online Social Bubbles

Connecting Dream Networks Across Cultures

Evolution of Online User Behavior During a Social Upheaval

Fast filtering and animation of large dynamic networks

Predicting Successful Memes using Network and Community Structure

Quality versus quantity in scientific impact

The production of information in the attention economy

Topicality and Social Impact: Diverse Messages but Focused Messengers

The Digital Evolution of Occupy Wall Street

The Geospatial Characteristics of a Social Movement Communication Network

The Role of Information Diffusion in the Evolution of Social Networks

Universality of scholarly impact metrics

Virality Prediction and Community Structure in Social Networks

Context Visualization for Social Bookmark Management

Partisan Asymmetries in Online Political Activity

Social Dynamics of Science

Visualizing Communication on Social Media: Making Big Data Accessible

Agents, Bookmarks and Clicks: A topical model of Web traffic

Characterizing and modeling the dynamics of online popularity

Detecting and Tracking the Spread of Astroturf Memes in Microblog Streams

Folks in Folksonomies: Social Link Prediction from Shared Metadata

What's in a Session: Tracking Individual Behavior on the Web

Remembering what we like: Toward an agent-based model of Web traffic