Source author record

Kristina Lerman

Kristina Lerman appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

66works

27topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

A Survey on Bias and Fairness in Machine Learning

With the widespread use of AI systems and applications in our everyday lives, it is important to take fairness issues into consideration while designing and engineering these types of systems. Such systems can be used in many sensitive environments to make important and life-changing decisions; thus, it is crucial to ensure that the decisions do not reflect discriminatory behavior toward certain groups or populations. We have recently seen work in machine learning, natural language processing, and deep learning that addresses such challenges in different subdomains. With the commercialization of these systems, researchers are becoming aware of the biases that these applications can contain and have attempted to address them. In this survey we investigated different real-world applications that have shown biases in various ways, and we listed different sources of biases that can affect AI applications. We then created a taxonomy for fairness definitions that machine learning researchers have defined in order to avoid the existing bias in AI systems. In addition to that, we examined different domains and subdomains in AI showing what researchers have observed with regard to unfair outcomes in the state-of-the-art methods and how they have tried to address them. There are still many future directions and solutions that can be taken to mitigate the problem of bias in AI systems. We are hoping that this survey will motivate researchers to tackle these issues in the near future by observing existing work in their respective fields.

preprint2022arXiv

Emergent Instabilities in Algorithmic Feedback Loops

Algorithms that aid human tasks, such as recommendation systems, are ubiquitous. They appear in everything from social media to streaming videos to online shopping. However, the feedback loop between people and algorithms is poorly understood and can amplify cognitive and social biases (algorithmic confounding), leading to unexpected outcomes. In this work, we explore algorithmic confounding in collaborative filtering-based recommendation algorithms through teacher-student learning simulations. Namely, a student collaborative filtering-based model, trained on simulated choices, is used by the recommendation algorithm to recommend items to agents. Agents might choose some of these items, according to an underlying teacher model, with new choices then fed back into the student model as new training data (approximating online machine learning). These simulations demonstrate how algorithmic confounding produces erroneous recommendations which in turn lead to instability, i.e., wide variations in an item's popularity between each simulation realization. We use the simulations to demonstrate a novel approach to training collaborative filtering models that can create more stable and accurate recommendations. Our methodology is general enough that it can be extended to other socio-technical systems in order to better quantify and improve the stability of algorithms. These results highlight the need to account for emergent behaviors from interactions between people and algorithms.

preprint2022arXiv

Infusing Knowledge from Wikipedia to Enhance Stance Detection

Stance detection infers a text author's attitude towards a target. This is challenging when the model lacks background knowledge about the target. Here, we show how background knowledge from Wikipedia can help enhance the performance on stance detection. We introduce Wikipedia Stance Detection BERT (WS-BERT) that infuses the knowledge into stance encoding. Extensive results on three benchmark datasets covering social media discussions and online debates indicate that our model significantly outperforms the state-of-the-art methods on target-specific stance detection, cross-target stance detection, and zero/few-shot stance detection.

preprint2022arXiv

Road Network Evolution in the Urban and Rural United States Since 1900

Road networks represent a key component of human settlements, such as cities, towns, and villages, that mediate pollution and congestion, as well as economic development. However, little is known about the long-term development trajectories of road networks in rural and urban settings. We leverage novel spatial data sources to reconstruct and analyze road networks in more than 850 US cities and over 2,500 US counties since 1900. Our analysis reveals significant variations in the structure of roads both within cities and across the conterminous US. Despite differences in the evolution of these networks, there are commonalities and strong geographic patterns. These results persist across the rural-urban continuum and are therefore not just a product of accelerated urban growth. These findings refine and extend existing knowledge and illuminate the need for policies for urban and rural planning including the critical assessment of new development trends.

preprint2021arXiv

A Model of Densifying Collaboration Networks

Research collaborations provide the foundation for scientific advances, but we have only recently begun to understand how they form and grow on a global scale. Here we analyze a model of the growth of research collaboration networks to explain the empirical observations that the number of collaborations scales superlinearly with institution size, though at different rates (heterogeneous densification), the number of institutions grows as a power of the number of researchers (Heaps' law) and institution sizes approximate Zipf's law. This model has three mechanisms: (i) researchers are preferentially hired by large institutions, (ii) new institutions trigger more potential institutions, and (iii) researchers collaborate with friends-of-friends. We show agreement between these assumptions and empirical data, through analysis of co-authorship networks spanning two centuries. We then develop a theoretical understanding of this model, which reveals emergent heterogeneous scaling such that the number of collaborations between institutions scale with an institution's size.

preprint2021arXiv

Individualized Context-Aware Tensor Factorization for Online Games Predictions

Individual behavior and decisions are substantially influenced by their contexts, such as location, environment, and time. Changes along these dimensions can be readily observed in Multiplayer Online Battle Arena games (MOBA), where players face different in-game settings for each match and are subject to frequent game patches. Existing methods utilizing contextual information generalize the effect of a context over the entire population, but contextual information tailored to each individual can be more effective. To achieve this, we present the Neural Individualized Context-aware Embeddings (NICE) model for predicting user performance and game outcomes. Our proposed method identifies individual behavioral differences in different contexts by learning latent representations of users and contexts through non-negative tensor factorization. Using a dataset from the MOBA game League of Legends, we demonstrate that our model substantially improves the prediction of winning outcome, individual user performance, and user engagement.

preprint2021arXiv

The Emergence of Heterogeneous Scaling in Research Institutions

Research institutions provide the infrastructure for scientific discovery, yet their role in the production of knowledge is not well characterized. To address this gap, we analyze interactions of researchers within and between institutions from millions of scientific papers. Our analysis reveals that the number of collaborations scales superlinearly with institution size, though at different rates (heterogeneous densification). We also find that the number of institutions scales with the number of researchers as a power law (Heaps' law) and institution sizes approximate Zipf's law. These patterns can be reproduced by a simple model with three mechanisms: (i) researchers collaborate with friends-of-friends, (ii) new institutions trigger more potential institutions, and (iii) researchers are preferentially hired by large institutions. This model reveals an economy of scale in research: larger institutions grow faster and amplify collaborations. Our work provides a new understanding of emergent behavior in research institutions and how they facilitate innovation.

preprint2020arXiv

Challenges in Forecasting Malicious Events from Incomplete Data

The ability to accurately predict cyber-attacks would enable organizations to mitigate their growing threat and avert the financial losses and disruptions they cause. But how predictable are cyber-attacks? Researchers have attempted to combine external data -- ranging from vulnerability disclosures to discussions on Twitter and the darkweb -- with machine learning algorithms to learn indicators of impending cyber-attacks. However, successful cyber-attacks represent a tiny fraction of all attempted attacks: the vast majority are stopped, or filtered by the security appliances deployed at the target. As we show in this paper, the process of filtering reduces the predictability of cyber-attacks. The small number of attacks that do penetrate the target's defenses follow a different generative process compared to the whole data which is much harder to learn for predictive models. This could be caused by the fact that the resulting time series also depends on the filtering process in addition to all the different factors that the original time series depended on. We empirically quantify the loss of predictability due to filtering using real-world data from two organizations. Our work identifies the limits to forecasting cyber-attacks from highly filtered data.

preprint2020arXiv

Learning Behavioral Representations from Wearable Sensors

Continuous collection of physiological data from wearable sensors enables temporal characterization of individual behaviors. Understanding the relation between an individual's behavioral patterns and psychological states can help identify strategies to improve quality of life. One challenge in analyzing physiological data is extracting the underlying behavioral states from the temporal sensor signals and interpreting them. Here, we use a non-parametric Bayesian approach to model sensor data from multiple people and discover the dynamic behaviors they share. We apply this method to data collected from sensors worn by a population of hospital workers and show that the learned states can cluster participants into meaningful groups and better predict their cognitive and psychological states. This method offers a way to learn interpretable compact behavioral representations from multivariate sensor signals.

preprint2020arXiv

Predictability limit of partially observed systems

Applications from finance to epidemiology and cyber-security require accurate forecasts of dynamic phenomena, which are often only partially observed. We demonstrate that a system's predictability degrades as a function of temporal sampling, regardless of the adopted forecasting model. We quantify the loss of predictability due to sampling, and show that it cannot be recovered by using external signals. We validate the generality of our theoretical findings in real-world partially observed systems representing infectious disease outbreaks, online discussions, and software development projects. On a variety of prediction tasks---forecasting new infections, the popularity of topics in online discussions, or interest in cryptocurrency projects---predictability irrecoverably decays as a function of sampling, unveiling fundamental predictability limits in partially observed systems.

preprint2020arXiv

Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set

At the time of this writing, the novel coronavirus (COVID-19) pandemic outbreak has already put tremendous strain on many countries' citizens, resources and economies around the world. Social distancing measures, travel bans, self-quarantines, and business closures are changing the very fabric of societies worldwide. With people forced out of public spaces, much conversation about these phenomena now occurs online, e.g., on social media platforms like Twitter. In this paper, we describe a multilingual coronavirus (COVID-19) Twitter dataset that we have been continuously collecting since January 22, 2020. We are making our dataset available to the research community (https://github.com/echen102/COVID-19-TweetIDs). It is our hope that our contribution will enable the study of online conversation dynamics in the context of a planetary-scale epidemic outbreak of unprecedented proportions and implications. This dataset could also help track scientific coronavirus misinformation and unverified rumors, or enable the understanding of fear and panic -- and undoubtedly more. Ultimately, this dataset may contribute towards enabling informed solutions and prescribing targeted policy interventions to fight this global crisis.

preprint2020arXiv

Unequal Impact and Spatial Aggregation Distort COVID-19 Growth Rates

The COVID-19 pandemic has emerged as a global public health crisis. To make decisions about mitigation strategies and to understand the disease dynamics, policy makers and epidemiologists must know how the disease is spreading in their communities. We analyze confirmed infections and deaths over multiple geographic scales to show that COVID-19's impact is highly unequal: many subregions have nearly zero infections, and others are hot spots. We attribute the effect to a Reed-Hughes-like mechanism in which disease arrives at different times and grows exponentially. Hot spots, however, appear to grow faster than neighboring subregions and dominate spatially aggregated statistics, thereby amplifying growth rates. The staggered spread of COVID-19 can also make aggregated growth rates appear higher even when subregions grow at the same rate. Public policy, economic analysis and epidemic modeling need to account for potential distortions introduced by spatial aggregation.

preprint2019arXiv

Friendship Paradox Biases Perceptions in Directed Networks

How popular a topic or an opinion appears to be in a network can be very different from its actual popularity. For example, in an online network of a social media platform, the number of people who mention a topic in their posts---i.e., its global popularity---can be dramatically different from how people see it in their social feeds---i.e., its perceived popularity---where the feeds aggregate their friends' posts. We trace the origin of this discrepancy to the friendship paradox in directed networks, which states that people are less popular than their friends (or followers) are, on average. We identify conditions on network structure that give rise to this perception bias, and validate the findings empirically using data from Twitter. Within messages posted by Twitter users in our sample, we identify topics that appear more frequently within the users' social feeds, than they do globally, i.e., among all posts. In addition, we present a polling algorithm that leverages the friendship paradox to obtain a statistically efficient estimate of a topic's global prevalence from biased perceptions of individuals. We characterize the bias of the polling estimate, provide an upper bound for its variance, and validate the algorithm's efficiency through synthetic polling experiments on our Twitter data. Our paper elucidates the non-intuitive ways in which the structure of directed networks can distort social perceptions and resulting behaviors.

preprint2016arXiv

Assessing the Navigational Effects of Click Biases and Link Insertion on the Web

Websites have an inherent interest in steering user navigation in order to, for example, increase sales of specific products or categories, or to guide users towards specific information. In general, website administrators can use the following two strategies to influence their visitors' navigation behavior. First, they can introduce click biases to reinforce specific links on their website by changing their visual appearance, for example, by locating them on the top of the page. Second, they can utilize link insertion to generate new paths for users to navigate over. In this paper, we present a novel approach for measuring the potential effects of these two strategies on user navigation. Our results suggest that, depending on the pages for which we want to increase user visits, optimal link modification strategies vary. Moreover, simple topological measures can be used as proxies for assessing the impact of the intended changes on the navigation of users, even before these changes are implemented.

preprint2016arXiv

Attention Inequality in Social Media

Social media can be viewed as a social system where the currency is attention. People post content and interact with others to attract attention and gain new followers. In this paper, we examine the distribution of attention across a large sample of users of a popular social media site Twitter. Through empirical analysis of these data we conclude that attention is very unequally distributed: the top 20\% of Twitter users own more than 96\% of all followers, 93\% of the retweets, and 93\% of the mentions. We investigate the mechanisms that lead to attention inequality and find that it results from the "rich-get-richer" and "poor-get-poorer" dynamics of attention diffusion. Namely, users who are "rich" in attention, because they are often mentioned and retweeted, are more likely to gain new followers, while those who are "poor" in attention are likely to lose followers. We develop a phenomenological model that quantifies attention diffusion and network dynamics, and solve it to study how attention inequality grows over time in a dynamic environment of social media.

preprint2016arXiv

Disentangling the Effects of Social Signals

Peer recommendation is a crowdsourcing task that leverages the opinions of many to identify interesting content online, such as news, images, or videos. Peer recommendation applications often use social signals, e.g., the number of prior recommendations, to guide people to the more interesting content. How people react to social signals, in combination with content quality and its presentation order, determines the outcomes of peer recommendation, i.e., item popularity. Using Amazon Mechanical Turk, we experimentally measure the effects of social signals in peer recommendation. Specifically, after controlling for variation due to item content and its position, we find that social signals affect item popularity about half as much as position and content do. These effects are somewhat correlated, so social signals exacerbate the "rich get richer" phenomenon, which results in a wider variance of popularity. Further, social signals change individual preferences, creating a "herding" effect that biases people's judgments about the content. Despite this, we find that social signals improve the efficiency of peer recommendation by reducing the effort devoted to evaluating content while maintaining recommendation quality.

preprint2016arXiv

Emotions, Demographics and Sociability in Twitter Interactions

The social connections people form online affect the quality of information they receive and their online experience. Although a host of socioeconomic and cognitive factors were implicated in the formation of offline social ties, few of them have been empirically validated, particularly in an online setting. In this study, we analyze a large corpus of geo-referenced messages, or tweets, posted by social media users from a major US metropolitan area. We linked these tweets to US Census data through their locations. This allowed us to measure emotions expressed in the tweets posted from an area, the structure of social connections, and also use that area's socioeconomic characteristics in analysis. %We extracted the structure of online social interactions from the people mentioned in tweets from that area. We find that at an aggregate level, places where social media users engage more deeply with less diverse social contacts are those where they express more negative emotions, like sadness and anger. Demographics also has an impact: these places have residents with lower household income and education levels. Conversely, places where people engage less frequently but with diverse contacts have happier, more positive messages posted from them and also have better educated, younger, more affluent residents. Results suggest that cognitive factors and offline characteristics affect the quality of online interactions. Our work highlights the value of linking social media data to traditional data sources, such as US Census, to drive novel analysis of online behavior.

preprint2016arXiv

Information is not a Virus, and Other Consequences of Human Cognitive Limits

The many decisions people make about what to pay attention to online shape the spread of information in online social networks. Due to the constraints of available time and cognitive resources, the ease of discovery strongly impacts how people allocate their attention to social media content. As a consequence, the position of information in an individual's social feed, as well as explicit social signals about its popularity, determine whether it will be seen, and the likelihood that it will be shared with followers. Accounting for these cognitive limits simplifies mechanics of information diffusion in online social networks and explains puzzling empirical observations: (i) information generally fails to spread in social media and (ii) highly connected people are less likely to re-share information. Studies of information diffusion on different social media platforms reviewed here suggest that the interplay between human cognitive limits and network structure differentiates the spread of information from other social contagions, such as the spread of a virus through a population.

preprint2016arXiv

Neighbor-Neighbor Correlations Explain Measurement Bias in Networks

In numerous physical models on networks, dynamics are based on interactions that exclusively involve properties of a node's nearest neighbors. However, a node's local view of its neighbors may systematically bias perceptions of network connectivity or the prevalence of certain traits. We investigate the strong friendship paradox, which occurs when the majority of a node's neighbors have more neighbors than does the node itself. We develop a model to predict the magnitude of the paradox, showing that it is enhanced by negative correlations between degrees of neighboring nodes. We then show that by including neighbor-neighbor correlations, which are degree correlations one step beyond those of neighboring nodes, we accurately predict the impact of the strong friendship paradox in real-world networks. Understanding how the paradox biases local observations can inform better measurements of network structure and our understanding of collective phenomena.

preprint2016arXiv

Network Composition from Multi-layer Data

It is common for people to access multiple social networks, for example, using phone, email, and social media. Together, the multi-layer social interactions form a "integrated social network." How can we extend well developed knowledge about single-layer networks, including vertex centrality and community structure, to such heterogeneous structures? In this paper, we approach these challenges by proposing a principled framework of network composition based on a unified dynamical process. Mathematically, we consider the following abstract problem: Given multi-layer network data and additional parameters for intra and inter-layer dynamics, construct a (single) weighted network that best integrates the joint process. We use transformations of dynamics to unify heterogeneous layers under a common dynamics. For inter-layer compositions, we will consider several cases as the inter-layer dynamics plays different roles in various social or technological networks. Empirically, we provide examples to highlight the usefulness of this framework for network analysis and network design.

preprint2015arXiv

Evolution of Conversations in the Age of Email Overload

Email is a ubiquitous communications tool in the workplace and plays an important role in social interactions. Previous studies of email were largely based on surveys and limited to relatively small populations of email users within organizations. In this paper, we report results of a large-scale study of more than 2 million users exchanging 16 billion emails over several months. We quantitatively characterize the replying behavior in conversations within pairs of users. In particular, we study the time it takes the user to reply to a received message and the length of the reply sent. We consider a variety of factors that affect the reply time and length, such as the stage of the conversation, user demographics, and use of portable devices. In addition, we study how increasing load affects emailing behavior. We find that as users receive more email messages in a day, they reply to a smaller fraction of them, using shorter replies. However, their responsiveness remains intact, and they may even reply to emails faster. Finally, we predict the time to reply, length of reply, and whether the reply ends a conversation. We demonstrate considerable improvement over the baseline in all three prediction tasks, showing the significant role that the factors that we uncover play, in determining replying behavior. We rank these factors based on their predictive power. Our findings have important implications for understanding human behavior and designing better email management applications for tasks like ranking unread emails.

preprint2015arXiv

Geography of Emotion: Where in a City are People Happier?

Location-sharing services were built upon people's desire to share their activities and locations with others. By "checking-in" to a place, such as a restaurant, a park, gym, or train station, people disclose where they are, thereby providing valuable information about land use and utilization of services in urban areas. This information may, in turn, be used to design smarter, happier, more equitable cities. We use data from Foursquare location-sharing service to identify areas within a major US metropolitan area with many check-ins, i.e., areas that people like to use. We then use data from the Twitter microblogging platform to analyze the properties of these areas. Specifically, we have extracted a large corpus of geo-tagged messages, called tweets, from a major metropolitan area and linked them US Census data through their locations. This allows us to measure the sentiment expressed in tweets that are posted from a specific area, and also use that area's demographic properties in analysis. Our results reveal that areas with many check-ins are different from other areas within the metropolitan region. In particular, these areas have happier tweets, which also encourage people from other areas to commute longer distances to these places. These findings shed light on human mobility patterns, as well as how physical environment influences human emotions.

preprint2015arXiv

Portrait of an Online Shopper: Understanding and Predicting Consumer Behavior

Consumer spending accounts for a large fraction of the US economic activity. Increasingly, consumer activity is moving to the web, where digital traces of shopping and purchases provide valuable data about consumer behavior. We analyze these data extracted from emails and combine them with demographic information to characterize, model, and predict consumer behavior. Breaking down purchasing by age and gender, we find that the amount of money spent on online purchases grows sharply with age, peaking in late 30s. Men are more frequent online purchasers and spend more money when compared to women. Linking online shopping to income, we find that shoppers from more affluent areas purchase more expensive items and buy them more frequently, resulting in significantly more money spent on online purchases. We also look at dynamics of purchasing behavior and observe daily and weekly cycles in purchasing behavior, similarly to other online activities. More specifically, we observe temporal patterns in purchasing behavior suggesting shoppers have finite budgets: the more expensive an item, the longer the shopper waits since the last purchase to buy it. We also observe that shoppers who email each other purchase more similar items than socially unconnected shoppers, and this effect is particularly evident among women. Finally, we build a model to predict when shoppers will make a purchase and how much will spend on it. We find that temporal features improve prediction accuracy over competitive baselines. A better understanding of consumer behavior can help improve marketing efforts and make online shopping more pleasant and efficient.

preprint2015arXiv

Structural Properties of Ego Networks

The structure of real-world social networks in large part determines the evolution of social phenomena, including opinion formation, diffusion of information and influence, and the spread of disease. Globally, network structure is characterized by features such as degree distribution, degree assortativity, and clustering coefficient. However, information about global structure is usually not available to each vertex. Instead, each vertex's knowledge is generally limited to the locally observable portion of the network consisting of the subgraph over its immediate neighbors. Such subgraphs, known as ego networks, have properties that can differ substantially from those of the global network. In this paper, we study the structural properties of ego networks and show how they relate to the global properties of networks from which they are derived. Through empirical comparisons and mathematical derivations, we show that structural features, similar to static attributes, suffer from paradoxes. We quantify the differences between global information about network structure and local estimates. This knowledge allows us to better identify and correct the biases arising from incomplete local information.

preprint2015arXiv

The Interplay Between Dynamics and Networks: Centrality, Communities, and Cheeger Inequality

We study the interplay between a dynamic process and the structure of the network on which it is defined. Specifically, we examine the impact of this interaction on the quality-measure of network clusters and node centrality. This enables us to effectively identify network communities and important nodes participating in the dynamics. As the first step towards this objective, we introduce an umbrella framework for defining and characterizing an ensemble of dynamic processes on a network. This framework generalizes the traditional Laplacian framework to continuous-time biased random walks and also allows us to model some epidemic processes over a network. For each dynamic process in our framework, we can define a function that measures the quality of every subset of nodes as a potential cluster (or community) with respect to this process on a given network. This subset-quality function generalizes the traditional conductance measure for graph partitioning. We partially justify our choice of the quality function by showing that the classic Cheeger's inequality, which relates the conductance of the best cluster in a network with a spectral quantity of its Laplacian matrix, can be extended from the Laplacian-conductance setting to this more general setting.

preprint2015arXiv

The Majority Illusion in Social Networks

Social behaviors are often contagious, spreading through a population as individuals imitate the decisions and choices of others. A variety of global phenomena, from innovation adoption to the emergence of social norms and political movements, arise as a result of people following a simple local rule, such as copy what others are doing. However, individuals often lack global knowledge of the behaviors of others and must estimate them from the observations of their friends' behaviors. In some cases, the structure of the underlying social network can dramatically skew an individual's local observations, making a behavior appear far more common locally than it is globally. We trace the origins of this phenomenon, which we call "the majority illusion," to the friendship paradox in social networks. As a result of this paradox, a behavior that is globally rare may be systematically overrepresented in the local neighborhoods of many people, i.e., among their friends. Thus, the "majority illusion" may facilitate the spread of social contagions in networks and also explain why systematic biases in social perceptions, for example, of risky behavior, arise. Using synthetic and real-world networks, we explore how the "majority illusion" depends on network structure and develop a statistical model to calculate its magnitude in a network.

preprint2015arXiv

User Effort and Network Structure Mediate Access to Information in Networks

Individuals' access to information in a social network depends on its distributed and where in the network individuals position themselves. However, individuals have limited capacity to manage their social connections and process information. In this work, we study how this limited capacity and network structure interact to affect the diversity of information social media users receive. Previous studies of the role of networks in information access were limited in their ability to measure the diversity of information. We address this problem by learning the topics of interest to social media users by observing messages they share online with their followers. We present a probabilistic model that incorporates human cognitive constraints in a generative model of information sharing. We then use the topics learned by the model to measure the diversity of information users receive from their social media contacts. We confirm that users in structurally diverse network positions, which bridge otherwise disconnected regions of the follower graph, are exposed to more diverse information. In addition, we identify user effort as an important variable that mediates access to diverse information in social media. Users who invest more effort into their activity on the site not only place themselves in more structurally diverse positions within the network than the less engaged users, but they also receive more diverse information when located in similar network positions. These findings indicate that the relationship between network structure and access to information in networks is more nuanced than previously thought.

preprint2014arXiv

Network Weirdness: Exploring the Origins of Network Paradoxes

Social networks have many counter-intuitive properties, including the "friendship paradox" that states, on average, your friends have more friends than you do. Recently, a variety of other paradoxes were demonstrated in online social networks. This paper explores the origins of these network paradoxes. Specifically, we ask whether they arise from mathematical properties of the networks or whether they have a behavioral origin. We show that sampling from heavy-tailed distributions always gives rise to a paradox in the mean, but not the median. We propose a strong form of network paradoxes, based on utilizing the median, and validate it empirically using data from two online social networks. Specifically, we show that for any user the majority of user's friends and followers have more friends, followers, etc. than the user, and that this cannot be explained by statistical properties of sampling. Next, we explore the behavioral origins of the paradoxes by using the shuffle test to remove correlations between node degrees and attributes. We find that paradoxes for the mean persist in the shuffled network, but not for the median. We demonstrate that strong paradoxes arise due to the assortativity of user attributes, including degree, and correlation between degree and attribute.

preprint2014arXiv

Partitioning Networks with Node Attributes by Compressing Information Flow

Real-world networks are often organized as modules or communities of similar nodes that serve as functional units. These networks are also rich in content, with nodes having distinguishing features or attributes. In order to discover a network's modular structure, it is necessary to take into account not only its links but also node attributes. We describe an information-theoretic method that identifies modules by compressing descriptions of information flow on a network. Our formulation introduces node content into the description of information flow, which we then minimize to discover groups of nodes with similar attributes that also tend to trap the flow of information. The method has several advantages: it is conceptually simple and does not require ad-hoc parameters to specify the number of modules or to control the relative contribution of links and node attributes to network structure. We apply the proposed method to partition real-world networks with known community structure. We demonstrate that adding node attributes helps recover the underlying community structure in content-rich networks more effectively than using links alone. In addition, we show that our method is faster and more accurate than alternative state-of-the-art algorithms.

preprint2014arXiv

Rethinking Centrality: The Role of Dynamical Processes in Social Network Analysis

Many popular measures used in social network analysis, including centrality, are based on the random walk. The random walk is a model of a stochastic process where a node interacts with one other node at a time. However, the random walk may not be appropriate for modeling social phenomena, including epidemics and information diffusion, in which one node may interact with many others at the same time, for example, by broadcasting the virus or information to its neighbors. To produce meaningful results, social network analysis algorithms have to take into account the nature of interactions between the nodes. In this paper we classify dynamical processes as conservative and non-conservative and relate them to well-known measures of centrality used in network analysis: PageRank and Alpha-Centrality. We demonstrate, by ranking users in online social networks used for broadcasting information, that non-conservative Alpha-Centrality generally leads to a better agreement with an empirical ranking scheme than the conservative PageRank.

preprint2014arXiv

The Impact of Network Flows on Community Formation in Models of Opinion Dynamics

We study dynamics of opinion formation in a network of coupled agents. As the network evolves to a steady state, opinions of agents within the same community converge faster than those of other agents. This framework allows us to study how network topology and network flow, which mediates the transfer of opinions between agents, both affect the formation of communities. In traditional models of opinion dynamics, agents are coupled via conservative flows, which result in one-to-one opinion transfer. However, social interactions are often non-conservative, resulting in one-to-many transfer of opinions. We study opinion formation in networks using one-to-one and one-to-many interactions and show that they lead to different community structure within the same network.

preprint2014arXiv

Tripartite Graph Clustering for Dynamic Sentiment Analysis on Social Media

The growing popularity of social media (e.g, Twitter) allows users to easily share information with each other and influence others by expressing their own sentiments on various subjects. In this work, we propose an unsupervised \emph{tri-clustering} framework, which analyzes both user-level and tweet-level sentiments through co-clustering of a tripartite graph. A compelling feature of the proposed framework is that the quality of sentiment clustering of tweets, users, and features can be mutually improved by joint clustering. We further investigate the evolution of user-level sentiments and latent feature vectors in an online framework and devise an efficient online algorithm to sequentially update the clustering of tweets, users and features with newly arrived data. The online framework not only provides better quality of both dynamic user-level and tweet-level sentiment analysis, but also improves the computational and storage efficiency. We verified the effectiveness and efficiency of the proposed approaches on the November 2012 California ballot Twitter data.

preprint2013arXiv

Attention and Visibility in an Information Rich World

As the rate of content production grows, we must make a staggering number of daily decisions about what information is worth acting on. For any flourishing online social media system, users can barely keep up with the new content shared by friends. How does the user-interface design help or hinder users' ability to find interesting content? We analyze the choices people make about which information to propagate on the social media sites Twitter and Digg. We observe regularities in behavior which can be attributed directly to cognitive limitations of humans, resulting from the different visibility policies of each site. We quantify how people divide their limited attention among competing sources of information, and we show how the user-interface design can mediate information spread.

preprint2013arXiv

Friendship Paradox Redux: Your Friends Are More Interesting Than You

Feld's friendship paradox states that "your friends have more friends than you, on average." This paradox arises because extremely popular people, despite being rare, are overrepresented when averaging over friends. Using a sample of the Twitter firehose, we confirm that the friendship paradox holds for >98% of Twitter users. Because of the directed nature of the follower graph on Twitter, we are further able to confirm more detailed forms of the friendship paradox: everyone you follow or who follows you has more friends and followers than you. This is likely caused by a correlation we demonstrate between Twitter activity, number of friends, and number of followers. In addition, we discover two new paradoxes: the virality paradox that states "your friends receive more viral content than you, on average," and the activity paradox, which states "your friends are more active than you, on average." The latter paradox is important in regulating online communication. It may result in users having difficulty maintaining optimal incoming information rates, because following additional users causes the volume of incoming tweets to increase super-linearly. While users may compensate for increased information flow by increasing their own activity, users become information overloaded when they receive more information than they are able or willing to process. We compare the average size of cascades that are sent and received by overloaded and underloaded users. And we show that overloaded users post and receive larger cascades and they are poor detector of small cascades.

preprint2013arXiv

LA-CTR: A Limited Attention Collaborative Topic Regression for Social Media

Probabilistic models can learn users' preferences from the history of their item adoptions on a social media site, and in turn, recommend new items to users based on learned preferences. However, current models ignore psychological factors that play an important role in shaping online social behavior. One such factor is attention, the mechanism that integrates perceptual and cognitive features to select the items the user will consciously process and may eventually adopt. Recent research has shown that people have finite attention, which constrains their online interactions, and that they divide their limited attention non-uniformly over other people. We propose a collaborative topic regression model that incorporates limited, non-uniformly divided attention. We show that the proposed model is able to learn more accurate user preferences than state-of-art models, which do not take human cognitive factors into account. Specifically we analyze voting on news items on the social news aggregator and show that our model is better able to predict held out votes than alternate models. Our study demonstrates that psycho-socially motivated models are better able to describe and predict observed behavior than models which only consider latent social structure and content.

preprint2013arXiv

LA-LDA: A Limited Attention Topic Model for Social Recommendation

Social media users have finite attention which limits the number of incoming messages from friends they can process. Moreover, they pay more attention to opinions and recommendations of some friends more than others. In this paper, we propose LA-LDA, a latent topic model which incorporates limited, non-uniformly divided attention in the diffusion process by which opinions and information spread on the social network. We show that our proposed model is able to learn more accurate user models from users' social network and item adoption behavior than models which do not take limited attention into account. We analyze voting on news items on the social news aggregator Digg and show that our proposed model is better able to predict held out votes than alternative models. Our study demonstrates that psycho-socially motivated models have better ability to describe and predict observed behavior than models which only consider topics.

preprint2013arXiv

Limited Attention and Centrality in Social Networks

How does one find important or influential people in an online social network? Researchers have proposed a variety of centrality measures to identify individuals that are, for example, often visited by a random walk, infected in an epidemic, or receive many messages from friends. Recent research suggests that a social media users' capacity to respond to an incoming message is constrained by their finite attention, which they divide over all incoming information, i.e., information sent by users they follow. We propose a new measure of centrality --- limited-attention version of Bonacich's Alpha-centrality --- that models the effect of limited attention on epidemic diffusion. The new measure describes a process in which nodes broadcast messages to their out-neighbors, but the neighbors' ability to receive the message depends on the number of in-neighbors they have. We evaluate the proposed measure on real-world online social networks and show that it can better reproduce an empirical influence ranking of users than other popular centrality measures.

preprint2013arXiv

Social Contagion: An Empirical Study of Information Spread on Digg and Twitter Follower Graphs

Social networks have emerged as a critical factor in information dissemination, search, marketing, expertise and influence discovery, and potentially an important tool for mobilizing people. Social media has made social networks ubiquitous, and also given researchers access to massive quantities of data for empirical analysis. These data sets offer a rich source of evidence for studying dynamics of individual and group behavior, the structure of networks and global patterns of the flow of information on them. However, in most previous studies, the structure of the underlying networks was not directly visible but had to be inferred from the flow of information from one individual to another. As a result, we do not yet understand dynamics of information spread on networks or how the structure of the network affects it. We address this gap by analyzing data from two popular social news sites. Specifically, we extract follower graphs of active Digg and Twitter users and track how interest in news stories cascades through the graph. We compare and contrast properties of information cascades on both sites and elucidate what they tell us about dynamics of information flow on networks.

preprint2013arXiv

Spectral Clustering with Epidemic Diffusion

Spectral clustering is widely used to partition graphs into distinct modules or communities. Existing methods for spectral clustering use the eigenvalues and eigenvectors of the graph Laplacian, an operator that is closely associated with random walks on graphs. We propose a new spectral partitioning method that exploits the properties of epidemic diffusion. An epidemic is a dynamic process that, unlike the random walk, simultaneously transitions to all the neighbors of a given node. We show that the replicator, an operator describing epidemic diffusion, is equivalent to the symmetric normalized Laplacian of a reweighted graph with edges reweighted by the eigenvector centralities of their incident nodes. Thus, more weight is given to edges connecting more central nodes. We describe a method that partitions the nodes based on the componentwise ratio of the replicator's second eigenvector to the first, and compare its performance to traditional spectral clustering techniques on synthetic graphs with known community structure. We demonstrate that the replicator gives preference to dense, clique-like structures, enabling it to more effectively discover communities that may be obscured by dense intercommunity linking.

preprint2013arXiv

Stochastic Models Predict User Behavior in Social Media

User response to contributed content in online social media depends on many factors. These include how the site lays out new content, how frequently the user visits the site, how many friends the user follows, how active these friends are, as well as how interesting or useful the content is to the user. We present a stochastic modeling framework that relates a user's behavior to details of the site's user interface and user activity and describe a procedure for estimating model parameters from available data. We apply the model to study discussions of controversial topics on Twitter, specifically, to predict how followers of an advocate for a topic respond to the advocate's posts. We show that a model of user behavior that explicitly accounts for a user transitioning through a series of states before responding to an advocate's post better predicts response than models that fail to take these states into account. We demonstrate other benefits of stochastic models, such as their ability to identify users who are highly interested in advocate's posts.

preprint2013arXiv

Structural and Cognitive Bottlenecks to Information Access in Social Networks

Information in networks is non-uniformly distributed, enabling individuals in certain network positions to get preferential access to information. Social scientists have developed influential theories about the role of network structure in information access. These theories were validated through numerous studies, which examined how individuals leverage their social networks for competitive advantage, such as a new job or higher compensation. It is not clear how these theories generalize to online networks, which differ from real-world social networks in important respects, including asymmetry of social links. We address this problem by analyzing how users of the social news aggregator Digg adopt stories recommended by friends, i.e., users they follow. We measure the impact different factors, such as network position and activity rate; have on access to novel information, which in Digg's case means set of distinct news stories. We show that a user can improve his information access by linking to active users, though this becomes less effective as the number of friends, or their activity, grows due to structural network constraints. These constraints arise because users in structurally diverse position within the follower graph have topically diverse interests from their friends. Moreover, though in most cases user's friends are exposed to almost all the information available in the network, after they make their recommendations, the user sees only a small fraction of the available information. Our study suggests that cognitive and structural bottlenecks limit access to novel information in online social networks.

preprint2013arXiv

The Simple Rules of Social Contagion

It is commonly believed that information spreads between individuals like a pathogen, with each exposure by an informed friend potentially resulting in a naive individual becoming infected. However, empirical studies of social media suggest that individual response to repeated exposure to information is significantly more complex than the prediction of the pathogen model. As a proxy for intervention experiments, we compare user responses to multiple exposures on two different social media sites, Twitter and Digg. We show that the position of the exposing messages on the user-interface strongly affects social contagion. Accounting for this visibility significantly simplifies the dynamics of social contagion. The likelihood an individual will spread information increases monotonically with exposure, while explicit feedback about how many friends have previously spread it increases the likelihood of a response. We apply our model to real-time forecasting of user behavior.

preprint2012arXiv

How Visibility and Divided Attention Constrain Social Contagion

How far and how fast does information spread in social media? Researchers have recently examined a number of factors that affect information diffusion in online social networks, including: the novelty of information, users' activity levels, who they pay attention to, and how they respond to friends' recommendations. Using URLs as markers of information, we carry out a detailed study of retweeting, the primary mechanism by which information spreads on the Twitter follower graph. Our empirical study examines how users respond to an incoming stimulus, i.e., a tweet (message) from a friend, and reveals that %retweeting behavior is constrained by a few simple principles. the "principle of least effort" combined with limited attention plays a dominant role in retweeting behavior. Specifically, we observe that users retweet information when it is most visible, such as when it near the top of their Twitter stream. Moreover, our measurements quantify how a user's limited attention is divided among incoming tweets, providing novel evidence that highly connected individuals are less likely to propagate an arbitrary tweet. Our study indicates that the finite ability to process incoming information constrains social contagion, and we conclude that rapid decay of visibility is the primary barrier to information propagation online.

preprint2012arXiv

Impact of Dynamic Interactions on Multi-Scale Analysis of Community Structure in Networks

To find interesting structure in networks, community detection algorithms have to take into account not only the network topology, but also dynamics of interactions between nodes. We investigate this claim using the paradigm of synchronization in a network of coupled oscillators. As the network evolves to a global steady state, nodes belonging to the same community synchronize faster than nodes belonging to different communities. Traditionally, nodes in network synchronization models are coupled via one-to-one, or conservative interactions. However, social interactions are often one-to-many, as for example, in social media, where users broadcast messages to all their followers. We formulate a novel model of synchronization in a network of coupled oscillators in which the oscillators are coupled via one-to-many, or non-conservative interactions. We study the dynamics of different interaction models and contrast their spectral properties. To find multi-scale community structure in a network of interacting nodes, we define a similarity function that measures the degree to which nodes are synchronized and use it to hierarchically cluster nodes. We study real-world social networks, including networks of two social media providers. To evaluate the quality of the discovered communities in a social media network we propose a community quality metric based on user activity. We find that conservative and non-conservative interaction models lead to dramatically different views of community structure even within the same network. Our work offers a novel mathematical framework for exploring the relationship between network structure, topology and dynamics.

preprint2012arXiv

Network Structure, Topology and Dynamics in Generalized Models of Synchronization

We explore the interplay of network structure, topology, and dynamic interactions between nodes using the paradigm of distributed synchronization in a network of coupled oscillators. As the network evolves to a global steady state, interconnected oscillators synchronize in stages, revealing network's underlying community structure. Traditional models of synchronization assume that interactions between nodes are mediated by a conservative process, such as diffusion. However, social and biological processes are often non-conservative. We propose a new model of synchronization in a network of oscillators coupled via non-conservative processes. We study dynamics of synchronization of a synthetic and real-world networks and show that different synchronization models reveal different structures within the same network.

preprint2012arXiv

Social Dynamics of Digg

Online social media provide multiple ways to find interesting content. One important method is highlighting content recommended by user's friends. We examine this process on one such site, the news aggregator Digg. With a stochastic model of user behavior, we distinguish the effects of the content visibility and interestingness to users. We find a wide range of interest and distinguish stories primarily of interest to a users' friends from those of interest to the entire user community. We show how this model predicts a story's eventual popularity from users' early reactions to it, and estimate the prediction reliability. This modeling framework can help evaluate alternative design choices for displaying content on the site.

preprint2011arXiv

Entropy-based Classification of 'Retweeting' Activity on Twitter

Twitter is used for a variety of reasons, including information dissemination, marketing, political organizing and to spread propaganda, spamming, promotion, conversations, and so on. Characterizing these activities and categorizing associated user generated content is a challenging task. We present a information-theoretic approach to classification of user activity on Twitter. We focus on tweets that contain embedded URLs and study their collective `retweeting' dynamics. We identify two features, time-interval and user entropy, which we use to classify retweeting activity. We achieve good separation of different activities using just these two features and are able to categorize content based on the collective user response it generates. We have identified five distinct categories of retweeting activity on Twitter: automatic/robotic activity, newsworthy information dissemination, advertising and promotion, campaigns, and parasitic advertisement. In the course of our investigations, we have shown how Twitter can be exploited for promotional and spam-like activities. The content-independent, entropy-based activity classification method is computationally efficient, scalable and robust to sampling and missing data. It has many applications, including automatic spam-detection, trend identification, trust management, user-modeling, social search and content classification on online social media.

preprint2011arXiv

Leveraging User Diversity to Harvest Knowledge on the Social Web

Social web users are a very diverse group with varying interests, levels of expertise, enthusiasm, and expressiveness. As a result, the quality of content and annotations they create to organize content is also highly variable. While several approaches have been proposed to mine social annotations, for example, to learn folksonomies that reflect how people relate narrower concepts to broader ones, these methods treat all users and the annotations they create uniformly. We propose a framework to automatically identify experts, i.e., knowledgeable users who create high quality annotations, and use their knowledge to guide folksonomy learning. We evaluate the approach on a large body of social annotations extracted from the photosharing site Flickr. We show that using expert knowledge leads to more detailed and accurate folksonomies. Moreover, we show that including annotations from non-expert, or novice, users leads to more comprehensive folksonomies than experts' knowledge alone.

preprint2011arXiv

Non-Conservative Diffusion and its Application to Social Network Analysis

The random walk is fundamental to modeling dynamic processes on networks. Metrics based on the random walk have been used in many applications from image processing to Web page ranking. However, how appropriate are random walks to modeling and analyzing social networks? We argue that unlike a random walk, which conserves the quantity diffusing on a network, many interesting social phenomena, such as the spread of information or disease on a social network, are fundamentally non-conservative. When an individual infects her neighbor with a virus, the total amount of infection increases. We classify diffusion processes as conservative and non-conservative and show how these differences impact the choice of metrics used for network analysis, as well as our understanding of network structure and behavior. We show that Alpha-Centrality, which mathematically describes non-conservative diffusion, leads to new insights into the behavior of spreading processes on networks. We give a scalable approximate algorithm for computing the Alpha-Centrality in a massive graph. We validate our approach on real-world online social networks of Digg. We show that a non-conservative metric, such as Alpha-Centrality, produces better agreement with empirical measure of influence than conservative metrics, such as PageRank. We hope that our investigation will inspire further exploration into the realms of conservative and non-conservative metrics in social network analysis.

preprint2011arXiv

Using Proximity to Predict Activity in Social Networks

The structure of a social network contains information useful for predicting its evolution. Nodes that are "close" in some sense are more likely to become linked in the future than more distant nodes. We show that structural information can also help predict node activity. We use proximity to capture the degree to which two nodes are "close" to each other in the network. In addition to standard proximity metrics used in the link prediction task, such as neighborhood overlap, we introduce new metrics that model different types of interactions that can occur between network nodes. We argue that the "closer" nodes are in a social network, the more similar will be their activity. We study this claim using data about URL recommendation on social media sites Digg and Twitter. We show that structural proximity of two users in the follower graph is related to similarity of their activity, i.e., how many URLs they both recommend. We also show that given friends' activity, knowing their proximity to the user can help better predict which URLs the user will recommend. We compare the performance of different proximity metrics on the activity prediction task and find that some metrics lead to substantial performance improvements.

preprint2011arXiv

What Stops Social Epidemics?

Theoretical progress in understanding the dynamics of spreading processes on graphs suggests the existence of an epidemic threshold below which no epidemics form and above which epidemics spread to a significant fraction of the graph. We have observed information cascades on the social media site Digg that spread fast enough for one initial spreader to infect hundreds of people, yet end up affecting only 0.1% of the entire network. We find that two effects, previously studied in isolation, combine cooperatively to drastically limit the final size of cascades on Digg. First, because of the highly clustered structure of the Digg network, most people who are aware of a story have been exposed to it via multiple friends. This structure lowers the epidemic threshold while moderately slowing the overall growth of cascades. In addition, we find that the mechanism for social contagion on Digg points to a fundamental difference between information spread and other contagion processes: despite multiple opportunities for infection within a social group, people are less likely to become spreaders of information with repeated exposure. The consequences of this mechanism become more pronounced for more clustered graphs. Ultimately, this effect severely curtails the size of social epidemics on Digg.

preprint2010arXiv

A Framework for Quantitative Analysis of Cascades on Networks

How does information flow in online social networks? How does the structure and size of the information cascade evolve in time? How can we efficiently mine the information contained in cascade dynamics? We approach these questions empirically and present an efficient and scalable mathematical framework for quantitative analysis of cascades on networks. We define a cascade generating function that captures the details of the microscopic dynamics of the cascades. We show that this function can also be used to compute the macroscopic properties of cascades, such as their size, spread, diameter, number of paths, and average path length. We present an algorithm to efficiently compute cascade generating function and demonstrate that while significantly compressing information within a cascade, it nevertheless allows us to accurately reconstruct its structure. We use this framework to study information dynamics on the social network of Digg. Digg allows users to post and vote on stories, and easily see the stories that friends have voted on. As a story spreads on Digg through voting, it generates cascades. We extract cascades of more than 3,500 Digg stories and calculate their macroscopic and microscopic properties. We identify several trends in cascade dynamics: spreading via chaining, branching and community. We discuss how these affect the spread of the story through the Digg social network. Our computational framework is general and offers a practical solution to quantitative analysis of the microscopic structure of even very large cascades.

preprint2010arXiv

A Parameterized Centrality Metric for Network Analysis

A variety of metrics have been proposed to measure the relative importance of nodes in a network. One of these, alpha-centrality [Bonacich, 2001], measures the number of attenuated paths that exist between nodes. We introduce a normalized version of this metric and use it to study network structure, specifically, to rank nodes and find community structure of the network. Specifically, we extend the modularity-maximization method [Newman and Girvan, 2004] for community detection to use this metric as the measure of node connectivity. Normalized alpha-centrality is a powerful tool for network analysis, since it contains a tunable parameter that sets the length scale of interactions. By studying how rankings and discovered communities change when this parameter is varied allows us to identify locally and globally important nodes and structures. We apply the proposed method to several benchmark networks and show that it leads to better insight into network structure than alternative methods.

preprint2010arXiv

A Probabilistic Approach for Learning Folksonomies from Structured Data

Learning structured representations has emerged as an important problem in many domains, including document and Web data mining, bioinformatics, and image analysis. One approach to learning complex structures is to integrate many smaller, incomplete and noisy structure fragments. In this work, we present an unsupervised probabilistic approach that extends affinity propagation to combine the small ontological fragments into a collection of integrated, consistent, and larger folksonomies. This is a challenging task because the method must aggregate similar structures while avoiding structural inconsistencies and handling noise. We validate the approach on a real-world social media dataset, comprised of shallow personal hierarchies specified by many individual users, collected from the photosharing website Flickr. Our empirical results show that our proposed approach is able to construct deeper and denser structures, compared to an approach using only the standard affinity propagation algorithm. Additionally, the approach yields better overall integration quality than a state-of-the-art approach based on incremental relational clustering.

preprint2010arXiv

Growing a Tree in the Forest: Constructing Folksonomies by Integrating Structured Metadata

Many social Web sites allow users to annotate the content with descriptive metadata, such as tags, and more recently to organize content hierarchically. These types of structured metadata provide valuable evidence for learning how a community organizes knowledge. For instance, we can aggregate many personal hierarchies into a common taxonomy, also known as a folksonomy, that will aid users in visualizing and browsing social content, and also to help them in organizing their own content. However, learning from social metadata presents several challenges, since it is sparse, shallow, ambiguous, noisy, and inconsistent. We describe an approach to folksonomy learning based on relational clustering, which exploits structured metadata contained in personal hierarchies. Our approach clusters similar hierarchies using their structure and tag statistics, then incrementally weaves them into a deeper, bushier tree. We study folksonomy learning using social metadata extracted from the photo-sharing site Flickr, and demonstrate that the proposed approach addresses the challenges. Moreover, comparing to previous work, the approach produces larger, more accurate folksonomies, and in addition, scales better.

preprint2010arXiv

Information Contagion: an Empirical Study of the Spread of News on Digg and Twitter Social Networks

Social networks have emerged as a critical factor in information dissemination, search, marketing, expertise and influence discovery, and potentially an important tool for mobilizing people. Social media has made social networks ubiquitous, and also given researchers access to massive quantities of data for empirical analysis. These data sets offer a rich source of evidence for studying dynamics of individual and group behavior, the structure of networks and global patterns of the flow of information on them. However, in most previous studies, the structure of the underlying networks was not directly visible but had to be inferred from the flow of information from one individual to another. As a result, we do not yet understand dynamics of information spread on networks or how the structure of the network affects it. We address this gap by analyzing data from two popular social news sites. Specifically, we extract social networks of active users on Digg and Twitter, and track how interest in news stories spreads among them. We show that social networks play a crucial role in the spread of information on these sites, and that network structure affects dynamics of information flow.

preprint2010arXiv

Integrating Structured Metadata with Relational Affinity Propagation

Structured and semi-structured data describing entities, taxonomies and ontologies appears in many domains. There is a huge interest in integrating structured information from multiple sources; however integrating structured data to infer complex common structures is a difficult task because the integration must aggregate similar structures while avoiding structural inconsistencies that may appear when the data is combined. In this work, we study the integration of structured social metadata: shallow personal hierarchies specified by many individual users on the SocialWeb, and focus on inferring a collection of integrated, consistent taxonomies. We frame this task as an optimization problem with structural constraints. We propose a new inference algorithm, which we refer to as Relational Affinity Propagation (RAP) that extends affinity propagation (Frey and Dueck 2007) by introducing structural constraints. We validate the approach on a real-world social media dataset, collected from the photosharing website Flickr. Our empirical results show that our proposed approach is able to construct deeper and denser structures compared to an approach using only the standard affinity propagation algorithm.

preprint2010arXiv

Modeling Social Annotation: a Bayesian Approach

Collaborative tagging systems, such as Delicious, CiteULike, and others, allow users to annotate resources, e.g., Web pages or scientific papers, with descriptive labels called tags. The social annotations contributed by thousands of users, can potentially be used to infer categorical knowledge, classify documents or recommend new relevant information. Traditional text inference methods do not make best use of social annotation, since they do not take into account variations in individual users' perspectives and vocabulary. In a previous work, we introduced a simple probabilistic model that takes interests of individual annotators into account in order to find hidden topics of annotated resources. Unfortunately, that approach had one major shortcoming: the number of topics and interests must be specified a priori. To address this drawback, we extend the model to a fully Bayesian framework, which offers a way to automatically estimate these numbers. In particular, the model allows the number of interests and topics to change as suggested by the structure of the data. We evaluate the proposed model in detail on the synthetic and real-world data by comparing its performance to Latent Dirichlet Allocation on the topic extraction task. For the latter evaluation, we apply the model to infer topics of Web resources from social annotations obtained from Delicious in order to discover new resources similar to a specified one. Our empirical results demonstrate that the proposed model is a promising method for exploiting social knowledge contained in user-generated annotations.

preprint2010arXiv

Predicting Influential Users in Online Social Networks

Who are the influential people in an online social network? The answer to this question depends not only on the structure of the network, but also on details of the dynamic processes occurring on it. We classify these processes as conservative and non-conservative. A random walk on a network is an example of a conservative dynamic process, while information spread is non-conservative. The influence models used to rank network nodes can be similarly classified, depending on the dynamic process they implicitly emulate. We claim that in order to correctly rank network nodes, the influence model has to match the details of the dynamic process. We study a real-world network on the social news aggregator Digg, which allows users to post and vote for news stories. We empirically define influence as the number of in-network votes a user's post generates. This influence measure, and the resulting ranking, arises entirely from the dynamics of voting on Digg, which represents non-conservative information flow. We then compare predictions of different influence models with this empirical estimate of influence. The results show that non-conservative models are better able to predict influential users on Digg. We find that normalized alpha-centrality metric turns out to be one of the best predictors of influence. We also present a simple algorithm for computing this metric and the associated mathematical formulation and analytical proofs.

preprint2010arXiv

Using a Model of Social Dynamics to Predict Popularity of News

preprint2010arXiv

Using Stochastic Models to Describe and Predict Social Dynamics of Web Users

Popularity of content in social media is unequally distributed, with some items receiving a disproportionate share of attention from users. Predicting which newly-submitted items will become popular is critically important for both hosts of social media content and its consumers. Accurate and timely prediction would enable hosts to maximize revenue through differential pricing for access to content or ad placement. Prediction would also give consumers an important tool for filtering the ever-growing amount of content. Predicting popularity of content in social media, however, is challenging due to the complex interactions between content quality and how the social media site chooses to highlight content. Moreover, most social media sites also selectively present content that has been highly rated by similar users, whose similarity is indicated implicitly by their behavior or explicitly by links in a social network. While these factors make it difficult to predict popularity \emph{a priori}, we show that stochastic models of user behavior on these sites allows predicting popularity based on early user reactions to new content. By incorporating the various mechanisms through which web sites display content, such models improve on predictions based on simply extrapolating from the early votes. Using data from one such site, the news aggregator Digg, we show how a stochastic model of user behavior distinguishes the effect of the increased visibility due to the network from how interested users are in the content. We find a wide range of interest, identifying stories primarily of interest to users in the network (``niche interests'') from those of more general interest to the user community. This distinction is useful for predicting a story's eventual popularity from users' early reactions to the story.

preprint2009arXiv

Structure of Heterogeneous Networks

Heterogeneous networks play a key role in the evolution of communities and the decisions individuals make. These networks link different types of entities, for example, people and the events they attend. Network analysis algorithms usually project such networks unto simple graphs composed of entities of a single type. In the process, they conflate relations between entities of different types and loose important structural information. We develop a mathematical framework that can be used to compactly represent and analyze heterogeneous networks that combine multiple entity and link types. We generalize Bonacich centrality, which measures connectivity between nodes by the number of paths between them, to heterogeneous networks and use this measure to study network structure. Specifically, we extend the popular modularity-maximization method for community detection to use this centrality metric. We also rank nodes based on their connectivity to other nodes. One advantage of this centrality metric is that it has a tunable parameter we can use to set the length scale of interactions. By studying how rankings change with this parameter allows us to identify important nodes in the network. We apply the proposed method to analyze the structure of several heterogeneous networks. We show that exploiting additional sources of evidence corresponding to links between, as well as among, different entity types yields new insights into network structure.

preprint2007arXiv

Exploiting Social Annotation for Automatic Resource Discovery

Information integration applications, such as mediators or mashups, that require access to information resources currently rely on users manually discovering and integrating them in the application. Manual resource discovery is a slow process, requiring the user to sift through results obtained via keyword-based search. Although search methods have advanced to include evidence from document contents, its metadata and the contents and link structure of the referring pages, they still do not adequately cover information sources -- often called ``the hidden Web''-- that dynamically generate documents in response to a query. The recently popular social bookmarking sites, which allow users to annotate and share metadata about various information sources, provide rich evidence for resource discovery. In this paper, we describe a probabilistic model of the user annotation process in a social bookmarking system del.icio.us. We then use the model to automatically find resources relevant to a particular information domain. Our experimental results on data obtained from \emph{del.icio.us} show this approach as a promising method for helping automate the resource discovery task.

preprint2007arXiv

Social Information Processing in Social News Aggregation

The rise of the social media sites, such as blogs, wikis, Digg and Flickr among others, underscores the transformation of the Web to a participatory medium in which users are collaboratively creating, evaluating and distributing information. The innovations introduced by social media has lead to a new paradigm for interacting with information, what we call 'social information processing'. In this paper, we study how social news aggregator Digg exploits social information processing to solve the problems of document recommendation and rating. First, we show, by tracking stories over time, that social networks play an important role in document recommendation. The second contribution of this paper consists of two mathematical models. The first model describes how collaborative rating and promotion of stories emerges from the independent decisions made by many users. The second model describes how a user's influence, the number of promoted stories and the user's social network, changes in time. We find qualitative agreement between predictions of the model and user data gathered from Digg.

preprint2007arXiv

User Participation in Social Media: Digg Study

The social news aggregator Digg allows users to submit and moderate stories by voting on (digging) them. As is true of most social sites, user participation on Digg is non-uniformly distributed, with few users contributing a disproportionate fraction of content. We studied user participation on Digg, to see whether it is motivated by competition, fueled by user ranking, or social factors, such as community acceptance. For our study we collected activity data of the top users weekly over the course of a year. We computed the number of stories users submitted, dugg or commented on weekly. We report a spike in user activity in September 2006, followed by a gradual decline, which seems unaffected by the elimination of user ranking. The spike can be explained by a controversy that broke out at the beginning of September 2006. We believe that the lasting acrimony that this incident has created led to a decline of top user participation on Digg.

preprint2006arXiv

Modeling and Mathematical Analysis of Swarms of Microscopic Robots

The biologically-inspired swarm paradigm is being used to design self-organizing systems of locally interacting artificial agents. A major difficulty in designing swarms with desired characteristics is understanding the causal relation between individual agent and collective behaviors. Mathematical analysis of swarm dynamics can address this difficulty to gain insight into system design. This paper proposes a framework for mathematical modeling of swarms of microscopic robots that may one day be useful in medical applications. While such devices do not yet exist, the modeling approach can be helpful in identifying various design trade-offs for the robots and be a useful guide for their eventual fabrication. Specifically, we examine microscopic robots that reside in a fluid, for example, a bloodstream, and are able to detect and respond to different chemicals. We present the general mathematical model of a scenario in which robots locate a chemical source. We solve the scenario in one-dimension and show how results can be used to evaluate certain design decisions.

Kristina Lerman

What is connected

Connect this record

See the researcher in context

Building this map preview

66 published item(s)

A Survey on Bias and Fairness in Machine Learning

Emergent Instabilities in Algorithmic Feedback Loops

Infusing Knowledge from Wikipedia to Enhance Stance Detection

Road Network Evolution in the Urban and Rural United States Since 1900

A Model of Densifying Collaboration Networks

Individualized Context-Aware Tensor Factorization for Online Games Predictions

The Emergence of Heterogeneous Scaling in Research Institutions

Challenges in Forecasting Malicious Events from Incomplete Data

Learning Behavioral Representations from Wearable Sensors

Predictability limit of partially observed systems

Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set

Unequal Impact and Spatial Aggregation Distort COVID-19 Growth Rates

Friendship Paradox Biases Perceptions in Directed Networks

Assessing the Navigational Effects of Click Biases and Link Insertion on the Web

Attention Inequality in Social Media

Disentangling the Effects of Social Signals

Emotions, Demographics and Sociability in Twitter Interactions

Information is not a Virus, and Other Consequences of Human Cognitive Limits

Neighbor-Neighbor Correlations Explain Measurement Bias in Networks

Network Composition from Multi-layer Data

Evolution of Conversations in the Age of Email Overload

Geography of Emotion: Where in a City are People Happier?

Portrait of an Online Shopper: Understanding and Predicting Consumer Behavior

Structural Properties of Ego Networks

The Interplay Between Dynamics and Networks: Centrality, Communities, and Cheeger Inequality

The Majority Illusion in Social Networks

User Effort and Network Structure Mediate Access to Information in Networks

Network Weirdness: Exploring the Origins of Network Paradoxes

Partitioning Networks with Node Attributes by Compressing Information Flow

Rethinking Centrality: The Role of Dynamical Processes in Social Network Analysis

The Impact of Network Flows on Community Formation in Models of Opinion Dynamics

Tripartite Graph Clustering for Dynamic Sentiment Analysis on Social Media

Attention and Visibility in an Information Rich World

Friendship Paradox Redux: Your Friends Are More Interesting Than You

LA-CTR: A Limited Attention Collaborative Topic Regression for Social Media

LA-LDA: A Limited Attention Topic Model for Social Recommendation

Limited Attention and Centrality in Social Networks

Social Contagion: An Empirical Study of Information Spread on Digg and Twitter Follower Graphs

Spectral Clustering with Epidemic Diffusion

Stochastic Models Predict User Behavior in Social Media

Structural and Cognitive Bottlenecks to Information Access in Social Networks

The Simple Rules of Social Contagion

How Visibility and Divided Attention Constrain Social Contagion

Impact of Dynamic Interactions on Multi-Scale Analysis of Community Structure in Networks

Network Structure, Topology and Dynamics in Generalized Models of Synchronization

Social Dynamics of Digg

Entropy-based Classification of 'Retweeting' Activity on Twitter

Leveraging User Diversity to Harvest Knowledge on the Social Web

Non-Conservative Diffusion and its Application to Social Network Analysis

Using Proximity to Predict Activity in Social Networks

What Stops Social Epidemics?

A Framework for Quantitative Analysis of Cascades on Networks

A Parameterized Centrality Metric for Network Analysis

A Probabilistic Approach for Learning Folksonomies from Structured Data

Growing a Tree in the Forest: Constructing Folksonomies by Integrating Structured Metadata

Information Contagion: an Empirical Study of the Spread of News on Digg and Twitter Social Networks

Integrating Structured Metadata with Relational Affinity Propagation

Modeling Social Annotation: a Bayesian Approach

Predicting Influential Users in Online Social Networks

Using a Model of Social Dynamics to Predict Popularity of News

Using Stochastic Models to Describe and Predict Social Dynamics of Web Users

Structure of Heterogeneous Networks

Exploiting Social Annotation for Automatic Resource Discovery

Social Information Processing in Social News Aggregation

User Participation in Social Media: Digg Study

Modeling and Mathematical Analysis of Swarms of Microscopic Robots