Source author record

Rumi Ghosh

Rumi Ghosh appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

physics.soc-ph Social and Information Networks cs.CY physics.data-an Machine Learning Artificial Intelligence cond-mat.dis-nn Digital Libraries Discrete Mathematics nlin.CD physics.comp-ph

Catalog footprint

What is connected

21works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2016arXiv

Causal Discovery for Manufacturing Domains

Yield and quality improvement is of paramount importance to any manufacturing company. One of the ways of improving yield is through discovery of the root causal factors affecting yield. We propose the use of data-driven interpretable causal models to identify key factors affecting yield. We focus on factors that are measured in different stages of production and testing in the manufacturing cycle of a product. We apply causal structure learning techniques on real data collected from this line. Specifically, the goal of this work is to learn interpretable causal models from observational data produced by manufacturing lines. Emphasis has been given to the interpretability of the models to make them actionable in the field of manufacturing. We highlight the challenges presented by assembly line data and propose ways to alleviate them.We also identify unique characteristics of data originating from assembly lines and how to leverage them in order to improve causal discovery. Standard evaluation techniques for causal structure learning shows that the learned causal models seem to closely represent the underlying latent causal relationship between different factors in the production process. These results were also validated by manufacturing domain experts who found them promising. This work demonstrates how data mining and knowledge discovery can be used for root cause analysis in the domain of manufacturing and connected industry.

preprint2016arXiv

Dealing with Class Imbalance using Thresholding

We propose thresholding as an approach to deal with class imbalance. We define the concept of thresholding as a process of determining a decision boundary in the presence of a tunable parameter. The threshold is the maximum value of this tunable parameter where the conditions of a certain decision are satisfied. We show that thresholding is applicable not only for linear classifiers but also for non-linear classifiers. We show that this is the implicit assumption for many approaches to deal with class imbalance in linear classifiers. We then extend this paradigm beyond linear classification and show how non-linear classification can be dealt with under this umbrella framework of thresholding. The proposed method can be used for outlier detection in many real-life scenarios like in manufacturing. In advanced manufacturing units, where the manufacturing process has matured over time, the number of instances (or parts) of the product that need to be rejected (based on a strict regime of quality tests) becomes relatively rare and are defined as outliers. How to detect these rare parts or outliers beforehand? How to detect combination of conditions leading to these outliers? These are the questions motivating our research. This paper focuses on prediction of outliers and conditions leading to outliers using classification. We address the problem of outlier detection using classification. The classes are good parts (those passing the quality tests) and bad parts (those failing the quality tests and can be considered as outliers). The rarity of outliers transforms this problem into a class-imbalanced classification problem.

preprint2015arXiv

Attention decay in science

The exponential growth in the number of scientific papers makes it increasingly difficult for researchers to keep track of all the publications relevant to their work. Consequently, the attention that can be devoted to individual papers, measured by their citation counts, is bound to decay rapidly. In this work we make a thorough study of the life-cycle of papers in different disciplines. Typically, the citation rate of a paper increases up to a few years after its publication, reaches a peak and then decreases rapidly. This decay can be described by an exponential or a power law behavior, as in ultradiffusive processes, with exponential fitting better than power law for the majority of cases. The decay is also becoming faster over the years, signaling that nowadays papers are forgotten more quickly. However, when time is counted in terms of the number of published papers, the rate of decay of citations is fairly independent of the period considered. This indicates that the attention of scholars depends on the number of published items, and not on real time.

preprint2015arXiv

The Interplay Between Dynamics and Networks: Centrality, Communities, and Cheeger Inequality

We study the interplay between a dynamic process and the structure of the network on which it is defined. Specifically, we examine the impact of this interaction on the quality-measure of network clusters and node centrality. This enables us to effectively identify network communities and important nodes participating in the dynamics. As the first step towards this objective, we introduce an umbrella framework for defining and characterizing an ensemble of dynamic processes on a network. This framework generalizes the traditional Laplacian framework to continuous-time biased random walks and also allows us to model some epidemic processes over a network. For each dynamic process in our framework, we can define a function that measures the quality of every subset of nodes as a potential cluster (or community) with respect to this process on a given network. This subset-quality function generalizes the traditional conductance measure for graph partitioning. We partially justify our choice of the quality function by showing that the classic Cheeger's inequality, which relates the conductance of the best cluster in a network with a spectral quantity of its Laplacian matrix, can be extended from the Laplacian-conductance setting to this more general setting.

preprint2014arXiv

Information Relaxation is Ultradiffusive

We investigate how the overall response to a piece of information (a story or an article) evolves and relaxes as a function of time in social networks like Reddit, Digg and Youtube. This response or popularity is measured in terms of the number of votes/comments that the story (or article) accrued over time. We find that the temporal evolution of popularity can be described by a universal function whose parameters depend upon the system under consideration. Unlike most previous studies, which empirically investigated the dynamics of voting behavior, we also give a theoretical interpretation of the observed behavior using ultradiffusion. Whether it is the inter-arrival time between two consecutive votes on a story on Reddit or the comments on a video shared on Youtube, there is always a hierarchy of time scales in information propagation. One vote/comment might occur almost simultaneously with the previous, whereas another vote/comment might occur hours after the preceding one. This hierarchy of time scales leads us to believe that the dynamical response of users to information is ultradiffusive in nature. We show that a ultradiffusion based stochastic process can be used to rationalize the observed temporal evolution.

preprint2014arXiv

Rethinking Centrality: The Role of Dynamical Processes in Social Network Analysis

Many popular measures used in social network analysis, including centrality, are based on the random walk. The random walk is a model of a stochastic process where a node interacts with one other node at a time. However, the random walk may not be appropriate for modeling social phenomena, including epidemics and information diffusion, in which one node may interact with many others at the same time, for example, by broadcasting the virus or information to its neighbors. To produce meaningful results, social network analysis algorithms have to take into account the nature of interactions between the nodes. In this paper we classify dynamical processes as conservative and non-conservative and relate them to well-known measures of centrality used in network analysis: PageRank and Alpha-Centrality. We demonstrate, by ranking users in online social networks used for broadcasting information, that non-conservative Alpha-Centrality generally leads to a better agreement with an empirical ranking scheme than the conservative PageRank.

preprint2014arXiv

The Impact of Network Flows on Community Formation in Models of Opinion Dynamics

We study dynamics of opinion formation in a network of coupled agents. As the network evolves to a steady state, opinions of agents within the same community converge faster than those of other agents. This framework allows us to study how network topology and network flow, which mediates the transfer of opinions between agents, both affect the formation of communities. In traditional models of opinion dynamics, agents are coupled via conservative flows, which result in one-to-one opinion transfer. However, social interactions are often non-conservative, resulting in one-to-many transfer of opinions. We study opinion formation in networks using one-to-one and one-to-many interactions and show that they lead to different community structure within the same network.

preprint2013arXiv

Limited Attention and Centrality in Social Networks

How does one find important or influential people in an online social network? Researchers have proposed a variety of centrality measures to identify individuals that are, for example, often visited by a random walk, infected in an epidemic, or receive many messages from friends. Recent research suggests that a social media users' capacity to respond to an incoming message is constrained by their finite attention, which they divide over all incoming information, i.e., information sent by users they follow. We propose a new measure of centrality --- limited-attention version of Bonacich's Alpha-centrality --- that models the effect of limited attention on epidemic diffusion. The new measure describes a process in which nodes broadcast messages to their out-neighbors, but the neighbors' ability to receive the message depends on the number of in-neighbors they have. We evaluate the proposed measure on real-world online social networks and show that it can better reproduce an empirical influence ranking of users than other popular centrality measures.

preprint2013arXiv

Social Contagion: An Empirical Study of Information Spread on Digg and Twitter Follower Graphs

Social networks have emerged as a critical factor in information dissemination, search, marketing, expertise and influence discovery, and potentially an important tool for mobilizing people. Social media has made social networks ubiquitous, and also given researchers access to massive quantities of data for empirical analysis. These data sets offer a rich source of evidence for studying dynamics of individual and group behavior, the structure of networks and global patterns of the flow of information on them. However, in most previous studies, the structure of the underlying networks was not directly visible but had to be inferred from the flow of information from one individual to another. As a result, we do not yet understand dynamics of information spread on networks or how the structure of the network affects it. We address this gap by analyzing data from two popular social news sites. Specifically, we extract follower graphs of active Digg and Twitter users and track how interest in news stories cascades through the graph. We compare and contrast properties of information cascades on both sites and elucidate what they tell us about dynamics of information flow on networks.

preprint2013arXiv

Spectral Clustering with Epidemic Diffusion

Spectral clustering is widely used to partition graphs into distinct modules or communities. Existing methods for spectral clustering use the eigenvalues and eigenvectors of the graph Laplacian, an operator that is closely associated with random walks on graphs. We propose a new spectral partitioning method that exploits the properties of epidemic diffusion. An epidemic is a dynamic process that, unlike the random walk, simultaneously transitions to all the neighbors of a given node. We show that the replicator, an operator describing epidemic diffusion, is equivalent to the symmetric normalized Laplacian of a reweighted graph with edges reweighted by the eigenvector centralities of their incident nodes. Thus, more weight is given to edges connecting more central nodes. We describe a method that partitions the nodes based on the componentwise ratio of the replicator's second eigenvector to the first, and compare its performance to traditional spectral clustering techniques on synthetic graphs with known community structure. We demonstrate that the replicator gives preference to dense, clique-like structures, enabling it to more effectively discover communities that may be obscured by dense intercommunity linking.

preprint2012arXiv

Impact of Dynamic Interactions on Multi-Scale Analysis of Community Structure in Networks

To find interesting structure in networks, community detection algorithms have to take into account not only the network topology, but also dynamics of interactions between nodes. We investigate this claim using the paradigm of synchronization in a network of coupled oscillators. As the network evolves to a global steady state, nodes belonging to the same community synchronize faster than nodes belonging to different communities. Traditionally, nodes in network synchronization models are coupled via one-to-one, or conservative interactions. However, social interactions are often one-to-many, as for example, in social media, where users broadcast messages to all their followers. We formulate a novel model of synchronization in a network of coupled oscillators in which the oscillators are coupled via one-to-many, or non-conservative interactions. We study the dynamics of different interaction models and contrast their spectral properties. To find multi-scale community structure in a network of interacting nodes, we define a similarity function that measures the degree to which nodes are synchronized and use it to hierarchically cluster nodes. We study real-world social networks, including networks of two social media providers. To evaluate the quality of the discovered communities in a social media network we propose a community quality metric based on user activity. We find that conservative and non-conservative interaction models lead to dramatically different views of community structure even within the same network. Our work offers a novel mathematical framework for exploring the relationship between network structure, topology and dynamics.

preprint2012arXiv

Network Structure, Topology and Dynamics in Generalized Models of Synchronization

We explore the interplay of network structure, topology, and dynamic interactions between nodes using the paradigm of distributed synchronization in a network of coupled oscillators. As the network evolves to a global steady state, interconnected oscillators synchronize in stages, revealing network's underlying community structure. Traditional models of synchronization assume that interactions between nodes are mediated by a conservative process, such as diffusion. However, social and biological processes are often non-conservative. We propose a new model of synchronization in a network of oscillators coupled via non-conservative processes. We study dynamics of synchronization of a synthetic and real-world networks and show that different synchronization models reveal different structures within the same network.

preprint2011arXiv

Entropy-based Classification of 'Retweeting' Activity on Twitter

Twitter is used for a variety of reasons, including information dissemination, marketing, political organizing and to spread propaganda, spamming, promotion, conversations, and so on. Characterizing these activities and categorizing associated user generated content is a challenging task. We present a information-theoretic approach to classification of user activity on Twitter. We focus on tweets that contain embedded URLs and study their collective `retweeting' dynamics. We identify two features, time-interval and user entropy, which we use to classify retweeting activity. We achieve good separation of different activities using just these two features and are able to categorize content based on the collective user response it generates. We have identified five distinct categories of retweeting activity on Twitter: automatic/robotic activity, newsworthy information dissemination, advertising and promotion, campaigns, and parasitic advertisement. In the course of our investigations, we have shown how Twitter can be exploited for promotional and spam-like activities. The content-independent, entropy-based activity classification method is computationally efficient, scalable and robust to sampling and missing data. It has many applications, including automatic spam-detection, trend identification, trust management, user-modeling, social search and content classification on online social media.

preprint2011arXiv

Non-Conservative Diffusion and its Application to Social Network Analysis

The random walk is fundamental to modeling dynamic processes on networks. Metrics based on the random walk have been used in many applications from image processing to Web page ranking. However, how appropriate are random walks to modeling and analyzing social networks? We argue that unlike a random walk, which conserves the quantity diffusing on a network, many interesting social phenomena, such as the spread of information or disease on a social network, are fundamentally non-conservative. When an individual infects her neighbor with a virus, the total amount of infection increases. We classify diffusion processes as conservative and non-conservative and show how these differences impact the choice of metrics used for network analysis, as well as our understanding of network structure and behavior. We show that Alpha-Centrality, which mathematically describes non-conservative diffusion, leads to new insights into the behavior of spreading processes on networks. We give a scalable approximate algorithm for computing the Alpha-Centrality in a massive graph. We validate our approach on real-world online social networks of Digg. We show that a non-conservative metric, such as Alpha-Centrality, produces better agreement with empirical measure of influence than conservative metrics, such as PageRank. We hope that our investigation will inspire further exploration into the realms of conservative and non-conservative metrics in social network analysis.

preprint2011arXiv

Using Proximity to Predict Activity in Social Networks

The structure of a social network contains information useful for predicting its evolution. Nodes that are "close" in some sense are more likely to become linked in the future than more distant nodes. We show that structural information can also help predict node activity. We use proximity to capture the degree to which two nodes are "close" to each other in the network. In addition to standard proximity metrics used in the link prediction task, such as neighborhood overlap, we introduce new metrics that model different types of interactions that can occur between network nodes. We argue that the "closer" nodes are in a social network, the more similar will be their activity. We study this claim using data about URL recommendation on social media sites Digg and Twitter. We show that structural proximity of two users in the follower graph is related to similarity of their activity, i.e., how many URLs they both recommend. We also show that given friends' activity, knowing their proximity to the user can help better predict which URLs the user will recommend. We compare the performance of different proximity metrics on the activity prediction task and find that some metrics lead to substantial performance improvements.

preprint2011arXiv

What Stops Social Epidemics?

Theoretical progress in understanding the dynamics of spreading processes on graphs suggests the existence of an epidemic threshold below which no epidemics form and above which epidemics spread to a significant fraction of the graph. We have observed information cascades on the social media site Digg that spread fast enough for one initial spreader to infect hundreds of people, yet end up affecting only 0.1% of the entire network. We find that two effects, previously studied in isolation, combine cooperatively to drastically limit the final size of cascades on Digg. First, because of the highly clustered structure of the Digg network, most people who are aware of a story have been exposed to it via multiple friends. This structure lowers the epidemic threshold while moderately slowing the overall growth of cascades. In addition, we find that the mechanism for social contagion on Digg points to a fundamental difference between information spread and other contagion processes: despite multiple opportunities for infection within a social group, people are less likely to become spreaders of information with repeated exposure. The consequences of this mechanism become more pronounced for more clustered graphs. Ultimately, this effect severely curtails the size of social epidemics on Digg.

preprint2010arXiv

A Framework for Quantitative Analysis of Cascades on Networks

How does information flow in online social networks? How does the structure and size of the information cascade evolve in time? How can we efficiently mine the information contained in cascade dynamics? We approach these questions empirically and present an efficient and scalable mathematical framework for quantitative analysis of cascades on networks. We define a cascade generating function that captures the details of the microscopic dynamics of the cascades. We show that this function can also be used to compute the macroscopic properties of cascades, such as their size, spread, diameter, number of paths, and average path length. We present an algorithm to efficiently compute cascade generating function and demonstrate that while significantly compressing information within a cascade, it nevertheless allows us to accurately reconstruct its structure. We use this framework to study information dynamics on the social network of Digg. Digg allows users to post and vote on stories, and easily see the stories that friends have voted on. As a story spreads on Digg through voting, it generates cascades. We extract cascades of more than 3,500 Digg stories and calculate their macroscopic and microscopic properties. We identify several trends in cascade dynamics: spreading via chaining, branching and community. We discuss how these affect the spread of the story through the Digg social network. Our computational framework is general and offers a practical solution to quantitative analysis of the microscopic structure of even very large cascades.

preprint2010arXiv

A Parameterized Centrality Metric for Network Analysis

A variety of metrics have been proposed to measure the relative importance of nodes in a network. One of these, alpha-centrality [Bonacich, 2001], measures the number of attenuated paths that exist between nodes. We introduce a normalized version of this metric and use it to study network structure, specifically, to rank nodes and find community structure of the network. Specifically, we extend the modularity-maximization method [Newman and Girvan, 2004] for community detection to use this metric as the measure of node connectivity. Normalized alpha-centrality is a powerful tool for network analysis, since it contains a tunable parameter that sets the length scale of interactions. By studying how rankings and discovered communities change when this parameter is varied allows us to identify locally and globally important nodes and structures. We apply the proposed method to several benchmark networks and show that it leads to better insight into network structure than alternative methods.

preprint2010arXiv

Information Contagion: an Empirical Study of the Spread of News on Digg and Twitter Social Networks

Social networks have emerged as a critical factor in information dissemination, search, marketing, expertise and influence discovery, and potentially an important tool for mobilizing people. Social media has made social networks ubiquitous, and also given researchers access to massive quantities of data for empirical analysis. These data sets offer a rich source of evidence for studying dynamics of individual and group behavior, the structure of networks and global patterns of the flow of information on them. However, in most previous studies, the structure of the underlying networks was not directly visible but had to be inferred from the flow of information from one individual to another. As a result, we do not yet understand dynamics of information spread on networks or how the structure of the network affects it. We address this gap by analyzing data from two popular social news sites. Specifically, we extract social networks of active users on Digg and Twitter, and track how interest in news stories spreads among them. We show that social networks play a crucial role in the spread of information on these sites, and that network structure affects dynamics of information flow.

preprint2010arXiv

Predicting Influential Users in Online Social Networks

Who are the influential people in an online social network? The answer to this question depends not only on the structure of the network, but also on details of the dynamic processes occurring on it. We classify these processes as conservative and non-conservative. A random walk on a network is an example of a conservative dynamic process, while information spread is non-conservative. The influence models used to rank network nodes can be similarly classified, depending on the dynamic process they implicitly emulate. We claim that in order to correctly rank network nodes, the influence model has to match the details of the dynamic process. We study a real-world network on the social news aggregator Digg, which allows users to post and vote for news stories. We empirically define influence as the number of in-network votes a user's post generates. This influence measure, and the resulting ranking, arises entirely from the dynamics of voting on Digg, which represents non-conservative information flow. We then compare predictions of different influence models with this empirical estimate of influence. The results show that non-conservative models are better able to predict influential users on Digg. We find that normalized alpha-centrality metric turns out to be one of the best predictors of influence. We also present a simple algorithm for computing this metric and the associated mathematical formulation and analytical proofs.

preprint2009arXiv

Structure of Heterogeneous Networks

Heterogeneous networks play a key role in the evolution of communities and the decisions individuals make. These networks link different types of entities, for example, people and the events they attend. Network analysis algorithms usually project such networks unto simple graphs composed of entities of a single type. In the process, they conflate relations between entities of different types and loose important structural information. We develop a mathematical framework that can be used to compactly represent and analyze heterogeneous networks that combine multiple entity and link types. We generalize Bonacich centrality, which measures connectivity between nodes by the number of paths between them, to heterogeneous networks and use this measure to study network structure. Specifically, we extend the popular modularity-maximization method for community detection to use this centrality metric. We also rank nodes based on their connectivity to other nodes. One advantage of this centrality metric is that it has a tunable parameter we can use to set the length scale of interactions. By studying how rankings change with this parameter allows us to identify important nodes in the network. We apply the proposed method to analyze the structure of several heterogeneous networks. We show that exploiting additional sources of evidence corresponding to links between, as well as among, different entity types yields new insights into network structure.

Rumi Ghosh

What is connected

Connect this record

See the researcher in context

Building this map preview

21 published item(s)

Causal Discovery for Manufacturing Domains

Dealing with Class Imbalance using Thresholding

Attention decay in science

The Interplay Between Dynamics and Networks: Centrality, Communities, and Cheeger Inequality

Information Relaxation is Ultradiffusive

Rethinking Centrality: The Role of Dynamical Processes in Social Network Analysis

The Impact of Network Flows on Community Formation in Models of Opinion Dynamics

Limited Attention and Centrality in Social Networks

Social Contagion: An Empirical Study of Information Spread on Digg and Twitter Follower Graphs

Spectral Clustering with Epidemic Diffusion

Impact of Dynamic Interactions on Multi-Scale Analysis of Community Structure in Networks

Network Structure, Topology and Dynamics in Generalized Models of Synchronization

Entropy-based Classification of 'Retweeting' Activity on Twitter

Non-Conservative Diffusion and its Application to Social Network Analysis

Using Proximity to Predict Activity in Social Networks

What Stops Social Epidemics?

A Framework for Quantitative Analysis of Cascades on Networks

A Parameterized Centrality Metric for Network Analysis

Information Contagion: an Empirical Study of the Spread of News on Digg and Twitter Social Networks

Predicting Influential Users in Online Social Networks

Structure of Heterogeneous Networks