Source author record

John C. S. Lui

John C. S. Lui appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

39works

23topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Decentralized Stochastic Proximal Gradient Descent with Variance Reduction over Time-varying Networks

In decentralized learning, a network of nodes cooperate to minimize an overall objective function that is usually the finite-sum of their local objectives, and incorporates a non-smooth regularization term for the better generalization ability. Decentralized stochastic proximal gradient (DSPG) method is commonly used to train this type of learning models, while the convergence rate is retarded by the variance of stochastic gradients. In this paper, we propose a novel algorithm, namely DPSVRG, to accelerate the decentralized training by leveraging the variance reduction technique. The basic idea is to introduce an estimator in each node, which tracks the local full gradient periodically, to correct the stochastic gradient at each iteration. By transforming our decentralized algorithm into a centralized inexact proximal gradient algorithm with variance reduction, and controlling the bounds of error sequences, we prove that DPSVRG converges at the rate of $O(1/T)$ for general convex objectives plus a non-smooth term with $T$ as the number of iterations, while DSPG converges at the rate $O(\frac{1}{\sqrt{T}})$. Our experiments on different applications, network topologies and learning models demonstrate that DPSVRG converges much faster than DSPG, and the loss function of DPSVRG decreases smoothly along with the training epochs.

preprint2022arXiv

Federated Online Clustering of Bandits

Contextual multi-armed bandit (MAB) is an important sequential decision-making problem in recommendation systems. A line of works, called the clustering of bandits (CLUB), utilize the collaborative effect over users and dramatically improve the recommendation quality. Owing to the increasing application scale and public concerns about privacy, there is a growing demand to keep user data decentralized and push bandit learning to the local server side. Existing CLUB algorithms, however, are designed under the centralized setting where data are available at a central server. We focus on studying the federated online clustering of bandit (FCLUB) problem, which aims to minimize the total regret while satisfying privacy and communication considerations. We design a new phase-based scheme for cluster detection and a novel asynchronous communication protocol for cooperative bandit learning for this problem. To protect users' privacy, previous differential privacy (DP) definitions are not very suitable, and we propose a new DP notion that acts on the user cluster level. We provide rigorous proofs to show that our algorithm simultaneously achieves (clustered) DP, sublinear communication complexity and sublinear regret. Finally, experimental evaluations show our superior performance compared with benchmark algorithms.

preprint2022arXiv

LPC-AD: Fast and Accurate Multivariate Time Series Anomaly Detection via Latent Predictive Coding

This paper proposes LPC-AD, a fast and accurate multivariate time series (MTS) anomaly detection method. LPC-AD is motivated by the ever-increasing needs for fast and accurate MTS anomaly detection methods to support fast troubleshooting in cloud computing, micro-service systems, etc. LPC-AD is fast in the sense that its reduces the training time by as high as 38.2% compared to the state-of-the-art (SOTA) deep learning methods that focus on training speed. LPC-AD is accurate in the sense that it improves the detection accuracy by as high as 18.9% compared to SOTA sophisticated deep learning methods that focus on enhancing detection accuracy. Methodologically, LPC-AD contributes a generic architecture LPC-Reconstruct for one to attain different trade-offs between training speed and detection accuracy. More specifically, LPC-Reconstruct is built on ideas from autoencoder for reducing redundancy in time series, latent predictive coding for capturing temporal dependence in MTS, and randomized perturbation for avoiding overfitting of anomalous dependence in the training data. We present simple instantiations of LPC-Reconstruct to attain fast training speed, where we propose a simple randomized perturbation method. The superior performance of LPC-AD over SOTA methods is validated by extensive experiments on four large real-world datasets. Experiment results also show the necessity and benefit of each component of the LPC-Reconstruct architecture and that LPC-AD is robust to hyper parameters.

preprint2022arXiv

Multi-Player Multi-Armed Bandits with Finite Shareable Resources Arms: Learning Algorithms & Applications

Multi-player multi-armed bandits (MMAB) study how decentralized players cooperatively play the same multi-armed bandit so as to maximize their total cumulative rewards. Existing MMAB models mostly assume when more than one player pulls the same arm, they either have a collision and obtain zero rewards, or have no collision and gain independent rewards, both of which are usually too restrictive in practical scenarios. In this paper, we propose an MMAB with shareable resources as an extension to the collision and non-collision settings. Each shareable arm has finite shareable resources and a "per-load" reward random variable, both of which are unknown to players. The reward from a shareable arm is equal to the "per-load" reward multiplied by the minimum between the number of players pulling the arm and the arm's maximal shareable resources. We consider two types of feedback: sharing demand information (SDI) and sharing demand awareness (SDA), each of which provides different signals of resource sharing. We design the DPE-SDI and SIC-SDA algorithms to address the shareable arm problem under these two cases of feedback respectively and prove that both algorithms have logarithmic regrets that are tight in the number of rounds. We conduct simulations to validate both algorithms' performance and show their utilities in wireless networking and edge computing.

preprint2022arXiv

Multiple-Play Stochastic Bandits with Shareable Finite-Capacity Arms

We generalize the multiple-play multi-armed bandits (MP-MAB) problem with a shareable arm setting, in which several plays can share the same arm. Furthermore, each shareable arm has a finite reward capacity and a ''per-load'' reward distribution, both of which are unknown to the learner. The reward from a shareable arm is load-dependent, which is the "per-load" reward multiplying either the number of plays pulling the arm, or its reward capacity when the number of plays exceeds the capacity limit. When the "per-load" reward follows a Gaussian distribution, we prove a sample complexity lower bound of learning the capacity from load-dependent rewards and also a regret lower bound of this new MP-MAB problem. We devise a capacity estimator whose sample complexity upper bound matches the lower bound in terms of reward means and capacities. We also propose an online learning algorithm to address the problem and prove its regret upper bound. This regret upper bound's first term is the same as regret lower bound's, and its second and third terms also evidently correspond to lower bound's. Extensive experiments validate our algorithm's performance and also its gain in 5G & 4G base station selection.

preprint2022arXiv

Online Competitive Influence Maximization

Online influence maximization has attracted much attention as a way to maximize influence spread through a social network while learning the values of unknown network parameters. Most previous works focus on single-item diffusion. In this paper, we introduce a new Online Competitive Influence Maximization (OCIM) problem, where two competing items (e.g., products, news stories) propagate in the same network and influence probabilities on edges are unknown. We adopt a combinatorial multi-armed bandit (CMAB) framework for OCIM, but unlike the non-competitive setting, the important monotonicity property (influence spread increases when influence probabilities on edges increase) no longer holds due to the competitive nature of propagation, which brings a significant new challenge to the problem. We provide a nontrivial proof showing that the Triggering Probability Modulated (TPM) condition for CMAB still holds in OCIM, which is instrumental for our proposed algorithms OCIM-TS and OCIM-OFU to achieve sublinear Bayesian and frequentist regret, respectively. We also design an OCIM-ETC algorithm that requires less feedback and easier offline computation, at the expense of a worse frequentist regret bound. Experimental evaluations demonstrate the effectiveness of our algorithms.

preprint2020arXiv

Conversational Contextual Bandit: Algorithm and Application

Contextual bandit algorithms provide principled online learning solutions to balance the exploitation-exploration trade-off in various applications such as recommender systems. However, the learning speed of the traditional contextual bandit algorithms is often slow due to the need for extensive exploration. This poses a critical issue in applications like recommender systems, since users may need to provide feedbacks on a lot of uninterested items. To accelerate the learning speed, we generalize contextual bandit to conversational contextual bandit. Conversational contextual bandit leverages not only behavioral feedbacks on arms (e.g., articles in news recommendation), but also occasional conversational feedbacks on key-terms from the user. Here, a key-term can relate to a subset of arms, for example, a category of articles in news recommendation. We then design the Conversational UCB algorithm (ConUCB) to address two challenges in conversational contextual bandit: (1) which key-terms to select to conduct conversation, (2) how to leverage conversational feedbacks to accelerate the speed of bandit learning. We theoretically prove that ConUCB can achieve a smaller regret upper bound than the traditional contextual bandit algorithm LinUCB, which implies a faster learning speed. Experiments on synthetic data, as well as real datasets from Yelp and Toutiao, demonstrate the efficacy of the ConUCB algorithm.

preprint2020arXiv

Online VNF Chaining and Predictive Scheduling: Optimality and Trade-offs

For NFV systems, the key design space includes the function chaining for network requests and resource scheduling for servers. The problem is challenging since NFV systems usually require multiple (often conflicting) design objectives and the computational efficiency of real-time decision making with limited information. Furthermore, the benefits of predictive scheduling to NFV systems still remain unexplored. In this paper, we propose POSCARS, an efficient predictive and online service chaining and resource scheduling scheme that achieves tunable trade-offs among various system metrics with queue stability guarantee. Through a careful choice of granularity in system modeling, we acquire a better understanding of the trade-offs in our design space. By a non-trivial transformation, we decouple the complex optimization problem into a series of online sub-problems to achieve the optimality with only limited information. By employing randomized load balancing techniques, we propose three variants of POSCARS to reduce the overheads of decision making. Theoretical analysis and simulations show that POSCARS and its variants require only mild-value of future information to achieve near-optimal system cost with an ultra-low request response time.

preprint2020arXiv

Quantifying Deployability & Evolvability of Future Internet Architectures via Economic Models

Emerging new applications demand the current Internet to provide new functionalities. Although many future Internet architectures and protocols have been proposed to fulfill such needs, ISPs have been reluctant to deploy many of these architectures. We believe technical issues are not the main reasons as many of these new proposals are technically sound. In this paper, we take an economic perspective and seek to answer: Why most new Internet architectures failed to be deployed? How to enhance the deployability of a new architecture? We develop a game-theoretic model to characterize the outcome of an architecture's deployment through the equilibrium of ISPs' decisions. This model enables us to: (1) analyze several key factors of the deployability of a new architecture such as the number of critical ISPs and the change of routing path; (2) explain the deploying outcomes of some previously proposed architectures/protocols such as IPv6, DiffServ, CDN, etc., and shed light on the "Internet flattening phenomenon"; (3) predict the deployability of a new architecture such as NDN, and compare its deployability with competing architectures. Our study suggests that the difficulty to deploy a new Internet architecture comes from the "coordination" of distributed ISPs. Finally, we design a coordination mechanism to enhance the deployability of new architectures.

preprint2016arXiv

A Fast Sampling Method of Exploring Graphlet Degrees of Large Directed and Undirected Graphs

Exploring small connected and induced subgraph patterns (CIS patterns, or graphlets) has recently attracted considerable attention. Despite recent efforts on computing the number of instances a specific graphlet appears in a large graph (i.e., the total number of CISes isomorphic to the graphlet), little attention has been paid to characterizing a node's graphlet degree, i.e., the number of CISes isomorphic to the graphlet that include the node, which is an important metric for analyzing complex networks such as social and biological networks. Similar to global graphlet counting, it is challenging to compute node graphlet degrees for a large graph due to the combinatorial nature of the problem. Unfortunately, previous methods of computing global graphlet counts are not suited to solve this problem. In this paper we propose sampling methods to estimate node graphlet degrees for undirected and directed graphs, and analyze the error of our estimates. To the best of our knowledge, we are the first to study this problem and give a fast scalable solution. We conduct experiments on a variety of real-word datasets that demonstrate that our methods accurately and efficiently estimate node graphlet degrees for graphs with millions of edges.

preprint2016arXiv

A General Framework for Estimating Graphlet Statistics via Random Walk

Graphlets are induced subgraph patterns and have been frequently applied to characterize the local topology structures of graphs across various domains, e.g., online social networks (OSNs) and biological networks. Discovering and computing graphlet statistics are highly challenging. First, the massive size of real-world graphs makes the exact computation of graphlets extremely expensive. Secondly, the graph topology may not be readily available so one has to resort to web crawling using the available application programming interfaces (APIs). In this work, we propose a general and novel framework to estimate graphlet statistics of "any size". Our framework is based on collecting samples through consecutive steps of random walks. We derive an analytical bound on the sample size (via the Chernoff-Hoeffding technique) to guarantee the convergence of our unbiased estimator. To further improve the accuracy, we introduce two novel optimization techniques to reduce the lower bound on the sample size. Experimental evaluations demonstrate that our methods outperform the state-of-the-art method up to an order of magnitude both in terms of accuracy and time cost.

preprint2016arXiv

A Unified Framework for Information Consumption Based on Markov Chains

This paper establishes a Markov chain model as a unified framework for understanding information consumption processes in complex networks, with clear implications to the Internet and big-data technologies. In particular, the proposed model is the first one to address the formation mechanism of the "trichotomy" in observed probability density functions from empirical data of various social and technical networks. Both simulation and experimental results demonstrate a good match of the proposed model with real datasets, showing its superiority over the classical power-law models.

preprint2016arXiv

Monet: A User-oriented Behavior-based Malware Variants Detection System for Android

Android, the most popular mobile OS, has around 78% of the mobile market share. Due to its popularity, it attracts many malware attacks. In fact, people have discovered around one million new malware samples per quarter, and it was reported that over 98% of these new malware samples are in fact "derivatives" (or variants) from existing malware families. In this paper, we first show that runtime behaviors of malware's core functionalities are in fact similar within a malware family. Hence, we propose a framework to combine "runtime behavior" with "static structures" to detect malware variants. We present the design and implementation of MONET, which has a client and a backend server module. The client module is a lightweight, in-device app for behavior monitoring and signature generation, and we realize this using two novel interception techniques. The backend server is responsible for large scale malware detection. We collect 3723 malware samples and top 500 benign apps to carry out extensive experiments of detecting malware variants and defending against malware transformation. Our experiments show that MONET can achieve around 99% accuracy in detecting malware variants. Furthermore, it can defend against 10 different obfuscation and transformation techniques, while only incurs around 7% performance overhead and about 3% battery overhead. More importantly, MONET will automatically alert users with intrusion details so to prevent further malicious behaviors.

preprint2016arXiv

PowerWalk: Scalable Personalized PageRank via Random Walks with Vertex-Centric Decomposition

Most methods for Personalized PageRank (PPR) precompute and store all accurate PPR vectors, and at query time, return the ones of interest directly. However, the storage and computation of all accurate PPR vectors can be prohibitive for large graphs, especially in caching them in memory for real-time online querying. In this paper, we propose a distributed framework that strikes a better balance between offline indexing and online querying. The offline indexing attains a fingerprint of the PPR vector of each vertex by performing billions of "short" random walks in parallel across a cluster of machines. We prove that our indexing method has an exponential convergence, achieving the same precision with previous methods using a much smaller number of random walks. At query time, the new PPR vector is composed by a linear combination of related fingerprints, in a highly efficient vertex-centric decomposition manner. Interestingly, the resulting PPR vector is much more accurate than its offline counterpart because it actually uses more random walks in its estimation. More importantly, we show that such decomposition for a batch of queries can be very efficiently processed using a shared decomposition. Our implementation, PowerWalk, takes advantage of advanced distributed graph engines and it outperforms the state-of-the-art algorithms by orders of magnitude. Particularly, it responses to tens of thousands of queries on graphs with billions of edges in just a few seconds.

preprint2016arXiv

Quegel: A General-Purpose Query-Centric Framework for Querying Big Graphs

Pioneered by Google's Pregel, many distributed systems have been developed for large-scale graph analytics. These systems expose the user-friendly "think like a vertex" programming interface to users, and exhibit good horizontal scalability. However, these systems are designed for tasks where the majority of graph vertices participate in computation, but are not suitable for processing light-workload graph queries where only a small fraction of vertices need to be accessed. The programming paradigm adopted by these systems can seriously under-utilize the resources in a cluster for graph query processing. In this work, we develop a new open-source system, called Quegel, for querying big graphs, which treats queries as first-class citizens in the design of its computing model. Users only need to specify the Pregel-like algorithm for a generic query, and Quegel processes light-workload graph queries on demand using a novel superstep-sharing execution model to effectively utilize the cluster resources. Quegel further provides a convenient interface for constructing graph indexes, which significantly improve query performance but are not supported by existing graph-parallel systems. Our experiments verified that Quegel is highly efficient in answering various types of graph queries and is up to orders of magnitude faster than existing systems.

preprint2016arXiv

Stochastic Modeling of Hybrid Cache Systems

In recent years, there is an increasing demand of big memory systems so to perform large scale data analytics. Since DRAM memories are expensive, some researchers are suggesting to use other memory systems such as non-volatile memory (NVM) technology to build large-memory computing systems. However, whether the NVM technology can be a viable alternative (either economically and technically) to DRAM remains an open question. To answer this question, it is important to consider how to design a memory system from a "system perspective", that is, incorporating different performance characteristics and price ratios from hybrid memory devices. This paper presents an analytical model of a "hybrid page cache system" so to understand the diverse design space and performance impact of a hybrid cache system. We consider (1) various architectural choices, (2) design strategies, and (3) configuration of different memory devices. Using this model, we provide guidelines on how to design hybrid page cache to reach a good trade-off between high system throughput (in I/O per sec or IOPS) and fast cache reactivity which is defined by the time to fill the cache. We also show how one can configure the DRAM capacity and NVM capacity under a fixed budget. We pick PCM as an example for NVM and conduct numerical analysis. Our analysis indicates that incorporating PCM in a page cache system significantly improves the system performance, and it also shows larger benefit to allocate more PCM in page cache in some cases. Besides, for the common setting of performance-price ratio of PCM, "flat architecture" offers as a better choice, but "layered architecture" outperforms if PCM write performance can be significantly improved in the future.

preprint2015arXiv

Mathematical Modeling of Insurance Mechanisms for E-commerce Systems

Electronic commerce (a.k.a. E-commerce) systems such as eBay and Taobao of Alibaba are becoming increasingly popular. Having an effective reputation system is critical to this type of internet service because it can assist buyers to evaluate the trustworthiness of sellers, and it can also improve the revenue for reputable sellers and E-commerce operators. We formulate a stochastic model to analyze an eBay-like reputation system and propose four measures to quantify its effectiveness: (1) new seller ramp up time, (2) new seller drop out probability, (3) long term profit gains for sellers, and (4) average per seller transaction gains for the E-commerce operator. Through our analysis, we identify key factors which influence these four measures. We propose a new insurance mechanism which consists of an insurance protocol and a transaction mechanism to improve the above four measures. We show that our insurance mechanism can reduce the ramp up time by around 87.2%, and guarantee new sellers ramp up before the deadline $T_w$ with a high probability (close to 1.0). It also increases the long term profit gains and average per seller transaction gains by at least 95.3%.

preprint2015arXiv

Minfer: Inferring Motif Statistics From Sampled Edges

Characterizing motif (i.e., locally connected subgraph patterns) statistics is important for understanding complex networks such as online social networks and communication networks. Previous work made the strong assumption that the graph topology of interest is known, and that the dataset either fits into main memory or stored on disks such that it is not expensive to obtain all neighbors of any given node. In practice, researchers have to deal with the situation where the graph topology is unknown, either because the graph is dynamic, or because it is expensive to collect and store all topological and meta information on disk. Hence, what is available to researchers is only a snapshot of the graph generated by sampling edges from the graph at random, which we called a "RESampled graph". Clearly, a RESampled graph's motif statistics may be quite different from the underlying original graph. To solve this challenge, we propose a framework and implement a system called Minfer, which can take the given RESampled graph and accurately infer the underlying graph's motif statistics. We also use Fisher information to bound the error of our estimates. Experiments using large scale datasets show that our method to be accurate.

preprint2015arXiv

Tracking Triadic Cardinality Distributions for Burst Detection in Social Activity Streams

In everyday life, we often observe unusually frequent interactions among people before or during important events, e.g., we receive/send more greetings from/to our friends on Christmas Day, than usual. We also observe that some videos suddenly go viral through people's sharing in online social networks (OSNs). Do these seemingly different phenomena share a common structure? All these phenomena are associated with sudden surges of user activities in networks, which we call "bursts" in this work. We find that the emergence of a burst is accompanied with the formation of triangles in networks. This finding motivates us to propose a new method to detect bursts in OSNs. We first introduce a new measure, "triadic cardinality distribution", corresponding to the fractions of nodes with different numbers of triangles, i.e., triadic cardinalities, within a network. We demonstrate that this distribution changes when a burst occurs, and is naturally immunized against spamming social-bot attacks. Hence, by tracking triadic cardinality distributions, we can reliably detect bursts in OSNs. To avoid handling massive activity data generated by OSN users, we design an efficient sample-estimate solution to estimate the triadic cardinality distribution from sampled data. Extensive experiments conducted on real data demonstrate the usefulness of this triadic cardinality distribution and the effectiveness of our sample-estimate solution.

preprint2014arXiv

Algorithmic Design for Competitive Influence Maximization Problems

Given the popularity of the viral marketing campaign in online social networks, finding an effective method to identify a set of most influential nodes so to compete well with other viral marketing competitors is of upmost importance. We propose a "General Competitive Independent Cascade (GCIC)" model to describe the general influence propagation of two competing sources in the same network. We formulate the "Competitive Influence Maximization (CIM)" problem as follows: Under a prespecified influence propagation model and that the competitor's seed set is known, how to find a seed set of $k$ nodes so as to trigger the largest influence cascade? We propose a general algorithmic framework TCIM for the CIM problem under the GCIC model. TCIM returns a $(1-1/e-ε)$-approximate solution with probability at least $1-n^{-\ell}$, and has an efficient time complexity of $O(c(k+\ell)(m+n)\log n/ε^2)$, where $c$ depends on specific propagation model and may also depend on $k$ and underlying network $G$. To the best of our knowledge, this is the first general algorithmic framework that has both $(1-1/e-ε)$ performance guarantee and practical efficiency. We conduct extensive experiments on real-world datasets under three specific influence propagation models, and show the efficiency and accuracy of our framework. In particular, we achieve up to four orders of magnitude speedup as compared to the previous state-of-the-art algorithms with the approximate guarantee.

preprint2014arXiv

Block-Structured Supermarket Models

Supermarket models are a class of parallel queueing networks with an adaptive control scheme that play a key role in the study of resource management of, such as, computer networks, manufacturing systems and transportation networks. When the arrival processes are non-Poisson and the service times are non-exponential, analysis of such a supermarket model is always limited, interesting, and challenging. This paper describes a supermarket model with non-Poisson inputs: Markovian Arrival Processes (MAPs) and with non-exponential service times: Phase-type (PH) distributions, and provides a generalized matrix-analytic method which is first combined with the operator semigroup and the mean-field limit. When discussing such a more general supermarket model, this paper makes some new results and advances as follows: (1) Providing a detailed probability analysis for setting up an infinite-dimensional system of differential vector equations satisfied by the expected fraction vector, where "the invariance of environment factors" is given as an important result. (2) Introducing the phase-type structure to the operator semigroup and to the mean-field limit, and a Lipschitz condition can be obtained by means of a unified matrix-differential algorithm. (3) The matrix-analytic method is used to compute the fixed point which leads to performance computation of this system. Finally, we use some numerical examples to illustrate how the performance measures of this supermarket model depend on the non-Poisson inputs and on the non-exponential service times. Thus the results of this paper give new highlight on understanding influence of non-Poisson inputs and of non-exponential service times on performance measures of more general supermarket models.

preprint2014arXiv

Design of Efficient Sampling Methods on Hybrid Social-Affiliation Networks

Graph sampling via crawling has become increasingly popular and important in the study of measuring various characteristics of large scale complex networks. While powerful, it is known to be challenging when the graph is loosely connected or disconnected which slows down the convergence of random walks and can cause poor estimation accuracy. In this work, we observe that the graph under study, or called target graph, usually does not exist in isolation. In many situations, the target graph is related to an auxiliary graph and an affiliation graph, and the target graph becomes well connected when we view it from the perspective of these three graphs together, or called a hybrid social-affiliation graph in this paper. When directly sampling the target graph is difficult or inefficient, we can indirectly sample it efficiently with the assistances of the other two graphs. We design three sampling methods on such a hybrid social-affiliation network. Experiments conducted on both synthetic and real datasets demonstrate the effectiveness of our proposed methods.

preprint2014arXiv

Efficiently Estimating Motif Statistics of Large Networks

Exploring statistics of locally connected subgraph patterns (also known as network motifs) has helped researchers better understand the structure and function of biological and online social networks (OSNs). Nowadays the massive size of some critical networks -- often stored in already overloaded relational databases -- effectively limits the rate at which nodes and edges can be explored, making it a challenge to accurately discover subgraph statistics. In this work, we propose sampling methods to accurately estimate subgraph statistics from as few queried nodes as possible. We present sampling algorithms that efficiently and accurately estimate subgraph properties of massive networks. Our algorithms require no pre-computation or complete network topology information. At the same time, we provide theoretical guarantees of convergence. We perform experiments using widely known data sets, and show that for the same accuracy, our algorithms require an order of magnitude less queries (samples) than the current state-of-the-art algorithms.

preprint2014arXiv

Friends or Foes: Distributed and Randomized Algorithms to Determine Dishonest Recommenders in Online Social Networks

Viral marketing is becoming important due to the popularity of online social networks (OSNs). Companies may provide incentives (e.g., via free samples of a product) to a small group of users in an OSN, and these users provide recommendations to their friends, which eventually increases the overall sales of a given product. Nevertheless, this also opens a door for "malicious behaviors": dishonest users may intentionally give misleading recommendations to their friends so as to distort the normal sales distribution. In this paper, we propose a detection framework to identify dishonest users in OSNs. In particular, we present a set of fully distributed and randomized algorithms, and also quantify the performance of the algorithms by deriving probability of false positive, probability of false negative, and the distribution of number of detection rounds. Extensive simulations are also carried out to illustrate the impact of misleading recommendations and the effectiveness of our detection algorithms. The methodology we present here will enhance the security level of viral marketing in OSNs.

preprint2014arXiv

The Chaos of Propagation in a Retrial Supermarket Model

When decomposing the total orbit into $N$ sub-orbits (or simply orbits) related to each of $N$ servers and through comparing the numbers of customers in these orbits, we introduce a retrial supermarket model of $N$ identical servers, where two probing-server choice numbers are respectively designed for dynamically allocating each primary arrival and each retrial arrival into these orbits when the chosen servers are all busy. Note that the designed purpose of the two choice numbers can effectively improve performance measures of this retrial supermarket model. This paper analyzes a simple and basic retrial supermarket model of N identical servers, that is, Poisson arrivals, exponential service and retrial times. To this end, we first provide a detailed probability computation to set up an infinite-dimensional system of differential equations (or mean-field equations) satisfied by the expected fraction vector. Then, as N goes to infinity, we apply the operator semigroup to obtaining the mean-field limit (or chaos of propagation) for the sequence of Markov processes which express the state of this retrial supermarket model. Specifically, some simple and basic conditions for the mean-field limit as well as for the Lipschitz condition are established through the first two moments of the queue length in any orbit. Finally, we show that the fixed point satisfies a system of nonlinear equations which is an interesting networking generalization of the tail equations given in the M/M/1 retrial queue, and also use the fixed point to give performance analysis of this retrial supermarket model through numerical computation.

preprint2013arXiv

DroidAnalytics: A Signature Based Analytic System to Collect, Extract, Analyze and Associate Android Malware

Smartphones and mobile devices are rapidly becoming indispensable devices for many users. Unfortunately, they also become fertile grounds for hackers to deploy malware and to spread virus. There is an urgent need to have a "security analytic & forensic system" which can facilitate analysts to examine, dissect, associate and correlate large number of mobile applications. An effective analytic system needs to address the following questions: How to automatically collect and manage a high volume of mobile malware? How to analyze a zero-day suspicious application, and compare or associate it with existing malware families in the database? How to perform information retrieval so to reveal similar malicious logic with existing malware, and to quickly identify the new malicious code segment? In this paper, we present the design and implementation of DroidAnalytics, a signature based analytic system to automatically collect, manage, analyze and extract android malware. The system facilitates analysts to retrieve, associate and reveal malicious logics at the "opcode level". We demonstrate the efficacy of DroidAnalytics using 150,368 Android applications, and successfully determine 2,494 Android malware from 102 different families, with 342 of them being zero-day malware samples from six different families. To the best of our knowledge, this is the first reported case in showing such a large Android malware analysis/detection. The evaluation shows the DroidAnalytics is a valuable tool and is effective in analyzing malware repackaging and mutations.

preprint2013arXiv

Mathematical Modeling of Product Rating: Sufficiency, Misbehavior and Aggregation Rules

Many web services like eBay, Tripadvisor, Epinions, etc, provide historical product ratings so that users can evaluate the quality of products. Product ratings are important since they affect how well a product will be adopted by the market. The challenge is that we only have {\em "partial information"} on these ratings: Each user provides ratings to only a "{\em small subset of products}". Under this partial information setting, we explore a number of fundamental questions: What is the "{\em minimum number of ratings}" a product needs so one can make a reliable evaluation of its quality? How users' {\em misbehavior} (such as {\em cheating}) in product rating may affect the evaluation result? To answer these questions, we present a formal mathematical model of product evaluation based on partial information. We derive theoretical bounds on the minimum number of ratings needed to produce a reliable indicator of a product's quality. We also extend our model to accommodate users' misbehavior in product rating. We carry out experiments using both synthetic and real-world data (from TripAdvisor, Amazon and eBay) to validate our model, and also show that using the "majority rating rule" to aggregate product ratings, it produces more reliable and robust product evaluation results than the "average rating rule".

preprint2013arXiv

Practical Characterization of Large Networks Using Neighborhood Information

Characterizing large online social networks (OSNs) through node querying is a challenging task. OSNs often impose severe constraints on the query rate, hence limiting the sample size to a small fraction of the total network. Various ad-hoc subgraph sampling methods have been proposed, but many of them give biased estimates and no theoretical basis on the accuracy. In this work, we focus on developing sampling methods for OSNs where querying a node also reveals partial structural information about its neighbors. Our methods are optimized for NoSQL graph databases (if the database can be accessed directly), or utilize Web API available on most major OSNs for graph sampling. We show that our sampling method has provable convergence guarantees on being an unbiased estimator, and it is more accurate than current state-of-the-art methods. We characterize metrics such as node label density estimation and edge label density estimation, two of the most fundamental network characteristics from which other network characteristics can be derived. We evaluate our methods on-the-fly over several live networks using their native APIs. Our simulation studies over a variety of offline datasets show that by including neighborhood information, our method drastically (4-fold) reduces the number of samples required to achieve the same estimation accuracy of state-of-the-art methods.

preprint2013arXiv

Sampling Content Distributed Over Graphs

Despite recent effort to estimate topology characteristics of large graphs (i.e., online social networks and peer-to-peer networks), little attention has been given to develop a formal methodology to characterize the vast amount of content distributed over these networks. Due to the large scale nature of these networks, exhaustive enumeration of this content is computationally prohibitive. In this paper, we show how one can obtain content properties by sampling only a small fraction of vertices. We first show that when sampling is naively applied, this can produce a huge bias in content statistics (i.e., average number of content duplications). To remove this bias, one may use maximum likelihood estimation to estimate content characteristics. However our experimental results show that one needs to sample most vertices in the graph to obtain accurate statistics using such a method. To address this challenge, we propose two efficient estimators: special copy estimator (SCE) and weighted copy estimator (WCE) to measure content characteristics using available information in sampled contents. SCE uses the special content copy indicator to compute the estimate, while WCE derives the estimate based on meta-information in sampled vertices. We perform experiments to show WCE and SCE are cost effective and also ``{\em asymptotically unbiased}''. Our methodology provides a new tool for researchers to efficiently query content distributed in large scale networks.

preprint2013arXiv

Social Sensor Placement in Large Scale Networks: A Graph Sampling Perspective

Sensor placement for the purpose of detecting/tracking news outbreak and preventing rumor spreading is a challenging problem in a large scale online social network (OSN). This problem is a kind of subset selection problem: choosing a small set of items from a large population so to maximize some prespecified set function. However, it is known to be NP-complete. Existing heuristics are very costly especially for modern OSNs which usually contain hundreds of millions of users. This paper aims to design methods to find \emph{good solutions} that can well trade off efficiency and accuracy. We first show that it is possible to obtain a high quality solution with a probabilistic guarantee from a "{\em candidate set}" of the underlying social network. By exploring this candidate set, one can increase the efficiency of placing social sensors. We also present how this candidate set can be obtained using "{\em graph sampling}", which has an advantage over previous methods of not requiring the prior knowledge of the complete network topology. Experiments carried out on two real datasets demonstrate not only the accuracy and efficiency of our approach, but also effectiveness in detecting and predicting news outbreak.

preprint2013arXiv

Stochastic Analysis on RAID Reliability for Solid-State Drives

Solid-state drives (SSDs) have been widely deployed in desktops and data centers. However, SSDs suffer from bit errors, and the bit error rate is time dependent since it increases as an SSD wears down. Traditional storage systems mainly use parity-based RAID to provide reliability guarantees by striping redundancy across multiple devices, but the effectiveness of RAID in SSDs remains debatable as parity updates aggravate the wearing and bit error rates of SSDs. In particular, an open problem is that how different parity distributions over multiple devices, such as the even distribution suggested by conventional wisdom, or uneven distributions proposed in recent RAID schemes for SSDs, may influence the reliability of an SSD RAID array. To address this fundamental problem, we propose the first analytical model to quantify the reliability dynamics of an SSD RAID array. Specifically, we develop a "non-homogeneous" continuous time Markov chain model, and derive the transient reliability solution. We validate our model via trace-driven simulations and conduct numerical analysis to provide insights into the reliability dynamics of SSD RAID arrays under different parity distributions and subject to different bit error rates and array configurations. Designers can use our model to decide the appropriate parity distribution based on their reliability requirements.

preprint2013arXiv

Stochastic Modeling of Large-Scale Solid-State Storage Systems: Analysis, Design Tradeoffs and Optimization

Solid state drives (SSDs) have seen wide deployment in mobiles, desktops, and data centers due to their high I/O performance and low energy consumption. As SSDs write data out-of-place, garbage collection (GC) is required to erase and reclaim space with invalid data. However, GC poses additional writes that hinder the I/O performance, while SSD blocks can only endure a finite number of erasures. Thus, there is a performance-durability tradeoff on the design space of GC. To characterize the optimal tradeoff, this paper formulates an analytical model that explores the full optimal design space of any GC algorithm. We first present a stochastic Markov chain model that captures the I/O dynamics of large-scale SSDs, and adapt the mean-field approach to derive the asymptotic steady-state performance. We further prove the model convergence and generalize the model for all types of workload. Inspired by this model, we propose a randomized greedy algorithm (RGA) that can operate along the optimal tradeoff curve with a tunable parameter. Using trace-driven simulation on DiskSim with SSD add-ons, we demonstrate how RGA can be parameterized to realize the performance-durability tradeoff.

preprint2012arXiv

Mathematical Modeling of Competitive Group Recommendation Systems with Application to Peer Review Systems

In this paper, we present a mathematical model to capture various factors which may influence the accuracy of a competitive group recommendation system. We apply this model to peer review systems, i.e., conference or research grants review, which is an essential component in our scientific community. We explore number of important questions, i.e., how will the number of reviews per paper affect the accuracy of the overall recommendation? Will the score aggregation policy influence the final recommendation? How reviewers' preference may affect the accuracy of the final recommendation? To answer these important questions, we formally analyze our model. Through this analysis, we obtain the insight on how to design a randomized algorithm which is both computationally efficient and asymptotically accurate in evaluating the accuracy of a competitive group recommendation system. We obtain number of interesting observations: i.e., for a medium tier conference, three reviews per paper is sufficient for a high accuracy recommendation. For prestigious conferences, one may need at least seven reviews per paper to achieve high accuracy. We also propose a heterogeneous review strategy which requires equal or less reviewing workload, but can improve over a homogeneous review strategy in recommendation accuracy by as much as 30% . We believe our models and methodology are important building blocks to study competitive group recommendation systems.

preprint2012arXiv

On the Evolution of the Internet Economic Ecosystem

The evolution of the Internet has manifested itself in many ways: the traffic characteristics, the interconnection topologies and the business relationships among the autonomous components. It is important to understand why (and how) this evolution came about, and how the interplay of these dynamics may affect future evolution and services. We propose a network aware, macroscopic model that captures the characteristics and interactions of the application and network providers, and show how it leads to a market equilibrium of the ecosystem. By analyzing the driving forces and the dynamics of the market equilibrium, we obtain some fundamental understandings of the cause and effect of the Internet evolution, which explain why some historical and recent evolutions have happened. Furthermore, by projecting the likely future evolutions, our model can help application and network providers to make informed business decisions so as to succeed in this competitive ecosystem.

preprint2011arXiv

A Matrix-Analytic Solution for Randomized Load Balancing Models with Phase-Type Service Times

In this paper, we provide a matrix-analytic solution for randomized load balancing models (also known as \emph{supermarket models}) with phase-type (PH) service times. Generalizing the service times to the phase-type distribution makes the analysis of the supermarket models more difficult and challenging than that of the exponential service time case which has been extensively discussed in the literature. We first describe the supermarket model as a system of differential vector equations, and provide a doubly exponential solution to the fixed point of the system of differential vector equations. Then we analyze the exponential convergence of the current location of the supermarket model to its fixed point. Finally, we present numerical examples to illustrate our approach and show its effectiveness in analyzing the randomized load balancing schemes with non-exponential service requirements.

preprint2011arXiv

Online Robust Subspace Tracking from Partial Information

This paper presents GRASTA (Grassmannian Robust Adaptive Subspace Tracking Algorithm), an efficient and robust online algorithm for tracking subspaces from highly incomplete information. The algorithm uses a robust $l^1$-norm cost function in order to estimate and track non-stationary subspaces when the streaming data vectors are corrupted with outliers. We apply GRASTA to the problems of robust matrix completion and real-time separation of background from foreground in video. In this second application, we show that GRASTA performs high-quality separation of moving objects from background at exceptional speeds: In one popular benchmark video example, GRASTA achieves a rate of 57 frames per second, even when run in MATLAB on a personal laptop.

preprint2010arXiv

Doubly Exponential Solution for Randomized Load Balancing Models with Markovian Arrival Processes and PH Service Times

In this paper, we provide a novel matrix-analytic approach for studying doubly exponential solutions of randomized load balancing models (also known as supermarket models) with Markovian arrival processes (MAPs) and phase-type (PH) service times. We describe the supermarket model as a system of differential vector equations by means of density dependent jump Markov processes, and obtain a closed-form solution with a doubly exponential structure to the fixed point of the system of differential vector equations. Based on this, we show that the fixed point can be decomposed into the product of two factors inflecting arrival information and service information, and further find that the doubly exponential solution to the fixed point is not always unique for more general supermarket models. Furthermore, we analyze the exponential convergence of the current location of the supermarket model to its fixed point, and apply the Kurtz Theorem to study density dependent jump Markov process given in the supermarket model with MAPs and PH service times, which leads to the Lipschitz condition under which the fraction measure of the supermarket model weakly converges the system of differential vector equations. This paper gains a new understanding of how workload probing can help in load balancing jobs with non-Poisson arrivals and non-exponential service times.

preprint2009arXiv

On Oligopoly Spectrum Allocation Game in Cognitive Radio Networks with Capacity Constraints

Dynamic spectrum sharing is a promising technology to improve spectrum utilization in the future wireless networks. The flexible spectrum management provides new opportunities for licensed primary user and unlicensed secondary users to reallocate the spectrum resource efficiently. In this paper, we present an oligopoly pricing framework for dynamic spectrum allocation in which the primary users sell excessive spectrum to the secondary users for monetary return. We present two approaches, the strict constraints (type-I) and the QoS penalty (type-II), to model the realistic situation that the primary users have limited capacities. In the oligopoly model with strict constraints, we propose a low-complexity searching method to obtain the Nash Equilibrium and prove its uniqueness. When reduced to a duopoly game, we analytically show the interesting gaps in the leader-follower pricing strategy. In the QoS penalty based oligopoly model, a novel variable transformation method is developed to derive the unique Nash Equilibrium. When the market information is limited, we provide three myopically optimal algorithms "StrictBEST", "StrictBR" and "QoSBEST" that enable price adjustment for duopoly primary users based on the Best Response Function (BRF) and the bounded rationality (BR) principles. Numerical results validate the effectiveness of our analysis and demonstrate the fast convergence of "StrictBEST" as well as "QoSBEST" to the Nash Equilibrium. For the "StrictBR" algorithm, we reveal the chaotic behaviors of dynamic price adaptation in response to the learning rates.

preprint2008arXiv

Understanding the Paradoxical Effects of Power Control on the Capacity of Wireless Networks

Recent works show conflicting results: network capacity may increase or decrease with higher transmission power under different scenarios. In this work, we want to understand this paradox. Specifically, we address the following questions: (1)Theoretically, should we increase or decrease transmission power to maximize network capacity? (2) Theoretically, how much network capacity gain can we achieve by power control? (3) Under realistic situations, how do power control, link scheduling and routing interact with each other? Under which scenarios can we expect a large capacity gain by using higher transmission power? To answer these questions, firstly, we prove that the optimal network capacity is a non-decreasing function of transmission power. Secondly, we prove that the optimal network capacity can be increased unlimitedly by higher transmission power in some network configurations. However, when nodes are distributed uniformly, the gain of optimal network capacity by higher transmission power is upper-bounded by a positive constant. Thirdly, we discuss why network capacity in practice may increase or decrease with higher transmission power under different scenarios using carrier sensing and the minimum hop-count routing. Extensive simulations are carried out to verify our analysis.

John C. S. Lui

What is connected

Connect this record

See the researcher in context

Building this map preview

39 published item(s)

Decentralized Stochastic Proximal Gradient Descent with Variance Reduction over Time-varying Networks

Federated Online Clustering of Bandits

LPC-AD: Fast and Accurate Multivariate Time Series Anomaly Detection via Latent Predictive Coding

Multi-Player Multi-Armed Bandits with Finite Shareable Resources Arms: Learning Algorithms & Applications

Multiple-Play Stochastic Bandits with Shareable Finite-Capacity Arms

Online Competitive Influence Maximization

Conversational Contextual Bandit: Algorithm and Application

Online VNF Chaining and Predictive Scheduling: Optimality and Trade-offs

Quantifying Deployability & Evolvability of Future Internet Architectures via Economic Models

A Fast Sampling Method of Exploring Graphlet Degrees of Large Directed and Undirected Graphs

A General Framework for Estimating Graphlet Statistics via Random Walk

A Unified Framework for Information Consumption Based on Markov Chains

Monet: A User-oriented Behavior-based Malware Variants Detection System for Android

PowerWalk: Scalable Personalized PageRank via Random Walks with Vertex-Centric Decomposition

Quegel: A General-Purpose Query-Centric Framework for Querying Big Graphs

Stochastic Modeling of Hybrid Cache Systems

Mathematical Modeling of Insurance Mechanisms for E-commerce Systems

Minfer: Inferring Motif Statistics From Sampled Edges

Tracking Triadic Cardinality Distributions for Burst Detection in Social Activity Streams

Algorithmic Design for Competitive Influence Maximization Problems

Block-Structured Supermarket Models

Design of Efficient Sampling Methods on Hybrid Social-Affiliation Networks

Efficiently Estimating Motif Statistics of Large Networks

Friends or Foes: Distributed and Randomized Algorithms to Determine Dishonest Recommenders in Online Social Networks

The Chaos of Propagation in a Retrial Supermarket Model

DroidAnalytics: A Signature Based Analytic System to Collect, Extract, Analyze and Associate Android Malware

Mathematical Modeling of Product Rating: Sufficiency, Misbehavior and Aggregation Rules

Practical Characterization of Large Networks Using Neighborhood Information

Sampling Content Distributed Over Graphs

Social Sensor Placement in Large Scale Networks: A Graph Sampling Perspective

Stochastic Analysis on RAID Reliability for Solid-State Drives

Stochastic Modeling of Large-Scale Solid-State Storage Systems: Analysis, Design Tradeoffs and Optimization

Mathematical Modeling of Competitive Group Recommendation Systems with Application to Peer Review Systems

On the Evolution of the Internet Economic Ecosystem

A Matrix-Analytic Solution for Randomized Load Balancing Models with Phase-Type Service Times

Online Robust Subspace Tracking from Partial Information

Doubly Exponential Solution for Randomized Load Balancing Models with Markovian Arrival Processes and PH Service Times

On Oligopoly Spectrum Allocation Game in Cognitive Radio Networks with Capacity Constraints

Understanding the Paradoxical Effects of Power Control on the Capacity of Wireless Networks