Source author record

Ruoming Jin

Ruoming Jin appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Social and Information Networks Databases Machine Learning physics.soc-ph Artificial Intelligence Cryptography and Security Data Structures and Algorithms Distributed, Parallel, and Cluster Computing Information Retrieval

Catalog footprint

What is connected

13works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Lifelong DP: Consistently Bounded Differential Privacy in Lifelong Machine Learning

In this paper, we show that the process of continually learning new tasks and memorizing previous tasks introduces unknown privacy risks and challenges to bound the privacy loss. Based upon this, we introduce a formal definition of Lifelong DP, in which the participation of any data tuples in the training set of any tasks is protected, under a consistently bounded DP protection, given a growing stream of tasks. A consistently bounded DP means having only one fixed value of the DP privacy budget, regardless of the number of tasks. To preserve Lifelong DP, we propose a scalable and heterogeneous algorithm, called L2DP-ML with a streaming batch training, to efficiently train and continue releasing new versions of an L2M model, given the heterogeneity in terms of data sizes and the training order of tasks, without affecting DP protection of the private training set. An end-to-end theoretical analysis and thorough evaluations show that our mechanism is significantly better than baseline approaches in preserving Lifelong DP. The implementation of L2DP-ML is available at: https://github.com/haiphanNJIT/PrivateDeepLearning.

preprint2022arXiv

Speed-ANN: Low-Latency and High-Accuracy Nearest Neighbor Search via Intra-Query Parallelism

Nearest Neighbor Search (NNS) has recently drawn a rapid increase of interest due to its core role in managing high-dimensional vector data in data science and AI applications. The interest is fueled by the success of neural embedding, where deep learning models transform unstructured data into semantically correlated feature vectors for data analysis, e.g., recommend popular items. Among several categories of methods for fast NNS, similarity graph is one of the most successful algorithmic trends. Several of the most popular and top-performing similarity graphs, such as NSG and HNSW, at their core employ best-first traversal along the underlying graph indices to search near neighbors. Maximizing the performance of the search is essential for many tasks, especially at the large-scale and high-recall regime. In this work, we provide an in-depth examination of the challenges of the state-of-the-art similarity search algorithms, revealing its challenges in leveraging multi-core processors to speed up the search efficiency. We also exploit whether similarity graph search is robust to deviation from maintaining strict order by allowing multiple walkers to simultaneously advance the search frontier. Based on our insights, we propose Speed-ANN, a parallel similarity search algorithm that exploits hidden intra-query parallelism and memory hierarchy that allows similarity search to take advantage of multiple CPU cores to significantly accelerate search speed while achieving high accuracy. We evaluate Speed-ANN on a wide range of datasets, ranging from million to billion data points, and show its shorter query latency than NSG and HNSW, respectively. Besides, with multicore support, we show that our approach offers faster search latency than highly-optimized GPU implementation and provides good scalability as the increase of the number of hardware resources (e.g., CPU cores) and graph sizes.

preprint2021arXiv

On Estimating Recommendation Evaluation Metrics under Sampling

Since the recent study (Krichene and Rendle 2020) done by Krichene and Rendle on the sampling-based top-k evaluation metric for recommendation, there has been a lot of debates on the validity of using sampling to evaluate recommendation algorithms. Though their work and the recent work (Li et al.2020) have proposed some basic approaches for mapping the sampling-based metrics to their global counterparts which rank the entire set of items, there is still a lack of understanding and consensus on how sampling should be used for recommendation evaluation. The proposed approaches either are rather uninformative (linking sampling to metric evaluation) or can only work on simple metrics, such as Recall/Precision (Krichene and Rendle 2020; Li et al. 2020). In this paper, we introduce a new research problem on learning the empirical rank distribution, and a new approach based on the estimated rank distribution, to estimate the top-k metrics. Since this question is closely related to the underlying mechanism of sampling for recommendation, tackling it can help better understand the power of sampling and can help resolve the questions of if and how should we use sampling for evaluating recommendation. We introduce two approaches based on MLE (MaximalLikelihood Estimation) and its weighted variants, and ME(Maximal Entropy) principals to recover the empirical rank distribution, and then utilize them for metrics estimation. The experimental results show the advantages of using the new approaches for evaluating recommendation algorithms based on top-k metrics.

preprint2020arXiv

Scalable Differential Privacy with Certified Robustness in Adversarial Learning

In this paper, we aim to develop a scalable algorithm to preserve differential privacy (DP) in adversarial learning for deep neural networks (DNNs), with certified robustness to adversarial examples. By leveraging the sequential composition theory in DP, we randomize both input and latent spaces to strengthen our certified robustness bounds. To address the trade-off among model utility, privacy loss, and robustness, we design an original adversarial objective function, based on the post-processing property in DP, to tighten the sensitivity of our model. A new stochastic batch training is proposed to apply our mechanism on large DNNs and datasets, by bypassing the vanilla iterative batch-by-batch training in DP DNNs. An end-to-end theoretical analysis and evaluations show that our mechanism notably improves the robustness and scalability of DP DNNs.

preprint2015arXiv

A Deep Embedding Model for Co-occurrence Learning

Co-occurrence Data is a common and important information source in many areas, such as the word co-occurrence in the sentences, friends co-occurrence in social networks and products co-occurrence in commercial transaction data, etc, which contains rich correlation and clustering information about the items. In this paper, we study co-occurrence data using a general energy-based probabilistic model, and we analyze three different categories of energy-based model, namely, the $L_1$, $L_2$ and $L_k$ models, which are able to capture different levels of dependency in the co-occurrence data. We also discuss how several typical existing models are related to these three types of energy models, including the Fully Visible Boltzmann Machine (FVBM) ($L_2$), Matrix Factorization ($L_2$), Log-BiLinear (LBL) models ($L_2$), and the Restricted Boltzmann Machine (RBM) model ($L_k$). Then, we propose a Deep Embedding Model (DEM) (an $L_k$ model) from the energy model in a \emph{principled} manner. Furthermore, motivated by the observation that the partition function in the energy model is intractable and the fact that the major objective of modeling the co-occurrence data is to predict using the conditional probability, we apply the \emph{maximum pseudo-likelihood} method to learn DEM. In consequence, the developed model and its learning method naturally avoid the above difficulties and can be easily used to compute the conditional probability in prediction. Interestingly, our method is equivalent to learning a special structured deep neural network using back-propagation and a special sampling strategy, which makes it scalable on large-scale datasets. Finally, in the experiments, we show that the DEM can achieve comparable or better results than state-of-the-art methods on datasets across several application domains.

preprint2013arXiv

Hub-Accelerator: Fast and Exact Shortest Path Computation in Large Social Networks

Shortest path computation is one of the most fundamental operations for managing and analyzing large social networks. Though existing techniques are quite effective for finding the shortest path on large but sparse road networks, social graphs have quite different characteristics: they are generally non-spatial, non-weighted, scale-free, and they exhibit small-world properties in addition to their massive size. In particular, the existence of hubs, those vertices with a large number of connections, explodes the search space, making the shortest path computation surprisingly challenging. In this paper, we introduce a set of novel techniques centered around hubs, collectively referred to as the Hub-Accelerator framework, to compute the k-degree shortest path (finding the shortest path between two vertices if their distance is within k). These techniques enable us to significantly reduce the search space by either greatly limiting the expansion scope of hubs (using the novel distance- preserving Hub-Network concept) or completely pruning away the hubs in the online search (using the Hub2-Labeling approach). The Hub-Accelerator approaches are more than two orders of magnitude faster than BFS and the state-of-the-art approximate shortest path method Sketch for the shortest path computation. The Hub- Network approach does not introduce additional index cost with light pre-computation cost; the index size and index construction cost of Hub2-Labeling are also moderate and better than or comparable to the approximation indexing Sketch method.

preprint2013arXiv

Large Scale Real-time Ridesharing with Service Guarantee on Road Networks

The mean occupancy rates of personal vehicle trips in the United States is only 1.6 persons per vehicle mile. Urban traffic gridlock is a familiar scene. Ridesharing has the potential to solve many environmental, congestion, and energy problems. In this paper, we introduce the problem of large scale real-time ridesharing with service guarantee on road networks. Servers and trip requests are dynamically matched while waiting time and service time constraints of trips are satisfied. We first propose two basic algorithms: a branch-and-bound algorithm and an integer programing algorithm. However, these algorithm structures do not adapt well to the dynamic nature of the ridesharing problem. Thus, we then propose a kinetic tree algorithm capable of better scheduling dynamic requests and adjusting routes on-the-fly. We perform experiments on a large real taxi dataset from Shanghai. The results show that the kinetic tree algorithm is faster than other algorithms in response time.

preprint2013arXiv

Limiting the Neighborhood: De-Small-World Network for Outbreak Prevention

In this work, we study a basic and practically important strategy to help prevent and/or delay an outbreak in the context of network: limiting the contact between individuals. In this paper, we introduce the average neighborhood size as a new measure for the degree of being small-world and utilize it to formally define the desmall- world network problem. We also prove the NP-hardness of the general reachable pair cut problem and propose a greedy edge betweenness based approach as the benchmark in selecting the candidate edges for solving our problem. Furthermore, we transform the de-small-world network problem as an OR-AND Boolean function maximization problem, which is also an NP-hardness problem. In addition, we develop a numerical relaxation approach to solve the Boolean function maximization and the de-small-world problem. Also, we introduce the short-betweenness, which measures the edge importance in terms of all short paths with distance no greater than a certain threshold, and utilize it to speed up our numerical relaxation approach. The experimental evaluation demonstrates the effectiveness and efficiency of our approaches.

preprint2013arXiv

Simple, Fast, and Scalable Reachability Oracle

A reachability oracle (or hop labeling) assigns each vertex v two sets of vertices: Lout(v) and Lin(v), such that u reaches v iff Lout(u) \cap Lin(v) \neq \emptyset. Despite their simplicity and elegance, reachability oracles have failed to achieve efficiency in more than ten years since their introduction: the main problem is high construction cost, which stems from a set-cover framework and the need to materialize transitive closure. In this paper, we present two simple and efficient labeling algorithms, Hierarchical-Labeling and Distribution-Labeling, which can work onmassive real-world graphs: their construction time is an order of magnitude faster than the setcover based labeling approach, and transitive closure materialization is not needed. On large graphs, their index sizes and their query performance can now beat the state-of-the-art transitive closure compression and online search approaches.

preprint2012arXiv

Network Backbone Discovery Using Edge Clustering

In this paper, we investigate the problem of network backbone discovery. In complex systems, a "backbone" takes a central role in carrying out the system functionality and carries the bulk of system traffic. It also both simplifies and highlight underlying networking structure. Here, we propose an integrated graph theoretical and information theoretical network backbone model. We develop an efficient mining algorithm based on Kullback-Leibler divergence optimization procedure and maximal weight connected subgraph discovery procedure. A detailed experimental evaluation demonstrates both the effectiveness and efficiency of our approach. The case studies in the real world domain further illustrates the usefulness of the discovered network backbones.

preprint2011arXiv

Axiomatic Ranking of Network Role Similarity

A key task in social network and other complex network analysis is role analysis: describing and categorizing nodes according to how they interact with other nodes. Two nodes have the same role if they interact with equivalent sets of neighbors. The most fundamental role equivalence is automorphic equivalence. Unfortunately, the fastest algorithms known for graph automorphism are nonpolynomial. Moreover, since exact equivalence may be rare, a more meaningful task is to measure the role similarity between any two nodes. This task is closely related to the structural or link-based similarity problem that SimRank attempts to solve. However, SimRank and most of its offshoots are not sufficient because they do not fully recognize automorphically or structurally equivalent nodes. In this paper we tackle two problems. First, what are the necessary properties for a role similarity measure or metric? Second, how can we derive a role similarity measure satisfying these properties? For the first problem, we justify several axiomatic properties necessary for a role similarity measure or metric: range, maximal similarity, automorphic equivalence, transitive similarity, and the triangle inequality. For the second problem, we present RoleSim, a new similarity metric with a simple iterative computational method. We rigorously prove that RoleSim satisfies all the axiomatic properties. We also introduce an iceberg RoleSim algorithm which can guarantee to discover all pairs with RoleSim score no less than a user-defined threshold $θ$ without computing the RoleSim for every pair. We demonstrate the superior interpretative power of RoleSim on both both synthetic and real datasets.

preprint2011arXiv

Distance Preserving Graph Simplification

Large graphs are difficult to represent, visualize, and understand. In this paper, we introduce "gate graph" - a new approach to perform graph simplification. A gate graph provides a simplified topological view of the original graph. Specifically, we construct a gate graph from a large graph so that for any "non-local" vertex pair (distance higher than some threshold) in the original graph, their shortest-path distance can be recovered by consecutive "local" walks through the gate vertices in the gate graph. We perform a theoretical investigation on the gate-vertex set discovery problem. We characterize its computational complexity and reveal the upper bound of minimum gate-vertex set using VC-dimension theory. We propose an efficient mining algorithm to discover a gate-vertex set with guaranteed logarithmic bound. We further present a fast technique for pruning redundant edges in a gate graph. The detailed experimental results using both real and synthetic graphs demonstrate the effectiveness and efficiency of our approach.

preprint2011arXiv

Relational Approach for Shortest Path Discovery over Large Graphs

With the rapid growth of large graphs, we cannot assume that graphs can still be fully loaded into memory, thus the disk-based graph operation is inevitable. In this paper, we take the shortest path discovery as an example to investigate the technique issues when leveraging existing infrastructure of relational database (RDB) in the graph data management. Based on the observation that a variety of graph search queries can be implemented by iterative operations including selecting frontier nodes from visited nodes, making expansion from the selected frontier nodes, and merging the expanded nodes into the visited ones, we introduce a relational FEM framework with three corresponding operators to implement graph search tasks in the RDB context. We show new features such as window function and merge statement introduced by recent SQL standards can not only simplify the expression but also improve the performance of the FEM framework. In addition, we propose two optimization strategies specific to shortest path discovery inside the FEM framework. First, we take a bi-directional set Dijkstra's algorithm in the path finding. The bi-directional strategy can reduce the search space, and set Dijkstra's algorithm finds the shortest path in a set-at-a-time fashion. Second, we introduce an index named SegTable to preserve the local shortest segments, and exploit SegTable to further improve the performance. The final extensive experimental results illustrate our relational approach with the optimization strategies achieves high scalability and performance.

Ruoming Jin

What is connected

Connect this record

See the researcher in context

Building this map preview

13 published item(s)

Lifelong DP: Consistently Bounded Differential Privacy in Lifelong Machine Learning

Speed-ANN: Low-Latency and High-Accuracy Nearest Neighbor Search via Intra-Query Parallelism

On Estimating Recommendation Evaluation Metrics under Sampling

Scalable Differential Privacy with Certified Robustness in Adversarial Learning

A Deep Embedding Model for Co-occurrence Learning

Hub-Accelerator: Fast and Exact Shortest Path Computation in Large Social Networks

Large Scale Real-time Ridesharing with Service Guarantee on Road Networks

Limiting the Neighborhood: De-Small-World Network for Outbreak Prevention

Simple, Fast, and Scalable Reachability Oracle

Network Backbone Discovery Using Edge Clustering

Axiomatic Ranking of Network Role Similarity

Distance Preserving Graph Simplification

Relational Approach for Shortest Path Discovery over Large Graphs