Source author record

Jianzhong Li

Jianzhong Li appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Databases Computational Complexity Machine Learning Artificial Intelligence Computation and Language Computational Geometry Data Structures and Algorithms Distributed, Parallel, and Cluster Computing Networking and Internet Architecture physics.app-ph physics.plasm-ph Social and Information Networks

Catalog footprint

What is connected

16works

12topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2024arXiv

The PCP-like Theorem for Sub-linear Time Inapproximability

In this paper we propose the PCP-like theorem for sub-linear time inapproximability. Abboud et al. have devised the distributed PCP framework for proving sub-quadratic time inapproximability. Here we try to go further in this direction. Staring from SETH, we first find a problem denoted as Ext-$k$-SAT, which can not be computed in linear time, then devise an efficient MA-like protocol for this problem. To use this protocol to prove the sub-linear time inapproximability of other problems, we devise a new kind of reduction denoted as Ext-reduction, and it is different from existing reduction techniques. We also define two new hardness class, the problems in which can be computed in linear-time, but can not be efficiently approximated in sub-linear time. Some problems are shown to be in the newly defined hardness class.

preprint2022arXiv

A New Model for Massively Parallel Computation Considering both Communication and IO Cost

In the research area of parallel computation, the communication cost has been extensively studied, while the IO cost has been neglected. For big data computation, the assumption that the data fits in main memory no longer holds, and external memory must be used. Therefore, it is necessary to bring the IO cost into the parallel computation model. In this paper, we propose the first parallel computation model which takes IO cost as well as non-uniform communication cost into consideration. Based on the new model, we raise several new problems which aim to minimize the IO and communication cost on the new model. We prove the hardness of these new problems, then design and analyze the approximate algorithms for solving them.

preprint2022arXiv

Dynamic Approximate Maximum Independent Set on Massive Graphs

Computing a maximum independent set (MaxIS) is a fundamental NP-hard problem in graph theory, which has important applications in a wide spectrum of fields. Since graphs in many applications are changing frequently over time, the problem of maintaining a MaxIS over dynamic graphs has attracted increasing attention over the past few years. Due to the intractability of maintaining an exact MaxIS, this paper aims to develop efficient algorithms that can maintain an approximate MaxIS with an accuracy guarantee theoretically. In particular, we propose a framework that maintains a $(\fracΔ{2} + 1)$-approximate MaxIS over dynamic graphs and prove that it achieves a constant approximation ratio in many real-world networks. To the best of our knowledge, this is the first non-trivial approximability result for the dynamic MaxIS problem. Following the framework, we implement an efficient linear-time dynamic algorithm and a more effective dynamic algorithm with near-linear expected time complexity. Our thorough experiments over real and synthetic graphs demonstrate the effectiveness and efficiency of the proposed algorithms, especially when the graph is highly dynamic.

preprint2022arXiv

PCP Theorems, SETH and More: Towards Proving Sub-linear Time Inapproximability

In this paper we propose the PCP-like theorem for sub-linear time inapproximability. Abboud et al. have devised the distributed PCP framework for sub-quadratic time inapproximability. We show that the distributed PCP theorem can be generalized for proving arbitrary polynomial time inapproximability, but fails in the linear case. We prove the sub-linear PCP theorem by adapting from an MA-protocol for the Set Containment problem, and show how to use the theorem to prove both existing and new inapproximability results, exhibiting the power of the sub-linear PCP theorem. Considering the emerging research works on sub-linear time algorithms, the sub-linear PCP theorem is important in guiding the research in sub-linear time approximation algorithms.

preprint2022arXiv

Rank-Regret Minimization

Multi-criteria decision-making often requires finding a small representative set from the database. A recently proposed method is the regret minimization set (RMS) query. RMS returns a size $r$ subset $S$ of dataset $D$ that minimizes the regret-ratio (the difference between the score of top-1 in $S$ and the score of top-1 in $D$, for any possible utility function). RMS is not shift invariant, causing inconsistency in results. Further, existing work showed that the regret-ratio is often a made-up number and users may mistake its absolute value. Instead, users do understand the notion of rank. Thus it considered the problem of finding the minimal set $S$ with a rank-regret (the rank of top-1 tuple of $S$ in the sorted list of $D$) at most $k$, called the rank-regret representative (RRR) problem. Corresponding to RMS, we focus on the min-error version of RRR, called the rank-regret minimization (RRM) problem, which finds a size $r$ set to minimize the maximum rank-regret for all utility functions. Further, we generalize RRM and propose the restricted RRM (i.e., RRRM) problem to optimize the rank-regret for functions restricted in a given space. Previous studies on both RMS and RRR did not consider the restricted function space. The solution for RRRM usually has a lower regret level and can better serve the specific preferences of some users. Note that RRM and RRRM are shift invariant. In 2D space, we design a dynamic programming algorithm 2DRRM to return the optimal solution for RRM. In HD space, we propose an algorithm HDRRM that introduces a double approximation guarantee on rank-regret. Both 2DRRM and HDRRM are applicable for RRRM. Extensive experiments on the synthetic and real datasets verify the efficiency and effectiveness of our algorithms. In particular, HDRRM always has the best output quality in experiments.

preprint2022arXiv

Turing Machines with Two-level Memory: A Deep Look into the Input/Output Complexity

The input/output complexity, which is the complexity of data exchange between the main memory and the external memory, has been elaborately studied by a lot of former researchers. However, the existing works failed to consider the input/output complexity in a computation model point of view. In this paper we remedy this by proposing three variants of Turing machine that include external memory and the mechanism of exchanging data between main memory and external memory. Based on these new models, the input/output complexity is deeply studied. We discussed the relationship between input/output complexity and the other complexity measures such as time complexity and parameterized complexity, which is not considered by former researchers. We also define the external access trace complexity, which reflects the physical behavior of magnetic disks and gives a theoretical evidence of IO-efficient algorithms.

preprint2020arXiv

A Sub-linear Time Algorithm for Approximating k-Nearest-Neighbor with Full Quality Guarantee

In this paper we propose an algorithm for the approximate k-Nearest-Neighbors problem. According to the existing researches, there are two kinds of approximation criterion. One is the distance criteria, and the other is the recall criteria. All former algorithms suffer the problem that there are no theoretical guarantees for the two approximation criterion. The algorithm proposed in this paper unifies the two kinds of approximation criterion, and has full theoretical guarantees. Furthermore, the query time of the algorithm is sub-linear. As far as we know, it is the first algorithm that achieves both sub-linear query time and full theoretical approximation guarantee.

preprint2020arXiv

Auto-Model: Utilizing Research Papers and HPO Techniques to Deal with the CASH problem

In many fields, a mass of algorithms with completely different hyperparameters have been developed to address the same type of problems. Choosing the algorithm and hyperparameter setting correctly can promote the overall performance greatly, but users often fail to do so due to the absence of knowledge. How to help users to effectively and quickly select the suitable algorithm and hyperparameter settings for the given task instance is an important research topic nowadays, which is known as the CASH problem. In this paper, we design the Auto-Model approach, which makes full use of known information in the related research paper and introduces hyperparameter optimization techniques, to solve the CASH problem effectively. Auto-Model tremendously reduces the cost of algorithm implementations and hyperparameter configuration space, and thus capable of dealing with the CASH problem efficiently and easily. To demonstrate the benefit of Auto-Model, we compare it with classical Auto-Weka approach. The experimental results show that our proposed approach can provide superior results and achieves better performance in a short time.

preprint2020arXiv

Complexity and Efficient Algorithms for Data Inconsistency Evaluating and Repairing

Data inconsistency evaluating and repairing are major concerns in data quality management. As the basic computing task, optimal subset repair is not only applied for cost estimation during the progress of database repairing, but also directly used to derive the evaluation of database inconsistency. Computing an optimal subset repair is to find a minimum tuple set from an inconsistent database whose remove results in a consistent subset left. Tight bound on the complexity and efficient algorithms are still unknown. In this paper, we improve the existing complexity and algorithmic results, together with a fast estimation on the size of optimal subset repair. We first strengthen the dichotomy for optimal subset repair computation problem, we show that it is not only APXcomplete, but also NPhard to approximate an optimal subset repair with a factor better than $17/16$ for most cases. We second show a $(2-0.5^{\tinyσ-1})$-approximation whenever given $σ$ functional dependencies, and a $(2-η_k+\frac{η_k}{k})$-approximation when an $η_k$-portion of tuples have the $k$-quasi-Tur$\acute{\text{a}}$n property for some $k>1$. We finally show a sublinear estimator on the size of optimal \textit{S}-repair for subset queries, it outputs an estimation of a ratio $2n+εn$ with a high probability, thus deriving an estimation of FD-inconsistency degree of a ratio $2+ε$. To support a variety of subset queries for FD-inconsistency evaluation, we unify them as the $\subseteq$-oracle which can answer membership-query, and return $p$ tuples uniformly sampled whenever given a number $p$. Experiments are conducted on range queries as an implementation of $\subseteq$-oracle, and results show the efficiency of our FD-inconsistency degree estimator.

preprint2020arXiv

PHOTOPiC: Calculate photo-ionization functions and model coefficients for gas discharge simulations

A program to compute photo-ionization functions and fitting parameters for an efficient photo-ionization model is presented. The code integrates the product of spectrum emission intensity, the photo-ionization yield and the absorption coefficient to calculate the photo-ionization function of each gas and the total photo-ionization function of the mixture. The coefficients of Helmholtz photo-ionization model is obtained by fitting the total photo-ionization function. A database consisting $\rm N_2$, $\rm O_2$, $\rm CO_2$ and $\rm H_2O$ molecules are included and can be modified by the users. The program provides more accurate photo-ionization functions and source terms for plasma fluid models.

preprint2016arXiv

Data Source Selection for Information Integration in Big Data Era

In Big data era, information integration often requires abundant data extracted from massive data sources. Due to a large number of data sources, data source selection plays a crucial role in information integration, since it is costly and even impossible to access all data sources. Data Source selection should consider both efficiency and effectiveness issues. For efficiency, the approach should achieve high performance and be scalability to fit large data source amount. From effectiveness aspect, data quality and overlapping of sources are to be considered, since data quality varies much from data sources, with significant differences in the accuracy and coverage of the data provided, and the overlapping of sources can even lower the quality of data integrated from selected data sources. In this paper, we study source selection problem in \textit{Big Data Era} and propose methods which can scale to datasets with up to millions of data sources and guarantee the quality of results. Motivated by this, we propose a new object function taking the expected number of true values a source can provide as a criteria to evaluate the contribution of a data source. Based on our proposed index we present a scalable algorithm and two pruning strategies to improve the efficiency without sacrificing precision. Experimental results on both real world and synthetic data sets show that our methods can select sources providing a large proportion of true values efficiently and can scale to massive data sources.

preprint2016arXiv

Efficient Entity Resolution on Heterogeneous Records

Entity resolution (ER) is the problem of identifying and merging records that refer to the same real-world entity. In many scenarios, raw records are stored under heterogeneous environment. Specifically, the schemas of records may differ from each other. To leverage such records better, most existing work assume that schema matching and data exchange have been done to convert records under different schemas to those under a predefined schema. However, we observe that schema matching would lose information in some cases, which could be useful or even crucial to ER. To leverage sufficient information from heterogeneous sources, in this paper, we address several challenges of ER on heterogeneous records and show that none of existing similarity metrics or their transformations could be applied to find similar records under heterogeneous settings. Motivated by this, we design the similarity function and propose a novel framework to iteratively find records which refer to the same entity. Regarding efficiency, we build an index to generate candidates and accelerate similarity computation. Evaluations on real-world datasets show the effectiveness and efficiency of our methods.

preprint2016arXiv

Technique Report: Scheduling Flows with Multiple Service Frequency Constraints

With the fast development of wireless technologies, wireless applications have invaded various areas in people's lives with a wide range of capabilities. Guaranteeing Quality-of-Service (QoS) is the key to the success of those applications. One of the QoS requirements, service frequency, is very important for tasks including multimedia transmission in the Internet of Things. A service frequency constraint denotes the length of time period during which a link can transmit at least once. Unfortunately, it has not been well addressed yet. Therefore, this paper proposes a new framework to schedule multi transmitting flows in wireless networks considering service frequency constraint for each link. In our model, the constraints for flows are heterogeneous due to the diversity of users' behaviors. We first introduce a new definition for network stability with service frequency constraints and demonstrate that the novel scheduling policy is throughput-optimal in one fundamental category of network models. After that, we discuss the performance of a wireless network with service frequency constraints from the views of capacity region and total queue length. Finally, a series of evaluations indicate the proposed scheduling policy can guarantee service frequency and achieve a good performance on the aspect of queue length of each flow.

preprint2015arXiv

Efficient Influence Maximization in Weighted Independent Cascade Model

Influence maximization(IM) problem is to find a seed set in a social network which achieves the maximal influence spread. This problem plays an important role in viral marketing. Numerous models have been proposed to solve this problem. However, none of them considers the attributes of nodes. Paying all attention to the structure of network causes some trouble applying these models to real-word applications. Motivated by this, we present weighted independent cascade (WIC) model, a novel cascade model which extends the applicability of independent cascade(IC) model by attaching attributes to the nodes. The IM problem in WIC model is to maximize the value of nodes which are influenced. This problem is NP-hard. To solve this problem, we present a basic greedy algorithm and Weight Reset(WR) algorithm. Moreover, we propose Bounded Weight Reset(BWR) algorithm to make further effort to improve the efficiency by bounding the diffusion node influence. We prove that BWR is a fully polynomial-time approximation scheme(FPTAS). Experimentally, we show that with additional node attribute, the solution achieved by WIC model outperforms that of IC model in nearly 90%. The experimental results show that BWR can achieve excellent approximation and faster than greedy algorithm more than three orders of magnitude with little sacrifice of accuracy. Especially, BWR can handle large networks with millions of nodes in several tens of seconds while keeping rather high accuracy. Such result demonstrates that BWR can solve IM problem effectively and efficiently.

preprint2015arXiv

SimRank Computation on Uncertain Graphs

SimRank is a similarity measure between vertices in a graph, which has become a fundamental technique in graph analytics. Recently, many algorithms have been proposed for efficient evaluation of SimRank similarities. However, the existing SimRank computation algorithms either overlook uncertainty in graph structures or is based on an unreasonable assumption (Du et al). In this paper, we study SimRank similarities on uncertain graphs based on the possible world model of uncertain graphs. Following the random-walk-based formulation of SimRank on deterministic graphs and the possible worlds model of uncertain graphs, we define random walks on uncertain graphs for the first time and show that our definition of random walks satisfies Markov's property. We formulate the SimRank measure based on random walks on uncertain graphs. We discover a critical difference between random walks on uncertain graphs and random walks on deterministic graphs, which makes all existing SimRank computation algorithms on deterministic graphs inapplicable to uncertain graphs. To efficiently compute SimRank similarities, we propose three algorithms, namely the baseline algorithm with high accuracy, the sampling algorithm with high efficiency, and the two-phase algorithm with comparable efficiency as the sampling algorithm and about an order of magnitude smaller relative error than the sampling algorithm. The extensive experiments and case studies verify the effectiveness of our SimRank measure and the efficiency of our SimRank computation algorithms.

preprint2012arXiv

Efficient Subgraph Matching on Billion Node Graphs

The ability to handle large scale graph data is crucial to an increasing number of applications. Much work has been dedicated to supporting basic graph operations such as subgraph matching, reachability, regular expression matching, etc. In many cases, graph indices are employed to speed up query processing. Typically, most indices require either super-linear indexing time or super-linear indexing space. Unfortunately, for very large graphs, super-linear approaches are almost always infeasible. In this paper, we study the problem of subgraph matching on billion-node graphs. We present a novel algorithm that supports efficient subgraph matching for graphs deployed on a distributed memory store. Instead of relying on super-linear indices, we use efficient graph exploration and massive parallel computing for query processing. Our experimental results demonstrate the feasibility of performing subgraph matching on web-scale graph data.

Jianzhong Li

What is connected

Connect this record

See the researcher in context

Building this map preview

16 published item(s)

The PCP-like Theorem for Sub-linear Time Inapproximability

A New Model for Massively Parallel Computation Considering both Communication and IO Cost

Dynamic Approximate Maximum Independent Set on Massive Graphs

PCP Theorems, SETH and More: Towards Proving Sub-linear Time Inapproximability

Rank-Regret Minimization

Turing Machines with Two-level Memory: A Deep Look into the Input/Output Complexity

A Sub-linear Time Algorithm for Approximating k-Nearest-Neighbor with Full Quality Guarantee

Auto-Model: Utilizing Research Papers and HPO Techniques to Deal with the CASH problem

Complexity and Efficient Algorithms for Data Inconsistency Evaluating and Repairing

PHOTOPiC: Calculate photo-ionization functions and model coefficients for gas discharge simulations

Data Source Selection for Information Integration in Big Data Era

Efficient Entity Resolution on Heterogeneous Records

Technique Report: Scheduling Flows with Multiple Service Frequency Constraints

Efficient Influence Maximization in Weighted Independent Cascade Model

SimRank Computation on Uncertain Graphs

Efficient Subgraph Matching on Billion Node Graphs