Source author record

Yufei Tao

Yufei Tao appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Databases Data Structures and Algorithms Machine Learning

Catalog footprint

What is connected

9works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Distributed Learning with Adversarial Gradient Perturbations

Privacy concerns in distributed learning often lead clients to return intentionally altered gradient information. We consider the problem of learning convex and $L$-smooth functions under adversarial gradient perturbation, where a client's gradient reply to a server query can deviate arbitrarily from the true gradient subject to a distance bound. Our study focuses on two fundamental questions: (i) what is the smallest achievable sub-optimality gap (i.e., excess error in optimization) under such responses, and (ii) how many queries are sufficient to guarantee a given sub-optimality gap? We establish tight feasibility thresholds on the sub-optimality gap and provide algorithms that achieve these thresholds with provable query complexity guarantees.

preprint2023arXiv

Space-Query Tradeoffs in Range Subgraph Counting and Listing

This paper initializes the study of {\em range subgraph counting} and {\em range subgraph listing}, both of which are motivated by the significant demands in practice to perform graph analytics on subgraphs pertinent to only selected, as opposed to all, vertices. In the first problem, there is an undirected graph $G$ where each vertex carries a real-valued attribute. Given an interval $q$ and a pattern $Q$, a query counts the number of occurrences of $Q$ in the subgraph of $G$ induced by the vertices whose attributes fall in $q$. The second problem has the same setup except that a query needs to enumerate (rather than count) those occurrences with a small delay. In both problems, our goal is to understand the tradeoff between {\em space usage} and {\em query cost}, or more specifically: (i) given a target on query efficiency, how much pre-computed information about $G$ must we store? (ii) Or conversely, given a budget on space usage, what is the best query time we can hope for? We establish a suite of upper- and lower-bound results on such tradeoffs for various query patterns.

preprint2022arXiv

Parallel Acyclic Joins with Canonical Edge Covers

In PODS'21, Hu presented an algorithm in the massively parallel computation (MPC) model that processes any acyclic join with an asymptotically optimal load. In this paper, we present an alternative analysis of her algorithm. The novelty of our analysis is in the revelation of a new mathematical structure -- which we name "canonical edge cover" -- for acyclic hypergraphs. We prove non-trivial properties for canonical edge covers that offer us a graph-theoretic perspective about why Hu's algorithm works.

preprint2014arXiv

A Dynamic I/O-Efficient Structure for One-Dimensional Top-k Range Reporting

We present a structure in external memory for "top-k range reporting", which uses linear space, answers a query in O(lg_B n + k/B) I/Os, and supports an update in O(lg_B n) amortized I/Os, where n is the input size, and B is the block size. This improves the state of the art which incurs O(lg^2_B n) amortized I/Os per update.

preprint2013arXiv

I/O-Efficient Planar Range Skyline and Attrition Priority Queues

In the planar range skyline reporting problem, we store a set P of n 2D points in a structure such that, given a query rectangle Q = [a_1, a_2] x [b_1, b_2], the maxima (a.k.a. skyline) of P \cap Q can be reported efficiently. The query is 3-sided if an edge of Q is grounded, giving rise to two variants: top-open (b_2 = \infty) and left-open (a_1 = -\infty) queries. All our results are in external memory under the O(n/B) space budget, for both the static and dynamic settings: * For static P, we give structures that answer top-open queries in O(log_B n + k/B), O(loglog_B U + k/B), and O(1 + k/B) I/Os when the universe is R^2, a U x U grid, and a rank space grid [O(n)]^2, respectively (where k is the number of reported points). The query complexity is optimal in all cases. * We show that the left-open case is harder, such that any linear-size structure must incur Ω((n/B)^e + k/B) I/Os for a query. We show that this case is as difficult as the general 4-sided queries, for which we give a static structure with the optimal query cost O((n/B)^e + k/B). * We give a dynamic structure that supports top-open queries in O(log_2B^e (n/B) + k/B^1-e) I/Os, and updates in O(log_2B^e (n/B)) I/Os, for any e satisfying 0 \le e \le 1. This leads to a dynamic structure for 4-sided queries with optimal query cost O((n/B)^e + k/B), and amortized update cost O(log (n/B)). As a contribution of independent interest, we propose an I/O-efficient version of the fundamental structure priority queue with attrition (PQA). Our PQA supports FindMin, DeleteMin, and InsertAndAttrite all in O(1) worst case I/Os, and O(1/B) amortized I/Os per operation. We also add the new CatenateAndAttrite operation that catenates two PQAs in O(1) worst case and O(1/B) amortized I/Os. This operation is a non-trivial extension to the classic PQA of Sundar, even in internal memory.

preprint2013arXiv

Optimal Planar Range Skyline Reporting with Linear Space in External Memory

Let P be a set of n points in R^2. Given a rectangle Q = [α_1, α_2] x [β_1, β_2], a range skyline query returns the maxima of the points in P \cap Q. An important variant is the so-called top-open queries, where Q is a 3-sided rectangle whose upper edge is grounded at y = \infty (that is, β_2 = \infty). These queries are crucial in numerous database applications. In internal memory, extensive research has been devoted to designing data structures that can answer such queries efficiently. In contrast, currently there is no clear understanding about their exact complexities in external memory. This paper presents several structures of linear size for answering the above queries with the optimal I/O cost. We show that a top-open query can be solved in O(log_B(n) + k/B) I/Os, where B is the block size and k is the number of points in the query result. The query cost can be made O(log log_B(U) + k/B) when the data points lie in a U x U grid for some integer U >= n, and further lowered to O(1 + k/B) if U = O(n). The same efficiency also applies to 3-sided queries where Q is a right-open rectangle. However, the hardness of the problem increases if Q is a left- or bottom-open 3-sided rectangle. We prove that any linear-size structure must perform Ω((n/B)^\eps + k/B) I/Os to solve such a query in the worst case, where \eps > 0 can be an arbitrarily small constant. In fact, left- and right-open queries are just as difficult as general (4-sided) queries, for which we give a linear-size structure with query time O((n/B)^\eps + k/B). Interestingly, this indicates that 4-sided range skyline queries have exactly the same hardness as 4-sided range reporting (where the goal is to report simply the whole P \cap Q). That is, the skyline requirement does not alter the problem difficulty at all.

preprint2012arXiv

A Scalable Algorithm for Maximizing Range Sum in Spatial Databases

This paper investigates the MaxRS problem in spatial databases. Given a set O of weighted points and a rectangular region r of a given size, the goal of the MaxRS problem is to find a location of r such that the sum of the weights of all the points covered by r is maximized. This problem is useful in many location-based applications such as finding the best place for a new franchise store with a limited delivery range and finding the most attractive place for a tourist with a limited reachable range. However, the problem has been studied mainly in theory, particularly, in computational geometry. The existing algorithms from the computational geometry community are in-memory algorithms which do not guarantee the scalability. In this paper, we propose a scalable external-memory algorithm (ExactMaxRS) for the MaxRS problem, which is optimal in terms of the I/O complexity. Furthermore, we propose an approximation algorithm (ApproxMaxCRS) for the MaxCRS problem that is a circle version of the MaxRS problem. We prove the correctness and optimality of the ExactMaxRS algorithm along with the approximation bound of the ApproxMaxCRS algorithm. From extensive experimental results, we show that the ExactMaxRS algorithm is two orders of magnitude faster than methods adapted from existing algorithms, and the approximation bound in practice is much better than the theoretical bound of the ApproxMaxCRS algorithm.

preprint2012arXiv

Optimal Algorithms for Crawling a Hidden Database in the Web

A hidden database refers to a dataset that an organization makes accessible on the web by allowing users to issue queries through a search interface. In other words, data acquisition from such a source is not by following static hyper-links. Instead, data are obtained by querying the interface, and reading the result page dynamically generated. This, with other facts such as the interface may answer a query only partially, has prevented hidden databases from being crawled effectively by existing search engines. This paper remedies the problem by giving algorithms to extract all the tuples from a hidden database. Our algorithms are provably efficient, namely, they accomplish the task by performing only a small number of queries, even in the worst case. We also establish theoretical results indicating that these algorithms are asymptotically optimal -- i.e., it is impossible to improve their efficiency by more than a constant factor. The derivation of our upper and lower bound results reveals significant insight into the characteristics of the underlying problem. Extensive experiments confirm the proposed techniques work very well on all the real datasets examined.

preprint2010arXiv

Transparent Anonymization: Thwarting Adversaries Who Know the Algorithm

Numerous generalization techniques have been proposed for privacy preserving data publishing. Most existing techniques, however, implicitly assume that the adversary knows little about the anonymization algorithm adopted by the data publisher. Consequently, they cannot guard against privacy attacks that exploit various characteristics of the anonymization mechanism. This paper provides a practical solution to the above problem. First, we propose an analytical model for evaluating disclosure risks, when an adversary knows everything in the anonymization process, except the sensitive values. Based on this model, we develop a privacy principle, transparent l-diversity, which ensures privacy protection against such powerful adversaries. We identify three algorithms that achieve transparent l-diversity, and verify their effectiveness and efficiency through extensive experiments with real data.

Yufei Tao

What is connected

Connect this record

See the researcher in context

Building this map preview

9 published item(s)

Distributed Learning with Adversarial Gradient Perturbations

Space-Query Tradeoffs in Range Subgraph Counting and Listing

Parallel Acyclic Joins with Canonical Edge Covers

A Dynamic I/O-Efficient Structure for One-Dimensional Top-k Range Reporting

I/O-Efficient Planar Range Skyline and Attrition Priority Queues

Optimal Planar Range Skyline Reporting with Linear Space in External Memory

A Scalable Algorithm for Maximizing Range Sum in Spatial Databases

Optimal Algorithms for Crawling a Hidden Database in the Web

Transparent Anonymization: Thwarting Adversaries Who Know the Algorithm