Source author record

Xiaojun Lin

Xiaojun Lin appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Data Structures and Algorithms Distributed, Parallel, and Cluster Computing Information Theory math.IT Networking and Internet Architecture Computer Science and Game Theory Discrete Mathematics math.OC math.ST Other Computer Science Performance Programming Languages q-fin.RM Social and Information Networks Software Engineering Statistics Theory Systems and Control

Catalog footprint

What is connected

11works

18topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2024arXiv

CodeFuse-Query: A Data-Centric Static Code Analysis System for Large-Scale Organizations

In the domain of large-scale software development, the demands for dynamic and multifaceted static code analysis exceed the capabilities of traditional tools. To bridge this gap, we present CodeFuse-Query, a system that redefines static code analysis through the fusion of Domain Optimized System Design and Logic Oriented Computation Design. CodeFuse-Query reimagines code analysis as a data computation task, support scanning over 10 billion lines of code daily and more than 300 different tasks. It optimizes resource utilization, prioritizes data reusability, applies incremental code extraction, and introduces tasks types specially for Code Change, underscoring its domain-optimized design. The system's logic-oriented facet employs Datalog, utilizing a unique two-tiered schema, COREF, to convert source code into data facts. Through Godel, a distinctive language, CodeFuse-Query enables formulation of complex tasks as logical expressions, harnessing Datalog's declarative prowess. This paper provides empirical evidence of CodeFuse-Query's transformative approach, demonstrating its robustness, scalability, and efficiency. We also highlight its real-world impact and diverse applications, emphasizing its potential to reshape the landscape of static code analysis in the context of large-scale software development.Furthermore, in the spirit of collaboration and advancing the field, our project is open-sourced and the repository is available for public access

preprint2022arXiv

A Case for Sampling Based Learning Techniques in Coflow Scheduling

Coflow scheduling improves data-intensive application performance by improving their networking performance. State-of-the-art online coflow schedulers in essence approximate the classic Shortest-Job-First (SJF) scheduling by learning the coflow size online. In particular, they use multiple priority queues to simultaneously accomplish two goals: to sieve long coflows from short coflows, and to schedule short coflows with high priorities. Such a mechanism pays high overhead in learning the coflow size: moving a large coflow across the queues delays small and other large coflows, and moving similar-sized coflows across the queues results in inadvertent round-robin scheduling. We propose Philae, a new online coflow scheduler that exploits the spatial dimension of coflows, i.e., a coflow has many flows, to drastically reduce the overhead of coflow size learning. Philae pre-schedules sampled flows of each coflow and uses their sizes to estimate the average flow size of the coflow. It then resorts to Shortest Coflow First, where the notion of shortest is determined using the learned coflow sizes and coflow contention. We show that the sampling-based learning is robust to flow size skew and has the added benefit of much improved scalability from reduced coordinator-local agent interactions. Our evaluation using an Azure testbed, a publicly available production cluster trace from Facebook shows that compared to the prior art Aalo, Philae reduces the coflow completion time (CCT) in average (P90) cases by 1.50x (8.00x) on a 150-node testbed and 2.72x (9.78x) on a 900-node testbed. Evaluation using additional traces further demonstrates Philae's robustness to flow size skew.

preprint2022arXiv

On the Generalization Power of the Overfitted Three-Layer Neural Tangent Kernel Model

In this paper, we study the generalization performance of overparameterized 3-layer NTK models. We show that, for a specific set of ground-truth functions (which we refer to as the "learnable set"), the test error of the overfitted 3-layer NTK is upper bounded by an expression that decreases with the number of neurons of the two hidden layers. Different from 2-layer NTK where there exists only one hidden-layer, the 3-layer NTK involves interactions between two hidden-layers. Our upper bound reveals that, between the two hidden-layers, the test error descends faster with respect to the number of neurons in the second hidden-layer (the one closer to the output) than with respect to that in the first hidden-layer (the one closer to the input). We also show that the learnable set of 3-layer NTK without bias is no smaller than that of 2-layer NTK models with various choices of bias in the neurons. However, in terms of the actual generalization performance, our results suggest that 3-layer NTK is much less sensitive to the choices of bias than 2-layer NTK, especially when the input dimension is large.

preprint2021arXiv

Graph Matching with Partially-Correct Seeds

Graph matching aims to find the latent vertex correspondence between two edge-correlated graphs and has found numerous applications across different fields. In this paper, we study a seeded graph matching problem, which assumes that a set of seeds, i.e., pre-mapped vertex-pairs, is given in advance. While most previous work requires all seeds to be correct, we focus on the setting where the seeds are partially correct. Specifically, consider two correlated graphs whose edges are sampled independently from a parent \ER graph $\mathcal{G}(n,p)$. A mapping between the vertices of the two graphs is provided as seeds, of which an unknown $β$ fraction is correct. We first analyze a simple algorithm that matches vertices based on the number of common seeds in the $1$-hop neighborhoods, and then further propose a new algorithm that uses seeds in the $2$-hop neighborhoods. We establish non-asymptotic performance guarantees of perfect matching for both $1$-hop and $2$-hop algorithms, showing that our new $2$-hop algorithm requires substantially fewer correct seeds than the $1$-hop algorithm when graphs are sparse. Moreover, by combining our new performance guarantees for the $1$-hop and $2$-hop algorithms, we attain the best-known results (in terms of the required fraction of correct seeds) across the entire range of graph sparsity and significantly improve the previous results in \cite{10.14778/2794367.2794371,lubars2018correcting} when $p\ge n^{-5/6}$. For instance, when $p$ is a constant or $p=n^{-3/4}$, we show that only $Ω(\sqrt{n\log n})$ correct seeds suffice for perfect matching, while the previously best-known results demand $Ω(n)$ and $Ω(n^{3/4}\log n)$ correct seeds, respectively. Numerical experiments corroborate our theoretical findings, demonstrating the superiority of our $2$-hop algorithm on a variety of synthetic and real graphs.

preprint2021arXiv

The Power of $D$-hops in Matching Power-Law Graphs

This paper studies seeded graph matching for power-law graphs. Assume that two edge-correlated graphs are independently edge-sampled from a common parent graph with a power-law degree distribution. A set of correctly matched vertex-pairs is chosen at random and revealed as initial seeds. Our goal is to use the seeds to recover the remaining latent vertex correspondence between the two graphs. Departing from the existing approaches that focus on the use of high-degree seeds in $1$-hop neighborhoods, we develop an efficient algorithm that exploits the low-degree seeds in suitably-defined $D$-hop neighborhoods. Specifically, we first match a set of vertex-pairs with appropriate degrees (which we refer to as the first slice) based on the number of low-degree seeds in their $D$-hop neighborhoods. This significantly reduces the number of initial seeds needed to trigger a cascading process to match the rest of the graphs. Under the Chung-Lu random graph model with $n$ vertices, max degree $Θ(\sqrt{n})$, and the power-law exponent $2<β<3$, we show that as soon as $D> \frac{4-β}{3-β}$, by optimally choosing the first slice, with high probability our algorithm can correctly match a constant fraction of the true pairs without any error, provided with only $Ω((\log n)^{4-β})$ initial seeds. Our result achieves an exponential reduction in the seed size requirement, as the best previously known result requires $n^{1/2+ε}$ seeds (for any small constant $ε>0$). Performance evaluation with synthetic and real data further corroborates the improved performance of our algorithm.

preprint2020arXiv

Learning Large Electrical Loads via Flexible Contracts with Commitment

Large electricity customers (e.g., large data centers) can exhibit huge and variable electricity demands, which poses significant challenges for the electricity suppliers to plan for sufficient capacity. Thus, it is desirable to design incentive and coordination mechanisms between the customers and the supplier to lower the capacity cost. This paper proposes a novel scheme based on flexible contracts. Unlike existing demand-side management schemes in the literature, a flexible contract leads to information revelation. That is, a customer committing to a flexible contract reveals valuable information about its future demand to the supplier. Such information revelation allows the customers and the supplier to share the risk of future demand uncertainty. On the other hand, the customer will still retain its autonomy in operation. We address two key challenges for the design of optimal flexible contracts: i) the contract design is a non-convex optimization problem and is intractable for a large number of customer types, and ii) the design should be robust to unexpected or adverse responses of the customers, i.e., a customer facing more than one contract yielding the same benefit may choose the contract less favorable to the supplier. We address these challenges by proposing sub-optimal contracts of low computational complexity that can achieve a provable fraction of the performance gain under the global optimum.

preprint2015arXiv

Proactive Demand Response for Data Centers: A Win-Win Solution

In order to reduce the energy cost of data centers, recent studies suggest distributing computation workload among multiple geographically dispersed data centers, by exploiting the electricity price difference. However, the impact of data center load redistribution on the power grid is not well understood yet. This paper takes the first step towards tackling this important issue, by studying how the power grid can take advantage of the data centers' load distribution proactively for the purpose of power load balancing. We model the interactions between power grid and data centers as a two-stage problem, where the utility company chooses proper pricing mechanisms to balance the electric power load in the first stage, and the data centers seek to minimize their total energy cost by responding to the prices in the second stage. We show that the two-stage problem is a bilevel quadratic program, which is NP-hard and cannot be solved using standard convex optimization techniques. We introduce benchmark problems to derive upper and lower bounds for the solution of the two-stage problem. We further propose a branch and bound algorithm to attain the globally optimal solution, and propose a heuristic algorithm with low computational complexity to obtain an alternative close-to-optimal solution. We also study the impact of background load prediction error using the theoretical framework of robust optimization. The simulation results demonstrate that our proposed scheme can not only improve the power grid reliability but also reduce the energy cost of data centers.

preprint2014arXiv

Achieving Optimal Throughput and Near-Optimal Asymptotic Delay Performance in Multi-Channel Wireless Networks with Low Complexity: A Practical Greedy Scheduling Policy

In this paper, we focus on the scheduling problem in multi-channel wireless networks, e.g., the downlink of a single cell in fourth generation (4G) OFDM-based cellular networks. Our goal is to design practical scheduling policies that can achieve provably good performance in terms of both throughput and delay, at a low complexity. While a class of $O(n^{2.5} \log n)$-complexity hybrid scheduling policies are recently developed to guarantee both rate-function delay optimality (in the many-channel many-user asymptotic regime) and throughput optimality (in the general non-asymptotic setting), their practical complexity is typically high. To address this issue, we develop a simple greedy policy called Delay-based Server-Side-Greedy (D-SSG) with a \lower complexity $2n^2+2n$, and rigorously prove that D-SSG not only achieves throughput optimality, but also guarantees near-optimal asymptotic delay performance. Specifically, we show that the rate-function attained by D-SSG for any delay-violation threshold $b$, is no smaller than the maximum achievable rate-function by any scheduling policy for threshold $b-1$. Thus, we are able to achieve a reduction in complexity (from $O(n^{2.5} \log n)$ of the hybrid policies to $2n^2 + 2n$) with a minimal drop in the delay performance. More importantly, in practice, D-SSG generally has a substantially lower complexity than the hybrid policies that typically have a large constant factor hidden in the $O(\cdot)$ notation. Finally, we conduct numerical simulations to validate our theoretical results in various scenarios. The simulation results show that D-SSG not only guarantees a near-optimal rate-function, but also empirically is virtually indistinguishable from delay-optimal policies.

preprint2014arXiv

Optimal Monitoring and Mitigation of Systemic Risk in Financial Networks

This paper studies the problem of optimally allocating a cash injection into a financial system in distress. Given a one-period borrower-lender network in which all debts are due at the same time and have the same seniority, we address the problem of allocating a fixed amount of cash among the nodes to minimize the weighted sum of unpaid liabilities. Assuming all the loan amounts and asset values are fixed and that there are no bankruptcy costs, we show that this problem is equivalent to a linear program. We develop a duality-based distributed algorithm to solve it which is useful for applications where it is desirable to avoid centralized data gathering and computation. We also consider the problem of minimizing the expectation of the weighted sum of unpaid liabilities under the assumption that the net external asset holdings of all institutions are stochastic. We show that this problem is a two-stage stochastic linear program. To solve it, we develop two algorithms based on: Benders decomposition algorithm and projected stochastic gradient descent. We show that if the defaulting nodes never pay anything, the deterministic optimal cash injection allocation problem is an NP-hard mixed-integer linear program. However, modern optimization software enables the computation of very accurate solutions to this problem on a personal computer in a few seconds for network sizes comparable with the size of the US banking system. In addition, we address the problem of allocating the cash injection amount so as to minimize the number of nodes in default. For this problem, we develop two heuristic algorithms: a reweighted l1 minimization algorithm and a greedy algorithm. We illustrate these two algorithms using three synthetic network structures for which the optimal solution can be calculated exactly. We also compare these two algorithms on three types random networks which are more complex.

preprint2013arXiv

Low-Complexity Scheduling Policies for Achieving Throughput and Asymptotic Delay Optimality in Multi-Channel Wireless Networks

In this paper, we study the scheduling problem for downlink transmission in a multi-channel (e.g., OFDM-based) wireless network. We focus on a single cell, with the aim of developing a unifying framework for designing low-complexity scheduling policies that can provide optimal performance in terms of both throughput and delay. We develop new easy-to-verify sufficient conditions for rate-function delay optimality (in the many-channel many-user asymptotic regime) and throughput optimality (in general non-asymptotic setting), respectively. The sufficient conditions allow us to prove rate-function delay optimality for a class of Oldest Packets First (OPF) policies and throughput optimality for a large class of Maximum Weight in the Fluid limit (MWF) policies, respectively. By exploiting the special features of our carefully chosen sufficient conditions and intelligently combining policies from the classes of OPF and MWF policies, we design hybrid policies that are both rate-function delay-optimal and throughput-optimal with a complexity of $O(n^{2.5} \log n)$, where $n$ is the number of channels or users. Our sufficient condition is also used to show that a previously proposed policy called Delay Weighted Matching (DWM) is rate-function delay-optimal. However, DWM incurs a high complexity of $O(n^5)$. Thus, our approach yields significantly lower complexity than the only previously designed delay and throughput optimal scheduling policy. We also conduct numerical experiments to validate our theoretical results.

preprint2013arXiv

Online Energy Generation Scheduling for Microgrids with Intermittent Energy Sources and Co-Generation

Microgrids represent an emerging paradigm of future electric power systems that can utilize both distributed and centralized generations. Two recent trends in microgrids are the integration of local renewable energy sources (such as wind farms) and the use of co-generation (i.e., to supply both electricity and heat). However, these trends also bring unprecedented challenges to the design of intelligent control strategies for microgrids. Traditional generation scheduling paradigms rely on perfect prediction of future electricity supply and demand. They are no longer applicable to microgrids with unpredictable renewable energy supply and with co-generation (that needs to consider both electricity and heat demand). In this paper, we study online algorithms for the microgrid generation scheduling problem with intermittent renewable energy sources and co-generation, with the goal of maximizing the cost-savings with local generation. Based on the insights from the structure of the offline optimal solution, we propose a class of competitive online algorithms, called CHASE (Competitive Heuristic Algorithm for Scheduling Energy-generation), that track the offline optimal in an online fashion. Under typical settings, we show that CHASE achieves the best competitive ratio among all deterministic online algorithms, and the ratio is no larger than a small constant 3.

Xiaojun Lin

What is connected

Connect this record

See the researcher in context

Building this map preview

11 published item(s)

CodeFuse-Query: A Data-Centric Static Code Analysis System for Large-Scale Organizations

A Case for Sampling Based Learning Techniques in Coflow Scheduling

On the Generalization Power of the Overfitted Three-Layer Neural Tangent Kernel Model

Graph Matching with Partially-Correct Seeds

The Power of $D$-hops in Matching Power-Law Graphs

Learning Large Electrical Loads via Flexible Contracts with Commitment

Proactive Demand Response for Data Centers: A Win-Win Solution

Achieving Optimal Throughput and Near-Optimal Asymptotic Delay Performance in Multi-Channel Wireless Networks with Low Complexity: A Practical Greedy Scheduling Policy

Optimal Monitoring and Mitigation of Systemic Risk in Financial Networks

Low-Complexity Scheduling Policies for Achieving Throughput and Asymptotic Delay Optimality in Multi-Channel Wireless Networks

Online Energy Generation Scheduling for Microgrids with Intermittent Energy Sources and Co-Generation