Source author record

Dan Alistarh

Dan Alistarh appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Distributed, Parallel, and Cluster Computing Data Structures and Algorithms Artificial Intelligence Computer Vision Computation and Language Hardware Architecture math.OC Neural and Evolutionary Computing

Catalog footprint

What is connected

29works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Grid Games: The Power of Multiple Grids for Quantizing Large Language Models

A major recent advance in quantization is given by microscaled 4-bit formats such as NVFP4 and MXFP4, quantizing values into small groups sharing a scale, assuming a fixed floating-point grid. In this paper, we study the following natural extension: assume that, for each group of values, we are free to select the "better" among two or more 4-bit grids marked by one or more bits in the scale value. We formalize the power-of-two-grids (PO2) problem, and provide theoretical results showing that practical small-group formats such as MXFP or NVFP can benefit significantly from PO2 grids, while the advantage vanishes for very large groups. On the practical side, we instantiate several grid families, including 1) PO2(NF4), which pairs the standard NF4 normal grid with a learned grid, 2) MPO2, a grid pair that is fully learned over real weights and activations, 3) PO2(Split87), an explicit-zero asymmetric grid and 4) SFP4, a TensorCore-implementable triple which pairs NVFP4 with two shifted variants. Results for post-training quantization of standard open models and pre-training of Llama-like models show that adaptive grids consistently improve accuracy vs single-grid FP4 under both weight-only and weight+activation. Source code is available at https://github.com/IST-DASLab/GridGames.

preprint2026arXiv

MatryoshkaLoRA: Learning Accurate Hierarchical Low-Rank Representations for LLM Fine-Tuning

With the rise in scale for deep learning models to billions of parameters, the computational cost of fine-tuning remains a significant barrier to deployment. While Low-Rank Adaptation (LoRA) has become the standard for parameter-efficient fine-tuning, the need to set a predefined, static rank $r$ requires exhaustive grid searches to balance efficiency and performance. Existing rank-adaptive solutions such as DyLoRA mitigate this by sampling ranks during the training from a predefined distribution. However, they often yield sub-optimal results at higher ranks due to lack of consistent gradient signals across the full hierarchy of ranks, thus making these methods data-inefficient. In this paper, we propose MatryoshkaLoRA, a general, Matryoshka-inspired training framework for LoRA that learns accurate hierarchical low-rank representations by inserting a fixed, carefully crafted diagonal matrix $P$ between the existing LoRA adapters to scale their sub-ranks accordingly. By introducing this simple modification, our general framework recovers LoRA and DyLoRA only by changing $P$ and ensures all sub-ranks embed the available gradient information efficiently. Our MatryoshkaLoRA supports dynamic rank selection with minimal degradation in accuracy. We further propose Area Under the Rank Accuracy Curve (AURAC), a metric that consistently evaluates the performance of hierarchical low-rank adapters. Our results demonstrate that MatryoshkaLoRA learns more accurate hierarchical low-rank representations than prior rank-adaptive approaches and achieves superior accuracy-performance trade-offs across ranks on the evaluated datasets. Our code is available at https://github.com/IST-DASLab/MatryoshkaLoRA.

preprint2026arXiv

Model Compression with Exact Budget Constraints via Riemannian Manifolds

Assigning one of K options to each of N groups under a total cost budget is a recurring problem in efficient AI, including mixed-precision quantization, non-uniform pruning, and expert selection. The objective, typically model loss, depends jointly on all assignments and does not decompose across groups, preventing combinatorial solvers from directly optimizing the true objective and forcing reliance on proxy formulations. Methods such as evolutionary search evaluate the actual loss but lack gradient information, while penalty-based approaches enforce the budget only approximately and often require extensive hyperparameter tuning. We present a new approach by showing that, under softmax relaxation, the budget constraint defines a smooth Riemannian manifold in logit space with unusually simple geometry. The normal vector admits a closed-form expression, shifting logits along the cost vector changes expected cost monotonically, and vector transport reduces to a single inner product. Building on these properties, we propose Riemannian Constrained Optimization (RCO), which augments a standard Adam step with tangent projection, binary-search retraction, and momentum transport. Combined with Gumbel straight-through estimation and budget-constrained dynamic programming for discrete feasibility, RCO enables first-order optimization of the actual loss under exact budget enforcement without introducing constraint-specific hyperparameters. Across both synthetic benchmarks and realistic LLM compression settings, RCO matches or exceeds state-of-the-art methods while often requiring substantially less wall-clock time. Source code is available at https://github.com/IST-DASLab/RCO.

preprint2026arXiv

Quartet: Native FP4 Training Can Be Optimal for Large Language Models

Training large language models (LLMs) models directly in low-precision offers a way to address computational costs by improving both throughput and energy efficiency. For those purposes, NVIDIA's recent Blackwell architecture facilitates very low-precision operations using FP4 variants. Yet, current algorithms for training LLMs in FP4 precision face significant accuracy degradation and often rely on mixed-precision fallbacks. In this paper, we investigate hardware-supported FP4 training and introduce a new approach for accurate, end-to-end FP4 training with all the major computations (i.e., linear layers) in low precision. Through extensive evaluations on Llama-type models, we reveal a new low-precision scaling law that quantifies performance trade-offs across bit-widths and training setups. Guided by this investigation, we design an "optimal" technique in terms of accuracy-vs-computation, called Quartet. We implement Quartet using optimized CUDA kernels tailored for Blackwell, demonstrating that fully FP4-based training is a competitive alternative to FP16 half-precision and to FP8 training. Our code is available at https://github.com/IST-DASLab/Quartet.

preprint2026arXiv

Statistically-Lossless Quantization of Large Language Models

Model quantization has become essential for efficient large language model deployment, yet existing approaches involve clear trade-offs: methods such as GPTQ and AWQ achieve practical compression but are lossy, while lossless techniques preserve fidelity but typically do not accelerate inference. This paper explores the middle ground of statistically-lossless compression through three complementary notions of losslessness for quantized LLMs. First, task-lossless compression preserves zero-shot benchmark accuracy within natural sampling variance and remains achievable at aggressive bitwidths. Second, we formalize the stricter notion of distribution-lossless compression, requiring the quantized model's next-token distribution to be practically indistinguishable from the original, and propose the Expected Acceptance Rate (EAR), the maximum token-agreement probability under optimal coupling, as a directly interpretable fidelity metric (for example, EAR >= 0.99 indicates 99% agreement). Third, we prove a gamma-squared variance law showing that symmetric quantization inflates noise variance by gamma squared relative to asymmetric quantization, making asymmetry necessary for distribution-lossless fidelity but not for task-level preservation. Using SLQ, a layer-wise non-uniform method with asymmetric quantization and wide bitwidth search, we achieve task-lossless compression at well below 4 bits per parameter (as low as 3.3 bits depending on the model), distribution-lossless compression at 5 to 6 bits per parameter on average, and inference speedups of 1.7 to 3.6x relative to FP16 with optimized kernels. Source code is available at https://github.com/IST-DASLab/SLQ.

preprint2023arXiv

Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning

We consider the problem of model compression for deep neural networks (DNNs) in the challenging one-shot/post-training setting, in which we are given an accurate trained model, and must compress it without any retraining, based only on a small amount of calibration input data. This problem has become popular in view of the emerging software and hardware support for executing models compressed via pruning and/or quantization with speedup, and well-performing solutions have been proposed independently for both compression approaches. In this paper, we introduce a new compression framework which covers both weight pruning and quantization in a unified setting, is time- and space-efficient, and considerably improves upon the practical performance of existing post-training methods. At the technical level, our approach is based on an exact and efficient realization of the classical Optimal Brain Surgeon (OBS) framework of [LeCun, Denker, and Solla, 1990] extended to also cover weight quantization at the scale of modern DNNs. From the practical perspective, our experimental results show that it can improve significantly upon the compression-accuracy trade-offs of existing post-training methods, and that it can enable the accurate compound application of both pruning and quantization in a post-training setting.

preprint2022arXiv

Asynchronous Decentralized SGD with Quantized and Local Updates

Decentralized optimization is emerging as a viable alternative for scalable distributed machine learning, but also introduces new challenges in terms of synchronization costs. To this end, several communication-reduction techniques, such as non-blocking communication, quantization, and local steps, have been explored in the decentralized setting. Due to the complexity of analyzing optimization in such a relaxed setting, this line of work often assumes \emph{global} communication rounds, which require additional synchronization. In this paper, we consider decentralized optimization in the simpler, but harder to analyze, \emph{asynchronous gossip} model, in which communication occurs in discrete, randomly chosen pairings among nodes. Perhaps surprisingly, we show that a variant of SGD called \emph{SwarmSGD} still converges in this setting, even if \emph{non-blocking communication}, \emph{quantization}, and \emph{local steps} are all applied \emph{in conjunction}, and even if the node data distributions and underlying graph topology are both \emph{heterogenous}. Our analysis is based on a new connection with multi-dimensional load-balancing processes. We implement this algorithm and deploy it in a super-computing environment, showing that it can outperform previous decentralized methods in terms of end-to-end training time, and that it can even rival carefully-tuned large-batch SGD for certain tasks.

preprint2022arXiv

CGX: Adaptive System Support for Communication-Efficient Deep Learning

The ability to scale out training workloads has been one of the key performance enablers of deep learning. The main scaling approach is data-parallel GPU-based training, which has been boosted by hardware and software support for highly efficient point-to-point communication, and in particular via hardware bandwidth overprovisioning. Overprovisioning comes at a cost: there is an order of magnitude price difference between "cloud-grade" servers with such support, relative to their popular "consumer-grade" counterparts, although single server-grade and consumer-grade GPUs can have similar computational envelopes. In this paper, we show that the costly hardware overprovisioning approach can be supplanted via algorithmic and system design, and propose a framework called CGX, which provides efficient software support for compressed communication in ML applications, for both multi-GPU single-node training, as well as larger-scale multi-node training. CGX is based on two technical advances: \emph{At the system level}, it relies on a re-developed communication stack for ML frameworks, which provides flexible, highly-efficient support for compressed communication. \emph{At the application level}, it provides \emph{seamless, parameter-free} integration with popular frameworks, so that end-users do not have to modify training recipes, nor significant training code. This is complemented by a \emph{layer-wise adaptive compression} technique which dynamically balances compression gains with accuracy preservation. CGX integrates with popular ML frameworks, providing up to 3X speedups for multi-GPU nodes based on commodity hardware, and order-of-magnitude improvements in the multi-node setting, with negligible impact on accuracy.

preprint2022arXiv

Distributionally Linearizable Data Structures

Relaxed concurrent data structures have become increasingly popular, due to their scalability in graph processing and machine learning applications. Despite considerable interest, there exist families of natural, high performing randomized relaxed concurrent data structures, such as the popular MultiQueue pattern for implementing relaxed priority queue data structures, for which no guarantees are known in the concurrent setting. Our main contribution is in showing for the first time that, under a set of analytic assumptions, a family of relaxed concurrent data structures, including variants of MultiQueues, but also a new approximate counting algorithm we call the MultiCounter, provides strong probabilistic guarantees on the degree of relaxation with respect to the sequential specification, in arbitrary concurrent executions. We formalize these guarantees via a new correctness condition called distributional linearizability, tailored to concurrent implementations with randomized relaxations. Our result is based on a new analysis of an asynchronous variant of the classic power-of-two-choices load balancing algorithm, in which placement choices can be based on inconsistent, outdated information (this result may be of independent interest). We validate our results empirically, showing that the MultiCounter algorithm can implement scalable relaxed timestamps, which in turn can improve the performance of the classic TL2 transactional algorithm by up to 3 times, for some settings of parameters.

preprint2022arXiv

How Well Do Sparse Imagenet Models Transfer?

Transfer learning is a classic paradigm by which models pretrained on large "upstream" datasets are adapted to yield good results on "downstream" specialized datasets. Generally, more accurate models on the "upstream" dataset tend to provide better transfer accuracy "downstream". In this work, we perform an in-depth investigation of this phenomenon in the context of convolutional neural networks (CNNs) trained on the ImageNet dataset, which have been pruned - that is, compressed by sparsifying their connections. We consider transfer using unstructured pruned models obtained by applying several state-of-the-art pruning methods, including magnitude-based, second-order, re-growth, lottery-ticket, and regularization approaches, in the context of twelve standard transfer tasks. In a nutshell, our study shows that sparse models can match or even outperform the transfer performance of dense models, even at high sparsities, and, while doing so, can lead to significant inference and even training speedups. At the same time, we observe and analyze significant differences in the behaviour of different pruning methods.

preprint2022arXiv

Lower Bounds for Shared-Memory Leader Election under Bounded Write Contention

This paper gives tight logarithmic lower bounds on the solo step complexity of leader election in an asynchronous shared-memory model with single-writer multi-reader (SWMR) registers, for randomized obstruction-free algorithms. The approach extends to lower bounds for randomized obstruction-free algorithms using multi-writer registers under bounded write concurrency, showing a trade-off between the solo step complexity of a leader election algorithm, and the worst-case contention incurred by a processor in an execution.

preprint2022arXiv

Robust Comparison in Population Protocols

There has recently been a surge of interest in the computational and complexity properties of the population model, which assumes $n$ anonymous, computationally-bounded nodes, interacting at random, and attempting to jointly compute global predicates. Significant work has gone towards investigating majority and consensus dynamics in this model: assuming that each node is initially in one of two states $X$ or $Y$, determine which state had higher initial count. In this paper, we consider a natural generalization of majority/consensus, which we call comparison. We are given two baseline states, $X_0$ and $Y_0$, present in any initial configuration in fixed, possibly small counts. Importantly, one of these states has higher count than the other: we will assume $|X_0| \ge C |Y_0|$ for some constant $C$. The challenge is to design a protocol which can quickly and reliably decide on which of the baseline states $X_0$ and $Y_0$ has higher initial count. We propose a simple algorithm solving comparison: the baseline algorithm uses $O(\log n)$ states per node, and converges in $O(\log n)$ (parallel) time, with high probability, to a state where whole population votes on opinions $X$ or $Y$ at rates proportional to initial $|X_0|$ vs. $|Y_0|$ concentrations. We then describe how such output can be then used to solve comparison. The algorithm is self-stabilizing, in the sense that it converges to the correct decision even if the relative counts of baseline states $X_0$ and $Y_0$ change dynamically during the execution, and leak-robust, in the sense that it can withstand spurious faulty reactions. Our analysis relies on a new martingale concentration result which relates the evolution of a population protocol to its expected (steady-state) analysis, which should be broadly applicable in the context of population protocols and opinion dynamics.

preprint2022arXiv

Scaling the Wild: Decentralizing Hogwild!-style Shared-memory SGD

Powered by the simplicity of lock-free asynchrony, Hogwilld! is a go-to approach to parallelize SGD over a shared-memory setting. Despite its popularity and concomitant extensions, such as PASSM+ wherein concurrent processes update a shared model with partitioned gradients, scaling it to decentralized workers has surprisingly been relatively unexplored. To our knowledge, there is no convergence theory of such methods, nor systematic numerical comparisons evaluating speed-up. In this paper, we propose an algorithm incorporating decentralized distributed memory computing architecture with each node running multiprocessing parallel shared-memory SGD itself. Our scheme is based on the following algorithmic tools and features: (a) asynchronous local gradient updates on the shared-memory of workers, (b) partial backpropagation, and (c) non-blocking in-place averaging of the local models. We prove that our method guarantees ergodic convergence rates for non-convex objectives. On the practical side, we show that the proposed method exhibits improved throughput and competitive accuracy for standard image classification benchmarks on the CIFAR-10, CIFAR-100, and Imagenet datasets. Our code is available at https://github.com/bapi/LPP-SGD.

preprint2022arXiv

SPDY: Accurate Pruning with Speedup Guarantees

The recent focus on the efficiency of deep neural networks (DNNs) has led to significant work on model compression approaches, of which weight pruning is one of the most popular. At the same time, there is rapidly-growing computational support for efficiently executing the unstructured-sparse models obtained via pruning. Yet, most existing pruning methods minimize just the number of remaining weights, i.e. the size of the model, rather than optimizing for inference time. We address this gap by introducing SPDY, a new compression method which automatically determines layer-wise sparsity targets achieving a desired inference speedup on a given system, while minimizing accuracy loss. SPDY is composed of two new techniques: the first is an efficient dynamic programming algorithm for solving the speedup-constrained layer-wise compression problem assuming a set of given layer-wise sensitivity scores; the second is a local search procedure for determining accurate layer-wise sensitivity scores. Experiments across popular vision and language models show that SPDY guarantees speedups while recovering higher accuracy relative to existing strategies, both for one-shot and gradual pruning scenarios, and is compatible with most existing pruning approaches. We also extend our approach to the recently-proposed task of pruning with very little data, where we achieve the best known accuracy recovery when pruning to the GPU-supported 2:4 sparsity pattern.

preprint2021arXiv

Relaxed Scheduling for Scalable Belief Propagation

The ability to leverage large-scale hardware parallelism has been one of the key enablers of the accelerated recent progress in machine learning. Consequently, there has been considerable effort invested into developing efficient parallel variants of classic machine learning algorithms. However, despite the wealth of knowledge on parallelization, some classic machine learning algorithms often prove hard to parallelize efficiently while maintaining convergence. In this paper, we focus on efficient parallel algorithms for the key machine learning task of inference on graphical models, in particular on the fundamental belief propagation algorithm. We address the challenge of efficiently parallelizing this classic paradigm by showing how to leverage scalable relaxed schedulers in this context. We present an extensive empirical study, showing that our approach outperforms previous parallel belief propagation implementations both in terms of scalability and in terms of wall-clock convergence time, on a range of practical applications.

preprint2021arXiv

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

The growing energy and performance costs of deep learning have driven the community to reduce the size of neural networks by selectively pruning components. Similarly to their biological counterparts, sparse networks generalize just as well, if not better than, the original dense networks. Sparsity can reduce the memory footprint of regular networks to fit mobile devices, as well as shorten training time for ever growing networks. In this paper, we survey prior work on sparsity in deep learning and provide an extensive tutorial of sparsification for both inference and training. We describe approaches to remove and add elements of neural networks, different training strategies to achieve model sparsity, and mechanisms to exploit sparsity in practice. Our work distills ideas from more than 300 research papers and provides guidance to practitioners who wish to utilize sparsity today, as well as to researchers whose goal is to push the frontier forward. We include the necessary background on mathematical methods in sparsification, describe phenomena such as early structure adaptation, the intricate relations between sparsity and the training process, and show techniques for achieving acceleration on real hardware. We also define a metric of pruned parameter efficiency that could serve as a baseline for comparison of different sparse networks. We close by speculating on how sparsity can improve future workloads and outline major open problems in the field.

preprint2020arXiv

Analysis and Evaluation of Non-Blocking Interpolation Search Trees

We start by summarizing the recently proposed implementation of the first non-blocking concurrent interpolation search tree (C-IST) data structure. We then analyze the individual operations of the C-IST, and show that they are correct and linearizable. We furthermore show that lookup (and several other non-destructive operations) are wait-free, and that the insert and delete operations are lock-free. We continue by showing that the C-IST has the following properties. For arbitrary key distributions, this data structure ensures worst-case $O(\log n + p)$ amortized time for search, insertion and deletion traversals. When the input key distributions are smooth, lookups run in expected $O(\log \log n + p)$ time, and insertion and deletion run in expected amortized $O(\log \log n + p)$ time, where $p$ is a bound on the number of threads. Finally, we present an extended experimental evaluation of the non-blocking IST performance.

preprint2020arXiv

Asynchronous Optimization Methods for Efficient Training of Deep Neural Networks with Guarantees

Asynchronous distributed algorithms are a popular way to reduce synchronization costs in large-scale optimization, and in particular for neural network training. However, for nonsmooth and nonconvex objectives, few convergence guarantees exist beyond cases where closed-form proximal operator solutions are available. As most popular contemporary deep neural networks lead to nonsmooth and nonconvex objectives, there is now a pressing need for such convergence guarantees. In this paper, we analyze for the first time the convergence of stochastic asynchronous optimization for this general class of objectives. In particular, we focus on stochastic subgradient methods allowing for block variable partitioning, where the shared-memory-based model is asynchronously updated by concurrent processes. To this end, we first introduce a probabilistic model which captures key features of real asynchronous scheduling between concurrent processes; under this model, we establish convergence with probability one to an invariant set for stochastic subgradient methods with momentum. From the practical perspective, one issue with the family of methods we consider is that it is not efficiently supported by machine learning frameworks, as they mostly focus on distributed data-parallel strategies. To address this, we propose a new implementation strategy for shared-memory based training of deep neural networks, whereby concurrent parameter servers are utilized to train a partitioned but shared model in single- and multi-GPU settings. Based on this implementation, we achieve on average 1.2x speed-up in comparison to state-of-the-art training methods for popular image classification tasks without compromising accuracy.

preprint2020arXiv

Dynamic Averaging Load Balancing on Cycles

We consider the following dynamic load-balancing process: given an underlying graph $G$ with $n$ nodes, in each step $t\geq 0$, one unit of load is created, and placed at a randomly chosen graph node. In the same step, the chosen node picks a random neighbor, and the two nodes balance their loads by averaging them. We are interested in the expected gap between the minimum and maximum loads at nodes as the process progresses, and its dependence on $n$ and on the graph structure. Similar variants of the above graphical balanced allocation process have been studied by Peres, Talwar, and Wieder, and by Sauerwald and Sun for regular graphs. These authors left as open the question of characterizing the gap in the case of \emph{cycle graphs} in the \emph{dynamic} case, where weights are created during the algorithm's execution. For this case, the only known upper bound is of $\mathcal{O}( n \log n )$, following from a majorization argument due to Peres, Talwar, and Wieder, which analyzes a related graphical allocation process. In this paper, we provide an upper bound of $\mathcal{O} ( \sqrt n \log n )$ on the expected gap of the above process for cycles of length $n$. We introduce a new potential analysis technique, which enables us to bound the difference in load between $k$-hop neighbors on the cycle, for any $k \leq n / 2$. We complement this with a "gap covering" argument, which bounds the maximum value of the gap by bounding its value across all possible subsets of a certain structure, and recursively bounding the gaps within each subset. We provide analytical and experimental evidence that our upper bound on the gap is tight up to a logarithmic factor.

preprint2020arXiv

Efficiency Guarantees for Parallel Incremental Algorithms under Relaxed Schedulers

Several classic problems in graph processing and computational geometry are solved via incremental algorithms, which split computation into a series of small tasks acting on shared state, which gets updated progressively. While the sequential variant of such algorithms usually specifies a fixed (but sometimes random) order in which the tasks should be performed, a standard approach to parallelizing such algorithms is to relax this constraint to allow for out-of-order parallel execution. This is the case for parallel implementations of Dijkstra's single-source shortest-paths algorithm (SSSP), and for parallel Delaunay mesh triangulation. While many software frameworks parallelize incremental computation in this way, it is still not well understood whether this relaxed ordering approach can still provide any complexity guarantees. In this paper, we address this problem, and analyze the efficiency guarantees provided by a range of incremental algorithms when parallelized via relaxed schedulers. We show that, for algorithms such as Delaunay mesh triangulation and sorting by insertion, schedulers with a maximum relaxation factor of $k$ in terms of the maximum priority inversion allowed will introduce a maximum amount of wasted work of $O(log(n) poly (k) ), $ where $n$ is the number of tasks to be executed. For SSSP, we show that the additional work is $O(poly (k) d_{max} / w_{min}), $ where $d_{\max}$ is the maximum distance between two nodes, and $w_{min}$ is the minimum such distance. In practical settings where $n \gg k$, this suggests that the overheads of relaxation will be outweighed by the improved scalability of the relaxed scheduler. On the negative side, we provide lower bounds showing that certain algorithms will inherently incur a non-trivial amount of wasted work due to scheduler relaxation, even for relatively benign relaxed schedulers.

preprint2020arXiv

Elastic Consistency: A General Consistency Model for Distributed Stochastic Gradient Descent

Machine learning has made tremendous progress in recent years, with models matching or even surpassing humans on a series of specialized tasks. One key element behind the progress of machine learning in recent years has been the ability to train machine learning models in large-scale distributed shared-memory and message-passing environments. Many of these models are trained employing variants of stochastic gradient descent (SGD) based optimization. In this paper, we introduce a general consistency condition covering communication-reduced and asynchronous distributed SGD implementations. Our framework, called elastic consistency enables us to derive convergence bounds for a variety of distributed SGD methods used in practice to train large-scale machine learning models. The proposed framework de-clutters the implementation-specific convergence analysis and provides an abstraction to derive convergence bounds. We utilize the framework to analyze a sparsification scheme for distributed SGD methods in an asynchronous setting for convex and non-convex objectives. We implement the distributed SGD variant to train deep CNN models in an asynchronous shared-memory setting. Empirical results show that error-feedback may not necessarily help in improving the convergence of sparsified asynchronous distributed SGD, which corroborates an insight suggested by our convergence analysis.

preprint2020arXiv

Fast General Distributed Transactions with Opacity using Global Time

Transactions can simplify distributed applications by hiding data distribution, concurrency, and failures from the application developer. Ideally the developer would see the abstraction of a single large machine that runs transactions sequentially and never fails. This requires the transactional subsystem to provide opacity (strict serializability for both committed and aborted transactions), as well as transparent fault tolerance with high availability. As even the best abstractions are unlikely to be used if they perform poorly, the system must also provide high performance. Existing distributed transactional designs either weaken this abstraction or are not designed for the best performance within a data center. This paper extends the design of FaRM - which provides strict serializability only for committed transactions - to provide opacity while maintaining FaRM's high throughput, low latency, and high availability within a modern data center. It uses timestamp ordering based on real time with clocks synchronized to within tens of microseconds across a cluster, and a failover protocol to ensure correctness across clock master failures. FaRM with opacity can commit 5.4 million neworder transactions per second when running the TPC-C transaction mix on 90 machines with 3-way replication.

preprint2020arXiv

On the Sample Complexity of Adversarial Multi-Source PAC Learning

We study the problem of learning from multiple untrusted data sources, a scenario of increasing practical relevance given the recent emergence of crowdsourcing and collaborative learning paradigms. Specifically, we analyze the situation in which a learning system obtains datasets from multiple sources, some of which might be biased or even adversarially perturbed. It is known that in the single-source case, an adversary with the power to corrupt a fixed fraction of the training data can prevent PAC-learnability, that is, even in the limit of infinitely much training data, no learning system can approach the optimal test error. In this work we show that, surprisingly, the same is not true in the multi-source setting, where the adversary can arbitrarily corrupt a fixed fraction of the data sources. Our main results are a generalization bound that provides finite-sample guarantees for this learning setting, as well as corresponding lower bounds. Besides establishing PAC-learnability our results also show that in a cooperative learning setting sharing data with other parties has provable benefits, even if some participants are malicious.

preprint2020arXiv

Stochastic Gradient Langevin with Delayed Gradients

Stochastic Gradient Langevin Dynamics (SGLD) ensures strong guarantees with regards to convergence in measure for sampling log-concave posterior distributions by adding noise to stochastic gradient iterates. Given the size of many practical problems, parallelizing across several asynchronously running processors is a popular strategy for reducing the end-to-end computation time of stochastic optimization algorithms. In this paper, we are the first to investigate the effect of asynchronous computation, in particular, the evaluation of stochastic Langevin gradients at delayed iterates, on the convergence in measure. For this, we exploit recent results modeling Langevin dynamics as solving a convex optimization problem on the space of measures. We show that the rate of convergence in measure is not significantly affected by the error caused by the delayed gradient information used for computation, suggesting significant potential for speedup in wall clock time. We confirm our theoretical results with numerical experiments on some practical problems.

preprint2020arXiv

The Splay-List: A Distribution-Adaptive Concurrent Skip-List

The design and implementation of efficient concurrent data structures have seen significant attention. However, most of this work has focused on concurrent data structures providing good \emph{worst-case} guarantees. In real workloads, objects are often accessed at different rates, since access distributions may be non-uniform. Efficient distribution-adaptive data structures are known in the sequential case, e.g. the splay-trees; however, they often are hard to translate efficiently in the concurrent case. In this paper, we investigate distribution-adaptive concurrent data structures and propose a new design called the splay-list. At a high level, the splay-list is similar to a standard skip-list, with the key distinction that the height of each element adapts dynamically to its access rate: popular elements ``move up,'' whereas rarely-accessed elements decrease in height. We show that the splay-list provides order-optimal amortized complexity bounds for a subset of operations while being amenable to efficient concurrent implementation. Experimental results show that the splay-list can leverage distribution-adaptivity to improve on the performance of classic concurrent designs, and can outperform the only previously-known distribution-adaptive design in certain settings.

preprint2020arXiv

Why Extension-Based Proofs Fail

We introduce extension-based proofs, a class of impossibility proofs that includes valency arguments. They are modelled as an interaction between a prover and a protocol. Using proofs based on combinatorial topology, it has been shown that it is impossible to deterministically solve k-set agreement among n > k > 1 processes in a wait-free manner in certain asynchronous models. However, it was unknown whether proofs based on simpler techniques were possible. We show that this impossibility result cannot be obtained for one of these models by an extension-based proof and, hence, extension-based proofs are limited in power.

preprint2015arXiv

How to Elect a Leader Faster than a Tournament

The problem of electing a leader from among $n$ contenders is one of the fundamental questions in distributed computing. In its simplest formulation, the task is as follows: given $n$ processors, all participants must eventually return a win or lose indication, such that a single contender may win. Despite a considerable amount of work on leader election, the following question is still open: can we elect a leader in an asynchronous fault-prone system faster than just running a $Θ(\log n)$-time tournament, against a strong adaptive adversary? In this paper, we answer this question in the affirmative, improving on a decades-old upper bound. We introduce two new algorithmic ideas to reduce the time complexity of electing a leader to $O(\log^* n)$, using $O(n^2)$ point-to-point messages. A non-trivial application of our algorithm is a new upper bound for the tight renaming problem, assigning $n$ items to the $n$ participants in expected $O(\log^2 n)$ time and $O(n^2)$ messages. We complement our results with lower bound of $Ω(n^2)$ messages for solving these two problems, closing the question of their message complexity.

preprint2014arXiv

The LevelArray: A Fast, Practical Long-Lived Renaming Algorithm

The long-lived renaming problem appears in shared-memory systems where a set of threads need to register and deregister frequently from the computation, while concurrent operations scan the set of currently registered threads. Instances of this problem show up in concurrent implementations of transactional memory, flat combining, thread barriers, and memory reclamation schemes for lock-free data structures. In this paper, we analyze a randomized solution for long-lived renaming. The algorithmic technique we consider, called the LevelArray, has previously been used for hashing and one-shot (single-use) renaming. Our main contribu- tion is to prove that, in long-lived executions, where processes may register and deregister polynomially many times, the technique guarantees constant steps on average and O(log log n) steps with high probability for registering, unit cost for deregistering, and O(n) steps for collect queries, where n is an upper bound on the number of processes that may be active at any point in time. We also show that the algorithm has the surprising property that it is self-healing: under reasonable assumptions on the schedule, operations running while the data structure is in a degraded state implicitly help the data structure re-balance itself. This subtle mechanism obviates the need for expensive periodic rebuilding procedures. Our benchmarks validate this approach, showing that, for typical use parameters, the average number of steps a process takes to register is less than two and the worst-case number of steps is bounded by six, even in executions with billions of operations. We contrast this with other randomized implementations, whose worst-case behavior we show to be unreliable, and with deterministic implementations, whose cost is linear in n.

preprint2013arXiv

Are Lock-Free Concurrent Algorithms Practically Wait-Free?

Lock-free concurrent algorithms guarantee that some concurrent operation will always make progress in a finite number of steps. Yet programmers prefer to treat concurrent code as if it were wait-free, guaranteeing that all operations always make progress. Unfortunately, designing wait-free algorithms is generally a very complex task, and the resulting algorithms are not always efficient. While obtaining efficient wait-free algorithms has been a long-time goal for the theory community, most non-blocking commercial code is only lock-free. This paper suggests a simple solution to this problem. We show that, for a large class of lock- free algorithms, under scheduling conditions which approximate those found in commercial hardware architectures, lock-free algorithms behave as if they are wait-free. In other words, programmers can keep on designing simple lock-free algorithms instead of complex wait-free ones, and in practice, they will get wait-free progress. Our main contribution is a new way of analyzing a general class of lock-free algorithms under a stochastic scheduler. Our analysis relates the individual performance of processes with the global performance of the system using Markov chain lifting between a complex per-process chain and a simpler system progress chain. We show that lock-free algorithms are not only wait-free with probability 1, but that in fact a general subset of lock-free algorithms can be closely bounded in terms of the average number of steps required until an operation completes. To the best of our knowledge, this is the first attempt to analyze progress conditions, typically stated in relation to a worst case adversary, in a stochastic model capturing their expected asymptotic behavior.

Institution

Affiliation not imported yet

This author record came from a source that does not expose affiliation metadata. Once the author claims the profile or we enrich the record from another provider, this section will link to the concrete institution.

Topic footprint

Fields this researcher appears in

Source provenance

Where this author record came from

arxivconfidence 95%

external id: arxiv:2605.02404:author:3:dan-alistarh

Imported May 20, 2026Synced May 21, 2026

arxivconfidence 95%

external id: arxiv:2605.07850:author:3:dan-alistarh

Imported May 20, 2026Synced May 21, 2026

arxivconfidence 95%

external id: arxiv:2605.12327:author:6:dan-alistarh

Imported May 20, 2026Synced May 21, 2026

arxivconfidence 95%

external id: arxiv:2605.00649:author:2:dan-alistarh

Imported May 20, 2026Synced May 20, 2026

6 works

Giorgi Nadiradze

Researcher

Giorgi Nadiradze contributes to research discovery and scholarly infrastructure.

Open to collaborate

4 works

Bapi Chatterjee

Researcher

Bapi Chatterjee contributes to research discovery and scholarly infrastructure.

Open to collaborate

4 works

Vyacheslav Kungurtsev

Researcher

Vyacheslav Kungurtsev contributes to research discovery and scholarly infrastructure.

Open to collaborate

3 works

Elias Frantar

Researcher

Elias Frantar contributes to research discovery and scholarly infrastructure.

Open to collaborate

Dan Alistarh

What is connected

Connect this record

See the researcher in context

Building this map preview

29 published item(s)

Grid Games: The Power of Multiple Grids for Quantizing Large Language Models

MatryoshkaLoRA: Learning Accurate Hierarchical Low-Rank Representations for LLM Fine-Tuning

Model Compression with Exact Budget Constraints via Riemannian Manifolds

Quartet: Native FP4 Training Can Be Optimal for Large Language Models

Statistically-Lossless Quantization of Large Language Models

Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning

Asynchronous Decentralized SGD with Quantized and Local Updates

CGX: Adaptive System Support for Communication-Efficient Deep Learning

Distributionally Linearizable Data Structures

How Well Do Sparse Imagenet Models Transfer?

Lower Bounds for Shared-Memory Leader Election under Bounded Write Contention

Robust Comparison in Population Protocols

Scaling the Wild: Decentralizing Hogwild!-style Shared-memory SGD

SPDY: Accurate Pruning with Speedup Guarantees

Relaxed Scheduling for Scalable Belief Propagation

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

Analysis and Evaluation of Non-Blocking Interpolation Search Trees

Asynchronous Optimization Methods for Efficient Training of Deep Neural Networks with Guarantees

Dynamic Averaging Load Balancing on Cycles

Efficiency Guarantees for Parallel Incremental Algorithms under Relaxed Schedulers

Elastic Consistency: A General Consistency Model for Distributed Stochastic Gradient Descent

Fast General Distributed Transactions with Opacity using Global Time

On the Sample Complexity of Adversarial Multi-Source PAC Learning

Stochastic Gradient Langevin with Delayed Gradients

The Splay-List: A Distribution-Adaptive Concurrent Skip-List

Why Extension-Based Proofs Fail

How to Elect a Leader Faster than a Tournament

The LevelArray: A Fast, Practical Long-Lived Renaming Algorithm

Are Lock-Free Concurrent Algorithms Practically Wait-Free?