Source author record

Mher Safaryan

Mher Safaryan appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning math.OC math.CA Artificial Intelligence Computation and Language Distributed, Parallel, and Cluster Computing

Catalog footprint

What is connected

6works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

MatryoshkaLoRA: Learning Accurate Hierarchical Low-Rank Representations for LLM Fine-Tuning

With the rise in scale for deep learning models to billions of parameters, the computational cost of fine-tuning remains a significant barrier to deployment. While Low-Rank Adaptation (LoRA) has become the standard for parameter-efficient fine-tuning, the need to set a predefined, static rank $r$ requires exhaustive grid searches to balance efficiency and performance. Existing rank-adaptive solutions such as DyLoRA mitigate this by sampling ranks during the training from a predefined distribution. However, they often yield sub-optimal results at higher ranks due to lack of consistent gradient signals across the full hierarchy of ranks, thus making these methods data-inefficient. In this paper, we propose MatryoshkaLoRA, a general, Matryoshka-inspired training framework for LoRA that learns accurate hierarchical low-rank representations by inserting a fixed, carefully crafted diagonal matrix $P$ between the existing LoRA adapters to scale their sub-ranks accordingly. By introducing this simple modification, our general framework recovers LoRA and DyLoRA only by changing $P$ and ensures all sub-ranks embed the available gradient information efficiently. Our MatryoshkaLoRA supports dynamic rank selection with minimal degradation in accuracy. We further propose Area Under the Rank Accuracy Curve (AURAC), a metric that consistently evaluates the performance of hierarchical low-rank adapters. Our results demonstrate that MatryoshkaLoRA learns more accurate hierarchical low-rank representations than prior rank-adaptive approaches and achieves superior accuracy-performance trade-offs across ranks on the evaluated datasets. Our code is available at https://github.com/IST-DASLab/MatryoshkaLoRA.

preprint2022arXiv

Distributed Newton-Type Methods with Communication Compression and Bernoulli Aggregation

Despite their high computation and communication costs, Newton-type methods remain an appealing option for distributed training due to their robustness against ill-conditioned convex problems. In this work, we study ommunication compression and aggregation mechanisms for curvature information in order to reduce these costs while preserving theoretically superior local convergence guarantees. We prove that the recently developed class of three point compressors (3PC) of Richtarik et al. [2022] for gradient communication can be generalized to Hessian communication as well. This result opens up a wide variety of communication strategies, such as contractive compression} and lazy aggregation, available to our disposal to compress prohibitively costly curvature information. Moreover, we discovered several new 3PC mechanisms, such as adaptive thresholding and Bernoulli aggregation, which require reduced communication and occasional Hessian computations. Furthermore, we extend and analyze our approach to bidirectional communication compression and partial device participation setups to cater to the practical considerations of applications in federated learning. For all our methods, we derive fast condition-number-independent local linear and/or superlinear convergence rates. Finally, with extensive numerical evaluations on convex optimization problems, we illustrate that our designed schemes achieve state-of-the-art communication complexity compared to several key baselines using second-order information.

preprint2022arXiv

FedNL: Making Newton-Type Methods Applicable to Federated Learning

Inspired by recent work of Islamov et al (2021), we propose a family of Federated Newton Learn (FedNL) methods, which we believe is a marked step in the direction of making second-order methods applicable to FL. In contrast to the aforementioned work, FedNL employs a different Hessian learning technique which i) enhances privacy as it does not rely on the training data to be revealed to the coordinating server, ii) makes it applicable beyond generalized linear models, and iii) provably works with general contractive compression operators for compressing the local Hessians, such as Top-$K$ or Rank-$R$, which are vastly superior in practice. Notably, we do not need to rely on error feedback for our methods to work with contractive compressors. Moreover, we develop FedNL-PP, FedNL-CR and FedNL-LS, which are variants of FedNL that support partial participation, and globalization via cubic regularization and line search, respectively, and FedNL-BC, which is a variant that can further benefit from bidirectional compression of gradients and models, i.e., smart uplink gradient and smart downlink model compression. We prove local convergence rates that are independent of the condition number, the number of training data points, and compression variance. Our communication efficient Hessian learning technique provably learns the Hessian at the optimum. Finally, we perform a variety of numerical experiments that show that our FedNL methods have state-of-the-art communication complexity when compared to key baselines.

preprint2022arXiv

On Estimates for Maximal Operators Associated with Tangential Regions

The thesis comprises three chapters. Chapter 1 investigates generalizations of the theorem of Fatou for convolution type integral operators with general approximate identities. It is introduced $λ(r)$-convergence, which is a generalization of non-tangential convergence in the unit disc. The connections between general approximate identities and optimal convergence regions for such operators are described in different functional spaces. Chapter 2 studies some generalizations of the theorem of Littlewood, which makes an important complement to the theorem of Fatou, constructing analytic function possessing almost everywhere divergent property along a given tangential curve. The same convolution type integral operators are considered with more general kernels than approximate identities. Two kinds of generalizations of the theorem of Littlewood are obtained, possessing everywhere divergent property. Chapter 3 is devoted to some questions of equivalency of differentiation bases in $\mathbb{R}^n$. The complete equivalence of basis of rare dyadic rectangles and the basis of complete dyadic rectangles in $\mathbb{R}^2$ is investigated. Besides, it is introduced quasi-equivalence between two differentiation bases in $\mathbb{R}^n$ and is considered the set of functions that such bases differentiate.

preprint2021arXiv

Smoothness Matrices Beat Smoothness Constants: Better Communication Compression Techniques for Distributed Optimization

Large scale distributed optimization has become the default tool for the training of supervised machine learning models with a large number of parameters and training data. Recent advancements in the field provide several mechanisms for speeding up the training, including {\em compressed communication}, {\em variance reduction} and {\em acceleration}. However, none of these methods is capable of exploiting the inherently rich data-dependent smoothness structure of the local losses beyond standard smoothness constants. In this paper, we argue that when training supervised models, {\em smoothness matrices} -- information-rich generalizations of the ubiquitous smoothness constants -- can and should be exploited for further dramatic gains, both in theory and practice. In order to further alleviate the communication burden inherent in distributed optimization, we propose a novel communication sparsification strategy that can take full advantage of the smoothness matrices associated with local losses. To showcase the power of this tool, we describe how our sparsification technique can be adapted to three distributed optimization algorithms -- DCGD, DIANA and ADIANA -- yielding significant savings in terms of communication complexity. The new methods always outperform the baselines, often dramatically so.

preprint2019arXiv

On Generalizations of Fatou's Theorem in $L^p$ for Convolution Integrals with General Kernels

We prove Fatou type theorem on almost everywhere convergence of convolution integrals in spaces $L^p\,(1<p<\infty)$ for general kernels, forming an approximate identity. For a wide class of kernels we show that obtained convergence regions are optimal in some sense. It is also established a weak boundedness of the corresponding maximal operator in $L^p\,(1\le p<\infty)$.

Mher Safaryan

What is connected

Connect this record

See the researcher in context

Building this map preview

6 published item(s)

MatryoshkaLoRA: Learning Accurate Hierarchical Low-Rank Representations for LLM Fine-Tuning

Distributed Newton-Type Methods with Communication Compression and Bernoulli Aggregation

FedNL: Making Newton-Type Methods Applicable to Federated Learning

On Estimates for Maximal Operators Associated with Tangential Regions

Smoothness Matrices Beat Smoothness Constants: Better Communication Compression Techniques for Distributed Optimization

On Generalizations of Fatou's Theorem in $L^p$ for Convolution Integrals with General Kernels