Source author record

Yury Demidovich

Yury Demidovich appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Artificial Intelligence Distributed, Parallel, and Cluster Computing math.CO math.OC

Catalog footprint

What is connected

3works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Distance-Aware Muon: Adaptive Step Scaling for Normalized Optimization

Muon and related normalized optimizers decouple the choice of update direction from the choice of step scale, but their practical performance remains sensitive to the scale of the normalized step. We study adaptive scaling rules for Muon in general norm geometries and develop three complementary algorithms. For smooth non-convex objectives, we introduce Distance-Adaptive Muon, whose trust-region radius is set from the radius explored by the trajectory, and prove a stationarity guarantee under a bounded-trajectory assumption. We then turn to star-convex objectives, a tractable model of the favorable global geometry often used to reason about the empirical loss landscapes of deep neural networks, where objective-gap guarantees are possible. In this setting, we first introduce Scale-Calibrated Muon, which keeps Muon's exponential moving average but sets the step length from a local descent certificate computed from the current gradient and momentum. For this method, we prove a last-iterate O(1/T) objective-gap bound under a bounded initial sublevel-set assumption, where the corresponding radius parameter appears only in the analysis and not in the algorithm. Finally, we develop Distance-Free Muon, a recentered trust-region method that uses a scalar distance certificate and a majorized one-dimensional search to select the trust-region radius without requiring the unknown distance from the initialization to a global minimizer. Experiments on Transformer language modeling (GPT-124M/WikiText-103) and image classification (ViT-Tiny/CIFAR-100) show that the proposed adaptive scaling rules reduce sensitivity to manual scale tuning and match or improve tuned fixed-scale Muon baselines under the tested budgets.

preprint2026arXiv

MAST: Model-Agnostic Sparsified Training

We introduce a novel optimization problem formulation that departs from the conventional way of minimizing machine learning model loss as a black-box function. Unlike traditional formulations, the proposed approach explicitly incorporates an initially pre-trained model and random sketch operators, allowing for sparsification of both the model and gradient during training. We establish the insightful properties of the proposed objective function and highlight its connections to the standard formulation. Furthermore, we present several variants of the Stochastic Gradient Descent (SGD) method adapted to the new problem formulation, including SGD with general sampling, a distributed version, and SGD with variance reduction techniques. We achieve tighter convergence rates and relax assumptions, bridging the gap between theoretical principles and practical applications, covering several important techniques such as Dropout and Sparse training. This work presents promising opportunities to enhance the theoretical understanding of model training through a sparsification-aware optimization approach.

preprint2022arXiv

Cycle saturation in random graphs

For a fixed graph $F,$ the minimum number of edges in an edge-maximal $F$-free subgraph of $G$ is called the $F$-saturation number. The asymptotics of the $F$-saturation number of the binomial random graph $G(n,p)$ for constant $p\in(0,1)$ is known for complete graphs $F=K_m$ and stars $F=K_{1,m}.$ This paper is devoted to the case when the pattern graph $F$ is a simple cycle $C_m.$ We prove that, for $m\geqslant 5,$ whp $\mathrm{sat}\left(G\left(n,p\right),C_m\right) = n+Θ\left(\frac{n}{\ln n}\right).$ Also we find $c=c(p)$ such that whp $\frac{3}{2}n(1+o(1))\leqslant\mathrm{sat}\left(G\left(n,p\right),C_4\right)\leqslant cn(1+o(1)).$ In particular, whp $\mathrm{sat}\left(G\left(n,\frac{1}{2}\right),C_4\right)\leqslant\frac{27}{14}n(1+o(1)).$