Researcher profile

Vyacheslav Kungurtsev

Vyacheslav Kungurtsev contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
17works
0followers
10topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

17 published item(s)

preprint2026arXiv

A Three-Tier Time-Scale Architecture for Controlling Complex Nonlinear Systems

This letter proposes a three-tier computational architecture for the real-time control of nonlinear complex systems, such as time-dependent PDEs. There is an important class of such problems for which existing single- and two-time-scale approaches are fundamentally insufficient due to lack of a priori system knowledge, computational complexity, model fidelity requirements, and uncertainty. The proposed architecture consists of an offline, meso-scale, and real-time layer of computation, with distinct roles for each layer and specific information flow between them. The result is a practical systems-level paradigm that enables real-time operation of complex nonlinear control problems.

preprint2026arXiv

An efficient penalty decomposition algorithm for minimization over sparse symmetric sets

This paper proposes an improved quasi-Newton penalty decomposition algorithm for the minimization of continuously differentiable functions, possibly nonconvex, over sparse symmetric sets. The method solves a sequence of penalty subproblems approximately via a two-block decomposition scheme: the first subproblem admits a closed-form solution without sparsity constraints, while the second subproblem is handled through an efficient sparse projection over the symmetric feasible set. Under a new assumption on the gradient of the objective function, weaker than global Lipschitz continuity from the origin, we establish that accumulation points of the outer iterates are basic feasible and cardinality-constrained Mordukhovich stationarity points. To ensure robustness and efficiency in finite-precision arithmetic, the algorithm incorporates several practical enhancements, including an enhanced line search strategy based on either backtracking or extrapolation, and four inexpensive diagonal Hessian approximations derived from differences of previous iterates and gradients or from eigenvalue-distribution information. Numerical experiments on a diverse benchmark of $30$ synthetic and data-driven test problems, including machine-learning datasets from the UCI repository and sparse symmetric instances with dimensions ranging from $10$ to $500$, demonstrate that the proposed algorithm is competitive with several state-of-the-art methods in terms of efficiency, robustness, and strong stationarity.

preprint2025arXiv

Towards Real Time Control of Water Engineering with Nonlinear Hyperbolic Partial Differential Equations

This paper examines aspirational requirements for software addressing mixed-integer optimization problems constrained by the nonlinear Shallow Water partial differential equations (PDEs), motivated by applications such as river-flow management in hydropower cascades. Realistic deployment of such software would require the simultaneous treatment of nonlinear and potentially non-smooth PDE dynamics, limited theoretical guarantees on the existence and regularity of control-to-state mappings under varying boundary conditions, and computational performance compatible with operational decision-making. In addition, practical settings motivate consideration of uncertainty arising from forecasts of demand, inflows, and environmental conditions. At present, the theoretical foundations, numerical optimization methods, and large-scale scientific computing tools required to address these challenges in a unified and tractable manner remain the subject of ongoing research across the associated research communities. Rather than proposing a complete solution, this work uses the problem as a case study to identify and organize the mathematical, algorithmic, and computational components that would be necessary for its realization. The resulting framework highlights open challenges and intermediate research directions, and may inform both more circumscribed related problems and the design of future large-scale collaborative efforts aimed at addressing such objectives.

preprint2022arXiv

Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks

Understanding the properties of neural networks trained via stochastic gradient descent (SGD) is at the heart of the theory of deep learning. In this work, we take a mean-field view, and consider a two-layer ReLU network trained via SGD for a univariate regularized regression problem. Our main result is that SGD is biased towards a simple solution: at convergence, the ReLU network implements a piecewise linear map of the inputs, and the number of "knot" points - i.e., points where the tangent of the ReLU network estimator changes - between two consecutive training inputs is at most three. In particular, as the number of neurons of the network grows, the SGD dynamics is captured by the solution of a gradient flow and, at convergence, the distribution of the weights approaches the unique minimizer of a related free energy, which has a Gibbs form. Our key technical contribution consists in the analysis of the estimator resulting from this minimizer: we show that its second derivative vanishes everywhere, except at some specific locations which represent the "knot" points. We also provide empirical evidence that knots at locations distinct from the data points might occur, as predicted by our theory.

preprint2022arXiv

On the effect of perturbations in first-order optimization methods with inertia and Hessian driven damping

Second-order continuous-time dissipative dynamical systems with viscous and Hessian driven damping have inspired effective first-order algorithms for solving convex optimization problems. While preserving the fast convergence properties of the Nesterov-type acceleration, the Hessian driven damping makes it possible to significantly attenuate the oscillations. To study the stability of these algorithms with respect to perturbations, we analyze the behaviour of the corresponding continuous systems when the gradient computation is subject to exogenous additive errors. We provide a quantitative analysis of the asymptotic behaviour of two types of systems, those with implicit and explicit Hessian driven damping. We consider convex, strongly convex, and non-smooth objective functions defined on a real Hilbert space and show that, depending on the formulation, different integrability conditions on the perturbations are sufficient to maintain the convergence rates of the systems. We highlight the differences between the implicit and explicit Hessian damping, and in particular point out that the assumptions on the objective and perturbations needed in the implicit case are more stringent than in the explicit case.

preprint2022arXiv

Retraction based Direct Search Methods for Derivative Free Riemannian Optimization

Direct search methods represent a robust and reliable class of algorithms for solving black-box optimization problems. In this paper, we explore the application of those strategies to Riemannian optimization, wherein minimization is to be performed with respect to variables restricted to lie on a manifold. More specifically, we consider classic and line search extrapolated variants of direct search, and, by making use of retractions, we devise tailored strategies for the minimization of both smooth and nonsmooth functions. As such we analyze, for the first time in the literature, a class of retraction based algorithms for minimizing nonsmooth objectives on a Riemannian manifold without having access to (sub)derivatives. Along with convergence guarantees we provide a set of numerical performance illustrations on a standard set of problems.

preprint2022arXiv

Scaling the Wild: Decentralizing Hogwild!-style Shared-memory SGD

Powered by the simplicity of lock-free asynchrony, Hogwilld! is a go-to approach to parallelize SGD over a shared-memory setting. Despite its popularity and concomitant extensions, such as PASSM+ wherein concurrent processes update a shared model with partitioned gradients, scaling it to decentralized workers has surprisingly been relatively unexplored. To our knowledge, there is no convergence theory of such methods, nor systematic numerical comparisons evaluating speed-up. In this paper, we propose an algorithm incorporating decentralized distributed memory computing architecture with each node running multiprocessing parallel shared-memory SGD itself. Our scheme is based on the following algorithmic tools and features: (a) asynchronous local gradient updates on the shared-memory of workers, (b) partial backpropagation, and (c) non-blocking in-place averaging of the local models. We prove that our method guarantees ergodic convergence rates for non-convex objectives. On the practical side, we show that the proposed method exhibits improved throughput and competitive accuracy for standard image classification benchmarks on the CIFAR-10, CIFAR-100, and Imagenet datasets. Our code is available at https://github.com/bapi/LPP-SGD.

preprint2022arXiv

Sensitivity Assisted Alternating Directions Method of Multipliers for Distributed Optimization and Statistical Learning

This paper considers the problem of distributed model fitting using the alternating directions method of multipliers (ADMM). ADMM splits the learning problem into several smaller subproblems, usually by partitioning the data samples. The different subproblems can be solved in parallel by a set of worker computing nodes coordinated by a master node, and the subproblems are repeatedly solved until convergence. At each iteration, the worker nodes must solve a convex optimization problem whose difficulty increases with the size of the problem. In this paper, we propose a sensitivity-assisted ADMM algorithm that leverages the parametric sensitivities such that the subproblems solutions can be approximated using a tangential predictor, thus easing the computational burden to computing one linear solve. We study the convergence properties of the proposed sensitivity-assisted ADMM algorithm. The numerical performance of the algorithm is illustrated on a nonlinear parameter estimation problem, and a multilayer perceptron learning problem.

preprint2021arXiv

Decentralized Langevin Dynamics over a Directed Graph

The prevalence of technologies in the space of the Internet of Things and use of multi-processing computing platforms to aid in the computation required to perform learning and inference from large volumes of data has necessitated the extensive study of algorithms on decentralized platforms. In these settings, computing nodes send and receive data across graph-structured communication links, and using a combination of local computation and consensus-seeking communication, cooperately solve a problem of interest. Recently, Langevin dynamics as a tool for high dimensional sampling and posterior Bayesian inference has been studied in the context of a decentralized operation. However, this work has been limited to undirected graphs, wherein all communication is two-sided, i.e., if node A can send data to node B, then node B can also send data to node A. We extend the state of the art in considering Langevin dynamics on directed graphs.

preprint2020arXiv

A Two-Step Pre-Processing for Semidefinite Programming

In semidefinite programming (SDP), a number of pre-processing techniques have been developed including chordal-completion procedures, which reduce the dimension of individual constraints by exploiting sparsity therein, and facial reduction, which reduces the dimension of the problem by removing redundant rows and columns. This paper suggest that these work in a complementary manner and that facial reduction should be used after chordal-completion procedures. In computational experiments on SDP instances from the SDPLib, a benchmark, and structured instances from polynomial and binary quadratic optimisation, we show that such two-step pre-processing with a standard interior-point method outperforms the interior point method, with or without the traditional pre-processing.

preprint2020arXiv

Asynchronous Optimization Methods for Efficient Training of Deep Neural Networks with Guarantees

Asynchronous distributed algorithms are a popular way to reduce synchronization costs in large-scale optimization, and in particular for neural network training. However, for nonsmooth and nonconvex objectives, few convergence guarantees exist beyond cases where closed-form proximal operator solutions are available. As most popular contemporary deep neural networks lead to nonsmooth and nonconvex objectives, there is now a pressing need for such convergence guarantees. In this paper, we analyze for the first time the convergence of stochastic asynchronous optimization for this general class of objectives. In particular, we focus on stochastic subgradient methods allowing for block variable partitioning, where the shared-memory-based model is asynchronously updated by concurrent processes. To this end, we first introduce a probabilistic model which captures key features of real asynchronous scheduling between concurrent processes; under this model, we establish convergence with probability one to an invariant set for stochastic subgradient methods with momentum. From the practical perspective, one issue with the family of methods we consider is that it is not efficiently supported by machine learning frameworks, as they mostly focus on distributed data-parallel strategies. To address this, we propose a new implementation strategy for shared-memory based training of deep neural networks, whereby concurrent parameter servers are utilized to train a partitioned but shared model in single- and multi-GPU settings. Based on this implementation, we achieve on average 1.2x speed-up in comparison to state-of-the-art training methods for popular image classification tasks without compromising accuracy.

preprint2020arXiv

Decentralized Langevin Dynamics

Langevin MCMC gradient optimization is a class of increasingly popular methods for estimating a posterior distribution. This paper addresses the algorithm as applied in a decentralized setting, wherein data is distributed across a network of agents which act to cooperatively solve the problem using peer-to-peer gossip communication. We show, theoretically, results in 1) the time-complexity to $ε$-consensus for the continuous time stochastic differential equation, 2) convergence rate in $L^2$ norm to consensus for the discrete implementation as defined by the Euler-Maruyama discretization and 3) convergence rate in the Wasserstein metric to the optimal stationary distribution for the discretized dynamics.

preprint2020arXiv

Elastic Consistency: A General Consistency Model for Distributed Stochastic Gradient Descent

Machine learning has made tremendous progress in recent years, with models matching or even surpassing humans on a series of specialized tasks. One key element behind the progress of machine learning in recent years has been the ability to train machine learning models in large-scale distributed shared-memory and message-passing environments. Many of these models are trained employing variants of stochastic gradient descent (SGD) based optimization. In this paper, we introduce a general consistency condition covering communication-reduced and asynchronous distributed SGD implementations. Our framework, called elastic consistency enables us to derive convergence bounds for a variety of distributed SGD methods used in practice to train large-scale machine learning models. The proposed framework de-clutters the implementation-specific convergence analysis and provides an abstraction to derive convergence bounds. We utilize the framework to analyze a sparsification scheme for distributed SGD methods in an asynchronous setting for convex and non-convex objectives. We implement the distributed SGD variant to train deep CNN models in an asynchronous shared-memory setting. Empirical results show that error-feedback may not necessarily help in improving the convergence of sparsified asynchronous distributed SGD, which corroborates an insight suggested by our convergence analysis.

preprint2020arXiv

Ghost Penalties in Nonconvex Constrained Optimization: Diminishing Stepsizes and Iteration Complexity

We consider nonconvex constrained optimization problems and propose a new approach to the convergence analysis based on penalty functions. We make use of classical penalty functions in an unconventional way, in that penalty functions only enter in the theoretical analysis of convergence while the algorithm itself is penalty-free. Based on this idea, we are able to establish several new results, including the first general analysis for diminishing stepsize methods in nonconvex, constrained optimization, showing convergence to generalized stationary points, and a complexity study for SQP-type algorithms.

preprint2020arXiv

Reinforcement Learning Based on Real-Time Iteration NMPC

Reinforcement Learning (RL) has proven a stunning ability to learn optimal policies from data without any prior knowledge on the process. The main drawback of RL is that it is typically very difficult to guarantee stability and safety. On the other hand, Nonlinear Model Predictive Control (NMPC) is an advanced model-based control technique which does guarantee safety and stability, but only yields optimality for the nominal model. Therefore, it has been recently proposed to use NMPC as a function approximator within RL. While the ability of this approach to yield good performance has been demonstrated, the main drawback hindering its applicability is related to the computational burden of NMPC, which has to be solved to full convergence. In practice, however, computationally efficient algorithms such as the Real-Time Iteration (RTI) scheme are deployed in order to return an approximate NMPC solution in very short time. In this paper we bridge this gap by extending the existing theoretical framework to also cover RL based on RTI NMPC. We demonstrate the effectiveness of this new RL approach with a nontrivial example modeling a challenging nonlinear system subject to stochastic perturbations with the objective of optimizing an economic cost.

preprint2020arXiv

Second-order Guarantees of Distributed Gradient Algorithms

We consider distributed smooth nonconvex unconstrained optimization over networks, modeled as a connected graph. We examine the behavior of distributed gradient-based algorithms near strict saddle points. Specifically, we establish that (i) the renowned Distributed Gradient Descent (DGD) algorithm likely converges to a neighborhood of a Second-order Stationary (SoS) solution; and (ii) the more recent class of distributed algorithms based on gradient tracking--implementable also over digraphs--likely converges to exact SoS solutions, thus avoiding (strict) saddle-points. Furthermore, new convergence rate results to first-order critical points is established for the latter class of algorithms.

preprint2020arXiv

Stochastic Gradient Langevin with Delayed Gradients

Stochastic Gradient Langevin Dynamics (SGLD) ensures strong guarantees with regards to convergence in measure for sampling log-concave posterior distributions by adding noise to stochastic gradient iterates. Given the size of many practical problems, parallelizing across several asynchronously running processors is a popular strategy for reducing the end-to-end computation time of stochastic optimization algorithms. In this paper, we are the first to investigate the effect of asynchronous computation, in particular, the evaluation of stochastic Langevin gradients at delayed iterates, on the convergence in measure. For this, we exploit recent results modeling Langevin dynamics as solving a convex optimization problem on the space of measures. We show that the rate of convergence in measure is not significantly affected by the error caused by the delayed gradient information used for computation, suggesting significant potential for speedup in wall clock time. We confirm our theoretical results with numerical experiments on some practical problems.