Researcher profile

Sandeep Silwal

Sandeep Silwal contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
10works
0followers
7topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

10 published item(s)

preprint2026arXiv

Capturing LLM Capabilities via Evidence-Calibrated Query Clustering

Query clustering organizes queries into groups that reflect shared latent capability demands, enabling capability-aware LLM evaluation. Existing clustering methods, which primarily rely on semantic taxonomies or embeddings, often fail to capture such latent capability requirements due to a misalignment between surface-level semantics and actual model performance. We propose ECC, an algorithm that calibrates prior semantic embeddings using limited posterior model comparisons to bridge the gap between surface-level semantics and latent capability requirements. ECC characterizes each cluster through a capability profile parameterized by a Bradley-Terry model and uses trainable mixture weights to accommodate queries with mixed capability demands, jointly learning a flexible, capability-aware clustering structure that supports query-specific inference of LLM capabilities. Extensive quantitative and qualitative evaluations demonstrate that ECC significantly improves LLM capability ranking quality, outperforming human-labeled and embedding-based baselines by an average of 17.64 and 18.02 percentage points, respectively, and proves effective in downstream tasks such as query routing.

preprint2026arXiv

DynMuon: A Dynamic Spectral Shaping View of Muon

In recent years, Muon has emerged as the dominant method for training large language models, and transformers more broadly. The essential difference, when compared to standard gradient descent methods, is to replace the usual update matrix $M=UΣV^\top$ with its polar factor $UV^\top$. In this work, we consider a class of Muon-like updates, where we replace the update $M$ with $UΣ^p V^\top$ for some parameter $p$. We call this a "spectral-shaping" operation, and develop a theory of how to pick $p$ which depends on (a) local curvature of the loss function, (b) noise stemming from stochastic gradients and label noise, and (c) training stage. Our theory and experimentation reveal a previously overlooked behavior: positive $p$ helps early by emphasizing high-curvature directions and accelerating signal contraction, while mildly negative $p$ helps later by reallocating update strength toward low-curvature directions that still contain useful training signals. Building on the insight, we propose DynMuon, an efficient dynamic spectral shaping method that schedules $p$ from positive to mildly negative over training. Extensive experiments across model sizes, architectures, and training settings show that DynMuon consistently achieves lower validation loss than Muon, while requiring 10.6-26.5% fewer steps to reach the same target loss.

preprint2023arXiv

Optimal Algorithms for Linear Algebra in the Current Matrix Multiplication Time

We study fundamental problems in linear algebra, such as finding a maximal linearly independent subset of rows or columns (a basis), solving linear regression, or computing a subspace embedding. For these problems, we consider input matrices $\mathbf{A}\in\mathbb{R}^{n\times d}$ with $n > d$. The input can be read in $\text{nnz}(\mathbf{A})$ time, which denotes the number of nonzero entries of $\mathbf{A}$. In this paper, we show that beyond the time required to read the input matrix, these fundamental linear algebra problems can be solved in $d^ω$ time, i.e., where $ω\approx 2.37$ is the current matrix-multiplication exponent. To do so, we introduce a constant-factor subspace embedding with the optimal $m=\mathcal{O}(d)$ number of rows, and which can be applied in time $\mathcal{O}\left(\frac{\text{nnz}(\mathbf{A})}α\right) + d^{2 + α}\text{poly}(\log d)$ for any trade-off parameter $α>0$, tightening a recent result by Chepurko et. al. [SODA 2022] that achieves an $\exp(\text{poly}(\log\log n))$ distortion with $m=d\cdot\text{poly}(\log\log d)$ rows in $\mathcal{O}\left(\frac{\text{nnz}(\mathbf{A})}α+d^{2+α+o(1)}\right)$ time. Our subspace embedding uses a recently shown property of stacked Subsampled Randomized Hadamard Transforms (SRHT), which actually increase the input dimension, to "spread" the mass of an input vector among a large number of coordinates, followed by random sampling. To control the effects of random sampling, we use fast semidefinite programming to reweight the rows. We then use our constant-factor subspace embedding to give the first optimal runtime algorithms for finding a maximal linearly independent subset of columns, regression, and leverage score sampling. To do so, we also introduce a novel subroutine that iteratively grows a set of independent rows, which may be of independent interest.

preprint2022arXiv

Faster Fundamental Graph Algorithms via Learned Predictions

We consider the question of speeding up classic graph algorithms with machine-learned predictions. In this model, algorithms are furnished with extra advice learned from past or similar instances. Given the additional information, we aim to improve upon the traditional worst-case run-time guarantees. Our contributions are the following: (i) We give a faster algorithm for minimum-weight bipartite matching via learned duals, improving the recent result by Dinitz, Im, Lavastida, Moseley and Vassilvitskii (NeurIPS, 2021); (ii) We extend the learned dual approach to the single-source shortest path problem (with negative edge lengths), achieving an almost linear runtime given sufficiently accurate predictions which improves upon the classic fastest algorithm due to Goldberg (SIAM J. Comput., 1995); (iii) We provide a general reduction-based framework for learning-based graph algorithms, leading to new algorithms for degree-constrained subgraph and minimum-cost $0$-$1$ flow, based on reductions to bipartite matching and the shortest path problem. Finally, we give a set of general learnability theorems, showing that the predictions required by our algorithms can be efficiently learned in a PAC fashion.

preprint2022arXiv

Hardness and Algorithms for Robust and Sparse Optimization

We explore algorithms and limitations for sparse optimization problems such as sparse linear regression and robust linear regression. The goal of the sparse linear regression problem is to identify a small number of key features, while the goal of the robust linear regression problem is to identify a small number of erroneous measurements. Specifically, the sparse linear regression problem seeks a $k$-sparse vector $x\in\mathbb{R}^d$ to minimize $\|Ax-b\|_2$, given an input matrix $A\in\mathbb{R}^{n\times d}$ and a target vector $b\in\mathbb{R}^n$, while the robust linear regression problem seeks a set $S$ that ignores at most $k$ rows and a vector $x$ to minimize $\|(Ax-b)_S\|_2$. We first show bicriteria, NP-hardness of approximation for robust regression building on the work of [OWZ15] which implies a similar result for sparse regression. We further show fine-grained hardness of robust regression through a reduction from the minimum-weight $k$-clique conjecture. On the positive side, we give an algorithm for robust regression that achieves arbitrarily accurate additive error and uses runtime that closely matches the lower bound from the fine-grained hardness result, as well as an algorithm for sparse regression with similar runtime. Both our upper and lower bounds rely on a general reduction from robust linear regression to sparse regression that we introduce. Our algorithms, inspired by the 3SUM problem, use approximate nearest neighbor data structures and may be of independent interest for solving sparse optimization problems. For instance, we demonstrate that our techniques can also be used for the well-studied sparse PCA problem.

preprint2022arXiv

Learning-Augmented $k$-means Clustering

$k$-means clustering is a well-studied problem due to its wide applicability. Unfortunately, there exist strong theoretical limits on the performance of any algorithm for the $k$-means problem on worst-case inputs. To overcome this barrier, we consider a scenario where "advice" is provided to help perform clustering. Specifically, we consider the $k$-means problem augmented with a predictor that, given any point, returns its cluster label in an approximately optimal clustering up to some, possibly adversarial, error. We present an algorithm whose performance improves along with the accuracy of the predictor, even though naïvely following the accurate predictor can still lead to a high clustering cost. Thus if the predictor is sufficiently accurate, we can retrieve a close to optimal clustering with nearly optimal runtime, breaking known computational barriers for algorithms that do not have access to such advice. We evaluate our algorithms on real datasets and show significant improvements in the quality of clustering.

preprint2022arXiv

Motif Cut Sparsifiers

A motif is a frequently occurring subgraph of a given directed or undirected graph $G$. Motifs capture higher order organizational structure of $G$ beyond edge relationships, and, therefore, have found wide applications such as in graph clustering, community detection, and analysis of biological and physical networks to name a few. In these applications, the cut structure of motifs plays a crucial role as vertices are partitioned into clusters by cuts whose conductance is based on the number of instances of a particular motif, as opposed to just the number of edges, crossing the cuts. In this paper, we introduce the concept of a motif cut sparsifier. We show that one can compute in polynomial time a sparse weighted subgraph $G'$ with only $\widetilde{O}(n/ε^2)$ edges such that for every cut, the weighted number of copies of $M$ crossing the cut in $G'$ is within a $1+ε$ factor of the number of copies of $M$ crossing the cut in $G$, for every constant size motif $M$. Our work carefully combines the viewpoints of both graph sparsification and hypergraph sparsification. We sample edges which requires us to extend and strengthen the concept of cut sparsifiers introduced in the seminal work of to the motif setting. We adapt the importance sampling framework through the viewpoint of hypergraph sparsification by deriving the edge sampling probabilities from the strong connectivity values of a hypergraph whose hyperedges represent motif instances. Finally, an iterative sparsification primitive inspired by both viewpoints is used to reduce the number of edges in $G$ to nearly linear. In addition, we present a strong lower bound ruling out a similar result for sparsification with respect to induced occurrences of motifs.

preprint2022arXiv

Triangle and Four Cycle Counting with Predictions in Graph Streams

We propose data-driven one-pass streaming algorithms for estimating the number of triangles and four cycles, two fundamental problems in graph analytics that are widely studied in the graph data stream literature. Recently, (Hsu 2018) and (Jiang 2020) applied machine learning techniques in other data stream problems, using a trained oracle that can predict certain properties of the stream elements to improve on prior "classical" algorithms that did not use oracles. In this paper, we explore the power of a "heavy edge" oracle in multiple graph edge streaming models. In the adjacency list model, we present a one-pass triangle counting algorithm improving upon the previous space upper bounds without such an oracle. In the arbitrary order model, we present algorithms for both triangle and four cycle estimation with fewer passes and the same space complexity as in previous algorithms, and we show several of these bounds are optimal. We analyze our algorithms under several noise models, showing that the algorithms perform well even when the oracle errs. Our methodology expands upon prior work on "classical" streaming algorithms, as previous multi-pass and random order streaming algorithms can be seen as special cases of our algorithms, where the first pass or random order was used to implement the heavy edge oracle. Lastly, our experiments demonstrate advantages of the proposed method compared to state-of-the-art streaming algorithms.

preprint2020arXiv

A note on the universality of ESDs of inhomogeneous random matrices

In this short note, we extend the celebrated results of Tao and Vu, and Krishnapur on the universality of empirical spectral distributions to a wide class of inhomogeneous complex random matrices, by showing that a technical and hard-to-verify Fourier domination assumption may be replaced simply by a natural uniform anti-concentration assumption. Along the way, we show that inhomogeneous complex random matrices, whose expected squared Hilbert-Schmidt norm is quadratic in the dimension, and whose entries (after symmetrization) are uniformly anti-concentrated at $0$ and infinity, typically have smallest singular value $Ω(n^{-1/2})$. The rate $n^{-1/2}$ is sharp, and closes a gap in the literature. Our proofs closely follow recent works of Livshyts, and Livshyts, Tikhomirov, and Vershynin on inhomogeneous real random matrices. The new ingredient is a couple of anti-concentration inequalities for sums of independent, but not necessarily identically distributed, complex random variables, which may also be useful in other contexts.

preprint2018arXiv

Directed Random Geometric Graphs

Many real-world networks are intrinsically directed. Such networks include activation of genes, hyperlinks on the internet, and the network of followers on Twitter among many others. The challenge, however, is to create a network model that has many of the properties of real-world networks such as powerlaw degree distributions and the small-world property. To meet these challenges, we introduce the \textit{Directed} Random Geometric Graph (DRGG) model, which is an extension of the random geometric graph model. We prove that it is scale-free with respect to the indegree distribution, has binomial outdegree distribution, has a high clustering coefficient, has few edges and is likely small-world. These are some of the main features of aforementioned real world networks. We empirically observe that word association networks have many of the theoretical properties of the DRGG model.