Source author record

Yash Kanoria

Yash Kanoria appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Science and Game Theory math.OC econ.TH Machine Learning math.PR Artificial Intelligence cond-mat.dis-nn Data Structures and Algorithms Discrete Mathematics Methodology

Catalog footprint

What is connected

8works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Policy Optimization in Hybrid Discrete-Continuous Action Spaces via Mixed Gradients

We study reinforcement learning in hybrid discrete-continuous action spaces, such as settings where the discrete component selects a regime (or index) and the continuous component optimizes within it -- a structure common in robotics, control, and operations problems. Standard model-free policy gradient methods rely on score-function (SF) estimators and suffer from severe credit-assignment issues in high-dimensional settings, leading to poor gradient quality. On the other hand, differentiable simulation largely sidesteps these issues by backpropagating through a simulator, but the presence of discrete actions or non-smooth dynamics yields biased or uninformative gradients. To address this, we propose Hybrid Policy Optimization (HPO), which backpropagates through the simulator wherever smoothness permits, using a mixed gradient estimator that combines pathwise and SF gradients while maintaining unbiasedness. We also show how problems with action discontinuities can be reformulated in hybrid form, further broadening its applicability. Empirically, HPO substantially outperforms PPO on inventory control and switched linear-quadratic regulator problems, with performance gaps increasing as the continuous action dimension grows. Finally, we characterize the structure of the mixed gradient, showing that its cross term -- which captures how continuous actions influence future discrete decisions -- becomes negligible near a discrete best response, thereby enabling approximate decentralized updates of the continuous and discrete components and reducing variance near optimality. All resources are available at github.com/MatiasAlvo/hybrid-rl.

preprint2023arXiv

The Competition for Partners in Matching Markets

We study the competition for partners in two-sided matching markets with heterogeneous agent preferences, with a focus on how the equilibrium outcomes depend on the connectivity in the market. We model random partially connected markets, with each agent having an average degree $d$ in a random (undirected) graph, and a uniformly random preference ranking over their neighbors in the graph. We formally characterize stable matchings in large markets random with small imbalance and find a threshold in the connectivity $d$ at $\log^2 n$ (where $n$ is the number of agents on one side of the market) which separates a ``weak competition'' regime, where agents on both sides of the market do equally well, from a ``strong competition'' regime, where agents on the short (long) side of the market enjoy a significant advantage (disadvantage). Numerical simulations confirm and sharpen our theoretical predictions, and demonstrate robustness to our assumptions. We leverage our characterizations in two ways: First, we derive prescriptive insights into how to design the connectivity of the market to trade off optimally between the average agent welfare achieved and the number of agents who remain unmatched in the market. For most market primitives, we find that the optimal connectivity should lie in the weak competition regime or at the threshold between the regimes. Second, our analysis uncovers a new conceptual principle governing whether the short side enjoys a significant advantage in a given matching market, which can moreover be applied as a diagnostic tool given only basic summary statistics for the market. Counterfactual analyses using data on centralized high school admissions in a major USA city show the practical value of both our design insights and our diagnostic principle.

preprint2022arXiv

Blind Dynamic Resource Allocation in Closed Networks via Mirror Backpressure

We study the problem of maximizing payoff generated over a period of time in a general class of closed queueing networks with a finite, fixed number of supply units which circulate in the system. Demand arrives stochastically, and serving a demand unit (customer) causes a supply unit to relocate from the ``origin'' to the ``destination'' of the customer. The key challenge is to manage the distribution of supply in the network. We consider general controls including customer entry control, pricing, and assignment. Motivating applications include shared transportation platforms and scrip systems. Inspired by the mirror descent algorithm for optimization and the backpressure policy for network control, we introduce a rich family of \emph{Mirror Backpressure} (MBP) control policies. The MBP policies are simple and practical, and crucially do not need any statistical knowledge of the demand (customer) arrival rates (these rates are permitted to vary in time). Under mild conditions, we propose MBP policies that are provably near optimal. Specifically, our policies lose at most $O(\frac{K}{T}+\frac{1}{K} + \sqrt{ηK})$ payoff per customer relative to the optimal policy that knows the demand arrival rates, where $K$ is the number of supply units, $T$ is the total number of customers over the time horizon, and $η$ is the demand process' average rate of change per customer arrival. An adaptation of MBP is found to perform well in numerical experiments based on data from ride-hailing.

preprint2022arXiv

Decentralized Online Convex Optimization in Networked Systems

We study the problem of networked online convex optimization, where each agent individually decides on an action at every time step and agents cooperatively seek to minimize the total global cost over a finite horizon. The global cost is made up of three types of local costs: convex node costs, temporal interaction costs, and spatial interaction costs. In deciding their individual action at each time, an agent has access to predictions of local cost functions for the next $k$ time steps in an $r$-hop neighborhood. Our work proposes a novel online algorithm, Localized Predictive Control (LPC), which generalizes predictive control to multi-agent systems. We show that LPC achieves a competitive ratio of $1 + \tilde{O}(ρ_T^k) + \tilde{O}(ρ_S^r)$ in an adversarial setting, where $ρ_T$ and $ρ_S$ are constants in $(0, 1)$ that increase with the relative strength of temporal and spatial interaction costs, respectively. This is the first competitive ratio bound on decentralized predictive control for networked online convex optimization. Further, we show that the dependence on $k$ and $r$ in our results is near optimal by lower bounding the competitive ratio of any decentralized online algorithm.

preprint2020arXiv

Dynamic Reserve Prices for Repeated Auctions: Learning from Bids

A large fraction of online advertisement is sold via repeated second price auctions. In these auctions, the reserve price is the main tool for the auctioneer to boost revenues. In this work, we investigate the following question: Can changing the reserve prices based on the previous bids improve the revenue of the auction, taking into account the long-term incentives and strategic behavior of the bidders? We show that if the distribution of the valuations is known and satisfies the standard regularity assumptions, then the optimal mechanism has a constant reserve. However, when there is uncertainty in the distribution of the valuations, previous bids can be used to learn the distribution of the valuations and to update the reserve price. We present a simple, approximately incentive-compatible, and asymptotically optimal dynamic reserve mechanism that can significantly improve the revenue over the best static reserve. The paper is from July 2014 (our submission to WINE 2014), posted later here on the arxiv to complement the 1-page abstract in the WINE 2014 proceedings.

preprint2020arXiv

Matching while Learning

We consider the problem faced by a service platform that needs to match limited supply with demand but also to learn the attributes of new users in order to match them better in the future. We introduce a benchmark model with heterogeneous "workers" (demand) and a limited supply of "jobs" that arrive over time. Job types are known to the platform, but worker types are unknown and must be learned by observing match outcomes. Workers depart after performing a certain number of jobs. The expected payoff from a match depends on the pair of types and the goal is to maximize the steady-state rate of accumulation of payoff. Though we use terminology inspired by labor markets, our framework applies more broadly to platforms where a limited supply of heterogeneous products is matched to users over time. Our main contribution is a complete characterization of the structure of the optimal policy in the limit that each worker performs many jobs. The platform faces a trade-off for each worker between myopically maximizing payoffs (exploitation) and learning the type of the worker (exploration). This creates a multitude of multi-armed bandit problems, one for each worker, coupled together by the constraint on availability of jobs of different types (capacity constraints). We find that the platform should estimate a shadow price for each job type, and use the payoffs adjusted by these prices, first, to determine its learning goals and then, for each worker, (i) to balance learning with payoffs during the "exploration phase," and (ii) to myopically match after it has achieved its learning goals during the "exploitation phase."

preprint2015arXiv

The set of solutions of random XORSAT formulae

The XOR-satisfiability (XORSAT) problem requires finding an assignment of $n$ Boolean variables that satisfy $m$ exclusive OR (XOR) clauses, whereby each clause constrains a subset of the variables. We consider random XORSAT instances, drawn uniformly at random from the ensemble of formulae containing $n$ variables and $m$ clauses of size $k$. This model presents several structural similarities to other ensembles of constraint satisfaction problems, such as $k$-satisfiability ($k$-SAT), hypergraph bicoloring and graph coloring. For many of these ensembles, as the number of constraints per variable grows, the set of solutions shatters into an exponential number of well-separated components. This phenomenon appears to be related to the difficulty of solving random instances of such problems. We prove a complete characterization of this clustering phase transition for random $k$-XORSAT. In particular, we prove that the clustering threshold is sharp and determine its exact location. We prove that the set of solutions has large conductance below this threshold and that each of the clusters has large conductance above the same threshold. Our proof constructs a very sparse basis for the set of solutions (or the subset within a cluster). This construction is intimately tied to the construction of specific subgraphs of the hypergraph associated with an instance of $k$-XORSAT. In order to study such subgraphs, we establish novel local weak convergence results for them.

preprint2014arXiv

The size of the core in assignment markets

Assignment markets involve matching with transfers, as in labor markets and housing markets. We consider a two-sided assignment market with agent types and stochastic structure similar to models used in empirical studies, and characterize the size of the core in such markets. Each agent has a randomly drawn productivity with respect to each type of agent on the other side. The value generated from a match between a pair of agents is the sum of the two productivity terms, each of which depends only on the type but not the identity of one of the agents, and a third deterministic term driven by the pair of types. We allow the number of agents to grow, keeping the number of agent types fixed. Let $n$ be the number of agents and $K$ be the number of types on the side of the market with more types. We find, under reasonable assumptions, that the relative variation in utility per agent over core outcomes is bounded as $O^*(1/n^{1/K})$, where polylogarithmic factors have been suppressed. Further, we show that this bound is tight in worst case. We also provide a tighter bound under more restrictive assumptions. Our results provide partial justification for the typical assumption of a unique core outcome in empirical studies.