Output list
Preprint
Posted to a preprint site 02/13/2026
ArXiv.org
In this work, we study the sample complexity of obtaining a Nash equilibrium (NE) estimate in two-player zero-sum matrix games with noisy feedback. Specifically, we propose a novel algorithm that repeatedly solves linear programs (LPs) to obtain an NE estimate with bias at most$\varepsilon$with a sample complexity of$O\left(\frac{m_1 m_2}{\varepsilon\min\{δ^2,σ_0^2,σ^3\}} \log\frac{m_1 m_2}{\varepsilon}\right)$for general$m_1 \times m_2$game matrices, where$σ$ ,$σ_0$ ,$δ$are some problem-dependent constants. To our knowledge, this is the first instance-dependent sample complexity bound for finding an NE estimate with$\varepsilon$bias in general-dimension matrix games with noisy feedback and potentially non-unique equilibria. Our algorithm builds on recent advances in online resource allocation and operates in two stages: (1) identifying the support set of an NE, and (2) computing the unique NE restricted to this support. Both stages rely on a careful analysis of LP solutions derived from noisy samples.
Preprint
Pricing Query Complexity of Multiplicative Revenue Approximation
Posted to a preprint site 02/11/2026
ArXiv.org
We study the pricing query complexity of revenue maximization for a single buyer whose private valuation is drawn from an unknown distribution. In this setting, the seller must learn the optimal monopoly price by posting prices and observing only binary purchase decisions, rather than the realized valuations. Prior work has established tight query complexity bounds for learning a near-optimal price with additive errorεwhen the valuation distribution is supported on[0,1] . However, our understanding of how to learn a near-optimal price that achieves at least a(1-ε)fraction of the optimal revenue remains limited. In this paper, we study the pricing query complexity of the single-buyer revenue maximization problem under such multiplicative error guarantees in several settings. Observe that when pricing queries are the only source of information about the buyer's distribution, no algorithm can achieve a non-trivial approximation, since the scale of the distribution cannot be learned from pricing queries alone. Motivated by this fundamental impossibility, we consider two natural and well-motivated models that provide "scale hints": (i) a one-sample hint, in which the algorithm observes a single realized valuation before making pricing queries; and (ii) a value-range hint, in which the valuation support is known to lie within[1, H] . For each type of hint, we establish pricing query complexity guarantees that are tight up to polylogarithmic factors for several classes of distributions, including monotone hazard rate (MHR) distributions, regular distributions, and general distributions.
Preprint
Interaction-Grounded Learning for Contextual Markov Decision Processes with Personalized Feedback
Posted to a preprint site 02/09/2026
ArXiv.org
In this paper, we study Interaction-Grounded Learning (IGL) [Xie et al., 2021], a paradigm designed for realistic scenarios where the learner receives indirect feedback generated by an unknown mechanism, rather than explicit numerical rewards. While prior work on IGL provides efficient algorithms with provable guarantees, those results are confined to single-step settings, restricting their applicability to modern sequential decision-making systems such as multi-turn Large Language Model (LLM) deployments. To bridge this gap, we propose a computationally efficient algorithm that achieves a sublinear regret guarantee for contextual episodic Markov Decision Processes (MDPs) with personalized feedback. Technically, we extend the reward-estimator construction of Zhang et al. [2024a] from the single-step to the multi-step setting, addressing the unique challenges of decoding latent rewards under MDPs. Building on this estimator, we design an Inverse-Gap-Weighting (IGW) algorithm for policy optimization. Finally, we demonstrate the effectiveness of our method in learning personalized objectives from multi-turn interactions through experiments on both a synthetic episodic MDP and a real-world user booking dataset.
Preprint
Near-Optimal Regret for Distributed Adversarial Bandits: A Black-Box Approach
Posted to a preprint site 02/06/2026
ArXiv.org
We study distributed adversarial bandits, whereNagents cooperate to minimize the global average loss while observing only their own local losses. We show that the minimax regret for this problem is\tildeΘ{(}{√(̅ρ̅^̅(̅-̅1̅/̅2̅)̅+̅K̅/̅N̅)̅T̅}{)} , whereTis the horizon,Kis the number of actions, andρis the spectral gap of the communication matrix. Our algorithm, based on a novel black-box reduction to bandits with delayed feedback, requires agents to communicate only through gossip. It achieves an upper bound that significantly improves over the previous best boundÕ(ρ^(-1/3)(KT)^(2/3))of Yi and Vojnovic (2023). We complement this result with a matching lower bound, showing that the problem's difficulty decomposes into a communication costρ^(-1/4)√T̅and a bandit cost√K̅T̅/̅N̅ . We further demonstrate the versatility of our approach by deriving first-order and best-of-both-worlds bounds in the distributed adversarial setting. Finally, we extend our framework to distributed linear bandits inRᵈ , obtaining a regret bound ofÕ(√(̅ρ̅^̅(̅-̅1̅/̅2̅)̅+̅1̅/̅N̅)̅d̅T̅) , achieved with onlyO(d)communication cost per agent and per round via a volumetric spanner.
Preprint
Parameter-free Dynamic Regret: Time-varying Movement Costs, Delayed Feedback, and Memory
Posted to a preprint site 02/06/2026
ArXiv.org
In this paper, we study dynamic regret in unconstrained online convex optimization (OCO) with movement costs. Specifically, we generalize the standard setting by allowing the movement cost coefficientsλ_(t)to vary arbitrarily over time. Our main contribution is a novel algorithm that establishes the first comparator-adaptive dynamic regret bound for this setting, guaranteeing\widetilde{𝓞}{(}{√(̅1̅+̅P̅_̅(̅T̅)̅)̅(̅T̅+̅∑̅_̅(̅t̅)̅ ̅λ̅_̅(̅t̅)̅)̅})regret, whereP_(T)is the path length of the comparator sequence overTrounds. This recovers the optimal guarantees for both static and dynamic regret in standard OCO as a special case whereλ_(t)=0for all rounds. To demonstrate the versatility of our results, we consider two applications: OCO with delayed feedback and OCO with time-varying memory. We show that both problems can be translated into time-varying movement costs, establishing a novel reduction specifically for the delayed feedback setting that is of independent interest. A crucial observation is that the first-order dependence on movement costs in our regret bound plays a key role in enabling optimal comparator-adaptive dynamic regret guarantees in both settings.
Preprint
Decentralized Online Convex Optimization with Unknown Feedback Delays
Posted to a preprint site 01/12/2026
ArXiv.org
Decentralized online convex optimization (D-OCO), where multiple agents within a network collaboratively learn optimal decisions in real-time, arises naturally in applications such as federated learning, sensor networks, and multi-agent control. In this paper, we study D-OCO under unknown, time-and agent-varying feedback delays. While recent work has addressed this problem (Nguyen et al., 2024), existing algorithms assume prior knowledge of the total delay over agents and still suffer from suboptimal dependence on both the delay and network parameters. To overcome these limitations, we propose a novel algorithm that achieves an improved regret bound of O N$\sqrt$d tot + N$\sqrt$T (1- $σ$ 2) 1/4 , where T is the total horizon, d tot denotes the average total delay across agents, N is the number of agents, and 1 - $σ$2 is the spectral gap of the network. Our approach builds upon recent advances in D-OCO (Wan et al., 2024a), but crucially incorporates an adaptive learning rate mechanism via a decentralized communication protocol. This enables each agent to estimate delays locally using a gossip-based strategy without the prior knowledge of the total delay. We further extend our framework to the strongly convex setting and derive a sharper regret bound of O N$δ$ max ln T$α$, where$α$is the strong convexity parameter and$δ$max is the maximum number of missing observations averaged over agents. We also show that our upper bounds for both settings are tight up to logarithmic factors. Experimental results validate the effectiveness of our approach, showing improvements over existing benchmark algorithms.
Preprint
Exploiting Curvature in Online Convex Optimization with Delayed Feedback
Posted to a preprint site 06/09/2025
ArXiV.org
In this work, we study the online convex optimization problem with curved losses and delayed feedback. When losses are strongly convex, existing approaches obtain regret bounds of order dmax ln T , where dmax is the maximum delay and T is the time horizon. However, in many cases, this guarantee can be much worse than √dtot as obtained by a delayed version of online gradient descent, where dtot is the total delay. We bridge this gap by proposing a variant of follow-the-regularized-leader that obtains regret of order min{σmax ln T, √dtot}, where σmax is the maximum number of missing observations. We then consider expconcave losses and extend the Online Newton Step algorithm to handle delays with an adaptive learning rate tuning, achieving regret min{dmaxn ln T, √dtot} where n is the dimension. To our knowledge, this is the first algorithm to achieve such a regret bound for exp-concave losses. We further consider the problem of unconstrained online linear regression and achieve a similar guarantee by designing a variant of the Vovk-Azoury-Warmuth forecaster with a clipping trick. Finally, we implement our algorithms and conduct experiments under various types of delay and losses, showing an improved performance over existing methods.
Preprint
Comparator-Adaptive Φ-Regret: Improved Bounds, Simpler Algorithms, and Applications to Games
Posted to a preprint site 05/22/2025
ArXiV.org
In the classic expert problem, Φ-regret measures the gap between the learner's total loss and that achieved by applying the best action transformation φ ∈ Φ. A recent work by Lu et al., [2025] introduces an adaptive algorithm whose regret against a comparator depends on a certain sparsity-based complexity measure of φ, (almost) recovering and interpolating optimal bounds for standard regret notions such as external, internal, and swap regret. In this work, we propose a general idea to achieve an even better comparator-adaptive Φ-regret bound via much simpler algorithms compared to Lu et al., [2025]. Specifically, we discover a prior distribution over all possible binary transformations and show that it suffices to achieve prior-dependent regret against these transformations. Then, we propose two concrete and efficient algorithms to achieve so, where the first one learns over multiple copies of a prior-aware variant of the Kernelized MWU algorithm of Farina et al., [2022], and the second one learns over multiple copies of a prior-aware variant of the BM-reduction [Blum and Mansour, 2007]. To further showcase the power of our methods and the advantages over Lu et al., [2025] besides the simplicity and better regret bounds, we also show that our second approach can be extended to the game setting to achieve accelerated and adaptive convergence rate to Φ-equilibria for a class of general-sum games. When specified to the special case of correlated equilibria, our bound improves over the existing ones from Anagnostides et al., [2022a,b]
Preprint
Contextual Linear Bandits with Delay as Payoff
Posted to a preprint site 02/17/2025
ArXiV.org
A recent work by Schlisselberg et al. (2024) studies a delay-as-payoff model
for stochastic multi-armed bandits, where the payoff (either loss or reward) is
delayed for a period that is proportional to the payoff itself. While this
captures many real-world applications, the simple multi-armed bandit setting
limits the practicality of their results. In this paper, we address this
limitation by studying the delay-as-payoff model for contextual linear bandits.
Specifically, we start from the case with a fixed action set and propose an
efficient algorithm whose regret overhead compared to the standard no-delay
case is at most $D\Delta_{\max}\log T$, where $T$ is the total horizon, $D$ is
the maximum delay, and $\Delta_{\max}$ is the maximum suboptimality gap. When
payoff is loss, we also show further improvement of the bound, demonstrating a
separation between reward and loss similar to Schlisselberg et al. (2024).
Contrary to standard linear bandit algorithms that construct least squares
estimator and confidence ellipsoid, the main novelty of our algorithm is to
apply a phased arm elimination procedure by only picking actions in a
volumetric spanner of the action set, which addresses challenges arising from
both payoff-dependent delays and large action sets. We further extend our
results to the case with varying action sets by adopting the reduction from
Hanna et al. (2023). Finally, we implement our algorithm and showcase its
effectiveness and superior performance in experiments.
Preprint
Alternating Regret for Online Convex Optimization
Posted to a preprint site 02/17/2025
ArXiV.org
Motivated by alternating learning dynamics in two-player games, a recent work by Cevher et al.(2024) shows that o(T−−√) alternating regret is possible for any T-round adversarial Online Linear Optimization (OLO) problem, and left as an open question whether the same is true for general Online Convex Optimization (OCO). We answer this question in the affirmative by showing that the continuous Hedge algorithm achieves O~(d23T13) alternating regret for any adversarial d-dimensional OCO problems. We show that this implies an alternating learning dynamic that finds a Nash equilibrium for any convex-concave zero-sum games or a coarse correlated equilibrium for any convex two-player general-sum games at a rate of O~(d23/T23). To further improve the time complexity and/or the dimension dependence, we propose another simple algorithm, Follow-the-Regularized-Leader with a regularizer whose convex conjugate is 3rd-order smooth, for OCO with smooth and self-concordant loss functions (such as linear or quadratic losses). We instantiate our algorithm with different regularizers and show that, for example, when the decision set is the ℓ2 ball, our algorithm achieves O~(T25) alternating regret with no dimension dependence (and a better O~(T13) bound for quadratic losses). We complement our results by showing some algorithm-specific alternating regret lower bounds, including a somewhat surprising Ω(T−−√) lower bound for a Regret Matching variant that is widely used in alternating learning dynamics.