Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
24works
0followers
16topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

24 published item(s)

preprint2026arXiv

BayesRAG: Probabilistic Mutual Evidence Corroboration for Multimodal Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) has become a pivotal paradigm for Large Language Models (LLMs), yet current approaches struggle with visually rich documents by treating text and images as isolated retrieval targets. Existing methods relying solely on cosine similarity often fail to capture the semantic reinforcement provided by cross-modal alignment and layout-induced coherence. To address these limitations, we propose BayesRAG, a novel multimodal retrieval framework grounded in Bayesian inference and Dempster-Shafer evidence theory. Unlike traditional approaches that rank candidates strictly by similarity, BayesRAG models the intrinsic consistency of retrieved candidates across modalities as probabilistic evidence to refine retrieval confidence. Specifically, our method computes the posterior association probability for combinations of multimodal retrieval results, prioritizing text-image pairs that mutually corroborate each other in terms of both semantics and layout. Extensive experiments demonstrate that BayesRAG significantly outperforms state-of-the-art (SOTA) methods on challenging multimodal benchmarks. This study establishes a new paradigm for multimodal retrieval fusion that effectively resolves the isolation of heterogeneous modalities through an evidence fusion mechanism and enhances the robustness of retrieval outcomes. Our code is available at https://github.com/TioeAre/BayesRAG.

preprint2026arXiv

BeamCKMDiff: Beam-Aware Channel Knowledge Map Construction via Diffusion Transformer

Channel knowledge map (CKM) is emerging as a critical enabler for environment-aware 6G networks, offering a site-specific database to significantly reduce pilot overhead. However, existing CKM construction methods typically rely on sparse sampling measurements and are restricted to either omnidirectional maps or discrete codebooks, hindering the exploitation of beamforming gain. To address these limitations, we propose BeamCKMDiff, a generative framework for constructing high-fidelity CKMs conditioned on arbitrary continuous beamforming vectors without site-specific sampling. Specifically, we incorporate a novel adaptive layer normalization (adaLN) mechanism into the noise prediction network of the Diffusion Transformer (DiT). This mechanism injects continuous beam embeddings as {global control parameters}, effectively steering the generative process to capture the complex coupling between beam patterns and environmental geometries. Simulation results demonstrate that BeamCKMDiff significantly outperforms state-of-the-art baselines, achieving superior reconstruction accuracy in capturing main lobes and side lobes.

preprint2026arXiv

FocalOrder: Focal Preference Optimization for Reading Order Detection

Reading order detection is the foundation of document understanding. Most existing methods rely on uniform supervision, implicitly assuming a constant difficulty distribution across layout regions. In this work, we challenge this assumption by revealing a critical flaw: \textbf{Positional Disparity}, a phenomenon where models demonstrate mastery over the deterministic start and end regions but suffer a performance collapse in the complex intermediate sections. This degradation arises because standard training allows the massive volume of easy patterns to drown out the learning signals from difficult layouts. To address this, we propose \textbf{FocalOrder}, a framework driven by \textbf{Focal Preference Optimization (FPO)}. Specifically, FocalOrder employs adaptive difficulty discovery with exponential moving average mechanism to dynamically pinpoint hard-to-learn transitions, while introducing a difficulty-calibrated pairwise ranking objective to enforce global logical consistency. Extensive experiments demonstrate that FocalOrder establishes new state-of-the-art results on OmniDocBench v1.0 and Comp-HRDoc. Our compact model not only outperforms competitive specialized baselines but also significantly surpasses large-scale general VLMs. These results demonstrate that aligning the optimization with intrinsic structural ambiguity of documents is critical for mastering complex document structures.

preprint2026arXiv

LaTER: Efficient Test-Time Reasoning via Latent Exploration and Explicit Verification

Chain-of-thought (CoT) reasoning improves large language models (LLMs) on difficult tasks, but it also makes inference expensive because every intermediate step must be generated as a discrete token. Latent reasoning reduces visible token generation by propagating continuous states, yet replacing explicit derivations with latent computation can hurt tasks that require symbolic checking. We propose Latent-Then-Explicit Reasoning (LaTER), a two-stage paradigm that first performs bounded exploration in a continuous latent space and then switches to explicit CoT for verification and answer generation. In a training-free instantiation, LaTER projects final-layer hidden states back to the input embedding space, preserves the latent KV cache, and uses entropy and model-native stop-token probes to decide when to switch. We find that strong reasoning models already exhibit structured latent trajectories under this interface. On Qwen3-14B, training-free LaTER reduces total token usage by 16%-32% on several benchmarks while matching or improving accuracy on most of them; for example, it improves AIME 2025 from 70.0% to 73.3% while reducing tokens from 15,730 to 10,661. We further construct Latent-Switch-69K, a supervised corpus that pairs condensed solution intuitions with shortened explicit derivations. Fine-tuning with latent rollout and halting supervision yields additional gains: trained LaTER reaches 80.0% accuracy on AIME 2025, 10.0 points above the standard CoT baseline, while using 33% fewer tokens. Our code, data, and model are available at https://github.com/TioeAre/LaTER.

preprint2026arXiv

PARL: Position-Aware Relation Learning Network for Document Layout Analysis

Document layout analysis aims to detect and categorize structural elements (e.g., titles, tables, figures) in scanned or digital documents. Popular methods often rely on high-quality Optical Character Recognition (OCR) to merge visual features with extracted text. This dependency introduces two major drawbacks: propagation of text recognition errors and substantial computational overhead, limiting the robustness and practical applicability of multimodal approaches. In contrast to the prevailing multimodal trend, we argue that effective layout analysis depends not on text-visual fusion, but on a deep understanding of documents' intrinsic visual structure. To this end, we propose PARL (Position-Aware Relation Learning Network), a novel OCR-free, vision-only framework that models layout through positional sensitivity and relational structure. Specifically, we first introduce a Bidirectional Spatial Position-Guided Deformable Attention module to embed explicit positional dependencies among layout elements directly into visual features. Second, we design a Graph Refinement Classifier (GRC) to refine predictions by modeling contextual relationships through a dynamically constructed layout graph. Extensive experiments show PARL achieves state-of-the-art results. It establishes a new benchmark for vision-only methods on DocLayNet and, notably, surpasses even strong multimodal models on M6Doc. Crucially, PARL (65M) is highly efficient, using roughly four times fewer parameters than large multimodal models (256M), demonstrating that sophisticated visual structure modeling can be both more efficient and robust than multimodal fusion.

preprint2026arXiv

Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning

Entropy-based deep reasoning has emerged as a promising direction for improving the reasoning capabilities of Large Language Models (LLMs), but existing methods often either increase response length indiscriminately or shorten responses at the cost of accuracy. To better balance this trade-off, we introduce Conditional Entropy Shaping (CES), a framework that dynamically controls token-level response entropy, enabling LLMs to produce concise solutions on simple problems while encouraging deeper exploration on hard ones. Built on DAPO, CES uses token-level entropy as an uncertainty signal and applies a conditional bidirectional policy: it penalizes high-entropy "forking point" tokens on correct reasoning paths to improve conciseness, and rewards them on incorrect paths to encourage exploration and error correction. We implement CES on DeepSeek-R1-Distill-7B and evaluate it on 12 mathematical benchmarks. CES consistently improves average accuracy while reducing response length relative to DAPO, and supplementary experiments show similar trends on a smaller 1.5B backbone and on out-of-domain benchmarks.

preprint2026arXiv

TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents

Multimodal large language models increasingly solve vision-centric tasks by calling external tools for visual inspection, OCR, retrieval, calculation, and multi-step reasoning. Current tool-using agents usually expose the executed tool trajectory and the final answer, but they rarely specify which tool observation supports each generated claim. We call this missing claim-level dependency structure the provenance gap. The gap makes tool use hard to verify and hard to optimize, because useful evidence, redundant exploration, and unsupported reasoning are mixed in the same trajectory. We introduce TRACER, a framework for verifiable generative provenance in multimodal tool-using agents. Instead of adding citations after generation, TRACER generates each answer sentence together with a structured provenance record that identifies the supporting tool turn, evidence unit, and semantic support relation. Its relation space contains Quotation, Compression, and Inference, covering direct reuse, faithful condensation, and grounded derivation. TRACER verifies each record through schema checking, tool-turn alignment, source authenticity, and relation rationality, and then converts verified provenance into traceability constraints and provenance-derived local credit for reinforcement learning. We further construct TRACE-Bench, a benchmark for sentence-level provenance reconstruction from coarse multimodal tool trajectories. On TRACE-Bench, simply adding tools often introduces noise. With Qwen3-VL-8B, TRACER reaches 78.23% answer accuracy and 95.72% summary accuracy, outperforming the strongest closed-source tool-augmented baseline by 23.80 percentage points. Compared with tool-only supervised fine-tuning, it also reduces total test-set tool calls from 4949 to 3486. These results show that reliable multimodal tool reasoning depends on provenance-aware use of observations, not on more tool calls alone.

preprint2023arXiv

Automated Ring-Diagram Framework for Classifying CP Invariants

In this study, we introduce a transformative, automated framework for classifying basis invariants in generic field theories. Utilising a novel ring-diagram methodology accompanied by the well-known Cayley-Hamilton theorem, our approach uniquely enables the identification of basic invariants and their CP-property characterisation. Critically, our framework also unveils previously concealed attributes of established techniques reliant on the Hilbert-Poincaré series and its associated Plethystic logarithm. This paradigm shift has broad implications for the deeper understanding and more accurate classification of CP invariants in generic field theories.

preprint2023arXiv

Optimization of Image Transmission in a Cooperative Semantic Communication Networks

In this paper, a semantic communication framework for image transmission is developed. In the investigated framework, a set of servers cooperatively transmit images to a set of users utilizing semantic communication techniques. To evaluate the performance of studied semantic communication system, a multimodal metric is proposed to measure the correlation between the extracted semantic information and the original image. To meet the ISS requirement of each user, each server must jointly determine the semantic information to be transmitted and the resource blocks (RBs) used for semantic information transmission. We formulate this problem as an optimization problem aiming to minimize each server's transmission latency while reaching the ISS requirement. To solve this problem, a value decomposition based entropy-maximized multi-agent reinforcement learning (RL) is proposed, which enables servers to coordinate for training and execute RB allocation in a distributed manner to approach to a globally optimal performance with less training iterations. Compared to traditional multi-agent RL, the proposed RL improves the valuable action exploration of servers and the probability of finding a globally optimal RB allocation policy based on local observation. Simulation results show that the proposed algorithm can reduce the transmission delay by up to 16.1% compared to traditional multi-agent RL.

preprint2022arXiv

Active Learning for Contextual Search with Binary Feedbacks

In this paper, we study the learning problem in contextual search, which is motivated by applications such as first-price auction, personalized medicine experiments, and feature-based pricing experiments. In particular, for a sequence of arriving context vectors, with each context associated with an underlying value, the decision-maker either makes a query at a certain point or skips the context. The decision-maker will only observe the binary feedback on the relationship between the query point and the value associated with the context. We study a PAC learning setting, where the goal is to learn the underlying mean value function in context with a minimum number of queries. To address this challenge, we propose a tri-section search approach combined with a margin-based active learning method. We show that the algorithm only needs to make $O(1/\varepsilon^2)$ queries to achieve an $ε$-estimation accuracy. This sample complexity significantly reduces the required sample complexity in the passive setting, at least $Ω(1/\varepsilon^4)$.

preprint2022arXiv

Cracking White-box DNN Watermarks via Invariant Neuron Transforms

Recently, how to protect the Intellectual Property (IP) of deep neural networks (DNN) becomes a major concern for the AI industry. To combat potential model piracy, recent works explore various watermarking strategies to embed secret identity messages into the prediction behaviors or the internals (e.g., weights and neuron activation) of the target model. Sacrificing less functionality and involving more knowledge about the target model, the latter branch of watermarking schemes (i.e., white-box model watermarking) is claimed to be accurate, credible and secure against most known watermark removal attacks, with emerging research efforts and applications in the industry. In this paper, we present the first effective removal attack which cracks almost all the existing white-box watermarking schemes with provably no performance overhead and no required prior knowledge. By analyzing these IP protection mechanisms at the granularity of neurons, we for the first time discover their common dependence on a set of fragile features of a local neuron group, all of which can be arbitrarily tampered by our proposed chain of invariant neuron transforms. On $9$ state-of-the-art white-box watermarking schemes and a broad set of industry-level DNN architectures, our attack for the first time reduces the embedded identity message in the protected models to be almost random. Meanwhile, unlike known removal attacks, our attack requires no prior knowledge on the training data distribution or the adopted watermark algorithms, and leaves model functionality intact.

preprint2022arXiv

Meta-Reinforcement Learning for Reliable Communication in THz/VLC Wireless VR Networks

In this paper, the problem of enhancing the quality of virtual reality (VR) services is studied for an indoor terahertz (THz)/visible light communication (VLC) wireless network. In the studied model, small base stations (SBSs) transmit high-quality VR images to VR users over THz bands and light-emitting diodes (LEDs) provide accurate indoor positioning services for them using VLC. Here, VR users move in real time and their movement patterns change over time according to their applications, where both THz and VLC links can be blocked by the bodies of VR users. To control the energy consumption of the studied THz/VLC wireless VR network, VLC access points (VAPs) must be selectively turned on so as to ensure accurate and extensive positioning for VR users. Based on the user positions, each SBS must generate corresponding VR images and establish THz links without body blockage to transmit the VR content. The problem is formulated as an optimization problem whose goal is to maximize the reliability of the VR network by selecting the appropriate VAPs to be turned on and controlling the user association with SBSs. To solve this problem, a policy gradient-based reinforcement learning (RL) algorithm that adopts a meta-learning approach is proposed. The proposed meta policy gradient (MPG) algorithm enables the trained policy to quickly adapt to new user movement patterns. In order to solve the problem of maximizing the average number of successfully served users for VR scenarios with a large number of users, a dual method based MPG algorithm (D-MPG) with a low complexity is proposed. Simulation results demonstrate that, compared to the trust region policy optimization algorithm (TRPO), the proposed MPG and D-MPG algorithms yield up to 26.8% and 21.9% improvement in the reliability as well as 81.2% and 87.5% gains in the convergence speed, respectively.

preprint2022arXiv

Performance Optimization for Semantic Communications: An Attention-based Reinforcement Learning Approach

In this paper, a semantic communication framework is proposed for textual data transmission. In the studied model, a base station (BS) extracts the semantic information from textual data, and transmits it to each user. The semantic information is modeled by a knowledge graph (KG) that consists of a set of semantic triples. After receiving the semantic information, each user recovers the original text using a graph-to-text generation model. To measure the performance of the considered semantic communication framework, a metric of semantic similarity (MSS) that jointly captures the semantic accuracy and completeness of the recovered text is proposed. Due to wireless resource limitations, the BS may not be able to transmit the entire semantic information to each user and satisfy the transmission delay constraint. Hence, the BS must select an appropriate resource block for each user as well as determine and transmit part of the semantic information to the users. As such, we formulate an optimization problem whose goal is to maximize the total MSS by jointly optimizing the resource allocation policy and determining the partial semantic information to be transmitted. To solve this problem, a proximal-policy-optimization-based reinforcement learning (RL) algorithm integrated with an attention network is proposed. The proposed algorithm can evaluate the importance of each triple in the semantic information using an attention network and then, build a relationship between the importance distribution of the triples in the semantic information and the total MSS. Compared to traditional RL algorithms, the proposed algorithm can dynamically adjust its learning rate thus ensuring convergence to a locally optimal solution.

preprint2022arXiv

Robust Dynamic Assortment Optimization in the Presence of Outlier Customers

We consider the dynamic assortment optimization problem under the multinomial logit model (MNL) with unknown utility parameters. The main question investigated in this paper is model mis-specification under the $\varepsilon$-contamination model, which is a fundamental model in robust statistics and machine learning. In particular, throughout a selling horizon of length $T$, we assume that customers make purchases according to a well specified underlying multinomial logit choice model in a $(1-\varepsilon)$-fraction of the time periods, and make arbitrary purchasing decisions instead in the remaining $\varepsilon$-fraction of the time periods. In this model, we develop a new robust online assortment optimization policy via an active elimination strategy. We establish both upper and lower bounds on the regret, and show that our policy is optimal up to logarithmic factor in $T$ when the assortment capacity is constant. %% capacity of assortments has a constant upper limit. We further develop a fully adaptive policy that does not require any prior knowledge of the contamination parameter $\varepsilon$. In the case of the existence a sub-optimality gap between optimal and sub-optimal products, we also established gap-dependent logarithmic regret upper bounds and lower bounds in both the known-$\varepsilon$ and unknown-$\varepsilon$ cases. Our simulation study shows that our policy outperforms the existing policies based on upper confidence bounds (UCB) and Thompson sampling.

preprint2021arXiv

Adversarial Combinatorial Bandits with General Non-linear Reward Functions

In this paper we study the adversarial combinatorial bandit with a known non-linear reward function, extending existing work on adversarial linear combinatorial bandit. {The adversarial combinatorial bandit with general non-linear reward is an important open problem in bandit literature, and it is still unclear whether there is a significant gap from the case of linear reward, stochastic bandit, or semi-bandit feedback.} We show that, with $N$ arms and subsets of $K$ arms being chosen at each of $T$ time periods, the minimax optimal regret is $\widetildeΘ_{d}(\sqrt{N^d T})$ if the reward function is a $d$-degree polynomial with $d< K$, and $Θ_K(\sqrt{N^K T})$ if the reward function is not a low-degree polynomial. {Both bounds are significantly different from the bound $O(\sqrt{\mathrm{poly}(N,K)T})$ for the linear case, which suggests that there is a fundamental gap between the linear and non-linear reward structures.} Our result also finds applications to adversarial assortment optimization problem in online recommendation. We show that in the worst-case of adversarial assortment problem, the optimal algorithm must treat each individual $\binom{N}{K}$ assortment as independent.

preprint2021arXiv

Dynamic Assortment Selection under the Nested Logit Models

We study a stylized dynamic assortment planning problem during a selling season of finite length $T$. At each time period, the seller offers an arriving customer an assortment of substitutable products and the customer makes the purchase among offered products according to a discrete choice model. The goal of the seller is to maximize the expected revenue, or equivalently, to minimize the worst-case expected regret. One key challenge is that utilities of products are unknown to the seller and need to be learned. Although the dynamic assortment planning problem has received increasing attention in revenue management, most existing work is based on the multinomial logit choice models (MNL). In this paper, we study the problem of dynamic assortment planning under a more general choice model -- the nested logit model, which models hierarchical choice behavior and is ``the most widely used member of the GEV (generalized extreme value) family&#39;&#39;. By leveraging the revenue-ordered structure of the optimal assortment within each nest, we develop a novel upper confidence bound (UCB) policy with an aggregated estimation scheme. Our policy simultaneously learns customers&#39; choice behavior and makes dynamic decisions on assortments based on the current knowledge. It achieves the accumulated regret at the order of $\tilde{O}(\sqrt{MNT})$, where $M$ is the number of nests and $N$ is the number of products in each nest. We further provide a lower bound result of $Ω(\sqrt{MT})$, which shows the near optimality of the upper bound when $T$ is much larger than $M$ and $N$. When the number of items per nest $N$ is large, we further provide a discretization heuristic for better performance of our algorithm. Numerical results are presented to demonstrate the empirical performance of our proposed algorithms.

preprint2021arXiv

Tight Regret Bounds for Infinite-armed Linear Contextual Bandits

Linear contextual bandit is an important class of sequential decision making problems with a wide range of applications to recommender systems, online advertising, healthcare, and many other machine learning related tasks. While there is a lot of prior research, tight regret bounds of linear contextual bandit with infinite action sets remain open. In this paper, we address this open problem by considering the linear contextual bandit with (changing) infinite action sets. We prove a regret upper bound on the order of $O(\sqrt{d^2T\log T})\times \text{poly}(\log\log T)$ where $d$ is the domain dimension and $T$ is the time horizon. Our upper bound matches the previous lower bound of $Ω(\sqrt{d^2 T\log T})$ in [Li et al., 2019] up to iterated logarithmic terms.

preprint2020arXiv

$\sqrt{n}$-Regret for Learning in Markov Decision Processes with Function Approximation and Low Bellman Rank

In this paper, we consider the problem of online learning of Markov decision processes (MDPs) with very large state spaces. Under the assumptions of realizable function approximation and low Bellman ranks, we develop an online learning algorithm that learns the optimal value function while at the same time achieving very low cumulative regret during the learning process. Our learning algorithm, Adaptive Value-function Elimination (AVE), is inspired by the policy elimination algorithm proposed in (Jiang et al., 2017), known as OLIVE. One of our key technical contributions in AVE is to formulate the elimination steps in OLIVE as contextual bandit problems. This technique enables us to apply the active elimination and expert weighting methods from (Dudik et al., 2011), instead of the random action exploration scheme used in the original OLIVE algorithm, for more efficient exploration and better control of the regret incurred in each policy elimination step. To the best of our knowledge, this is the first $\sqrt{n}$-regret result for reinforcement learning in stochastic MDPs with general value function approximation.

preprint2020arXiv

Constant Regret Re-solving Heuristics for Price-based Revenue Management

Price-based revenue management is an important problem in operations management with many practical applications. The problem considers a retailer who sells a product (or multiple products) over $T$ consecutive time periods and is subject to constraints on the initial inventory levels. While the optimal pricing policy could be obtained via dynamic programming, such an approach is sometimes undesirable because of high computational costs. Approximate policies, such as the re-solving heuristics, are often applied as computationally tractable alternatives. In this paper, we show the following two results. First, we prove that a natural re-solving heuristic attains $O(1)$ regret compared to the value of the optimal policy. This improves the $O(\ln T)$ regret upper bound established in the prior work of \cite{jasin2014reoptimization}. Second, we prove that there is an $Ω(\ln T)$ gap between the value of the optimal policy and that of the fluid model. This complements our upper bound result by showing that the fluid is not an adequate information-relaxed benchmark when analyzing price-based revenue management algorithms.

preprint2020arXiv

Deep Learning for Optimal Deployment of UAVs with Visible Light Communications

In this paper, the problem of dynamical deployment of unmanned aerial vehicles (UAVs) equipped with visible light communication (VLC) capabilities for optimizing the energy efficiency of UAV-enabled networks is studied. In the studied model, the UAVs can simultaneously provide communications and illumination to service ground users. Since ambient illumination increases the interference over VLC links while reducing the illumination threshold of the UAVs, it is necessary to consider the illumination distribution of the target area for UAV deployment optimization. This problem is formulated as an optimization problem which jointly optimizes UAV deployment, user association, and power efficiency while meeting the illumination and communication requirements of users. To solve this problem, an algorithm that combines the machine learning framework of gated recurrent units (GRUs) with convolutional neural networks (CNNs) is proposed. Using GRUs and CNNs, the UAVs can model the long-term historical illumination distribution and predict the future illumination distribution. Given the prediction of illumination distribution, the original nonconvex optimization problem can be divided into two sub-problems and is then solved using a low-complexity, iterative algorithm. Then, the proposed algorithm enables UAVs to determine the their deployment and user association to minimize the total transmit power. Simulation results using real data from the Earth observations group (EOG) at NOAA/NCEI show that the proposed approach can achieve up to 68.9% reduction in total transmit power compared to a conventional optimal UAV deployment that does not consider the illumination distribution and user association.

preprint2020arXiv

Near-Linear Time Local Polynomial Nonparametric Estimation with Box Kernels

Local polynomial regression (Fan and Gijbels 1996) is an important class of methods for nonparametric density estimation and regression problems. However, straightforward implementation of local polynomial regression has quadratic time complexity which hinders its applicability in large-scale data analysis. In this paper, we significantly accelerate the computation of local polynomial estimates by novel applications of multi-dimensional binary indexed trees (Fenwick 1994). Both time and space complexity of our proposed algorithm is nearly linear in the number of input data points. Simulation results confirm the efficiency and effectiveness of our proposed approach.

preprint2020arXiv

Nearly Dimension-Independent Sparse Linear Bandit over Small Action Spaces via Best Subset Selection

We consider the stochastic contextual bandit problem under the high dimensional linear model. We focus on the case where the action space is finite and random, with each action associated with a randomly generated contextual covariate. This setting finds essential applications such as personalized recommendation, online advertisement, and personalized medicine. However, it is very challenging as we need to balance exploration and exploitation. We propose doubly growing epochs and estimating the parameter using the best subset selection method, which is easy to implement in practice. This approach achieves $ \tilde{\mathcal{O}}(s\sqrt{T})$ regret with high probability, which is nearly independent in the ``ambient&#39;&#39; regression model dimension $d$. We further attain a sharper $\tilde{\mathcal{O}}(\sqrt{sT})$ regret by using the \textsc{SupLinUCB} framework and match the minimax lower bound of low-dimensional linear stochastic bandit problems. Finally, we conduct extensive numerical experiments to demonstrate the applicability and robustness of our algorithms empirically.

preprint2020arXiv

Nearly Minimax-Optimal Regret for Linearly Parameterized Bandits

We study the linear contextual bandit problem with finite action sets. When the problem dimension is $d$, the time horizon is $T$, and there are $n \leq 2^{d/2}$ candidate actions per time period, we (1) show that the minimax expected regret is $Ω(\sqrt{dT (\log T) (\log n)})$ for every algorithm, and (2) introduce a Variable-Confidence-Level (VCL) SupLinUCB algorithm whose regret matches the lower bound up to iterated logarithmic factors. Our algorithmic result saves two $\sqrt{\log T}$ factors from previous analysis, and our information-theoretical lower bound also improves previous results by one $\sqrt{\log T}$ factor, revealing a regret scaling quite different from classical multi-armed bandits in which no logarithmic $T$ term is present in minimax regret. Our proof techniques include variable confidence levels and a careful analysis of layer sizes of SupLinUCB on the upper bound side, and delicately constructed adversarial sequences showing the tightness of elliptical potential lemmas on the lower bound side.

preprint2020arXiv

Uncertainty Quantification for Demand Prediction in Contextual Dynamic Pricing

Data-driven sequential decision has found a wide range of applications in modern operations management, such as dynamic pricing, inventory control, and assortment optimization. Most existing research on data-driven sequential decision focuses on designing an online policy to maximize the revenue. However, the research on uncertainty quantification on the underlying true model function (e.g., demand function), a critical problem for practitioners, has not been well explored. In this paper, using the problem of demand function prediction in dynamic pricing as the motivating example, we study the problem of constructing accurate confidence intervals for the demand function. The main challenge is that sequentially collected data leads to significant distributional bias in the maximum likelihood estimator or the empirical risk minimization estimate, making classical statistics approaches such as the Wald&#39;s test no longer valid. We address this challenge by developing a debiased approach and provide the asymptotic normality guarantee of the debiased estimator. Based this the debiased estimator, we provide both point-wise and uniform confidence intervals of the demand function.