Researcher profile

Bowen Xiao

Bowen Xiao contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 15 - UnverifiedVerification L1Unclaimed author
3works
0followers
5topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

3 published item(s)

preprint2026arXiv

Stable Preference Optimization: A Bilevel Approach to Catastrophic Preference Shift

Direct Preference Learning has emerged as a dominant offline paradigm for preference optimization. Most of these methods are based on the Bradley-Terry (BT) model for pairwise preference ranking, which directly aligns language model with human preference. Prior work has observed a counter-intuitive phenomenon termed likelihood displacement, where the absolute probability of preferred responses decreases simultaneously during training. We demonstrate that such displacement can lead to a more devastating failure mode, which we defined as \textit{Catastrophic Preference Shift}, where the lost preference probability mass inadvertently shifts toward out-of-distribution (OOD) responses. Such a failure mode is a key limitation shared across BT-style direct preference learning methods, due to the fundamental conflict between the unconstrained discriminative alignment and generative foundational capabilities, ultimately leading to severe performance degradation (e.g., SimPO suffers a significant drop in reasoning accuracy from 73.5\% to 37.5\%). We analyze existing BT-style methods from the probability evolution perspective and theoretically prove that these methods exhibit over-reliance on model initialization and can lead to preference shift. To resolve these counter-intuitive behaviors, we propose a theoretically grounded Stable Preference Optimization (SPO) framework that constrains preference learning within a safe alignment region. Empirical evaluations demonstrate that SPO effectively stabilizes and enhances the performance of existing BT-style preference learning methods. SPO provides new insights into the design of preference learning objectives and opens up new avenues towards more reliable and interpretable language model alignment.

preprint2022arXiv

FinRL: A Deep Reinforcement Learning Library for Automated Stock Trading in Quantitative Finance

As deep reinforcement learning (DRL) has been recognized as an effective approach in quantitative finance, getting hands-on experiences is attractive to beginners. However, to train a practical DRL trading agent that decides where to trade, at what price, and what quantity involves error-prone and arduous development and debugging. In this paper, we introduce a DRL library FinRL that facilitates beginners to expose themselves to quantitative finance and to develop their own stock trading strategies. Along with easily-reproducible tutorials, FinRL library allows users to streamline their own developments and to compare with existing schemes easily. Within FinRL, virtual environments are configured with stock market datasets, trading agents are trained with neural networks, and extensive backtesting is analyzed via trading performance. Moreover, it incorporates important trading constraints such as transaction cost, market liquidity and the investor's degree of risk-aversion. FinRL is featured with completeness, hands-on tutorial and reproducibility that favors beginners: (i) at multiple levels of time granularity, FinRL simulates trading environments across various stock markets, including NASDAQ-100, DJIA, S&P 500, HSI, SSE 50, and CSI 300; (ii) organized in a layered architecture with modular structure, FinRL provides fine-tuned state-of-the-art DRL algorithms (DQN, DDPG, PPO, SAC, A2C, TD3, etc.), commonly-used reward functions and standard evaluation baselines to alleviate the debugging workloads and promote the reproducibility, and (iii) being highly extendable, FinRL reserves a complete set of user-import interfaces. Furthermore, we incorporated three application demonstrations, namely single stock trading, multiple stock trading, and portfolio allocation. The FinRL library will be available on Github at link https://github.com/AI4Finance-LLC/FinRL-Library.

preprint2019arXiv

EdgeToll: A Blockchain-based Toll Collection System for Public Sharing of Heterogeneous Edges

Edge computing is a novel paradigm designed toimprove the quality of service for latency sensitive cloud applications. However, the state-of-the-art edge services are designedfor specific applications, which are isolated from each other.To better improve the utilization level of edge nodes, publicresource sharing among edges from distinct service providersshould be encouraged economically. In this work, we employ thepayment channel techniques to design and implement EdgeToll,a blockchain-based toll collection system for heterogeneous public edge sharing. Test-bed has been developed to validate theproposal and preliminary experiments have been conducted todemonstrate the time and cost efficiency of the system.