Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
28works
0followers
17topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

28 published item(s)

preprint2026arXiv

EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios

As large language models continue to advance, their application in educational contexts remains underexplored and under-optimized. In this paper, we address this gap by introducing the first diverse benchmark tailored for educational scenarios, incorporating synthetic data containing 9 major scenarios and over 4,000 distinct educational contexts. To enable comprehensive assessment, we propose a set of multi-dimensional evaluation metrics that cover 12 critical aspects relevant to both teachers and students. We further apply human annotation to ensure the effectiveness of the model-generated evaluation responses. Additionally, we succeed to train a relatively small-scale model on our constructed dataset and demonstrate that it can achieve performance comparable to state-of-the-art large models (e.g., Deepseek V3, Qwen Max) on the test set. Overall, this work provides a practical foundation for the development and evaluation of education-oriented language models. Code and data are released at https://github.com/ybai-nlp/EduBench.

preprint2022arXiv

Analyzing Micro-Founded General Equilibrium Models with Many Agents using Deep Reinforcement Learning

Real economies can be modeled as a sequential imperfect-information game with many heterogeneous agents, such as consumers, firms, and governments. Dynamic general equilibrium (DGE) models are often used for macroeconomic analysis in this setting. However, finding general equilibria is challenging using existing theoretical or computational methods, especially when using microfoundations to model individual agents. Here, we show how to use deep multi-agent reinforcement learning (MARL) to find $ε$-meta-equilibria over agent types in microfounded DGE models. Whereas standard MARL fails to learn non-trivial solutions, our structured learning curricula enable stable convergence to meaningful solutions. Conceptually, our approach is more flexible and does not need unrealistic assumptions, e.g., continuous market clearing, that are commonly used for analytical tractability. Furthermore, our end-to-end GPU implementation enables fast real-time convergence with a large number of RL economic agents. We showcase our approach in open and closed real-business-cycle (RBC) models with 100 worker-consumers, 10 firms, and a social planner who taxes and redistributes. We validate the learned solutions are $ε$-meta-equilibria through best-response analyses, show that they align with economic intuitions, and show our approach can learn a spectrum of qualitatively distinct $ε$-meta-equilibria in open RBC models. As such, we show that hardware-accelerated MARL is a promising framework for modeling the complexity of economies based on microfoundations.

preprint2022arXiv

Application of Color Block Code in Image Scaling

Aiming at the high cost of embedding annotation watermark in a narrow small area and the information distortion caused by the change of annotation watermark image resolution, this paper proposes a color block code technology, which uses location information and color code to form recognizable graphics, which can not only simplify the annotation graphics, but also ensure the recognition efficiency. First, the constituent elements of color block code are designed, and then the coding and decoding method of color block code is proposed. Experiments show that color block code has high anti-scaling and anti-interference, and can be widely used in the labeling of small object surface and low resolution image.

preprint2022arXiv

Detecting and Monitoring Tidal Dissipation of Hot Jupiters in the Era of SiTian

Transit Timing Variation (TTV) of hot Jupiters provides direct observational evidence of planet tidal dissipation. Detecting tidal dissipation through TTV needs high precision transit timings and long timing baselines. In this work, we predict and discuss the potential scientific contribution of SiTian Survey in detecting and analyzing exoplanet TTV. We develop a tidal dissipation detection pipeline for SiTian Survey that aims at time-domain astronomy with 72 1-meter optical telescopes. The pipeline includes the modules of light curve deblending, transit timing obtaining, and TTV modeling. SiTian is capable to detect more than 25,000 exoplanets among which we expect $\sim$50 sources showing evidence of tidal dissipation. We present detection and analysis of tidal dissipating targets, based on simulated SiTian light curves of XO-3b and WASP-161b. The transit light curve modeling gives consistent results within 1$σ$ to input values of simulated light curves. Also, the parameter uncertainties predicted by Monte-Carlo Markov Chain are consistent with the distribution obtained from simulating and modeling the light curve 1000 times. The timing precision of SiTian observations is $\sim$ 0.5 minutes with one transit visit. We show that differences between TTV origins, e.g., tidal dissipation, apsidal precession, multiple planets, would be significant, considering the timing precision and baseline. The detection rate of tidal dissipating hot Jupiters would answer a crucial question of whether the planet migrates at an early formation stage or random stages due to perturbations, e.g., planet scattering, secular interaction. SiTian identified targets would be constructive given that the sample would extend tenfold.

preprint2022arXiv

Efficient and Differentiable Conformal Prediction with General Function Classes

Quantifying the data uncertainty in learning tasks is often done by learning a prediction interval or prediction set of the label given the input. Two commonly desired properties for learned prediction sets are \emph{valid coverage} and \emph{good efficiency} (such as low length or low cardinality). Conformal prediction is a powerful technique for learning prediction sets with valid coverage, yet by default its conformalization step only learns a single parameter, and does not optimize the efficiency over more expressive function classes. In this paper, we propose a generalization of conformal prediction to multiple learnable parameters, by considering the constrained empirical risk minimization (ERM) problem of finding the most efficient prediction set subject to valid empirical coverage. This meta-algorithm generalizes existing conformal prediction algorithms, and we show that it achieves approximate valid population coverage and near-optimal efficiency within class, whenever the function class in the conformalization step is low-capacity in a certain sense. Next, this ERM problem is challenging to optimize as it involves a non-differentiable coverage constraint. We develop a gradient-based algorithm for it by approximating the original constrained ERM using differentiable surrogate losses and Lagrangians. Experiments show that our algorithm is able to learn valid prediction sets and improve the efficiency significantly over existing approaches in several applications such as prediction intervals with improved length, minimum-volume prediction sets for multi-output regression, and label prediction sets for image classification.

preprint2022arXiv

J-PLUS: Support Vector Regression to Measure Stellar Parameters

Context. Stellar parameters are among the most important characteristics in studies of stars, which are based on atmosphere models in traditional methods. However, time cost and brightness limits restrain the efficiency of spectral observations. The J-PLUS is an observational campaign that aims to obtain photometry in 12 bands. Owing to its characteristics, J-PLUS data have become a valuable resource for studies of stars. Machine learning provides powerful tools to efficiently analyse large data sets, such as the one from J-PLUS, and enable us to expand the research domain to stellar parameters. Aims. The main goal of this study is to construct a SVR algorithm to estimate stellar parameters of the stars in the first data release of the J-PLUS observational campaign. Methods. The training data for the parameters regressions is featured with 12-waveband photometry from J-PLUS, and is cross-identified with spectrum-based catalogs. These catalogs are from the LAMOST, the APOGEE, and the SEGUE. We then label them with the stellar effective temperature, the surface gravity and the metallicity. Ten percent of the sample is held out to apply a blind test. We develop a new method, a multi-model approach in order to fully take into account the uncertainties of both the magnitudes and stellar parameters. The method utilizes more than two hundred models to apply the uncertainty analysis. Results. We present a catalog of 2,493,424 stars with the Root Mean Square Error of 160K in the effective temperature regression, 0.35 in the surface gravity regression and 0.25 in the metallicity regression. We also discuss the advantages of this multi-model approach and compare it to other machine-learning methods.

preprint2022arXiv

Local Calibration: Metrics and Recalibration

Probabilistic classifiers output confidence scores along with their predictions, and these confidence scores should be calibrated, i.e., they should reflect the reliability of the prediction. Confidence scores that minimize standard metrics such as the expected calibration error (ECE) accurately measure the reliability on average across the entire population. However, it is in general impossible to measure the reliability of an individual prediction. In this work, we propose the local calibration error (LCE) to span the gap between average and individual reliability. For each individual prediction, the LCE measures the average reliability of a set of similar predictions, where similarity is quantified by a kernel function on a pretrained feature space and by a binning scheme over predicted model confidences. We show theoretically that the LCE can be estimated sample-efficiently from data, and empirically find that it reveals miscalibration modes that are more fine-grained than the ECE can detect. Our key result is a novel local recalibration method LoRe, to improve confidence scores for individual predictions and decrease the LCE. Experimentally, we show that our recalibration method produces more accurate confidence scores, which improves downstream fairness and decision making on classification tasks with both image and tabular data.

preprint2022arXiv

Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning

Recent theoretical work studies sample-efficient reinforcement learning (RL) extensively in two settings: learning interactively in the environment (online RL), or learning from an offline dataset (offline RL). However, existing algorithms and theories for learning near-optimal policies in these two settings are rather different and disconnected. Towards bridging this gap, this paper initiates the theoretical study of policy finetuning, that is, online RL where the learner has additional access to a "reference policy" $μ$ close to the optimal policy $π_\star$ in a certain sense. We consider the policy finetuning problem in episodic Markov Decision Processes (MDPs) with $S$ states, $A$ actions, and horizon length $H$. We first design a sharp offline reduction algorithm -- which simply executes $μ$ and runs offline policy optimization on the collected dataset -- that finds an $\varepsilon$ near-optimal policy within $\widetilde{O}(H^3SC^\star/\varepsilon^2)$ episodes, where $C^\star$ is the single-policy concentrability coefficient between $μ$ and $π_\star$. This offline result is the first that matches the sample complexity lower bound in this setting, and resolves a recent open question in offline RL. We then establish an $Ω(H^3S\min\{C^\star, A\}/\varepsilon^2)$ sample complexity lower bound for any policy finetuning algorithm, including those that can adaptively explore the environment. This implies that -- perhaps surprisingly -- the optimal policy finetuning algorithm is either offline reduction or a purely online RL algorithm that does not use $μ$. Finally, we design a new hybrid offline/online algorithm for policy finetuning that achieves better sample complexity than both vanilla offline reduction and purely online RL algorithms, in a relaxed setting where $μ$ only satisfies concentrability partially up to a certain time step.

preprint2022arXiv

Privacy protection based on mask template

Powerful recognition algorithms are widely used in the Internet or important medical systems, which poses a serious threat to personal privacy. Although the law provides for diversity protection, e.g. The General Data Protection Regulation (GDPR) in Europe and Articles 1032 to 1039 of the civil code in China. However, as an important privacy disclosure event, biometric data is often hidden, which is difficult for the owner to detect and trace to the source. Human biometrics generally exist in images. In order to avoid the disclosure of personal privacy, we should prevent unauthorized recognition algorithms from acquiring the real features of the original image.

preprint2022arXiv

Sample-Efficient Learning of Correlated Equilibria in Extensive-Form Games

Imperfect-Information Extensive-Form Games (IIEFGs) is a prevalent model for real-world games involving imperfect information and sequential plays. The Extensive-Form Correlated Equilibrium (EFCE) has been proposed as a natural solution concept for multi-player general-sum IIEFGs. However, existing algorithms for finding an EFCE require full feedback from the game, and it remains open how to efficiently learn the EFCE in the more challenging bandit feedback setting where the game can only be learned by observations from repeated playing. This paper presents the first sample-efficient algorithm for learning the EFCE from bandit feedback. We begin by proposing $K$-EFCE -- a more generalized definition that allows players to observe and deviate from the recommended actions for $K$ times. The $K$-EFCE includes the EFCE as a special case at $K=1$, and is an increasingly stricter notion of equilibrium as $K$ increases. We then design an uncoupled no-regret algorithm that finds an $\varepsilon$-approximate $K$-EFCE within $\widetilde{\mathcal{O}}(\max_{i}X_iA_i^{K}/\varepsilon^2)$ iterations in the full feedback setting, where $X_i$ and $A_i$ are the number of information sets and actions for the $i$-th player. Our algorithm works by minimizing a wide-range regret at each information set that takes into account all possible recommendation histories. Finally, we design a sample-based variant of our algorithm that learns an $\varepsilon$-approximate $K$-EFCE within $\widetilde{\mathcal{O}}(\max_{i}X_iA_i^{K+1}/\varepsilon^2)$ episodes of play in the bandit feedback setting. When specialized to $K=1$, this gives the first sample-efficient algorithm for learning EFCE from bandit feedback.

preprint2022arXiv

Stellar chromospheric activities revealed from the LAMOST-K2 time-domain survey

By using the LAMOST time-domain survey data, we study stellar activities based on the $\rm{H_α}$ lines for about 2000 stars in four $K$2 plates. Two indices, $R_{\rm{Hα}}^{'}$ and $R_{\rm{Hα}}^{+}$, are computed from LAMOST spectra, the former of which is derived by excluding the photospheric contributions to the $\rm{H_α}$ lines, while the latter is derived by further subtracting the non-dynamo driven chromospheric emission. Meanwhile, the periodicity and variation amplitudes are computed from \emph{K2} light curves. Both the $R_{\rm{Hα}}^{'}$-Ro relation and $R_{\rm{Hα}}^{+}$-Ro relation show complicated profiles in the non-saturated decay region. Hot stars show flatter slopes and higher activity level than cool stars, and the behaviour is more notable in the $R_{\rm{Hα}}^{+}$-$R_{o}$ relation. This is consistent with recent studies using other activity proxies, including $L_{\rm{x}}/L_{\rm{bol}}$, $R_{\rm{HK}}^{'}$ and amplitudes of optical light curves. % This may suggest different kinds of stars follow different power laws in the decay region. Most of our targets have multiple observations, and some of them exhibit significant variability of ${\rm{Hα}}$ emissions, which may cause the large scatters shown in the decay region. We find three targets exhibiting positive correlation in rotational phase, possibly indicating that their optical light curves are dominated by hot faculae rather than cool starspots.

preprint2022arXiv

Unifying Cross-lingual Summarization and Machine Translation with Compression Rate

Cross-Lingual Summarization (CLS) is a task that extracts important information from a source document and summarizes it into a summary in another language. It is a challenging task that requires a system to understand, summarize, and translate at the same time, making it highly related to Monolingual Summarization (MS) and Machine Translation (MT). In practice, the training resources for Machine Translation are far more than that for cross-lingual and monolingual summarization. Thus incorporating the Machine Translation corpus into CLS would be beneficial for its performance. However, the present work only leverages a simple multi-task framework to bring Machine Translation in, lacking deeper exploration. In this paper, we propose a novel task, Cross-lingual Summarization with Compression rate (CSC), to benefit Cross-Lingual Summarization by large-scale Machine Translation corpus. Through introducing compression rate, the information ratio between the source and the target text, we regard the MT task as a special CLS task with a compression rate of 100%. Hence they can be trained as a unified task, sharing knowledge more effectively. However, a huge gap exists between the MT task and the CLS task, where samples with compression rates between 30% and 90% are extremely rare. Hence, to bridge these two tasks smoothly, we propose an effective data augmentation method to produce document-summary pairs with different compression rates. The proposed method not only improves the performance of the CLS task, but also provides controllability to generate summaries in desired lengths. Experiments demonstrate that our method outperforms various strong baselines in three cross-lingual summarization datasets. We released our code and data at https://github.com/ybai-nlp/CLS_CR.

preprint2022arXiv

When Can We Learn General-Sum Markov Games with a Large Number of Players Sample-Efficiently?

Multi-agent reinforcement learning has made substantial empirical progresses in solving games with a large number of players. However, theoretically, the best known sample complexity for finding a Nash equilibrium in general-sum games scales exponentially in the number of players due to the size of the joint action space, and there is a matching exponential lower bound. This paper investigates what learning goals admit better sample complexities in the setting of $m$-player general-sum Markov games with $H$ steps, $S$ states, and $A_i$ actions per player. First, we design algorithms for learning an $ε$-Coarse Correlated Equilibrium (CCE) in $\widetilde{\mathcal{O}}(H^5S\max_{i\le m} A_i / ε^2)$ episodes, and an $ε$-Correlated Equilibrium (CE) in $\widetilde{\mathcal{O}}(H^6S\max_{i\le m} A_i^2 / ε^2)$ episodes. This is the first line of results for learning CCE and CE with sample complexities polynomial in $\max_{i\le m} A_i$. Our algorithm for learning CE integrates an adversarial bandit subroutine which minimizes a weighted swap regret, along with several novel designs in the outer loop. Second, we consider the important special case of Markov Potential Games, and design an algorithm that learns an $ε$-approximate Nash equilibrium within $\widetilde{\mathcal{O}}(S\sum_{i\le m} A_i / ε^3)$ episodes (when only highlighting the dependence on $S$, $A_i$, and $ε$), which only depends linearly in $\sum_{i\le m} A_i$ and significantly improves over existing efficient algorithm in the $ε$ dependence. Overall, our results shed light on what equilibria or structural assumptions on the game may enable sample-efficient learning with many players.

preprint2021arXiv

A Sharp Analysis of Model-based Reinforcement Learning with Self-Play

Model-based algorithms -- algorithms that explore the environment through building and utilizing an estimated model -- are widely used in reinforcement learning practice and theoretically shown to achieve optimal sample efficiency for single-agent reinforcement learning in Markov Decision Processes (MDPs). However, for multi-agent reinforcement learning in Markov games, the current best known sample complexity for model-based algorithms is rather suboptimal and compares unfavorably against recent model-free approaches. In this paper, we present a sharp analysis of model-based self-play algorithms for multi-agent Markov games. We design an algorithm -- Optimistic Nash Value Iteration (Nash-VI) for two-player zero-sum Markov games that is able to output an $ε$-approximate Nash policy in $\tilde{\mathcal{O}}(H^3SAB/ε^2)$ episodes of game playing, where $S$ is the number of states, $A,B$ are the number of actions for the two players respectively, and $H$ is the horizon length. This significantly improves over the best known model-based guarantee of $\tilde{\mathcal{O}}(H^4S^2AB/ε^2)$, and is the first that matches the information-theoretic lower bound $Ω(H^3S(A+B)/ε^2)$ except for a $\min\{A,B\}$ factor. In addition, our guarantee compares favorably against the best known model-free algorithm if $\min \{A,B\}=o(H^3)$, and outputs a single Markov policy while existing sample-efficient model-free algorithms output a nested mixture of Markov policies that is in general non-Markov and rather inconvenient to store and execute. We further adapt our analysis to designing a provably efficient task-agnostic algorithm for zero-sum Markov games, and designing the first line of provably sample-efficient algorithms for multi-player general-sum Markov games.

preprint2021arXiv

Exact Gap between Generalization Error and Uniform Convergence in Random Feature Models

Recent work showed that there could be a large gap between the classical uniform convergence bound and the actual test error of zero-training-error predictors (interpolators) such as deep neural networks. To better understand this gap, we study the uniform convergence in the nonlinear random feature model and perform a precise theoretical analysis on how uniform convergence depends on the sample size and the number of parameters. We derive and prove analytical expressions for three quantities in this model: 1) classical uniform convergence over norm balls, 2) uniform convergence over interpolators in the norm ball (recently proposed by Zhou et al. (2020)), and 3) the risk of minimum norm interpolator. We show that, in the setting where the classical uniform convergence bound is vacuous (diverges to $\infty$), uniform convergence over the interpolators still gives a non-trivial bound of the test error of interpolating solutions. We also showcase a different setting where classical uniform convergence bound is non-vacuous, but uniform convergence over interpolators can give an improved sample complexity guarantee. Our result provides a first exact comparison between the test errors and uniform convergence bounds for interpolators beyond simple linear models.

preprint2021arXiv

How Important is the Train-Validation Split in Meta-Learning?

Meta-learning aims to perform fast adaptation on a new task through learning a "prior" from multiple existing tasks. A common practice in meta-learning is to perform a train-validation split (\emph{train-val method}) where the prior adapts to the task on one split of the data, and the resulting predictor is evaluated on another split. Despite its prevalence, the importance of the train-validation split is not well understood either in theory or in practice, particularly in comparison to the more direct \emph{train-train method}, which uses all the per-task data for both training and evaluation. We provide a detailed theoretical study on whether and when the train-validation split is helpful in the linear centroid meta-learning problem. In the agnostic case, we show that the expected loss of the train-val method is minimized at the optimal prior for meta testing, and this is not the case for the train-train method in general without structural assumptions on the data. In contrast, in the realizable case where the data are generated from linear models, we show that both the train-val and train-train losses are minimized at the optimal prior in expectation. Further, perhaps surprisingly, our main result shows that the train-train method achieves a \emph{strictly better} excess loss in this realizable case, even when the regularization parameter and split ratio are optimally tuned for both methods. Our results highlight that sample splitting may not always be preferable, especially when the data is realizable by the model. We validate our theories by experimentally showing that the train-train method can indeed outperform the train-val method, on both simulations and real meta-learning tasks.

preprint2021arXiv

J-PLUS: Support Vector Machine Applied to STAR-GALAXY-QSOClassification

Context. In modern astronomy, machine learning has proved to be efficient and effective to mine the big data from the newesttelescopes. Spectral surveys enable us to characterize millions of objects, while long exposure time observations and wide surveysconstrain their strides from millions to billions. Aims.In this study, we construct a supervised machine learning algorithm, to classify the objects in the Javalambre Photometric LocalUniverse Survey first data release (J-PLUS DR1). Methods.The sample set is featured with 12-waveband photometry, and magnitudes are labeled with spectrum-based catalogs, in-cluding Sloan Digital Sky Survey spectroscopic data, Large Sky Area Multi-Object Fiber Spectroscopic Telescope, and VERONCAT- Veron Catalog of Quasars & AGN. The performance of the classifier is presented with applications of blind test validations basedon RAdial Velocity Extension, Kepler Input Catalog, 2 MASS Redshift Survey, and the UV-bright Quasar Survey. A new algorithmis applied to constrain the extrapolation that could decrease accuracies for many machine learning classifiers. Results.The accuracies of the classifier are 96.5% in blind test and 97.0% in training cross validation. The F1-scores for each classare presented to show the precision of the classifier. We also discuss different methods to constrain the po

preprint2021arXiv

LTD064402+245919: A Subgiant with a 1-3 M$_{\odot}$ Undetected Companion Identified from LAMOST-TD Data

Single-line spectroscopic binaries recently contribute to the stellar-mass black hole discovery, independently of the X-ray transient method. We report the identification of a single-line binary system LTD064402+245919, with an orbital period of 14.50 days. The observed component is a subgiant with a mass of 2.77$\pm$0.68M$_{\odot}$, radius 15.5$\pm$2.5R$_{\odot}$, effective temperature $T_{\rm eff}$ 4500$\pm$200K, and surface gravity log\emph{g} 2.5$\pm$0.25dex. The discovery makes use of the LAMOST time-domain (LAMOST-TD) and ZTF survey. Our general-purpose software pipeline applies the Lomb-Scargle periodogram to determine the orbital period and uses machine-learning to classify the variable type from the folded light curves. We apply a combined model to estimate the orbital parameters from both the light and radial velocity curves, taking constraints on the primary star mass, mass function, and detection limit of secondary luminosity into consideration. We obtain a radial velocity semi-amplitude of 44.6$\pm$1.5 km s$^{-1}$, mass ratio of 0.73$\pm$0.07, and an undetected component mass of 2.02$\pm$0.49M$_{\odot}$ when the type of the undetected component is not set. We conclude that the inclination is not well constrained, and that the secondary mass is larger than 1M$_{\odot}$ when the undetected component is modelled as a compact object. According to our investigations using an MCMC simulation, increasing the spectra SNR by a factor of 3 would enable the secondary light to be distinguished (if present). The algorithm and software in this work are able to serve as general-purpose tools for the identification of compact objects quiescent in X-rays.

preprint2021arXiv

Near-Optimal Offline Reinforcement Learning via Double Variance Reduction

We consider the problem of offline reinforcement learning (RL) -- a well-motivated setting of RL that aims at policy optimization using only historical data. Despite its wide applicability, theoretical understandings of offline RL, such as its optimal sample complexity, remain largely open even in basic settings such as \emph{tabular} Markov Decision Processes (MDPs). In this paper, we propose Off-Policy Double Variance Reduction (OPDVR), a new variance reduction based algorithm for offline RL. Our main result shows that OPDVR provably identifies an $ε$-optimal policy with $\widetilde{O}(H^2/d_mε^2)$ episodes of offline data in the finite-horizon stationary transition setting, where $H$ is the horizon length and $d_m$ is the minimal marginal state-action distribution induced by the behavior policy. This improves over the best known upper bound by a factor of $H$. Moreover, we establish an information-theoretic lower bound of $Ω(H^2/d_mε^2)$ which certifies that OPDVR is optimal up to logarithmic factors. Lastly, we show that OPDVR also achieves rate-optimal sample complexity under alternative settings such as the finite-horizon MDPs with non-stationary transitions and the infinite horizon MDPs with discounted rewards.

preprint2021arXiv

Towards Understanding Hierarchical Learning: Benefits of Neural Representations

Deep neural networks can empirically perform efficient hierarchical learning, in which the layers learn useful representations of the data. However, how they make use of the intermediate representations are not explained by recent theories that relate them to "shallow learners" such as kernels. In this work, we demonstrate that intermediate neural representations add more flexibility to neural networks and can be advantageous over raw inputs. We consider a fixed, randomly initialized neural network as a representation function fed into another trainable network. When the trainable network is the quadratic Taylor model of a wide two-layer network, we show that neural representation can achieve improved sample complexities compared with the raw input: For learning a low-rank degree-$p$ polynomial ($p \geq 4$) in $d$ dimension, neural representation requires only $\tilde{O}(d^{\lceil p/2 \rceil})$ samples, while the best-known sample complexity upper bound for the raw input is $\tilde{O}(d^{p-1})$. We contrast our result with a lower bound showing that neural representations do not improve over the raw input (in the infinite width limit), when the trainable network is instead a neural tangent kernel. Our results characterize when neural representations are beneficial, and may provide a new perspective on why depth is important in deep learning.

preprint2021arXiv

White dwarfs identified in LAMOST Data Release 5

In this paper, we report white dwarfs identified in the 5th Data Release of the Large Area Multi-Object fibre Spectroscopic Telescope, including spectral types of DA, DB, DC, DZ, and so on. There are 2 625 DA spectra of 2 281 DA stars, 182 DB spectra of 166 DB stars, 62 DC spectra of 58 DC stars, 36 DZ spectra of 33 DZ stars and many other types identified, in addition to our previous paper (Data Release 2). Among those sources, 393 DA stars and 46 DB stars are new identifications after cross-matching with the literature. In order to select DA candidates, we use the classification result from the LAMOST pipeline, colour-colour cut method and a random forest machine learning method. For DBs, since there is no template for DB in the pipeline model, a random forest machine learning method is chosen to select candidates. All the WD candidates have been visually checked individually. The parameters of effective temperature, surface gravity, mass, and cooling age have been estimated for relatively high signal-to-noise ratio DAs and DBs. The peaks of the DA and DB mass distributions are found to be around 0.62Msun and 0.65Msun, respectively. Finally, the data and method we used to select white dwarf candidates for the second phase of LAMOST survey are also addressed in this paper.

preprint2020arXiv

A Catalog of Short Period Spectroscopic and Eclipsing Binaries Identified from the LAMOST & PTF Surveys

Binaries play key roles in determining stellar parameters and exploring stellar evolution models. We build a catalog of 88 eclipsing binaries with spectroscopic information, taking advantage of observations from both the Large Sky Area Multi-Object fiber Spectroscopic Telescope (LAMOST) and the Palomar Transient Factory (PTF) surveys. A software pipeline is constructed to identify binary candidates by examining their light curves. The orbital periods of binaries are derived from the Lomb-Scargle method. The key distinguishing features of eclipsing binaries are recognized by a new filter \textit{Flat Test}. We classify the eclipsing binaries by applying Fourier analysis on the light curves. Among all the binary stars, 13 binaries are identified as eclipsing binaries for the first time. The catalog contains information: position, primary eclipsing magnitude and time, eclipsing depth, the number of photometry and radial velocity observations, largest radial velocity difference, binary type, the effective temperature of observable star $T_{\rm eff}$, and surface gravity of observable star log \emph{g}. The false-positive probability is calculated by using both a Monte Carlo simulation and real data from the SDSS Stripe 82 Standard Catalog. The binaries in the catalog are mostly with a period of less than one day. The period distribution shows a 0.22-day cut-off which is consistent with the low probability of an eclipsing binary rotating with such a period.

preprint2020arXiv

Beyond Linearization: On Quadratic and Higher-Order Approximation of Wide Neural Networks

Recent theoretical work has established connections between over-parametrized neural networks and linearized models governed by he Neural Tangent Kernels (NTKs). NTK theory leads to concrete convergence and generalization results, yet the empirical performance of neural networks are observed to exceed their linearized models, suggesting insufficiency of this theory. Towards closing this gap, we investigate the training of over-parametrized neural networks that are beyond the NTK regime yet still governed by the Taylor expansion of the network. We bring forward the idea of \emph{randomizing} the neural networks, which allows them to escape their NTK and couple with quadratic models. We show that the optimization landscape of randomized two-layer networks are nice and amenable to escaping-saddle algorithms. We prove concrete generalization and expressivity results on these randomized networks, which lead to sample complexity bounds (of learning certain simple functions) that match the NTK and can in addition be better by a dimension factor when mild distributional assumptions are present. We demonstrate that our randomization technique can be generalized systematically beyond the quadratic case, by using it to find networks that are coupled with higher-order terms in their Taylor series.

preprint2020arXiv

Near-Optimal Reinforcement Learning with Self-Play

This paper considers the problem of designing optimal algorithms for reinforcement learning in two-player zero-sum games. We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision. In a tabular episodic Markov game with $S$ states, $A$ max-player actions and $B$ min-player actions, the best existing algorithm for finding an approximate Nash equilibrium requires $\tilde{\mathcal{O}}(S^2AB)$ steps of game playing, when only highlighting the dependency on $(S,A,B)$. In contrast, the best existing lower bound scales as $Ω(S(A+B))$ and has a significant gap from the upper bound. This paper closes this gap for the first time: we propose an optimistic variant of the \emph{Nash Q-learning} algorithm with sample complexity $\tilde{\mathcal{O}}(SAB)$, and a new \emph{Nash V-learning} algorithm with sample complexity $\tilde{\mathcal{O}}(S(A+B))$. The latter result matches the information-theoretic lower bound in all problem-dependent parameters except for a polynomial factor of the length of each episode. In addition, we present a computational hardness result for learning the best responses against a fixed opponent in Markov games---a learning objective different from finding the Nash equilibrium.

preprint2020arXiv

Provable Self-Play Algorithms for Competitive Reinforcement Learning

Self-play, where the algorithm learns by playing against itself without requiring any direct supervision, has become the new weapon in modern Reinforcement Learning (RL) for achieving superhuman performance in practice. However, the majority of exisiting theory in reinforcement learning only applies to the setting where the agent plays against a fixed environment; it remains largely open whether self-play algorithms can be provably effective, especially when it is necessary to manage the exploration/exploitation tradeoff. We study self-play in competitive reinforcement learning under the setting of Markov games, a generalization of Markov decision processes to the two-player case. We introduce a self-play algorithm---Value Iteration with Upper/Lower Confidence Bound (VI-ULCB)---and show that it achieves regret $\tilde{\mathcal{O}}(\sqrt{T})$ after playing $T$ steps of the game, where the regret is measured by the agent's performance against a \emph{fully adversarial} opponent who can exploit the agent's strategy at \emph{any} step. We also introduce an explore-then-exploit style algorithm, which achieves a slightly worse regret of $\tilde{\mathcal{O}}(T^{2/3})$, but is guaranteed to run in polynomial time even in the worst case. To the best of our knowledge, our work presents the first line of provably sample-efficient self-play algorithms for competitive reinforcement learning.

preprint2020arXiv

Provably Efficient Q-Learning with Low Switching Cost

We take initial steps in studying PAC-MDP algorithms with limited adaptivity, that is, algorithms that change its exploration policy as infrequently as possible during regret minimization. This is motivated by the difficulty of running fully adaptive algorithms in real-world applications (such as medical domains), and we propose to quantify adaptivity using the notion of local switching cost. Our main contribution, Q-Learning with UCB2 exploration, is a model-free algorithm for H-step episodic MDP that achieves sublinear regret whose local switching cost in K episodes is $O(H^3SA\log K)$, and we provide a lower bound of $Ω(HSA)$ on the local switching cost for any no-regret algorithm. Our algorithm can be naturally adapted to the concurrent setting, which yields nontrivial results that improve upon prior work in certain aspects.

preprint2020arXiv

Taylorized Training: Towards Better Approximation of Neural Network Training at Finite Width

We propose \emph{Taylorized training} as an initiative towards better understanding neural network training at finite width. Taylorized training involves training the $k$-th order Taylor expansion of the neural network at initialization, and is a principled extension of linearized training---a recently proposed theory for understanding the success of deep learning. We experiment with Taylorized training on modern neural network architectures, and show that Taylorized training (1) agrees with full neural network training increasingly better as we increase $k$, and (2) can significantly close the performance gap between linearized and full training. Compared with linearized training, higher-order training works in more realistic settings such as standard parameterization and large (initial) learning rate. We complement our experiments with theoretical results showing that the approximation error of $k$-th order Taylorized models decay exponentially over $k$ in wide neural networks.

preprint2019arXiv

Machine Learning Regression of extinction in the second $Gaia$ Data Release

Machine learning has become a popular tool to help us make better decisions and predictions, based on experiences, observations and analysing patterns within a given data set without explicitly functions. In this paper, we describe an application of the supervised machine-learning algorithm to the extinction regression for the second Gaia data release, based on the combination of Large Sky Area Multi-Object Fiber Spectroscopic Telescope, Sloan Extension for Galactic Understanding and Exploration and the Apache Point Observatory Galactic Evolution Experiment. The derived extinction in our training sample is consistent with other spectrum-based estimates, and its standard deviation of the cross validations is 0.0127 mag. A blind test is carried out using the RAdial Velocity Experiment catalog, and the standard deviation is 0.0372 mag. Such precise training sample enable us to regress the extinction, E(BP-RP), for 133 million stars in the second Gaia data release. Of these, 106 million stars have the uncertainties less than 0.1 mag, which suffer less bias from the external regression. We also find that there are high deviations between the extinctions form photometry-based methods, and between spectrum- and photometry-based methods. This implies that spectrum-based method could bring more signal to a regressing model than multi-band photometry, and a higher signal-to-noise ratio would acquire a more reliable result.