Researcher profile

Xiao Fang

Xiao Fang contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
17works
0followers
11topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

17 published item(s)

preprint2024arXiv

Data Valuation for Vertical Federated Learning: A Model-free and Privacy-preserving Method

Vertical Federated learning (VFL) is a promising paradigm for predictive analytics, empowering an organization (i.e., task party) to enhance its predictive models through collaborations with multiple data suppliers (i.e., data parties) in a decentralized and privacy-preserving way. Despite the fast-growing interest in VFL, the lack of effective and secure tools for assessing the value of data owned by data parties hinders the application of VFL in business contexts. In response, we propose FedValue, a privacy-preserving, task-specific but model-free data valuation method for VFL, which consists of a data valuation metric and a federated computation method. Specifically, we first introduce a novel data valuation metric, namely MShapley-CMI. The metric evaluates a data party's contribution to a predictive analytics task without the need of executing a machine learning model, making it well-suited for real-world applications of VFL. Next, we develop an innovative federated computation method that calculates the MShapley-CMI value for each data party in a privacy-preserving manner. Extensive experiments conducted on six public datasets validate the efficacy of FedValue for data valuation in the context of VFL. In addition, we illustrate the practical utility of FedValue with a case study involving federated movie recommendations.

preprint2024arXiv

Second-order Approximation of Exponential Random Graph Models

Exponential random graph models (ERGMs) are flexible probability models allowing edge dependency. However, it is known that, to a first-order approximation, many ERGMs behave like Erdös-Rényi random graphs, where edges are independent. In this paper, to distinguish ERGMs from Erdös-Rényi random graphs, we consider second-order approximations of ERGMs using two-stars and triangles. We prove that the second-order approximation indeed achieves second-order accuracy in the triangle-free case. The new approximation is formally obtained by Hoeffding decomposition and rigorously justified using Stein's method.

preprint2022arXiv

Billion-user Customer Lifetime Value Prediction: An Industrial-scale Solution from Kuaishou

Customer Life Time Value (LTV) is the expected total revenue that a single user can bring to a business. It is widely used in a variety of business scenarios to make operational decisions when acquiring new customers. Modeling LTV is a challenging problem, due to its complex and mutable data distribution. Existing approaches either directly learn from posterior feature distributions or leverage statistical models that make strong assumption on prior distributions, both of which fail to capture those mutable distributions. In this paper, we propose a complete set of industrial-level LTV modeling solutions. Specifically, we introduce an Order Dependency Monotonic Network (ODMN) that models the ordered dependencies between LTVs of different time spans, which greatly improves model performance. We further introduce a Multi Distribution Multi Experts (MDME) module based on the Divide-and-Conquer idea, which transforms the severely imbalanced distribution modeling problem into a series of relatively balanced sub-distribution modeling problems hence greatly reduces the modeling complexity. In addition, a novel evaluation metric Mutual Gini is introduced to better measure the distribution difference between the estimated value and the ground-truth label based on the Lorenz Curve. The ODMN framework has been successfully deployed in many business scenarios of Kuaishou, and achieved great performance. Extensive experiments on real-world industrial data demonstrate the superiority of the proposed methods compared to state-of-the-art baselines including ZILN and Two-Stage XGBoost models.

preprint2022arXiv

Cosmology with the Roman Space Telescope -- Synergies with CMB lensing

We explore synergies between the Nancy Grace Roman Space Telescope and CMB lensing data to constrain dark energy and modified gravity scenarios. A simulated likelihood analysis of the galaxy clustering and weak lensing data from the Roman Space Telescope High Latitude Survey combined with CMB lensing data from the Simons Observatory is undertaken, marginalizing over important astrophysical effects and calibration uncertainties. Included in the modeling are the effects of baryons on small-scale clustering, scale-dependent growth suppression by neutrinos, as well as uncertainties in the galaxy clustering biases, in the intrinsic alignment contributions to the lensing signal, in the redshift distributions, and in the galaxy shape calibration. The addition of CMB lensing roughly doubles the dark energy figure-of-merit from Roman photometric survey data alone, varying from a factor of 1.7 to 2.4 improvement depending on the particular Roman survey configuration. Alternatively, the inclusion of CMB lensing information can compensate for uncertainties in the Roman galaxy shape calibration if it falls below the design goals. Furthermore, we report the first forecast of Roman constraints on a model-independent structure growth, parameterized by $σ_8 (z)$, and on the Hu-Sawicki f(R) gravity as well as an improved forecast of the phenomenological $(Σ_0,μ_0)$ model. We find that CMB lensing plays a crucial role in constraining $σ_8(z)$ at z>2, with percent-level constraints forecasted out to z=4. CMB lensing information does not improve constraints on the f(R) models substantially. It does, however, increase the $(Σ_0,μ_0)$ figure-of-merit by a factor of about 1.5.

preprint2022arXiv

Exploiting Expert Knowledge for Assigning Firms to Industries: A Novel Deep Learning Method

Industry assignment, which assigns firms to industries according to a predefined Industry Classification System (ICS), is fundamental to a large number of critical business practices, ranging from operations and strategic decision making by firms to economic analyses by government agencies. Three types of expert knowledge are essential to effective industry assignment: definition-based knowledge (i.e., expert definitions of each industry), structure-based knowledge (i.e., structural relationships among industries as specified in an ICS), and assignment-based knowledge (i.e., prior firm-industry assignments performed by domain experts). Existing industry assignment methods utilize only assignment-based knowledge to learn a model that classifies unassigned firms to industries, and overlook definition-based and structure-based knowledge. Moreover, these methods only consider which industry a firm has been assigned to, but ignore the time-specificity of assignment-based knowledge, i.e., when the assignment occurs. To address the limitations of existing methods, we propose a novel deep learning-based method that not only seamlessly integrates the three types of knowledge for industry assignment but also takes the time-specificity of assignment-based knowledge into account. Methodologically, our method features two innovations: dynamic industry representation and hierarchical assignment. The former represents an industry as a sequence of time-specific vectors by integrating the three types of knowledge through our proposed temporal and spatial aggregation mechanisms. The latter takes industry and firm representations as inputs, computes the probability of assigning a firm to different industries, and assigns the firm to the industry with the highest probability.

preprint2022arXiv

From $p$-Wasserstein Bounds to Moderate Deviations

We use a new method via $p$-Wasserstein bounds to prove Cramér-type moderate deviations in (multivariate) normal approximations. In the classical setting that $W$ is a standardized sum of $n$ independent and identically distributed (i.i.d.) random variables with sub-exponential tails, our method recovers the optimal range of $0\leq x=o(n^{1/6})$ and the near optimal error rate $O(1)(1+x)(\log n+x^2)/\sqrt{n}$ for $P(W>x)/(1-Φ(x))\to 1$, where $Φ$ is the standard normal distribution function. Our method also works for dependent random variables (vectors) and we give applications to the combinatorial central limit theorem, Wiener chaos, homogeneous sums and local dependence. The key step of our method is to show that the $p$-Wasserstein distance between the distribution of the random variable (vector) of interest and a normal distribution grows like $O(p^αΔ)$, $1\leq p\leq p_0$, for some constants $α, Δ$ and $p_0$. In the above i.i.d. setting, $α=1, Δ=1/\sqrt{n}, p_0=n^{1/3}$. For this purpose, we obtain general $p$-Wasserstein bounds in (multivariate) normal approximations using Stein's method.

preprint2022arXiv

High order steady-state diffusion approximations

We derive and analyze new diffusion approximations of stationary distributions of Markov chains that are based on second- and higher-order terms in the expansion of the Markov chain generator. Our approximations achieve a higher degree of accuracy compared to diffusion approximations widely used for the past fifty years, while retaining a similar computational complexity. To support our approximations, we present a combination of theoretical and numerical results across three different models. Our approximations are derived recursively through Stein/Poisson equations, and the theoretical results are proved using Stein's method.

preprint2022arXiv

High-dimensional properties for empirical priors in linear regression with unknown error variance

We study full Bayesian procedures for high-dimensional linear regression. We adopt data-dependent empirical priors introduced in [1]. In their paper, these priors have nice posterior contraction properties and are easy to compute. Our paper extend their theoretical results to the case of unknown error variance . Under proper sparsity assumption, we achieve model selection consistency, posterior contraction rates as well as Bernstein von-Mises theorem by analyzing multivariate t-distribution.

preprint2022arXiv

Posterior Consistency for Bayesian Relevance Vector Machines

Statistical modeling and inference problems with sample sizes substantially smaller than the number of available covariates are challenging. Chakraborty et al. (2012) did a full hierarchical Bayesian analysis of nonlinear regression in such situations using relevance vector machines based on reproducing kernel Hilbert space (RKHS). But they did not provide any theoretical properties associated with their procedure. The present paper revisits their problem, introduces a new class of global-local priors different from theirs, and provides results on posterior consistency as well as posterior contraction rates

preprint2021arXiv

2D-FFTLog: Efficient computation of real space covariance matrices for galaxy clustering and weak lensing

Accurate covariance matrices for two-point functions are critical for inferring cosmological parameters in likelihood analyses of large-scale structure surveys. Among various approaches to obtaining the covariance, analytic computation is much faster and less noisy than estimation from data or simulations. However, the transform of covariances from Fourier space to real space involves integrals with two Bessel integrals, which are numerically slow and easily affected by numerical uncertainties. Inaccurate covariances may lead to significant errors in the inference of the cosmological parameters. In this paper, we introduce a 2D-FFTLog algorithm for efficient, accurate and numerically stable computation of non-Gaussian real space covariances for both 3D and projected statistics. The 2D-FFTLog algorithm is easily extended to perform real space bin-averaging. We apply the algorithm to the covariances for galaxy clustering and weak lensing for a Dark Energy Survey Year 3-like and a Rubin Observatory's Legacy Survey of Space and Time Year 1-like survey, and demonstrate that for both surveys, our algorithm can produce numerically stable angular bin-averaged covariances with the flat sky approximation, which are sufficiently accurate for inferring cosmological parameters. The code CosmoCov for computing the real space covariances with or without the flat sky approximation is released along with this paper.

preprint2021arXiv

Large-dimensional Central Limit Theorem with Fourth-moment Error Bounds on Convex Sets and Balls

We prove the large-dimensional Gaussian approximation of a sum of $n$ independent random vectors in $\mathbb{R}^d$ together with fourth-moment error bounds on convex sets and Euclidean balls. We show that compared with classical third-moment bounds, our bounds have near-optimal dependence on $n$ and can achieve improved dependence on the dimension $d$. For centered balls, we obtain an additional error bound that has a sub-optimal dependence on $n$, but recovers the known result of the validity of the Gaussian approximation if and only if $d=o(n)$. We discuss an application to the bootstrap. We prove our main results using Stein's method.

preprint2020arXiv

Arcsine laws for random walks generated from random permutations with applications to genomics

A classical result for the simple symmetric random walk with $2n$ steps is that the number of steps above the origin, the time of the last visit to the origin, and the time of the maximum height all have exactly the same distribution and converge when scaled to the arcsine law. Motivated by applications in genomics, we study the distributions of these statistics for the non-Markovian random walk generated from the ascents and descents of a uniform random permutation and a Mallows($q$) permutation and show that they have the same asymptotic distributions as for the simple random walk. We also give an unexpected conjecture, along with numerical evidence and a partial proof in special cases, for the result that the number of steps above the origin by step $2n$ for the uniform permutation generated walk has exactly the same discrete arcsine distribution as for the simple random walk, even though the other statistics for these walks have very different laws. We also give explicit error bounds to the limit theorems using Stein's method for the arcsine distribution, as well as functional central limit theorems and a strong embedding of the Mallows$(q)$ permutation which is of independent interest.

preprint2020arXiv

Beyond Limber: Efficient computation of angular power spectra for galaxy clustering and weak lensing

Angular two-point statistics of large-scale structure observables are important cosmological probes. To reach the high accuracy required by the statistical precision of future surveys, some of these statistics may need to be computed without the commonly employed Limber approximation; the exact computation however requires integration over Bessel functions, and a brute-force evaluation is slow to converge. We present a new method based on our generalized FFTLog algorithm for the efficient computation of angular power spectra beyond the Limber approximation. The new method significantly simplifies the calculation and improves the numerical speed and stability. It is easily extended to handle integrals involving derivatives of Bessel functions, making it equally applicable to numerically more challenging cases such as contributions from redshift-space distortions and Doppler effects. We implement our method for galaxy clustering and galaxy-galaxy lensing power spectra. We find that using the Limber approximation for galaxy clustering in future analyses like LSST Year 1 and DES Year 6 may cause significant biases in cosmological parameters, indicating that going beyond the Limber approximation is necessary for these analyses.

preprint2020arXiv

High-dimensional Central Limit Theorems by Stein's Method

We obtain explicit error bounds for the $d$-dimensional normal approximation on hyperrectangles for a random vector that has a Stein kernel, or admits an exchangeable pair coupling, or is a non-linear statistic of independent random variables or a sum of $n$ locally dependent random vectors. We assume the approximating normal distribution has a non-singular covariance matrix. The error bounds vanish even when the dimension $d$ is much larger than the sample size $n$. We prove our main results using the approach of Götze (1991) in Stein's method, together with modifications of an estimate of Anderson, Hall and Titterington (1998) and a smoothing inequality of Bhattacharya and Rao (1976). For sums of $n$ independent and identically distributed isotropic random vectors having a log-concave density, we obtain an error bound that is optimal up to a $\log n$ factor. We also discuss an application to multiple Wiener-Itô integrals.

preprint2020arXiv

New error bounds in multivariate normal approximations via exchangeable pairs with applications to Wishart matrices and fourth moment theorems

We extend Stein's celebrated Wasserstein bound for normal approximation via exchangeable pairs to the multi-dimensional setting. As an intermediate step, we exploit the symmetry of exchangeable pairs to obtain an error bound for smooth test functions. We also obtain a continuous version of the multi-dimensional Wasserstein bound in terms of fourth moments. We apply the main results to multivariate normal approximations to Wishart matrices of size $n$ and degree $d$, where we obtain the optimal convergence rate $\sqrt{n^3/d}$ under only moment assumptions, and to quadratic forms and Poisson functionals, where we strengthen a few of the fourth moment bounds in the literature on the Wasserstein distance.

preprint2020arXiv

Normal Approximation and Fourth Moment Theorems for Monochromatic Triangles

Given a graph sequence $\{G_n\}_{n \geq 1}$ denote by $T_3(G_n)$ the number of monochromatic triangles in a uniformly random coloring of the vertices of $G_n$ with $c \geq 2$ colors. This arises as a generalization of the birthday paradox, where $G_n$ corresponds to a friendship network and $T_3(G_n)$ counts the number of triples of friends with matching birthdays. In this paper we prove a central limit theorem (CLT) for $T_3(G_n)$ with explicit error rates. The proof involves constructing a martingale difference sequence by carefully ordering the vertices of $G_n$, based on a certain combinatorial score function, and using a quantitive version of the martingale CLT. We then relate this error term to the well-known fourth moment phenomenon, which, interestingly, holds only when the number of colors $c \geq 5$. We also show that the convergence of the fourth moment is necessary to obtain a Gaussian limit for any $c \geq 2$, which, together with the above result, implies that the fourth-moment condition characterizes the limiting normal distribution of $T_3(G_n)$, whenever $c \geq 5$. Finally, to illustrate the promise of our approach, we include an alternative proof of the CLT for the number of monochromatic edges, which provides quantitative rates for the results obtained in Bhattacharya et al. (2017).

preprint2019arXiv

An efficient method for mapping the 12C+12C molecular resonances at low energies

The 12C+12C fusion reaction is famous for its complication of molecular resonances, and plays an important role in both nuclear structure and astrophysics. It is extremely difficult to measure the cross sections of 12C+12C fusions at energies of astrophysical relevance due to very low reaction yields. To measure the complicated resonant structure existing in this important reaction, an efficient thick target method has been developed and applied for the first time at energies Ec.m.<5.3 MeV. A scan of the cross sections over a relatively wide range of energies can be carried out using only a single beam energy. The result of measurement at Ec.m.= 4.1 MeV is compared with other results from previous work. This method would be useful for searching potentially existing resonances of 12C+12C in the energy range 1 MeV<Ec.m.<3 MeV.