Source author record

Bin Yu

Bin Yu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

62works

32topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation

Robot imitation data are often multimodal: similar visual-language observations may be followed by different action chunks because human demonstrators act with different short-horizon intents, task phases, or recent context. Existing frame-conditioned VLA policies infer each chunk from the current observation and instruction alone, so under partial observability they may resample different intents across adjacent replanning steps, leading to inter-chunk conflict and unstable execution. We introduce IntentVLA, a history-conditioned VLA framework that encodes recent visual observations into a compact short-horizon intent representation and uses it to condition chunk generation. We further introduce AliasBench, a 12-task ambiguity-aware benchmark on RoboTwin2 with matched training data and evaluation environments that isolate short-horizon observation aliasing. Across AliasBench, SimplerEnv, LIBERO, and RoboCasa, IntentVLA improves rollout stability and outperforms strong VLA baselines

preprint2026arXiv

MR-Align: Meta-Reasoning Informed Factuality Alignment for Large Reasoning Models

Large reasoning models (LRMs) show strong capabilities in complex reasoning, yet their marginal gains on evidence-dependent factual questions are limited. We find this limitation is partially attributable to a reasoning-answer hit gap, where the model identifies the correct facts during reasoning but fails to incorporate them into the final response, thereby reducing factual fidelity. To address this issue, we propose MR-ALIGN, a Meta-Reasoning informed alignment framework that enhances factuality without relying on external verifiers. MR-ALIGN quantifies state transition probabilities along the model's thinking process and constructs a transition-aware implicit reward that reinforces beneficial reasoning patterns while suppressing defective ones at the atomic thinking segments. This re-weighting reshapes token-level signals into probability-aware segment scores, encouraging coherent reasoning trajectories that are more conducive to factual correctness. Empirical evaluations across four factual QA datasets and one long-form factuality benchmark show that MR-ALIGN consistently improves accuracy and truthfulness while reducing misleading reasoning. These results highlight that aligning the reasoning process itself, rather than merely the outputs, is pivotal for advancing factuality in LRMs.

preprint2026arXiv

PhysBrain 1.0 Technical Report

Vision-language-action models have advanced rapidly, but robot trajectories alone provide limited coverage for learning broad physical understanding. PhysBrain 1.0 studies a complementary route: converting large-scale human egocentric video into structured physical commonsense supervision before robot adaptation. Our data engine extracts scene elements, spatial dynamics, action execution, and depth-aware relations, then turns them into question-answer supervision for training PhysBrain VLMs. The resulting physical priors are further transferred to VLA policies through a capability-preserving and language-sensitive adaptation design. Across multimodal QA benchmarks and embodied control benchmarks, including ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa, PhysBrain 1.0 achieves SOTA results and shows especially strong out-of-domain performance on SimplerEnv. These results suggest that scaling physical commonsense from human interaction video can provide an effective bridge from multimodal understanding to robot action.

preprint2022arXiv

Hierarchical Shrinkage: improving the accuracy and interpretability of tree-based methods

Tree-based models such as decision trees and random forests (RF) are a cornerstone of modern machine-learning practice. To mitigate overfitting, trees are typically regularized by a variety of techniques that modify their structure (e.g. pruning). We introduce Hierarchical Shrinkage (HS), a post-hoc algorithm that does not modify the tree structure, and instead regularizes the tree by shrinking the prediction over each node towards the sample means of its ancestors. The amount of shrinkage is controlled by a single regularization parameter and the number of data points in each ancestor. Since HS is a post-hoc method, it is extremely fast, compatible with any tree growing algorithm, and can be used synergistically with other regularization techniques. Extensive experiments over a wide variety of real-world datasets show that HS substantially increases the predictive performance of decision trees, even when used in conjunction with other regularization techniques. Moreover, we find that applying HS to each tree in an RF often improves accuracy, as well as its interpretability by simplifying and stabilizing its decision boundaries and SHAP values. We further explain the success of HS in improving prediction performance by showing its equivalence to ridge regression on a (supervised) basis constructed of decision stumps associated with the internal nodes of a tree. All code and models are released in a full-fledged package available on Github (github.com/csinva/imodels)

preprint2022arXiv

Instability, Computational Efficiency and Statistical Accuracy

Many statistical estimators are defined as the fixed point of a data-dependent operator, with estimators based on minimizing a cost function being an important special case. The limiting performance of such estimators depends on the properties of the population-level operator in the idealized limit of infinitely many samples. We develop a general framework that yields bounds on statistical accuracy based on the interplay between the deterministic convergence rate of the algorithm at the population level, and its degree of (in)stability when applied to an empirical object based on $n$ samples. Using this framework, we analyze both stable forms of gradient descent and some higher-order and unstable algorithms, including Newton's method and its cubic-regularized variant, as well as the EM algorithm. We provide applications of our general results to several concrete classes of models, including Gaussian mixture estimation, non-linear regression models, and informative non-response models. We exhibit cases in which an unstable algorithm can achieve the same statistical accuracy as a stable algorithm in exponentially fewer steps -- namely, with the number of iterations being reduced from polynomial to logarithmic in sample size $n$.

preprint2022arXiv

Learning Using Privileged Information for Zero-Shot Action Recognition

Zero-Shot Action Recognition (ZSAR) aims to recognize video actions that have never been seen during training. Most existing methods assume a shared semantic space between seen and unseen actions and intend to directly learn a mapping from a visual space to the semantic space. This approach has been challenged by the semantic gap between the visual space and semantic space. This paper presents a novel method that uses object semantics as privileged information to narrow the semantic gap and, hence, effectively, assist the learning. In particular, a simple hallucination network is proposed to implicitly extract object semantics during testing without explicitly extracting objects and a cross-attention module is developed to augment visual feature with the object semantics. Experiments on the Olympic Sports, HMDB51 and UCF101 datasets have shown that the proposed method outperforms the state-of-the-art methods by a large margin.

preprint2022arXiv

Provable Boolean Interaction Recovery from Tree Ensemble obtained via Random Forests

Random Forests (RF) are at the cutting edge of supervised machine learning in terms of prediction performance, especially in genomics. Iterative Random Forests (iRF) use a tree ensemble from iteratively modified RF to obtain predictive and stable non-linear or Boolean interactions of features. They have shown great promise for Boolean biological interaction discovery that is central to advancing functional genomics and precision medicine. However, theoretical studies into how tree-based methods discover Boolean feature interactions are missing. Inspired by the thresholding behavior in many biological processes, we first introduce a novel discontinuous nonlinear regression model, called the Locally Spiky Sparse (LSS) model. Specifically, the LSS model assumes that the regression function is a linear combination of piecewise constant Boolean interaction terms. Given an RF tree ensemble, we define a quantity called Depth-Weighted Prevalence (DWP) for a set of signed features S. Intuitively speaking, DWP(S) measures how frequently features in S appear together in an RF tree ensemble. We prove that, with high probability, DWP(S) attains a universal upper bound that does not involve any model coefficients, if and only if S corresponds to a union of Boolean interactions under the LSS model. Consequentially, we show that a theoretically tractable version of the iRF procedure, called LSSFind, yields consistent interaction discovery under the LSS model as the sample size goes to infinity. Finally, simulation results show that LSSFind recovers the interactions under the LSS model even when some assumptions are violated.

preprint2022arXiv

Seven Principles for Rapid-Response Data Science: Lessons Learned from Covid-19 Forecasting

In this article, we take a step back to distill seven principles out of our experience in the spring of 2020, when our 12-person rapid-response team used skills of data science and beyond to help distribute Covid PPE. This process included tapping into domain knowledge of epidemiology and medical logistics chains, curating a relevant data repository, developing models for short-term county-level death forecasting in the US, and building a website for sharing visualization (an automated AI machine). The principles are described in the context of working with Response4Life, a then-new nonprofit organization, to illustrate their necessity. Many of these principles overlap with those in standard data-science teams, but an emphasis is put on dealing with problems that require rapid response, often resembling agile software development.

preprint2022arXiv

SOFFLFM: Super-resolution optical fluctuation Fourier light-field microscopy

Fourier light-field microscopy (FLFM) uses a micro-lens array (MLA) to segment the Fourier Plane of the microscopic objective lens to generate multiple two-dimensional perspective views, thereby reconstructing the three-dimensional(3D) structure of the sample using 3D deconvolution calculation without scanning. However, the resolution of FLFM is still limited by diffraction, and furthermore, dependent on the aperture division. In order to improve its resolution, a Super-resolution optical fluctuation Fourier light field microscopy (SOFFLFM) was proposed here, in which the Sofi method with ability of super-resolution was introduced into FLFM. SOFFLFM uses higher-order cumulants statistical analysis on an image sequence collected by FLFM, and then carries out 3D deconvolution calculation to reconstruct the 3D structure of the sample. Theoretical basis of SOFFLFM on improving resolution was explained and then verified with simulations. Simulation results demonstrated that SOFFLFM improved lateral and axial resolution by more than sqrt(2) and 2 times in the 2nd and 4th order accumulations, compared with that of FLFM.

preprint2022arXiv

Towards Robust Waveform-Based Acoustic Models

We study the problem of learning robust acoustic models in adverse environments, characterized by a significant mismatch between training and test conditions. This problem is of paramount importance for the deployment of speech recognition systems that need to perform well in unseen environments. First, we characterize data augmentation theoretically as an instance of vicinal risk minimization, which aims at improving risk estimates during training by replacing the delta functions that define the empirical density over the input space with an approximation of the marginal population density in the vicinity of the training samples. More specifically, we assume that local neighborhoods centered at training samples can be approximated using a mixture of Gaussians, and demonstrate theoretically that this can incorporate robust inductive bias into the learning process. We then specify the individual mixture components implicitly via data augmentation schemes, designed to address common sources of spurious correlations in acoustic models. To avoid potential confounding effects on robustness due to information loss, which has been associated with standard feature extraction techniques (e.g., FBANK and MFCC features), we focus on the waveform-based setting. Our empirical results show that the approach can generalize to unseen noise conditions, with 150% relative improvement in out-of-distribution generalization compared to training using the standard risk minimization principle. Moreover, the results demonstrate competitive performance relative to models learned using a training sample designed to match the acoustic conditions characteristic of test utterances.

preprint2021arXiv

Fast mixing of Metropolized Hamiltonian Monte Carlo: Benefits of multi-step gradients

Hamiltonian Monte Carlo (HMC) is a state-of-the-art Markov chain Monte Carlo sampling algorithm for drawing samples from smooth probability densities over continuous spaces. We study the variant most widely used in practice, Metropolized HMC with the Störmer-Verlet or leapfrog integrator, and make two primary contributions. First, we provide a non-asymptotic upper bound on the mixing time of the Metropolized HMC with explicit choices of step-size and number of leapfrog steps. This bound gives a precise quantification of the faster convergence of Metropolized HMC relative to simpler MCMC algorithms such as the Metropolized random walk, or Metropolized Langevin algorithm. Second, we provide a general framework for sharpening mixing time bounds of Markov chains initialized at a substantial distance from the target distribution over continuous spaces. We apply this sharpening device to the Metropolized random walk and Langevin algorithms, thereby obtaining improved mixing time bounds from a non-warm initial distribution.

preprint2020arXiv

A Survey on Dynamic Network Embedding

Real-world networks are composed of diverse interacting and evolving entities, while most of existing researches simply characterize them as particular static networks, without consideration of the evolution trend in dynamic networks. Recently, significant progresses in tracking the properties of dynamic networks have been made, which exploit changes of entities and links in the network to devise network embedding techniques. Compared to widely proposed static network embedding methods, dynamic network embedding endeavors to encode nodes as low-dimensional dense representations that effectively preserve the network structures and the temporal dynamics, which is beneficial to multifarious downstream machine learning tasks. In this paper, we conduct a systematical survey on dynamic network embedding. In specific, basic concepts of dynamic network embedding are described, notably, we propose a novel taxonomy of existing dynamic network embedding techniques for the first time, including matrix factorization based, Skip-Gram based, autoencoder based, neural networks based and other embedding methods. Additionally, we carefully summarize the commonly used datasets and a wide variety of subsequent tasks that dynamic network embedding can benefit. Afterwards and primarily, we suggest several challenges that the existing algorithms faced and outline possible directions to facilitate the future research, such as dynamic embedding models, large-scale dynamic networks, heterogeneous dynamic networks, dynamic attributed networks, task-oriented dynamic network embedding and more embedding spaces.

preprint2020arXiv

A Systematic Study of the dust of Galactic Supernova Remnants I. The Distance and the Extinction

By combining the photometric, spectroscopic, and astrometric information of the stars in the sightline of SNRs, the distances to and the extinctions of 32 Galactic supernova remnants (SNRs) are investigated. The stellar atmospheric parameters are from the SDSS$-$DR14$/$APOGEE and LAMOST$-$DR5$/$LEGUE spectroscopic surveys. The multi-band photometry, from optical to infrared, are collected from the {\it Gaia}, APASS, Pan--STARRS1, 2MASS, and {\it WISE} surveys. With the calibrated {\it Gaia} distances of individual stars, the distances to 15 of 32 SNRs are well determined from their produced extinction and association with molecular clouds. The upper limits of distance are derived for 3 SNRs. The color excess ratios $E(g_{\rm P1}-λ) / E(g_{\rm P1}-r_{\rm P1})$ of 32 SNRs are calculated, and their variation with wavebands is fitted by a simple dust model. The inferred dust grain size distribution bifurcates: while the graphite grains have comparable size to the average ISM dust, the silicate grains are generally larger. Along the way, the average extinction law from optical to near-infrared of the Milky Way is derived from the 1.3 million star sample and found to agree with the CCM89 law with $R_{\rm V}=3.15$.

preprint2020arXiv

Classifying expanding attractors on figure eight knot complement space and non-transitive Anosov flows on Franks-Williams manifold

The path closure of figure eight knot complement space, $N_0$, supports a natural DA (derived from Anosov) expanding attractor. Using this attractor, Franks-Williams constructed the first example of non-transitive Anosov flow on the manifold $M_0$ obtained by gluing two copies of $N_0$ through identity map along their boundaries, named by Franks-Williams manifold. In this paper, our main goal is to classify expanding attractors on $N_0$ and non-transitive Anosov flows on $M_0$. We prove that, up to orbit-equivalence, the DA expanding attractor is the unique expanding attractor supported by $N_0$, and the non-transitive Anosov flow constructed by Franks and Williams is the unique non-transitive Anosov flow admitted by $M_0$. Moreover, more general cases are also discussed. In particular, we completely classify non-transitive Anosov flows on a family of infinitely many toroidal $3$-manifolds with two hyperbolic pieces, obtained by gluing two copies of $N_0$ through any gluing homeomorphism.

preprint2020arXiv

Incremental causal effects

Causal evidence is needed to act and it is often enough for the evidence to point towards a direction of the effect of an action. For example, policymakers might be interested in estimating the effect of slightly increasing taxes on private spending across the whole population. We study identifiability and estimation of causal effects, where a continuous treatment is slightly shifted across the whole population (termed average partial effect or incremental causal effect). We show that incremental effects are identified under local ignorability and local overlap assumptions, where exchangeability and positivity only hold in a neighborhood of units. Average treatment effects are not identified under these assumptions. In this case, and under a smoothness condition, the incremental effect can be estimated via the average derivative. Moreover, we prove that in certain finite-sample observational settings, estimating the incremental effect is easier than estimating the average treatment effect in terms of asymptotic variance. For high-dimensional settings, we develop a simple feature transformation that allows for doubly-robust estimation and inference of incremental causal effects. Finally, we compare the behaviour of estimators of the incremental treatment effect and average treatment effect in experiments including data-inspired simulations.

preprint2020arXiv

Singularity, Misspecification, and the Convergence Rate of EM

A line of recent work has analyzed the behavior of the Expectation-Maximization (EM) algorithm in the well-specified setting, in which the population likelihood is locally strongly concave around its maximizing argument. Examples include suitably separated Gaussian mixture models and mixtures of linear regressions. We consider over-specified settings in which the number of fitted components is larger than the number of components in the true distribution. Such misspecified settings can lead to singularity in the Fisher information matrix, and moreover, the maximum likelihood estimator based on $n$ i.i.d. samples in $d$ dimensions can have a non-standard $\mathcal{O}((d/n)^{\frac{1}{4}})$ rate of convergence. Focusing on the simple setting of two-component mixtures fit to a $d$-dimensional Gaussian distribution, we study the behavior of the EM algorithm both when the mixture weights are different (unbalanced case), and are equal (balanced case). Our analysis reveals a sharp distinction between these two cases: in the former, the EM algorithm converges geometrically to a point at Euclidean distance of $\mathcal{O}((d/n)^{\frac{1}{2}})$ from the true parameter, whereas in the latter case, the convergence rate is exponentially slower, and the fixed point has a much lower $\mathcal{O}((d/n)^{\frac{1}{4}})$ accuracy. Analysis of this singular case requires the introduction of some novel techniques: in particular, we make use of a careful form of localization in the associated empirical process, and develop a recursive argument to progressively sharpen the statistical rate.

preprint2020arXiv

Structural Compression of Convolutional Neural Networks

Deep convolutional neural networks (CNNs) have been successful in many tasks in machine vision, however, millions of weights in the form of thousands of convolutional filters in CNNs makes them difficult for human intepretation or understanding in science. In this article, we introduce CAR, a greedy structural compression scheme to obtain smaller and more interpretable CNNs, while achieving close to original accuracy. The compression is based on pruning filters with the least contribution to the classification accuracy. We demonstrate the interpretability of CAR-compressed CNNs by showing that our algorithm prunes filters with visually redundant functionalities such as color filters. These compressed networks are easier to interpret because they retain the filter diversity of uncompressed networks with order of magnitude less filters. Finally, a variant of CAR is introduced to quantify the importance of each image category to each CNN filter. Specifically, the most and the least important class labels are shown to be meaningful interpretations of each filter.

preprint2019arXiv

Veridical Data Science

Building and expanding on principles of statistics, machine learning, and scientific inquiry, we propose the predictability, computability, and stability (PCS) framework for veridical data science. Our framework, comprised of both a workflow and documentation, aims to provide responsible, reliable, reproducible, and transparent results across the entire data science life cycle. The PCS workflow uses predictability as a reality check and considers the importance of computation in data collection/storage and algorithm design. It augments predictability and computability with an overarching stability principle for the data science life cycle. Stability expands on statistical uncertainty considerations to assess how human judgment calls impact data results through data and model/algorithm perturbations. Moreover, we develop inference procedures that build on PCS, namely PCS perturbation intervals and PCS hypothesis testing, to investigate the stability of data results relative to problem formulation, data cleaning, modeling decisions, and interpretations. We illustrate PCS inference through neuroscience and genomics projects of our own and others and compare it to existing methods in high dimensional, sparse linear model simulations. Over a wide range of misspecified simulation models, PCS inference demonstrates favorable performance in terms of ROC curves. Finally, we propose PCS documentation based on R Markdown or Jupyter Notebook, with publicly available, reproducible codes and narratives to back up human choices made throughout an analysis. The PCS workflow and documentation are demonstrated in a genomics case study available on Zenodo.

preprint2017arXiv

Iterative Random Forests to detect predictive and stable high-order interactions

Genomics has revolutionized biology, enabling the interrogation of whole transcriptomes, genome-wide binding sites for proteins, and many other molecular processes. However, individual genomic assays measure elements that interact in vivo as components of larger molecular machines. Understanding how these high-order interactions drive gene expression presents a substantial statistical challenge. Building on Random Forests (RF), Random Intersection Trees (RITs), and through extensive, biologically inspired simulations, we developed the iterative Random Forest algorithm (iRF). iRF trains a feature-weighted ensemble of decision trees to detect stable, high-order interactions with same order of computational cost as RF. We demonstrate the utility of iRF for high-order interaction discovery in two prediction problems: enhancer activity in the early Drosophila embryo and alternative splicing of primary transcripts in human derived cell lines. In Drosophila, among the 20 pairwise transcription factor interactions iRF identifies as stable (returned in more than half of bootstrap replicates), 80% have been previously reported as physical interactions. Moreover, novel third-order interactions, e.g. between Zelda (Zld), Giant (Gt), and Twist (Twi), suggest high-order relationships that are candidates for follow-up experiments. In human-derived cells, iRF re-discovered a central role of H3K36me3 in chromatin-mediated splicing regulation, and identified novel 5th and 6th order interactions, indicative of multi-valent nucleosomes with specific roles in splicing regulation. By decoupling the order of interactions from the computational cost of identification, iRF opens new avenues of inquiry into the molecular mechanisms underlying genome biology.

preprint2016arXiv

Formulas for Counting the Sizes of Markov Equivalence Classes of Directed Acyclic Graphs

The sizes of Markov equivalence classes of directed acyclic graphs play important roles in measuring the uncertainty and complexity in causal learning. A Markov equivalence class can be represented by an essential graph and its undirected subgraphs determine the size of the class. In this paper, we develop a method to derive the formulas for counting the sizes of Markov equivalence classes. We first introduce a new concept of core graph. The size of a Markov equivalence class of interest is a polynomial of the number of vertices given its core graph. Then, we discuss the recursive and explicit formula of the polynomial, and provide an algorithm to derive the size formula via symbolic computation for any given core graph. The proposed size formula derivation sheds light on the relationships between the size of a Markov equivalence class and its representation graph, and makes size counting efficient, even when the essential graphs contain non-sparse undirected subgraphs.

preprint2016arXiv

Local identifiability of $l_1$-minimization dictionary learning: a sufficient and almost necessary condition

We study the theoretical properties of learning a dictionary from $N$ signals $\mathbf x_i\in \mathbb R^K$ for $i=1,...,N$ via $l_1$-minimization. We assume that $\mathbf x_i$'s are $i.i.d.$ random linear combinations of the $K$ columns from a complete (i.e., square and invertible) reference dictionary $\mathbf D_0 \in \mathbb R^{K\times K}$. Here, the random linear coefficients are generated from either the $s$-sparse Gaussian model or the Bernoulli-Gaussian model. First, for the population case, we establish a sufficient and almost necessary condition for the reference dictionary $\mathbf D_0$ to be locally identifiable, i.e., a local minimum of the expected $l_1$-norm objective function. Our condition covers both sparse and dense cases of the random linear coefficients and significantly improves the sufficient condition by Gribonval and Schnass (2010). In addition, we show that for a complete $μ$-coherent reference dictionary, i.e., a dictionary with absolute pairwise column inner-product at most $μ\in[0,1)$, local identifiability holds even when the random linear coefficient vector has up to $O(μ^{-2})$ nonzeros on average. Moreover, our local identifiability results also translate to the finite sample case with high probability provided that the number of signals $N$ scales as $O(K\log K)$.

preprint2015arXiv

A Novel Scattered Pilot Design for FBMC/OQAM Systems

Filter bank multi-carrier with offset quadrature amplitude modulation (FBMC/OQAM) has been heavily studied as an alternative waveform for 5G systems. Its advantages of higher spectrum efficiency, localized frequency response and insensitivity to synchronization errors may enable promising performance when orthogonal frequency division multiplexing (OFDM) fails. However, performing channel estimation under the intrinsic interference has been a fundamental obstacle towards adopting FBMC/OQMA in a practical system. Several schemes are available but the performance is far from satisfaction. In this paper, we will show the existing methods are trapped by the paradigm that a clean pilot is mandatory so as to explicitly carry a reference symbol to the receiver for the purpose of channel estimation. By breaking this paradigm, a novel dual dependent pilot scheme is proposed, which gives up the independent pilot and derives dual pilots from the imposed interference. By doing this, the interference between pilots can be fully utilized. Consequentially, the new scheme significantly outperforms existing solutions and the simulation results show FBMC/OQAM can achieve close-to-OFDM performance in a practical system even with the presence of strong intrinsic interference.

preprint2015arXiv

A spectral-like decomposition for transitive Anosov flows in dimension three

Given a (transitive or non-transitive) Anosov vector field $X$ on a closed three-dimensional manifold $M$, one may try to decompose $(M,X)$ by cutting $M$ along two-tori transverse to $X$. We prove that one can find a finite collection $\{T_1,\dots,T_n\}$ of pairwise disjoint, pairwise non-parallel incompressible tori transverse to $X$, such that the maximal invariant sets $Λ_1,\dots,Λ_m$ of the connected components $V_1,\dots,V_m$ of $M-(T_1\cup\dots\cup T_n)$ satisfy the following properties: 1, each $Λ_i$ is a compact invariant locally maximal transitive set for $X$, 2, the collection $\{Λ_1,\dots,Λ_m\}$ is canonically attached to the pair $(M,X)$ (i.e., it can be defined independently of the collection of tori $\{T_1,\dots,T_n\}$), 3, the $Λ_i$'s are the smallest possible: for every (possibly infinite) collection $\{S_i\}_{i\in I}$ of tori transverse to $X$, the $Λ_i$'s are contained in the maximal invariant set of $M-\cup_i S_i$. To a certain extent, the sets $Λ_1,\dots,Λ_m$ are analogs (for Anosov vector field in dimension 3) of the basic pieces which appear in the spectral decomposition of a non-transitive axiom A vector field. Then we discuss the uniqueness of such a decomposition: we prove that the pieces of the decomposition $V_1,\dots,V_m$, equipped with the restriction of the Anosov vector field $X$, are "almost unique up to topological equivalence".

preprint2015arXiv

Co-clustering for directed graphs: the Stochastic co-Blockmodel and spectral algorithm Di-Sim

Directed graphs have asymmetric connections, yet the current graph clustering methodologies cannot identify the potentially global structure of these asymmetries. We give a spectral algorithm called di-sim that builds on a dual measure of similarity that correspond to how a node (i) sends and (ii) receives edges. Using di-sim, we analyze the global asymmetries in the networks of Enron emails, political blogs, and the c elegans neural connectome. In each example, a small subset of nodes have persistent asymmetries; these nodes send edges with one cluster, but receive edges with another cluster. Previous approaches would have assigned these asymmetric nodes to only one cluster, failing to identify their sending/receiving asymmetries. Regularization and "projection" are two steps of di-sim that are essential for spectral clustering algorithms to work in practice. The theoretical results show that these steps make the algorithm weakly consistent under the degree corrected Stochastic co-Blockmodel, a model that generalizes the Stochastic Blockmodel to allow for both (i) degree heterogeneity and (ii) the global asymmetries that we intend to detect. The theoretical results make no assumptions on the smallest degree nodes. Instead, the theorem requires that the average degree grows sufficiently fast and that the weak consistency only applies to the subset of the nodes with sufficiently large leverage scores. The results results also apply to bipartite graphs.

preprint2015arXiv

Estimation Stability with Cross Validation (ESCV)

Cross-validation (CV) is often used to select the regularization parameter in high dimensional problems. However, when applied to the sparse modeling method Lasso, CV leads to models that are unstable in high-dimensions, and consequently not suited for reliable interpretation. In this paper, we propose a model-free criterion ESCV based on a new estimation stability (ES) metric and CV. Our proposed ESCV finds a locally ES-optimal model smaller than the CV choice so that the it fits the data and also enjoys estimation stability property. We demonstrate that ESCV is an effective alternative to CV at a similar easily parallelizable computational cost. In particular, we compare the two approaches with respect to several performance measures when applied to the Lasso on both simulated and real data sets. For dependent predictors common in practice, our main finding is that, ESCV cuts down false positive rates often by a large margin, while sacrificing little of true positive rates. ESCV usually outperforms CV in terms of parameter estimation while giving similar performance as CV in terms of prediction. For the two real data sets from neuroscience and cell biology, the models found by ESCV are less than half of the model sizes by CV. Judged based on subject knowledge, they are more plausible than those by CV as well. We also discuss some regularization parameter alignment issues that come up in both approaches.

preprint2015arXiv

Lasso adjustments of treatment effect estimates in randomized experiments

We provide a principled way for investigators to analyze randomized experiments when the number of covariates is large. Investigators often use linear multivariate regression to analyze randomized experiments instead of simply reporting the difference of means between treatment and control groups. Their aim is to reduce the variance of the estimated treatment effect by adjusting for covariates. If there are a large number of covariates relative to the number of observations, regression may perform poorly because of overfitting. In such cases, the Lasso may be helpful. We study the resulting Lasso-based treatment effect estimator under the Neyman-Rubin model of randomized experiments. We present theoretical conditions that guarantee that the estimator is more efficient than the simple difference-of-means estimator, and we provide a conservative estimator of the asymptotic variance, which can yield tighter confidence intervals than the difference-of-means estimator. Simulation and data examples show that Lasso-based adjustment can be advantageous even when the number of covariates is less than the number of observations. Specifically, a variant using Lasso for selection and OLS for estimation performs particularly well, and it chooses a smoothing parameter based on combined performance of Lasso and OLS.

preprint2015arXiv

Optimal Subsampling Approaches for Large Sample Linear Regression

A significant hurdle for analyzing large sample data is the lack of effective statistical computing and inference methods. An emerging powerful approach for analyzing large sample data is subsampling, by which one takes a random subsample from the original full sample and uses it as a surrogate for subsequent computation and estimation. In this paper, we study subsampling methods under two scenarios: approximating the full sample ordinary least-square (OLS) estimator and estimating the coefficients in linear regression. We present two algorithms, weighted estimation algorithm and unweighted estimation algorithm, and analyze asymptotic behaviors of their resulting subsample estimators under general conditions. For the weighted estimation algorithm, we propose a criterion for selecting the optimal sampling probability by making use of the asymptotic results. On the basis of the criterion, we provide two novel subsampling methods, the optimal subsampling and the predictor- length subsampling methods. The predictor-length subsampling method is based on the L2 norm of predictors rather than leverage scores. Its computational cost is scalable. For unweighted estimation algorithm, we show that its resulting subsample estimator is not consistent to the full sample OLS estimator. However, it has better performance than the weighted estimation algorithm for estimating the coefficients. Simulation studies and a real data example are used to demonstrate the effectiveness of our proposed subsampling methods.

preprint2015arXiv

The geometry of kernelized spectral clustering

Clustering of data sets is a standard problem in many areas of science and engineering. The method of spectral clustering is based on embedding the data set using a kernel function, and using the top eigenvectors of the normalized Laplacian to recover the connected components. We study the performance of spectral clustering in recovering the latent labels of i.i.d. samples from a finite mixture of nonparametric distributions. The difficulty of this label recovery problem depends on the overlap between mixture components and how easily a mixture component is divided into two nonoverlapping components. When the overlap is small compared to the indivisibility of the mixture components, the principal eigenspace of the population-level normalized Laplacian operator is approximately spanned by the square-root kernelized component densities. In the finite sample setting, and under the same assumption, embedded samples from different components are approximately orthogonal with high probability when the sample size is large. As a corollary we control the fraction of samples mislabeled by spectral clustering under finite mixtures with nonparametric components.

preprint2014arXiv

Asymptotic Properties of Lasso+mLS and Lasso+Ridge in Sparse High-dimensional Linear Regression

We study the asymptotic properties of Lasso+mLS and Lasso+Ridge under the sparse high-dimensional linear regression model: Lasso selecting predictors and then modified Least Squares (mLS) or Ridge estimating their coefficients. First, we propose a valid inference procedure for parameter estimation based on parametric residual bootstrap after Lasso+mLS and Lasso+Ridge. Second, we derive the asymptotic unbiasedness of Lasso+mLS and Lasso+Ridge. More specifically, we show that their biases decay at an exponential rate and they can achieve the oracle convergence rate of $s/n$ (where $s$ is the number of nonzero regression coefficients and $n$ is the sample size) for mean squared error (MSE). Third, we show that Lasso+mLS and Lasso+Ridge are asymptotically normal. They have an oracle property in the sense that they can select the true predictors with probability converging to 1 and the estimates of nonzero parameters have the same asymptotic normal distribution that they would have if the zero parameters were known in advance. In fact, our analysis is not limited to adopting Lasso in the selection stage, but is applicable to any other model selection criteria with exponentially decay rates of the probability of selecting wrong models.

preprint2014arXiv

Concise comparative summaries (CCS) of large text corpora with a human experiment

In this paper we propose a general framework for topic-specific summarization of large text corpora and illustrate how it can be used for the analysis of news databases. Our framework, concise comparative summarization (CCS), is built on sparse classification methods. CCS is a lightweight and flexible tool that offers a compromise between simple word frequency based methods currently in wide use and more heavyweight, model-intensive methods such as latent Dirichlet allocation (LDA). We argue that sparse methods have much to offer for text analysis and hope CCS opens the door for a new branch of research in this important field. For a particular topic of interest (e.g., China or energy), CSS automatically labels documents as being either on- or off-topic (usually via keyword search), and then uses sparse classification methods to predict these labels with the high-dimensional counts of all the other words and phrases in the documents. The resulting small set of phrases found as predictive are then harvested as the summary. To validate our tool, we, using news articles from the New York Times international section, designed and conducted a human survey to compare the different summarizers with human understanding. We demonstrate our approach with two case studies, a media analysis of the framing of "Egypt" in the New York Times throughout the Arab Spring and an informal comparison of the New York Times' and Wall Street Journal's coverage of "energy." Overall, we find that the Lasso with $L^2$ normalization can be effectively and usefully used to summarize large corpora, regardless of document size.

preprint2014arXiv

Depth $0$ Nonsingular Morse Smale flows on $S^3$

In this paper, we first develope the concept of Lyapunov graph to weighted Lyapunov graph (abbreviated as WLG) for nonsingular Morse-Smale flows (abbreviated as NMS flows) on $S^3$. WLG is quite sensitive to NMS flows on $S^3$. For instance, WLG detect the indexed links of NMS flows. Then we use WLG and some other tools to describe nonsingular Morse-Smale flows without heteroclinic trajectories connecting saddle orbits (abbreviated as depth $0$ NMS flows). It mainly contains the following several directions: \begin{enumerate} \item we use WLG to list depth $0$ NMS flows on $S^3$; \item with the help of WLG, comparing with Wada's algorithm, we provide a direct description about the (indexed) link of depth $0$ NMS flows; \item to overcome the weakness that WLG can't decide topologically equivalent class, we give a simplified Umanskii Theorem to decide when two depth $0$ NMS flows on $S^3$ are topological equivalence; \item under these theories, we classify (up to topological equivalence) all depth 0 NMS flows on $S^3$ with periodic orbits number no more than 4. \end{enumerate}

preprint2014arXiv

Error Rate Bounds and Iterative Weighted Majority Voting for Crowdsourcing

Crowdsourcing has become an effective and popular tool for human-powered computation to label large datasets. Since the workers can be unreliable, it is common in crowdsourcing to assign multiple workers to one task, and to aggregate the labels in order to obtain results of high quality. In this paper, we provide finite-sample exponential bounds on the error rate (in probability and in expectation) of general aggregation rules under the Dawid-Skene crowdsourcing model. The bounds are derived for multi-class labeling, and can be used to analyze many aggregation methods, including majority voting, weighted majority voting and the oracle Maximum A Posteriori (MAP) rule. We show that the oracle MAP rule approximately optimizes our upper bound on the mean error rate of weighted majority voting in certain setting. We propose an iterative weighted majority voting (IWMV) method that optimizes the error rate bound and approximates the oracle MAP rule. Its one step version has a provable theoretical guarantee on the error rate. The IWMV method is intuitive and computationally simple. Experimental results on simulated and real data show that IWMV performs at least on par with the state-of-the-art methods, and it has a much lower computational cost (around one hundred times faster) than the state-of-the-art methods.

preprint2014arXiv

Impact of regularization on Spectral Clustering

The performance of spectral clustering can be considerably improved via regularization, as demonstrated empirically in Amini et. al (2012). Here, we provide an attempt at quantifying this improvement through theoretical analysis. Under the stochastic block model (SBM), and its extensions, previous results on spectral clustering relied on the minimum degree of the graph being sufficiently large for its good performance. By examining the scenario where the regularization parameter $τ$ is large we show that the minimum degree assumption can potentially be removed. As a special case, for an SBM with two blocks, the results require the maximum degree to be large (grow faster than $\log n$) as opposed to the minimum degree. More importantly, we show the usefulness of regularization in situations where not all nodes belong to well-defined clusters. Our results rely on a `bias-variance'-like trade-off that arises from understanding the concentration of the sample Laplacian and the eigen gap as a function of the regularization parameter. As a byproduct of our bounds, we propose a data-driven technique \textit{DKest} (standing for estimated Davis-Kahan bounds) for choosing the regularization parameter. This technique is shown to work well through simulations and on a real data set.

preprint2014arXiv

Reversible MCMC on Markov equivalence classes of sparse directed acyclic graphs

Graphical models are popular statistical tools which are used to represent dependent or causal complex systems. Statistically equivalent causal or directed graphical models are said to belong to a Markov equivalent class. It is of great interest to describe and understand the space of such classes. However, with currently known algorithms, sampling over such classes is only feasible for graphs with fewer than approximately 20 vertices. In this paper, we design reversible irreducible Markov chains on the space of Markov equivalent classes by proposing a perfect set of operators that determine the transitions of the Markov chain. The stationary distribution of a proposed Markov chain has a closed form and can be computed easily. Specifically, we construct a concrete perfect set of operators on sparse Markov equivalence classes by introducing appropriate conditions on each possible operator. Algorithms and their accelerated versions are provided to efficiently generate Markov chains and to explore properties of Markov equivalence classes of sparse directed acyclic graphs (DAGs) with thousands of vertices. We find experimentally that in most Markov equivalence classes of sparse DAGs, (1) most edges are directed, (2) most undirected subgraphs are small and (3) the number of these undirected subgraphs grows approximately linearly with the number of vertices. The article contains supplement arXiv:1303.0632, http://dx.doi.org/10.1214/13-AOS1125SUPP

preprint2014arXiv

Statistical guarantees for the EM algorithm: From population to sample-based analysis

We develop a general framework for proving rigorous guarantees on the performance of the EM algorithm and a variant known as gradient EM. Our analysis is divided into two parts: a treatment of these algorithms at the population level (in the limit of infinite data), followed by results that apply to updates based on a finite set of samples. First, we characterize the domain of attraction of any global maximizer of the population likelihood. This characterization is based on a novel view of the EM updates as a perturbed form of likelihood ascent, or in parallel, of the gradient EM updates as a perturbed form of standard gradient ascent. Leveraging this characterization, we then provide non-asymptotic guarantees on the EM and gradient EM algorithms when applied to a finite set of samples. We develop consequences of our general theory for three canonical examples of incomplete-data problems: mixture of Gaussians, mixture of regressions, and linear regression with covariates missing completely at random. In each case, our theory guarantees that with a suitable initialization, a relatively small number of EM (or gradient EM) steps will yield (with high probability) an estimate that is within statistical error of the MLE. We provide simulations to confirm this theoretically predicted behavior.

preprint2014arXiv

The shuffle estimator for explainable variance in fMRI experiments

In computational neuroscience, it is important to estimate well the proportion of signal variance in the total variance of neural activity measurements. This explainable variance measure helps neuroscientists assess the adequacy of predictive models that describe how images are encoded in the brain. Complicating the estimation problem are strong noise correlations, which may confound the neural responses corresponding to the stimuli. If not properly taken into account, the correlations could inflate the explainable variance estimates and suggest false possible prediction accuracies. We propose a novel method to estimate the explainable variance in functional MRI (fMRI) brain activity measurements when there are strong correlations in the noise. Our shuffle estimator is nonparametric, unbiased, and built upon the random effect model reflecting the randomization in the fMRI data collection process. Leveraging symmetries in the measurements, our estimator is obtained by appropriately permuting the measurement vector in such a way that the noise covariance structure is intact but the explainable variance is changed after the permutation. This difference is then used to estimate the explainable variance. We validate the properties of the proposed method in simulation experiments. For the image-fMRI data, we show that the shuffle estimates can explain the variation in prediction accuracy for voxels within the primary visual cortex (V1) better than alternative parametric methods.

preprint2013arXiv

A Statistical Perspective on Algorithmic Leveraging

One popular method for dealing with large-scale data sets is sampling. For example, by using the empirical statistical leverage scores as an importance sampling distribution, the method of algorithmic leveraging samples and rescales rows/columns of data matrices to reduce the data size before performing computations on the subproblem. This method has been successful in improving computational efficiency of algorithms for matrix problems such as least-squares approximation, least absolute deviations approximation, and low-rank matrix approximation. Existing work has focused on algorithmic issues such as worst-case running times and numerical issues associated with providing high-quality implementations, but none of it addresses statistical aspects of this method. In this paper, we provide a simple yet effective framework to evaluate the statistical properties of algorithmic leveraging in the context of estimating parameters in a linear regression model with a fixed number of predictors. We show that from the statistical perspective of bias and variance, neither leverage-based sampling nor uniform sampling dominates the other. This result is particularly striking, given the well-known result that, from the algorithmic perspective of worst-case analysis, leverage-based sampling provides uniformly superior worst-case algorithmic results, when compared with uniform sampling. Based on these theoretical results, we propose and analyze two new leveraging algorithms. A detailed empirical evaluation of existing leverage-based methods as well as these two new methods is carried out on both synthetic and real data sets. The empirical results indicate that our theory is a good predictor of practical performance of existing and new leverage-based algorithms and that the new algorithms achieve improved performance.

preprint2013arXiv

A Unified Framework for High-Dimensional Analysis of M-Estimators with Decomposable Regularizers

High-dimensional statistical inference deals with models in which the the number of parameters p is comparable to or larger than the sample size n. Since it is usually impossible to obtain consistent procedures unless $p/n\rightarrow0$, a line of recent work has studied models with various types of low-dimensional structure, including sparse vectors, sparse and structured matrices, low-rank matrices and combinations thereof. In such settings, a general approach to estimation is to solve a regularized optimization problem, which combines a loss function measuring how well the model fits the data with some regularization function that encourages the assumed structure. This paper provides a unified framework for establishing consistency and convergence rates for such regularized M-estimators under high-dimensional scaling. We state one main theorem and show how it can be used to re-derive some existing results, and also to obtain a number of new results on consistency and convergence rates, in both $\ell_2$-error and related norms. Our analysis also identifies two key properties of loss and regularization functions, referred to as restricted strong convexity and decomposability, that ensure corresponding regularized M-estimators have fast convergence rates and which are optimal in many well-studied cases.

preprint2013arXiv

Early stopping and non-parametric regression: An optimal data-dependent stopping rule

The strategy of early stopping is a regularization technique based on choosing a stopping time for an iterative algorithm. Focusing on non-parametric regression in a reproducing kernel Hilbert space, we analyze the early stopping strategy for a form of gradient-descent applied to the least-squares loss function. We propose a data-dependent stopping rule that does not involve hold-out or cross-validation data, and we prove upper bounds on the squared error of the resulting function estimate, measured in either the $L^2(P)$ and $L^2(P_n)$ norm. These upper bounds lead to minimax-optimal rates for various kernel classes, including Sobolev smoothness classes and other forms of reproducing kernel Hilbert spaces. We show through simulation that our stopping rule compares favorably to two other stopping rules, one based on hold-out data and the other based on Stein's unbiased risk estimate. We also establish a tight connection between our early stopping strategy and the solution path of a kernel ridge regression estimator.

preprint2013arXiv

Error Rate Bounds in Crowdsourcing Models

Crowdsourcing is an effective tool for human-powered computation on many tasks challenging for computers. In this paper, we provide finite-sample exponential bounds on the error rate (in probability and in expectation) of hyperplane binary labeling rules under the Dawid-Skene crowdsourcing model. The bounds can be applied to analyze many common prediction methods, including the majority voting and weighted majority voting. These bound results could be useful for controlling the error rate and designing better algorithms. We show that the oracle Maximum A Posterior (MAP) rule approximately optimizes our upper bound on the mean error rate for any hyperplane binary labeling rule, and propose a simple data-driven weighted majority voting (WMV) rule (called one-step WMV) that attempts to approximate the oracle MAP and has a provable theoretical guarantee on the error rate. Moreover, we use simulated and real data to demonstrate that the data-driven EM-MAP rule is a good approximation to the oracle MAP rule, and to demonstrate that the mean error rate of the data-driven EM-MAP rule is also bounded by the mean error rate bound of the oracle MAP rule with estimated parameters plugging into the bound.

preprint2013arXiv

Geometry of the faithfulness assumption in causal inference

Many algorithms for inferring causality rely heavily on the faithfulness assumption. The main justification for imposing this assumption is that the set of unfaithful distributions has Lebesgue measure zero, since it can be seen as a collection of hypersurfaces in a hypercube. However, due to sampling error the faithfulness condition alone is not sufficient for statistical estimation, and strong-faithfulness has been proposed and assumed to achieve uniform or high-dimensional consistency. In contrast to the plain faithfulness assumption, the set of distributions that is not strong-faithful has nonzero Lebesgue measure and in fact, can be surprisingly large as we show in this paper. We study the strong-faithfulness condition from a geometric and combinatorial point of view and give upper and lower bounds on the Lebesgue measure of strong-faithful distributions for various classes of directed acyclic graphs. Our results imply fundamental limitations for the PC-algorithm and potentially also for other algorithms based on partial correlation testing in the Gaussian case.

preprint2013arXiv

Scaling Analysis of Nanowire Phase Change Memory

This letter analyzes the scaling property of nanowire (NW) phase change memory (PCM) using analytic and numerical methods. The scaling scenarios of the three widely-used NW PCM peration schemes (constant electric field, voltage, and current) are studied and compared. It is shown that if the device size is downscaled by a factor of 1/k (k>1), the peration energy (current) will be reduced by more than k3 (k) times, and the operation speed will be increased by k2 times. It is also shown that more than 90% of operation energy is wasted as thermal flux into substrate and electrodes. We predict that, if the wasted thermal flux is effectively reduced by heat confinement technologies, the energy consumed per RESET operation can be decreased from about 1 pJ to less than 100 fJ. It is shown that reducing NW aspect ratio (AR) helps decreasing PCM energy consumption. It is revealed that cross-cell thermal proximity disturbance is counter-intuitively alleviated by scaling, leading to a desirable scaling scenario.

preprint2013arXiv

Stability

Reproducibility is imperative for any scientific discovery. More often than not, modern scientific findings rely on statistical analysis of high-dimensional data. At a minimum, reproducibility manifests itself in stability of statistical results relative to "reasonable" perturbations to data and to the model used. Jacknife, bootstrap, and cross-validation are based on perturbations to data, while robust statistics methods deal with perturbations to models. In this article, a case is made for the importance of stability in statistics. Firstly, we motivate the necessity of stability for interpretable and reliable encoding models from brain fMRI signals. Secondly, we find strong evidence in the literature to demonstrate the central role of stability in statistical inference, such as sensitivity analysis and effect detection. Thirdly, a smoothing parameter selector based on estimation stability (ES), ES-CV, is proposed for Lasso, in order to bring stability to bear on cross-validation (CV). ES-CV is then utilized in the encoding models to reduce the number of predictors by 60% with almost no loss (1.3%) of prediction performance across over 2,000 voxels. Last, a novel "stability" argument is seen to drive new results that shed light on the intriguing interactions between sample to sample variability and heavier tail error distribution (e.g., double-exponential) in high-dimensional regression models with $p$ predictors and $n$ independent samples. In particular, when $p/n\rightarrowκ\in(0.3,1)$ and the error distribution is double-exponential, the Ordinary Least Squares (OLS) is a better estimator than the Least Absolute Deviation (LAD) estimator.

preprint2013arXiv

Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows

We consider supervised learning problems where the features are embedded in a graph, such as gene expressions in a gene network. In this context, it is of much interest to automatically select a subgraph with few connected components; by exploiting prior knowledge, one can indeed improve the prediction performance or obtain results that are easier to interpret. Regularization or penalty functions for selecting features in graphs have recently been proposed, but they raise new algorithmic challenges. For example, they typically require solving a combinatorially hard selection problem among all connected subgraphs. In this paper, we propose computationally feasible strategies to select a sparse and well-connected subset of features sitting on a directed acyclic graph (DAG). We introduce structured sparsity penalties over paths on a DAG called "path coding" penalties. Unlike existing regularization functions that model long-range interactions between features in a graph, path coding penalties are tractable. The penalties and their proximal operators involve path selection problems, which we efficiently solve by leveraging network flow optimization. We experimentally show on synthetic, image, and genomic data that our approach is scalable and leads to more connected subgraphs than other regularization functions for graphs.

preprint2013arXiv

Supplement to "Reversible MCMC on Markov equivalence classes of sparse directed acyclic graphs"

This supplementary material includes three parts: some preliminary results, four examples, an experiment, three new algorithms, and all proofs of the results in the paper "Reversible MCMC on Markov equivalence classes of sparse directed acyclic graphs".

preprint2012arXiv

A Hierarchical Bayesian Approach for Aerosol Retrieval Using MISR Data

Atmospheric aerosols can cause serious damage to human health and life expectancy. Using the radiances observed by NASA's Multi-angle Imaging SpectroRadiometer (MISR), the current MISR operational algorithm retrieves Aerosol Optical Depth (AOD) at a spatial resolution of 17.6 km x 17.6 km. A systematic study of aerosols and their impact on public health, especially in highly-populated urban areas, requires a finer-resolution estimate of the spatial distribution of AOD values. We embed MISR's operational weighted least squares criterion and its forward simulations for AOD retrieval in a likelihood framework and further expand it into a Bayesian hierarchical model to adapt to a finer spatial scale of 4.4 km x 4.4 km. To take advantage of AOD's spatial smoothness, our method borrows strength from data at neighboring pixels by postulating a Gaussian Markov Random Field prior for AOD. Our model considers both AOD and aerosol mixing vectors as continuous variables. The inference of AOD and mixing vectors is carried out using Metropolis-within-Gibbs sampling methods. Retrieval uncertainties are quantified by posterior variabilities. We also implement a parallel MCMC algorithm to reduce computational cost. We assess our retrievals performance using ground-based measurements from the AErosol RObotic NETwork (AERONET), a hand-held sunphotometer and satellite images from Google Earth. Based on case studies in the greater Beijing area, China, we show that a 4.4 km resolution can improve the accuracy and coverage of remotely-sensed aerosol retrievals, as well as our understanding of the spatial and seasonal behaviors of aerosols. This improvement is particularly important during high-AOD events, which often indicate severe air pollution.

preprint2012arXiv

Complexity Analysis of the Lasso Regularization Path

The regularization path of the Lasso can be shown to be piecewise linear, making it possible to "follow" and explicitly compute the entire path. We analyze in this paper this popular strategy, and prove that its worst case complexity is exponential in the number of variables. We then oppose this pessimistic result to an (optimistic) approximate analysis: We show that an approximate path with at most O(1/sqrt(epsilon)) linear segments can always be obtained, where every point on the path is guaranteed to be optimal up to a relative epsilon-duality gap. We complete our theoretical analysis with a practical algorithm to compute these approximate paths.

preprint2011arXiv

Chemical Vapor Deposition-Assembled Graphene Field-Effect Transistor on Hexagonal Boron Nitride

We investigate key electrical properties of monolayer graphene assembled by chemical-vapor-deposition (CVD) as impacted by supporting substrate material. Graphene field-effect transistors (GFETs) were fabricated with carbon channel placing directly on hexagonal boron nitride (h-BN) and SiO2, respectively. Small-signal transconductance (gm) and effective carrier mobility (μeff) are improved by 8.5 and 4 times on h-BN, respectively, as compared with that on SiO2. Compared with GFET with exfoliated graphene on SiO2, gm and μeff measured from device with CVD graphene on h-BN substrate exhibits comparable values. The experiment demonstrates the potential of employing h-BN as a platform material for large-area carbon electronics.

preprint2011arXiv

Electronic Transport in Monolayer Graphene with Extreme Physical Deformation: ab Initio Density Functional Calculation

Electronic transport properties of monolayer graphene with extreme physical bending up to 90o angle are studied using ab Initio first-principle calculations. The importance of key structural parameters including step height, curvature radius and bending angle are discussed how they modify the transport properties of the deformed graphene sheet comparing to the corresponding flat ones. The local density of state reveals that energy state modification caused by the physical bending is highly localized. It is observed that the transport properties of bent graphene with a wide range of geometrical configurations are insensitive to the structural deformation in the low-energy transmission spectra, even in the extreme case of bending. The results support that graphene, with its superb electromechanical robustness, could serve as a viable material platform in a spectrum of applications such as photovoltaics, flexible electronics, OLED, and 3D electronic chips.

preprint2011arXiv

Encoding and decoding V1 fMRI responses to natural images with sparse nonparametric models

Functional MRI (fMRI) has become the most common method for investigating the human brain. However, fMRI data present some complications for statistical analysis and modeling. One recently developed approach to these data focuses on estimation of computational encoding models that describe how stimuli are transformed into brain activity measured in individual voxels. Here we aim at building encoding models for fMRI signals recorded in the primary visual cortex of the human brain. We use residual analyses to reveal systematic nonlinearity across voxels not taken into account by previous models. We then show how a sparse nonparametric method [J. Roy. Statist. Soc. Ser. B 71 (2009b) 1009-1030] can be used together with correlation screening to estimate nonlinear encoding models effectively. Our approach produces encoding models that predict about 25% more accurately than models estimated using other methods [Nature 452 (2008a) 352--355]. The estimated nonlinearity impacts the inferred properties of individual voxels, and it has a plausible biological interpretation. One benefit of quantitative encoding models is that estimated models can be used to decode brain activity, in order to identify which specific image was seen by an observer. Encoding models estimated by our approach also improve such image identification by about 12% when the correct image is one of 11,500 possible images.

preprint2011arXiv

Highly Conductive 3D Nano-Carbon: Stacked Multilayer Graphene System with Interlayer Decoupling

We investigate electrical conduction and breakdown behavior of 3D nano-carbon-stacked multilayer graphene (s-MLG) system with complete interlayer decoupling. The s-MLG is prepared by transferring and stacking large-area CVD-grown graphene monolayers, followed by wire patterning and plasma etching. Raman spectroscopy was used to confirm the layer number. The D-band peak indicates low defect level in the samples. Electrical current stressing induced doping is performed to shift the charge-neutrality Dirac point and decrease the graphene/metal contact resistance, improving the overall electrical conduction. Breakdown experiments show the current-carrying capacity of s-MLG is largely enhanced as compared with that of monolayer graphene.

preprint2011arXiv

Local Electrical Stress-Induced Doping and Formation of 2D Monolayer Graphene P-N Junction

We demonstrated doping in 2D monolayer graphene via local electrical stressing. The doping, confirmed by the resistance-voltage transfer characteristics of the graphene system, is observed to continuously tunable from N-type to P-type as the electrical stressing level (voltage) increases. Two major physical mechanisms are proposed to interpret the observed phenomena: modifications of surface chemistry for N-type doping (at low-level stressing) and thermally-activated charge transfer from graphene to SiO2 substrate for P-type doping (at high-level stressing). The formation of P-N junction on 2D graphene monolayer is demonstrated with complementary doping based on locally applied electrical stressing.

preprint2011arXiv

Lyapunov graphs of nonsingular Smale flows on $S^{1}\times S^{2}$

In this paper, following J. Franks' work on Lyapunov graphs of nonsingular Smale flows on $S^3$, we study Lyapunov graphs of nonsingular Smale flows on $S^1 \times S^2$. More precisely, we determine necessary and sufficient conditions on an abstract Lyapunov graph to be associated with a nonsingular Smale flow on $S^1 \times S^2$. We also study the singular type vertices in Lyapunov graphs of nonsingular Smale flows on 3-manifolds.

preprint2011arXiv

Minimax-optimal rates for sparse additive models over kernel classes via convex programming

Sparse additive models are families of $d$-variate functions that have the additive decomposition $f^* = \sum_{j \in S} f^*_j$, where $S$ is an unknown subset of cardinality $s \ll d$. In this paper, we consider the case where each univariate component function $f^*_j$ lies in a reproducing kernel Hilbert space (RKHS), and analyze a method for estimating the unknown function $f^*$ based on kernels combined with $\ell_1$-type convex regularization. Working within a high-dimensional framework that allows both the dimension $d$ and sparsity $s$ to increase with $n$, we derive convergence rates (upper bounds) in the $L^2(\mathbb{P})$ and $L^2(\mathbb{P}_n)$ norms over the class $\MyBigClass$ of sparse additive models with each univariate function $f^*_j$ in the unit ball of a univariate RKHS with bounded kernel function. We complement our upper bounds by deriving minimax lower bounds on the $L^2(\mathbb{P})$ error, thereby showing the optimality of our method. Thus, we obtain optimal minimax rates for many interesting classes of sparse additive models, including polynomials, splines, and Sobolev classes. We also show that if, in contrast to our univariate conditions, the multivariate function class is assumed to be globally bounded, then much faster estimation rates are possible for any sparsity $s = Ω(\sqrt{n})$, showing that global boundedness is a significant restriction in the high-dimensional setting.

preprint2011arXiv

Multicolor Graphene Nanoribbon/Semiconductor Nanowire Heterojunction Light-Emitting Diodes

We report novel graphene nanoribbon (GNR)/semiconductor nanowire (SNW) heterojunction light-emitting diodes (LEDs) for the first time. The GNR and SNW have a face-to-face contact structure, which has the merit of bigger active region. ZnO, CdS, and CdSe NWs were employed in our case. At forward biases, the GNR/SNW heterjunction LEDs could emit light with wavelengths varying from ultraviolet (380 nm) to green (513 nm) to red (705 nm), which were determined by the band-gaps of the involved SNWs. The mechanism of light emitting for the GNR/SNW heterojunction LED was discussed. Our approach can easily be extended to other semiconductor nano-materials. Moreover, our achievement opens the door to next-generation display technologies, including portable, "see-through", and conformable products.

preprint2011arXiv

Remembering Leo

I do not remember when was the first time that I met Leo, but I have a clear memory of going to Leo's office on the 4th floor of Evans Hall to talk to him in my second year in Berkeley's Ph.D. program in 1986. The details of the conversation are not retained but a visual image of his clean and orderly office remains, in a stark contrast to a high entropy state of the same office now being used by myself.

preprint2011arXiv

Spectral clustering and the high-dimensional stochastic blockmodel

Networks or graphs can easily represent a diverse set of data sources that are characterized by interacting units or actors. Social networks, representing people who communicate with each other, are one example. Communities or clusters of highly connected actors form an essential feature in the structure of several empirical networks. Spectral clustering is a popular and computationally feasible method to discover these communities. The stochastic blockmodel [Social Networks 5 (1983) 109--137] is a social network model with well-defined communities; each node is a member of one community. For a network generated from the Stochastic Blockmodel, we bound the number of nodes "misclustered" by spectral clustering. The asymptotic results in this paper are the first clustering results that allow the number of clusters in the model to grow with the number of nodes, hence the name high-dimensional. In order to study spectral clustering under the stochastic blockmodel, we first show that under the more general latent space model, the eigenvectors of the normalized graph Laplacian asymptotically converge to the eigenvectors of a "population" normalized graph Laplacian. Aside from the implication for spectral clustering, this provides insight into a graph visualization technique. Our method of studying the eigenvectors of random matrices is original.

preprint2011arXiv

The Templates of Nonsingular Smale Flows on Three Manifolds

In this paper, we first discuss some connections between template theory and the description of basic sets of Smale flows on 3-manifolds due to F. Béguin and C. Bonatti. The main tools we use are symbolic dynamics, template moves and some combinatorial surgeries. Second, we obtain some relationship between the surgeries and the number of $S^1 \times S^2$ factors of $M$ for a nonsingular Smale flow on a given closed orientable 3-manifold $M$. Besides these, we also prove that any template $T$ can model a basic set $Λ$ of a nonsingular Smale flow on $nS^1 \times S^2$ for some positive integer $n$.

preprint2010arXiv

A Simple and Scalable Graphene Patterning Method and Its Application in CdSe Nanobelt/Graphene Schottky Junction Solar Cells

We develop a simple and scalable graphene patterning method using electron-beam or ultraviolet lithography followed by a lift-off process. This method, with the merits of: high pattern resolution and high alignment accuracy, free from additional etching or harsh process, universal to arbitrary substrates, compatible to Si microelectronic technology, can be easily applied to diverse graphene-based devices, especially in array-based applications, where large-scale graphene patterns are desired. We have applied this method to fabricate CdSe nanobelt (NB)/graphene Schottky junction solar cells, which have potential application in integrated nano-optoelectronic systems. Typical as-fabricated solar cell shows excellent photovoltaic behavior with an open-circuit voltage of ~ 0.51 V, a short-circuit current density of ~ 5.75 mA/cm2, and an energy conversion efficiency of ~1.25%. We attribute the high performance of the cell to the as-patterned high-performance graphene, which can form an ideal Schottky contact with CdSe NB. Our results suggest both the developed graphene patterning method and the as-fabricated CdSe nanobelt (NB)/graphene Schottky junction solar cells have reachable application prospect.

preprint2010arXiv

Regular level sets of Lyapunov graphs of nonsingular Smale flows on 3-manifolds

In this paper, we first discuss the regular level set of a nonsingular Smale flow (NSF) on a 3-manifold. The main result about this topic is that a 3-manifold $M$ admits an NSF flow which has a regular level set homeomorphic to $(n+1)T^{2}$ $(n\in \mathbb{Z}, n\geq 0)$ if and only if $M=M'\sharp n S^{1}\times S^{2}$. Then we discuss how to realize a template as a basic set of an NSF on a 3-manifold. We focus on the connection between the genus of the template $T$ and the topological structure of the realizing 3-manifold $M$.

preprint2010arXiv

The Lasso under Heteroscedasticity

The performance of the Lasso is well understood under the assumptions of the standard linear model with homoscedastic noise. However, in several applications, the standard model does not describe the important features of the data. This paper examines how the Lasso performs on a non-standard model that is motivated by medical imaging applications. In these applications, the variance of the noise scales linearly with the expectation of the observation. Like all heteroscedastic models, the noise terms in this Poisson-like model are \textit{not} independent of the design matrix. More specifically, this paper studies the sign consistency of the Lasso under a sparse Poisson-like model. In addition to studying sufficient conditions for the sign consistency of the Lasso estimate, this paper also gives necessary conditions for sign consistency. Both sets of conditions are comparable to results for the homoscedastic model, showing that when a measure of the signal to noise ratio is large, the Lasso performs well on both Poisson-like data and homoscedastic data. Simulations reveal that the Lasso performs equally well in terms of model selection performance on both Poisson-like data and homoscedastic data (with properly scaled noise variance), across a range of parameterizations. Taken as a whole, these results suggest that the Lasso is robust to the Poisson-like heteroscedastic noise.

preprint2009arXiv

Minimax rates of estimation for high-dimensional linear regression over $\ell_q$-balls

Consider the standard linear regression model $\y = \Xmat \betastar + w$, where $\y \in \real^\numobs$ is an observation vector, $\Xmat \in \real^{\numobs \times \pdim}$ is a design matrix, $\betastar \in \real^\pdim$ is the unknown regression vector, and $w \sim \mathcal{N}(0, σ^2 I)$ is additive Gaussian noise. This paper studies the minimax rates of convergence for estimation of $\betastar$ for $\ell_\rpar$-losses and in the $\ell_2$-prediction loss, assuming that $\betastar$ belongs to an $\ell_{\qpar}$-ball $\Ballq(\myrad)$ for some $\qpar \in [0,1]$. We show that under suitable regularity conditions on the design matrix $\Xmat$, the minimax error in $\ell_2$-loss and $\ell_2$-prediction loss scales as $\Rq \big(\frac{\log \pdim}{n}\big)^{1-\frac{\qpar}{2}}$. In addition, we provide lower bounds on minimax risks in $\ell_{\rpar}$-norms, for all $\rpar \in [1, +\infty], \rpar \neq \qpar$. Our proofs of the lower bounds are information-theoretic in nature, based on Fano's inequality and results on the metric entropy of the balls $\Ballq(\myrad)$, whereas our proofs of the upper bounds are direct and constructive, involving direct analysis of least-squares over $\ell_{\qpar}$-balls. For the special case $q = 0$, a comparison with $\ell_2$-risks achieved by computationally efficient $\ell_1$-relaxations reveals that although such methods can achieve the minimax rates up to constant factors, they require slightly stronger assumptions on the design matrix $\Xmat$ than algorithms involving least-squares over the $\ell_0$-ball.

Bin Yu

What is connected

Connect this record

See the researcher in context

Building this map preview

62 published item(s)

IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation

MR-Align: Meta-Reasoning Informed Factuality Alignment for Large Reasoning Models

PhysBrain 1.0 Technical Report

Hierarchical Shrinkage: improving the accuracy and interpretability of tree-based methods

Instability, Computational Efficiency and Statistical Accuracy

Learning Using Privileged Information for Zero-Shot Action Recognition

Provable Boolean Interaction Recovery from Tree Ensemble obtained via Random Forests

Seven Principles for Rapid-Response Data Science: Lessons Learned from Covid-19 Forecasting

SOFFLFM: Super-resolution optical fluctuation Fourier light-field microscopy

Towards Robust Waveform-Based Acoustic Models

Fast mixing of Metropolized Hamiltonian Monte Carlo: Benefits of multi-step gradients

A Survey on Dynamic Network Embedding

A Systematic Study of the dust of Galactic Supernova Remnants I. The Distance and the Extinction

Classifying expanding attractors on figure eight knot complement space and non-transitive Anosov flows on Franks-Williams manifold

Incremental causal effects

Singularity, Misspecification, and the Convergence Rate of EM

Structural Compression of Convolutional Neural Networks

Veridical Data Science

Iterative Random Forests to detect predictive and stable high-order interactions

Formulas for Counting the Sizes of Markov Equivalence Classes of Directed Acyclic Graphs

Local identifiability of $l_1$-minimization dictionary learning: a sufficient and almost necessary condition

A Novel Scattered Pilot Design for FBMC/OQAM Systems

A spectral-like decomposition for transitive Anosov flows in dimension three

Co-clustering for directed graphs: the Stochastic co-Blockmodel and spectral algorithm Di-Sim

Estimation Stability with Cross Validation (ESCV)

Lasso adjustments of treatment effect estimates in randomized experiments

Optimal Subsampling Approaches for Large Sample Linear Regression

The geometry of kernelized spectral clustering

Asymptotic Properties of Lasso+mLS and Lasso+Ridge in Sparse High-dimensional Linear Regression

Concise comparative summaries (CCS) of large text corpora with a human experiment

Depth $0$ Nonsingular Morse Smale flows on $S^3$

Error Rate Bounds and Iterative Weighted Majority Voting for Crowdsourcing

Impact of regularization on Spectral Clustering

Reversible MCMC on Markov equivalence classes of sparse directed acyclic graphs

Statistical guarantees for the EM algorithm: From population to sample-based analysis

The shuffle estimator for explainable variance in fMRI experiments

A Statistical Perspective on Algorithmic Leveraging

A Unified Framework for High-Dimensional Analysis of M-Estimators with Decomposable Regularizers

Early stopping and non-parametric regression: An optimal data-dependent stopping rule

Error Rate Bounds in Crowdsourcing Models

Geometry of the faithfulness assumption in causal inference

Scaling Analysis of Nanowire Phase Change Memory

Stability

Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows

Supplement to "Reversible MCMC on Markov equivalence classes of sparse directed acyclic graphs"

A Hierarchical Bayesian Approach for Aerosol Retrieval Using MISR Data

Complexity Analysis of the Lasso Regularization Path

Chemical Vapor Deposition-Assembled Graphene Field-Effect Transistor on Hexagonal Boron Nitride

Electronic Transport in Monolayer Graphene with Extreme Physical Deformation: ab Initio Density Functional Calculation

Encoding and decoding V1 fMRI responses to natural images with sparse nonparametric models

Highly Conductive 3D Nano-Carbon: Stacked Multilayer Graphene System with Interlayer Decoupling

Local Electrical Stress-Induced Doping and Formation of 2D Monolayer Graphene P-N Junction

Lyapunov graphs of nonsingular Smale flows on $S^{1}\times S^{2}$

Minimax-optimal rates for sparse additive models over kernel classes via convex programming

Multicolor Graphene Nanoribbon/Semiconductor Nanowire Heterojunction Light-Emitting Diodes

Remembering Leo

Spectral clustering and the high-dimensional stochastic blockmodel

The Templates of Nonsingular Smale Flows on Three Manifolds

A Simple and Scalable Graphene Patterning Method and Its Application in CdSe Nanobelt/Graphene Schottky Junction Solar Cells

Regular level sets of Lyapunov graphs of nonsingular Smale flows on 3-manifolds

The Lasso under Heteroscedasticity

Minimax rates of estimation for high-dimensional linear regression over $\ell_q$-balls