Source author record

Yang Feng

Yang Feng appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

61works

21topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges

LLM-as-a-Judge pipelines have become the de facto evaluator for agent safety, yet existing benchmarks treat their verdicts as ground-truth proxies without checking whether the verdicts depend on the agent's behavior or merely on how the evaluation policy happens to be worded. We argue that any trustworthy safety judge must satisfy a basic property we call policy invariance, and we operationalize it as three testable principles: rubric-semantics invariance under certified-equivalent rewrites, rubric-threshold invariance under intentional strict-to-lenient shifts, and ambiguity-aware calibration so that verdict instability concentrates on genuinely ambiguous cases. Instantiating these principles as a stress-test protocol with four agent-class judges on trajectories drawn from ASSEBench and R-Judge, we surface a previously unmeasured failure mode: today's judges respond to meaningful normative shifts and to meaningless structural rewrites with comparable strength, and cannot tell the two apart. Content-preserving policy rewrites flip up to 9.1% of verdicts above baseline jitter, and 18-43% of all observed flips occur on unambiguous cases under such rewrites, so existing safety scores conflate what the agent did with how the evaluator was prompted. Beyond the diagnosis, we contribute the Policy Invariance Score and the Judge Card reporting protocol, which expose an order-of-magnitude spread in judge reliability that is invisible to accuracy-only leaderboards. We release the protocol and code so that future agent-safety benchmarks can audit their own evaluators rather than trust them by default.

preprint2026arXiv

Towards Compositional Generalization of LLMs via Skill Taxonomy Guided Data Synthesis

Large Language Models (LLMs) and agent-based systems often struggle with compositional generalization due to a data bottleneck in which complex skill combinations follow a long-tailed, power-law distribution, limiting both instruction-following performance and generalization in agent-centric tasks. To address this challenge, we propose STEPS, a Skill Taxonomy guided Entropy-based Post-training data Synthesis framework for generating compositionally challenging data. STEPS explicitly targets compositional generalization by uncovering latent relationships among skills and organizing them into an interpretable, hierarchical skill taxonomy using structural information theory. Building on this taxonomy, we formulate data synthesis as a constrained information maximization problem, selecting skill combinations that maximize marginal structural information within the hierarchy while preserving semantic coherence. Experiments on challenging instruction-following benchmarks show that STEPS outperforms existing data synthesis baselines, while also yielding improved compositional generalization in downstream agent-based evaluations.

preprint2025arXiv

Large Language Models for Unit Test Generation: Achievements, Challenges, and Opportunities

Automated unit test generation is critical for software quality but traditional structure-driven methods often lack the semantic understanding required to produce realistic inputs and oracles. Large language models (LLMs) address this limitation by leveraging their extensive data-driven knowledge of code semantics and programming patterns. To analyze the state of the art in this domain, we conducted a systematic literature review of 115 publications published between May 2021 and August 2025. We propose a taxonomy based on the unit test generation lifecycle that divides the process into a generative phase for creating test artifacts and a quality assurance phase for refining them. Our analysis reveals that prompt engineering has emerged as the dominant utilization approach and accounts for 89% of the studies due to its flexibility. We find that iterative validation and repair loops have become the standard mechanism to ensure robust usability by significantly improving compilation and execution pass rates. However, critical challenges remain regarding the weak fault detection capabilities and the lack of standardized benchmarks. We conclude with a roadmap for future research that emphasizes the progression toward autonomous testing agents and hybrid systems combining LLMs with traditional software engineering tools.

preprint2022arXiv

AI-enabled Automatic Multimodal Fusion of Cone-Beam CT and Intraoral Scans for Intelligent 3D Tooth-Bone Reconstruction and Clinical Applications

A critical step in virtual dental treatment planning is to accurately delineate all tooth-bone structures from CBCT with high fidelity and accurate anatomical information. Previous studies have established several methods for CBCT segmentation using deep learning. However, the inherent resolution discrepancy of CBCT and the loss of occlusal and dentition information largely limited its clinical applicability. Here, we present a Deep Dental Multimodal Analysis (DDMA) framework consisting of a CBCT segmentation model, an intraoral scan (IOS) segmentation model (the most accurate digital dental model), and a fusion model to generate 3D fused crown-root-bone structures with high fidelity and accurate occlusal and dentition information. Our model was trained with a large-scale dataset with 503 CBCT and 28,559 IOS meshes manually annotated by experienced human experts. For CBCT segmentation, we use a five-fold cross validation test, each with 50 CBCT, and our model achieves an average Dice coefficient and IoU of 93.99% and 88.68%, respectively, significantly outperforming the baselines. For IOS segmentations, our model achieves an mIoU of 93.07% and 95.70% on the maxillary and mandible on a test set of 200 IOS meshes, which are 1.77% and 3.52% higher than the state-of-art method. Our DDMA framework takes about 20 to 25 minutes to generate the fused 3D mesh model following the sequential processing order, compared to over 5 hours by human experts. Notably, our framework has been incorporated into a software by a clear aligner manufacturer, and real-world clinical cases demonstrate that our model can visualize crown-root-bone structures during the entire orthodontic treatment and can predict risks like dehiscence and fenestration. These findings demonstrate the potential of multi-modal deep learning to improve the quality of digital dental models and help dentists make better clinical decisions.

preprint2022arXiv

Direct visualization of percolating metal-insulator transition in V2O3 using scanning microwave impedance microscopy

Using the extensively studied V2O3 as a prototype system, we investigate the role of percolation in metal-insulator transition (MIT). We apply scanning microwave impedance microscopy to directly determine the metallic phase fraction p and relate it to the macroscopic conductance G, which shows a sudden jump when p reaches the percolation threshold. Interestingly, the conductance G exhibits a hysteretic behavior against p, suggesting two different percolating processes upon cooling and warming. Based on our image analysis and model simulation, we ascribe such hysteretic behavior to different domain nucleation and growth processes between cooling and warming, which is likely caused by the decoupled structural and electronic transitions in V2O3 during MIT. Our work provides a microscopic view of how the interplay of structural and electronic degrees of freedom affects MIT in strongly correlated systems.

preprint2022arXiv

Gaussian Multi-head Attention for Simultaneous Machine Translation

Simultaneous machine translation (SiMT) outputs translation while receiving the streaming source inputs, and hence needs a policy to determine where to start translating. The alignment between target and source words often implies the most informative source word for each target word, and hence provides the unified control over translation quality and latency, but unfortunately the existing SiMT methods do not explicitly model the alignment to perform the control. In this paper, we propose Gaussian Multi-head Attention (GMA) to develop a new SiMT policy by modeling alignment and translation in a unified manner. For SiMT policy, GMA models the aligned source position of each target word, and accordingly waits until its aligned position to start translating. To integrate the learning of alignment into the translation model, a Gaussian distribution centered on predicted aligned position is introduced as an alignment-related prior, which cooperates with translation-related soft attention to determine the final attention. Experiments on En-Vi and De-En tasks show that our method outperforms strong baselines on the trade-off between translation and latency.

preprint2022arXiv

GReS: Graphical Cross-domain Recommendation for Supply Chain Platform

Supply Chain Platforms (SCPs) provide downstream industries with numerous raw materials. Compared with traditional e-commerce platforms, data in SCPs is more sparse due to limited user interests. To tackle the data sparsity problem, one can apply Cross-Domain Recommendation (CDR) which improves the recommendation performance of the target domain with the source domain information. However, applying CDR to SCPs directly ignores the hierarchical structure of commodities in SCPs, which reduce the recommendation performance. To leverage this feature, in this paper, we take the catering platform as an example and propose GReS, a graphical cross-domain recommendation model. The model first constructs a tree-shaped graph to represent the hierarchy of different nodes of dishes and ingredients, and then applies our proposed Tree2vec method combining GCN and BERT models to embed the graph for recommendations. Experimental results on a commercial dataset show that GReS significantly outperforms state-of-the-art methods in Cross-Domain Recommendation for Supply Chain Platforms.

preprint2022arXiv

Influences of the dissipative topological edge state on quantized transport in MnBi2Te4

The beauty of quantum Hall (QH) effect is the metrological precision of Hall resistance quantization that originates from the topological edge states. Understanding the factors that lead to quantization breakdown not only provides important insights on the nature of the topological protection of these edge states, but is beneficial for device applications involving such quantized transport. In this work, we combine conventional transport and real space conductivity mapping to investigate whether the quantization breakdown is tied to the disappearance of edge state in the hotly studied MnBi2Te4 system. Our experimental results unambiguously show that topological edge state does exist when quantization breakdown occurs. Such edge state is dissipative in nature and could lead to a quantization breakdown due to its diffusive character causing overlapping with bulk and other edge states in real devices. Our findings bring attentions to issues that are generally inaccessible in the transport study of QH, but can play important roles in practical measurements and device applications.

preprint2022arXiv

Mental Health Assessment for the Chatbots

Previous researches on dialogue system assessment usually focus on the quality evaluation (e.g. fluency, relevance, etc) of responses generated by the chatbots, which are local and technical metrics. For a chatbot which responds to millions of online users including minors, we argue that it should have a healthy mental tendency in order to avoid the negative psychological impact on them. In this paper, we establish several mental health assessment dimensions for chatbots (depression, anxiety, alcohol addiction, empathy) and introduce the questionnaire-based mental health assessment methods. We conduct assessments on some well-known open-domain chatbots and find that there are severe mental health issues for all these chatbots. We consider that it is due to the neglect of the mental health risks during the dataset building and the model training procedures. We expect to attract researchers' attention to the serious mental health problems of chatbots and improve the chatbots' ability in positive emotional interaction.

preprint2022arXiv

Modeling Dual Read/Write Paths for Simultaneous Machine Translation

Simultaneous machine translation (SiMT) outputs translation while reading source sentence and hence requires a policy to decide whether to wait for the next source word (READ) or generate a target word (WRITE), the actions of which form a read/write path. Although the read/write path is essential to SiMT performance, no direct supervision is given to the path in the existing methods. In this paper, we propose a method of dual-path SiMT which introduces duality constraints to direct the read/write path. According to duality constraints, the read/write path in source-to-target and target-to-source SiMT models can be mapped to each other. As a result, the two SiMT models can be optimized jointly by forcing their read/write paths to satisfy the mapping. Experiments on En-Vi and De-En tasks show that our method can outperform strong baselines under all latency.

preprint2022arXiv

Neural Machine Translation with Phrase-Level Universal Visual Representations

Multimodal machine translation (MMT) aims to improve neural machine translation (NMT) with additional visual information, but most existing MMT methods require paired input of source sentence and image, which makes them suffer from shortage of sentence-image pairs. In this paper, we propose a phrase-level retrieval-based method for MMT to get visual information for the source input from existing sentence-image data sets so that MMT can break the limitation of paired sentence-image input. Our method performs retrieval at the phrase level and hence learns visual information from pairs of source phrase and grounded region, which can mitigate data sparsity. Furthermore, our method employs the conditional variational auto-encoder to learn visual representations which can filter redundant visual information and only retain visual information related to the phrase. Experiments show that the proposed method significantly outperforms strong baselines on multiple MMT datasets, especially when the textual context is limited.

preprint2022arXiv

One Reference Is Not Enough: Diverse Distillation with Reference Selection for Non-Autoregressive Translation

Non-autoregressive neural machine translation (NAT) suffers from the multi-modality problem: the source sentence may have multiple correct translations, but the loss function is calculated only according to the reference sentence. Sequence-level knowledge distillation makes the target more deterministic by replacing the target with the output from an autoregressive model. However, the multi-modality problem in the distilled dataset is still nonnegligible. Furthermore, learning from a specific teacher limits the upper bound of the model capability, restricting the potential of NAT models. In this paper, we argue that one reference is not enough and propose diverse distillation with reference selection (DDRS) for NAT. Specifically, we first propose a method called SeedDiv for diverse machine translation, which enables us to generate a dataset containing multiple high-quality reference translations for each source sentence. During the training, we compare the NAT output with all references and select the one that best fits the NAT output to train the model. Experiments on widely-used machine translation benchmarks demonstrate the effectiveness of DDRS, which achieves 29.82 BLEU with only one decoding pass on WMT14 En-De, improving the state-of-the-art performance for NAT by over 1 BLEU. Source code: https://github.com/ictnlp/DDRS-NAT

preprint2022arXiv

Overcoming Catastrophic Forgetting beyond Continual Learning: Balanced Training for Neural Machine Translation

Neural networks tend to gradually forget the previously learned knowledge when learning multiple tasks sequentially from dynamic data distributions. This problem is called \textit{catastrophic forgetting}, which is a fundamental challenge in the continual learning of neural networks. In this work, we observe that catastrophic forgetting not only occurs in continual learning but also affects the traditional static training. Neural networks, especially neural machine translation models, suffer from catastrophic forgetting even if they learn from a static training set. To be specific, the final model pays imbalanced attention to training samples, where recently exposed samples attract more attention than earlier samples. The underlying cause is that training samples do not get balanced training in each model update, so we name this problem \textit{imbalanced training}. To alleviate this problem, we propose Complementary Online Knowledge Distillation (COKD), which uses dynamically updated teacher models trained on specific data orders to iteratively provide complementary knowledge to the student model. Experimental results on multiple machine translation tasks show that our method successfully alleviates the problem of imbalanced training and achieves substantial improvements over strong baseline systems.

preprint2022arXiv

Reducing Position Bias in Simultaneous Machine Translation with Length-Aware Framework

Simultaneous machine translation (SiMT) starts translating while receiving the streaming source inputs, and hence the source sentence is always incomplete during translating. Different from the full-sentence MT using the conventional seq-to-seq architecture, SiMT often applies prefix-to-prefix architecture, which forces each target word to only align with a partial source prefix to adapt to the incomplete source in streaming inputs. However, the source words in the front positions are always illusoryly considered more important since they appear in more prefixes, resulting in position bias, which makes the model pay more attention on the front source positions in testing. In this paper, we first analyze the phenomenon of position bias in SiMT, and develop a Length-Aware Framework to reduce the position bias by bridging the structural gap between SiMT and full-sentence MT. Specifically, given the streaming inputs, we first predict the full-sentence length and then fill the future source position with positional encoding, thereby turning the streaming inputs into a pseudo full-sentence. The proposed framework can be integrated into most existing SiMT methods to further improve performance. Experiments on two representative SiMT methods, including the state-of-the-art adaptive policy, show that our method successfully reduces the position bias and thereby achieves better SiMT performance.

preprint2022arXiv

Relational Surrogate Loss Learning

Evaluation metrics in machine learning are often hardly taken as loss functions, as they could be non-differentiable and non-decomposable, e.g., average precision and F1 score. This paper aims to address this problem by revisiting the surrogate loss learning, where a deep neural network is employed to approximate the evaluation metrics. Instead of pursuing an exact recovery of the evaluation metric through a deep neural network, we are reminded of the purpose of the existence of these evaluation metrics, which is to distinguish whether one model is better or worse than another. In this paper, we show that directly maintaining the relation of models between surrogate losses and metrics suffices, and propose a rank correlation-based optimization method to maximize this relation and learn surrogate losses. Compared to previous works, our method is much easier to optimize and enjoys significant efficiency and performance gains. Extensive experiments show that our method achieves improvements on various tasks including image classification and neural machine translation, and even outperforms state-of-the-art methods on human pose estimation and machine reading comprehension tasks. Code is available at: https://github.com/hunto/ReLoss.

preprint2022arXiv

STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation

How to learn a better speech representation for end-to-end speech-to-text translation (ST) with limited labeled data? Existing techniques often attempt to transfer powerful machine translation (MT) capabilities to ST, but neglect the representation discrepancy across modalities. In this paper, we propose the Speech-TExt Manifold Mixup (STEMM) method to calibrate such discrepancy. Specifically, we mix up the representation sequences of different modalities, and take both unimodal speech sequences and multimodal mixed sequences as input to the translation model in parallel, and regularize their output predictions with a self-learning framework. Experiments on MuST-C speech translation benchmark and further analysis show that our method effectively alleviates the cross-modal representation discrepancy, and achieves significant improvements over a strong baseline on eight translation directions.

preprint2022arXiv

Transfer Learning under High-dimensional Generalized Linear Models

In this work, we study the transfer learning problem under high-dimensional generalized linear models (GLMs), which aim to improve the fit on target data by borrowing information from useful source data. Given which sources to transfer, we propose a transfer learning algorithm on GLM, and derive its $\ell_1/\ell_2$-estimation error bounds as well as a bound for a prediction error measure. The theoretical analysis shows that when the target and source are sufficiently close to each other, these bounds could be improved over those of the classical penalized estimator using only target data under mild conditions. When we don't know which sources to transfer, an algorithm-free transferable source detection approach is introduced to detect informative sources. The detection consistency is proved under the high-dimensional GLM transfer learning setting. We also propose an algorithm to construct confidence intervals of each coefficient component, and the corresponding theories are provided. Extensive simulations and a real-data experiment verify the effectiveness of our algorithms. We implement the proposed GLM transfer learning algorithms in a new R package glmtrans, which is available on CRAN.

preprint2022arXiv

Universal Simultaneous Machine Translation with Mixture-of-Experts Wait-k Policy

Simultaneous machine translation (SiMT) generates translation before reading the entire source sentence and hence it has to trade off between translation quality and latency. To fulfill the requirements of different translation quality and latency in practical applications, the previous methods usually need to train multiple SiMT models for different latency levels, resulting in large computational costs. In this paper, we propose a universal SiMT model with Mixture-of-Experts Wait-k Policy to achieve the best translation quality under arbitrary latency with only one trained model. Specifically, our method employs multi-head attention to accomplish the mixture of experts where each head is treated as a wait-k expert with its own waiting words number, and given a test latency and source inputs, the weights of the experts are accordingly adjusted to produce the best translation. Experiments on three datasets show that our method outperforms all the strong baselines under different latency, including the state-of-the-art adaptive policy.

preprint2021arXiv

Gate Tunable Supercurrent in Josephson Junctions Based on Bi2Te3 Topological Insulator Thin Films

We report transport measurements on Josephson junctions consisting of Bi2Te3 topological insulator (TI) thin films contacted by superconducting Nb electrodes. For a device with junction length L = 134 nm, the critical supercurrent Ic can be modulated by an electrical gate which tunes the carrier type and density of the TI film. Ic can reach a minimum when the TI is near the charge neutrality regime with the Fermi energy lying close to the Dirac point of the surface state. In the p-type regime the Josephson current can be well described by a short ballistic junction model. In the n-type regime the junction is ballistic at 0.7 K < T < 3.8 K while for T < 0.7 K the diffusive bulk modes emerge and contribute a larger Ic than the ballistic model. We attribute the lack of diffusive bulk modes in the p-type regime to the formation of p-n junctions. Our work provides new clues for search of Majorana zero mode in TI-based superconducting devices.

preprint2021arXiv

Layout and Image Recognition Driving Cross-Platform Automated Mobile Testing

The fragmentation problem has extended from Android to different platforms, such as iOS, mobile web, and even mini-programs within some applications (app). In such a situation, recording and replaying test scripts is a popular automated mobile app testing approaches. But such approach encounters severe problems when crossing platforms. Different versions of the same app need to be developed to support different platforms relying on different platform supports. Therefore, mobile app developers need to develop and maintain test scripts for multiple platforms aimed at completely the same test requirements, greatly increasing testing costs. However, we discover that developers adopt highly similar user interface layouts for versions of the same app on different platforms. Such a phenomenon inspires us to replay test scripts from the perspective of similar UI layouts. We propose an image-driven mobile app testing framework, utilizing Widget Feature Matching and Layout Characterization Matching. We use computer vision technologies to perform UI feature comparison and layout hierarchy extraction on app screenshots to obtain UI structures with rich contextual information, including coordinates, relative relationship, etc. Based on acquired UI structures, we can form a platform-independent test script, and then locate the target widgets under test. Thus, the proposed framework non-intrusively replays test scripts according to a novel platform-independent test script model. We also design and implement a tool named LIT to devote the proposed framework into practice, based on which, we conduct an empirical study to evaluate the effectiveness and usability of the proposed testing framework. Results show that the overall replay accuracy reaches around 63.39% on Android (14% improvement over state-of-the-art approaches) and 21.83% on iOS (98% improvement over state-of-the-art approaches).

preprint2021arXiv

Learning to Select Context in a Hierarchical and Global Perspective for Open-domain Dialogue Generation

Open-domain multi-turn conversations mainly have three features, which are hierarchical semantic structure, redundant information, and long-term dependency. Grounded on these, selecting relevant context becomes a challenge step for multi-turn dialogue generation. However, existing methods cannot differentiate both useful words and utterances in long distances from a response. Besides, previous work just performs context selection based on a state in the decoder, which lacks a global guidance and could lead some focuses on irrelevant or unnecessary information. In this paper, we propose a novel model with hierarchical self-attention mechanism and distant supervision to not only detect relevant words and utterances in short and long distances, but also discern related information globally when decoding. Experimental results on two public datasets of both automatic and human evaluations show that our model significantly outperforms other baselines in terms of fluency, coherence, and informativeness.

preprint2021arXiv

PyART: Python API Recommendation in Real-Time

API recommendation in real-time is challenging for dynamic languages like Python. Many existing API recommendation techniques are highly effective, but they mainly support static languages. A few Python IDEs provide API recommendation functionalities based on type inference and training on a large corpus of Python libraries and third-party libraries. As such, they may fail to recommend or make poor recommendations when type information is missing or target APIs are project-specific. In this paper, we propose a novel approach, PyART, to recommend APIs for Python programs in real-time. It features a light-weight analysis to derives so-called optimistic data-flow, which is neither sound nor complete, but simulates the local data-flow information humans can derive. It extracts three kinds of features: data-flow, token similarity, and token co-occurrence, in the context of the program point where a recommendation is solicited. A predictive model is trained on these features using the Random Forest algorithm. Evaluation on 8 popular Python projects demonstrates that PyART can provide effective API recommendations. When historic commits can be leveraged, which is the target scenario of a state-of-the-art tool ARIREC, our average top-1 accuracy is over 50% and average top-10 accuracy over 70%, outperforming APIREC and Intellicode (i.e., the recommendation component in Visual Studio) by 28.48%-39.05% for top-1 accuracy and 24.41%-30.49% for top-10 accuracy. In other applications such as when historic comments are not available and cross-project recommendation, PyART also shows better overall performance. The time to make a recommendation is less than a second on average, satisfying the real-time requirement.

preprint2021arXiv

RaSE: A Variable Screening Framework via Random Subspace Ensembles

Variable screening methods have been shown to be effective in dimension reduction under the ultra-high dimensional setting. Most existing screening methods are designed to rank the predictors according to their individual contributions to the response. As a result, variables that are marginally independent but jointly dependent with the response could be missed. In this work, we propose a new framework for variable screening, Random Subspace Ensemble (RaSE), which works by evaluating the quality of random subspaces that may cover multiple predictors. This new screening framework can be naturally combined with any subspace evaluation criterion, which leads to an array of screening methods. The framework is capable to identify signals with no marginal effect or with high-order interaction effects. It is shown to enjoy the sure screening property and rank consistency. We also develop an iterative version of RaSE screening with theoretical support. Extensive simulation studies and real-data analysis show the effectiveness of the new screening framework.

preprint2021arXiv

The Interplay of Demographic Variables and Social Distancing Scores in Deep Prediction of U.S. COVID-19 Cases

With the severity of the COVID-19 outbreak, we characterize the nature of the growth trajectories of counties in the United States using a novel combination of spectral clustering and the correlation matrix. As the U.S. and the rest of the world are experiencing a severe second wave of infections, the importance of assigning growth membership to counties and understanding the determinants of the growth are increasingly evident. Subsequently, we select the demographic features that are most statistically significant in distinguishing the communities. Lastly, we effectively predict the future growth of a given county with an LSTM using three social distancing scores. This comprehensive study captures the nature of counties' growth in cases at a very micro-level using growth communities, demographic factors, and social distancing performance to help government agencies utilize known information to make appropriate decisions regarding which potential counties to target resources and funding to.

preprint2020arXiv

Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog

Audio-Visual Scene-Aware Dialog (AVSD) is a task to generate responses when chatting about a given video, which is organized as a track of the 8th Dialog System Technology Challenge (DSTC8). To solve the task, we propose a universal multimodal transformer and introduce the multi-task learning method to learn joint representations among different modalities as well as generate informative and fluent responses. Our method extends the natural language generation pre-trained model to multimodal dialogue generation task. Our system achieves the best performance in both objective and subjective evaluations in the challenge.

preprint2020arXiv

CDL: Curriculum Dual Learning for Emotion-Controllable Response Generation

Emotion-controllable response generation is an attractive and valuable task that aims to make open-domain conversations more empathetic and engaging. Existing methods mainly enhance the emotion expression by adding regularization terms to standard cross-entropy loss and thus influence the training process. However, due to the lack of further consideration of content consistency, the common problem of response generation tasks, safe response, is intensified. Besides, query emotions that can help model the relationship between query and response are simply ignored in previous models, which would further hurt the coherence. To alleviate these problems, we propose a novel framework named Curriculum Dual Learning (CDL) which extends the emotion-controllable response generation to a dual task to generate emotional responses and emotional queries alternatively. CDL utilizes two rewards focusing on emotion and content to improve the duality. Additionally, it applies curriculum learning to gradually generate high-quality responses based on the difficulties of expressing various emotions. Experimental results show that CDL significantly outperforms the baselines in terms of coherence, diversity, and relation to emotion factors.

preprint2020arXiv

DeepGini: Prioritizing Massive Tests to Enhance the Robustness of Deep Neural Networks

Deep neural networks (DNN) have been deployed in many software systems to assist in various classification tasks. In company with the fantastic effectiveness in classification, DNNs could also exhibit incorrect behaviors and result in accidents and losses. Therefore, testing techniques that can detect incorrect DNN behaviors and improve DNN quality are extremely necessary and critical. However, the testing oracle, which defines the correct output for a given input, is often not available in the automated testing. To obtain the oracle information, the testing tasks of DNN-based systems usually require expensive human efforts to label the testing data, which significantly slows down the process of quality assurance. To mitigate this problem, we propose DeepGini, a test prioritization technique designed based on a statistical perspective of DNN. Such a statistical perspective allows us to reduce the problem of measuring misclassification probability to the problem of measuring set impurity, which allows us to quickly identify possibly-misclassified tests. To evaluate, we conduct an extensive empirical study on popular datasets and prevalent DNN models. The experimental results demonstrate that DeepGini outperforms existing coverage-based techniques in prioritizing tests regarding both effectiveness and efficiency. Meanwhile, we observe that the tests prioritized at the front by DeepGini are more effective in improving the DNN quality in comparison with the coverage-based techniques.

preprint2020arXiv

Nested Model Averaging on Solution Path for High-dimensional Linear Regression

We study the nested model averaging method on the solution path for a high-dimensional linear regression problem. In particular, we propose to combine model averaging with regularized estimators (e.g., lasso and SLOPE) on the solution path for high-dimensional linear regression. In simulation studies, we first conduct a systematic investigation on the impact of predictor ordering on the behavior of nested model averaging, then show that nested model averaging with lasso and SLOPE compares favorably with other competing methods, including the infeasible lasso and SLOPE with the tuning parameter optimally selected. A real data analysis on predicting the per capita violent crime in the United States shows an outstanding performance of the nested model averaging with lasso.

preprint2020arXiv

Neyman-Pearson classification: parametrics and sample size requirement

The Neyman-Pearson (NP) paradigm in binary classification seeks classifiers that achieve a minimal type II error while enforcing the prioritized type I error controlled under some user-specified level $α$. This paradigm serves naturally in applications such as severe disease diagnosis and spam detection, where people have clear priorities among the two error types. Recently, Tong, Feng and Li (2018) proposed a nonparametric umbrella algorithm that adapts all scoring-type classification methods (e.g., logistic regression, support vector machines, random forest) to respect the given type I error upper bound $α$ with high probability, without specific distributional assumptions on the features and the responses. Universal the umbrella algorithm is, it demands an explicit minimum sample size requirement on class $0$, which is often the more scarce class, such as in rare disease diagnosis applications. In this work, we employ the parametric linear discriminant analysis (LDA) model and propose a new parametric thresholding algorithm, which does not need the minimum sample size requirements on class $0$ observations and thus is suitable for small sample applications such as rare disease diagnosis. Leveraging both the existing nonparametric and the newly proposed parametric thresholding rules, we propose four LDA-based NP classifiers, for both low- and high-dimensional settings. On the theoretical front, we prove NP oracle inequalities for one proposed classifier, where the rate for excess type II error benefits from the explicit parametric model assumption. Furthermore, as NP classifiers involve a sample splitting step of class $0$ observations, we construct a new adaptive sample splitting scheme that can be applied universally to NP classifiers, and this adaptive strategy reduces the type II error of these classifiers.

preprint2020arXiv

Towards Multimodal Response Generation with Exemplar Augmentation and Curriculum Optimization

Recently, variational auto-encoder (VAE) based approaches have made impressive progress on improving the diversity of generated responses. However, these methods usually suffer the cost of decreased relevance accompanied by diversity improvements. In this paper, we propose a novel multimodal response generation framework with exemplar augmentation and curriculum optimization to enhance relevance and diversity of generated responses. First, unlike existing VAE-based models that usually approximate a simple Gaussian posterior distribution, we present a Gaussian mixture posterior distribution (i.e, multimodal) to further boost response diversity, which helps capture complex semantics of responses. Then, to ensure that relevance does not decrease while diversity increases, we fully exploit similar examples (exemplars) retrieved from the training data into posterior distribution modeling to augment response relevance. Furthermore, to facilitate the convergence of Gaussian mixture prior and posterior distributions, we devise a curriculum optimization strategy to progressively train the model under multiple training criteria from easy to hard. Experimental results on widely used SwitchBoard and DailyDialog datasets demonstrate that our model achieves significant improvements compared to strong baselines in terms of diversity and relevance.

preprint2020arXiv

Unifying Specialist Image Embedding into Universal Image Embedding

Deep image embedding provides a way to measure the semantic similarity of two images. It plays a central role in many applications such as image search, face verification, and zero-shot learning. It is desirable to have a universal deep embedding model applicable to various domains of images. However, existing methods mainly rely on training specialist embedding models each of which is applicable to images from a single domain. In this paper, we study an important but unexplored task: how to train a single universal image embedding model to match the performance of several specialists on each specialist's domain. Simply fusing the training data from multiple domains cannot solve this problem because some domains become overfitted sooner when trained together using existing methods. Therefore, we propose to distill the knowledge in multiple specialists into a universal embedding to solve this problem. In contrast to existing embedding distillation methods that distill the absolute distances between images, we transform the absolute distances between images into a probabilistic distribution and minimize the KL-divergence between the distributions of the specialists and the universal embedding. Using several public datasets, we validate that our proposed method accomplishes the goal of universal image embedding.

preprint2020arXiv

Universal Model for Multi-Domain Medical Image Retrieval

Medical Image Retrieval (MIR) helps doctors quickly find similar patients' data, which can considerably aid the diagnosis process. MIR is becoming increasingly helpful due to the wide use of digital imaging modalities and the growth of the medical image repositories. However, the popularity of various digital imaging modalities in hospitals also poses several challenges to MIR. Usually, one image retrieval model is only trained to handle images from one modality or one source. When there are needs to retrieve medical images from several sources or domains, multiple retrieval models need to be maintained, which is cost ineffective. In this paper, we study an important but unexplored task: how to train one MIR model that is applicable to medical images from multiple domains? Simply fusing the training data from multiple domains cannot solve this problem because some domains become over-fit sooner when trained together using existing methods. Therefore, we propose to distill the knowledge in multiple specialist MIR models into a single multi-domain MIR model via universal embedding to solve this problem. Using skin disease, x-ray, and retina image datasets, we validate that our proposed universal model can effectively accomplish multi-domain MIR.

preprint2020arXiv

Video-based Person Re-Identification using Gated Convolutional Recurrent Neural Networks

Deep neural networks have been successfully applied to solving the video-based person re-identification problem with impressive results reported. The existing networks for person re-id are designed to extract discriminative features that preserve the identity information. Usually, whole video frames are fed into the neural networks and all the regions in a frame are equally treated. This may be a suboptimal choice because many regions, e.g., background regions in the video, are not related to the person. Furthermore, the person of interest may be occluded by another person or something else. These unrelated regions may hinder person re-identification. In this paper, we introduce a novel gating mechanism to deep neural networks. Our gating mechanism will learn which regions are helpful for person re-identification and let these regions pass the gate. The unrelated background regions or occluding regions are filtered out by the gate. In each frame, the color channels and optical flow channels provide quite different information. To better leverage such information, we generate one gate using the color channels and another gate using the optical flow channels. These two gates are combined to provide a more reliable gate with a novel fusion method. Experimental results on two major datasets demonstrate the performance improvements due to the proposed gating mechanism.

preprint2016arXiv

Community detection with nodal information

Community detection is one of the fundamental problems in the study of network data. Most existing community detection approaches only consider edge information as inputs, and the output could be suboptimal when nodal information is available. In such cases, it is desirable to leverage nodal information for the improvement of community detection accuracy. Towards this goal, we propose a flexible network model incorporating nodal information, and develop likelihood-based inference methods. For the proposed methods, we establish favorable asymptotic properties as well as efficient algorithms for computation. Numerical experiments show the effectiveness of our methods in utilizing nodal information across a variety of simulated and real network data sets.

preprint2016arXiv

Do They All Look the Same? Deciphering Chinese, Japanese and Koreans by Fine-Grained Deep Learning

We study to what extend Chinese, Japanese and Korean faces can be classified and which facial attributes offer the most important cues. First, we propose a novel way of obtaining large numbers of facial images with nationality labels. Then we train state-of-the-art neural networks with these labeled images. We are able to achieve an accuracy of 75.03% in the classification task, with chances being 33.33% and human accuracy 38.89% . Further, we train multiple facial attribute classifiers to identify the most distinctive features for each group. We find that Chinese, Japanese and Koreans do exhibit substantial differences in certain attributes, such as bangs, smiling, and bushy eyebrows. Along the way, we uncover several gender-related cross-country patterns as well. Our work, which complements existing APIs such as Microsoft Cognitive Services and Face++, could find potential applications in tourism, e-commerce, social media marketing, criminal justice and even counter-terrorism.

preprint2016arXiv

Experimental Observation of the Quantum Anomalous Hall Effect in a Magnetic Topological Insulator

The quantized version of the anomalous Hall effect has been predicted to occur in magnetic topological insulators, but the experimental realization has been challenging. Here, we report the observation of the quantum anomalous Hall (QAH) effect in thin films of Cr-doped (Bi,Sb)2Te3, a magnetic topological insulator. At zero magnetic field, the gate-tuned anomalous Hall resistance reaches the predicted quantized value of h/e^2,accompanied by a considerable drop of the longitudinal resistance. Under a strong magnetic field, the longitudinal resistance vanishes whereas the Hall resistance remains at the quantized value. The realization of the QAH effect may lead to the development of low-power-consumption electronics.

preprint2016arXiv

Gender Politics in the 2016 U.S. Presidential Election: A Computer Vision Approach

Gender is playing an important role in the 2016 U.S. presidential election, especially with Hillary Clinton becoming the first female presidential nominee and Donald Trump being frequently accused of sexism. In this paper, we introduce computer vision to the study of gender politics and present an image-driven method that can measure the effects of gender in an accurate and timely manner. We first collect all the profile images of the candidates' Twitter followers. Then we train a convolutional neural network using images that contain gender labels. Lastly, we classify all the follower and unfollower images. Through two case studies, one on the `woman card' controversy and one on Sanders followers, we demonstrate how gender is informing the 2016 presidential election. Our framework of analysis can be readily generalized to other case studies and elections.

preprint2016arXiv

Model Selection for High Dimensional Quadratic Regression via Regularization

Quadratic regression (QR) models naturally extend linear models by considering interaction effects between the covariates. To conduct model selection in QR, it is important to maintain the hierarchical model structure between main effects and interaction effects. Existing regularization methods generally achieve this goal by solving complex optimization problems, which usually demands high computational cost and hence are not feasible for high dimensional data. This paper focuses on scalable regularization methods for model selection in high dimensional QR. We first consider two-stage regularization methods and establish theoretical properties of the two-stage LASSO. Then, a new regularization method, called Regularization Algorithm under Marginality Principle (RAMP), is proposed to compute a hierarchy-preserving regularization solution path efficiently. Both methods are further extended to solve generalized QR models. Numerical results are also shown to demonstrate performance of the methods.

preprint2016arXiv

Post Selection Shrinkage Estimation for High Dimensional Data Analysis

In high-dimensional data settings where $p\gg n$, many penalized regularization approaches were studied for simultaneous variable selection and estimation. However, with the existence of covariates with weak effect, many existing variable selection methods, including Lasso and its generations, cannot distinguish covariates with weak and no contribution. Thus, prediction based on a subset model of selected covariates only can be inefficient. In this paper, we propose a post selection shrinkage estimation strategy to improve the prediction performance of a selected subset model. Such a post selection shrinkage estimator (PSE) is data-adaptive and constructed by shrinking a post selection weighted ridge estimator in the direction of a selected candidate subset. Under an asymptotic distributional quadratic risk criterion, its prediction performance is explored analytically. We show that the proposed post selection PSE performs better than the post selection weighted ridge estimator. More importantly, it improves the prediction performance of any candidate subset model selected from most existing Lasso-type variable selection methods significantly. The relative performance of the post selection PSE is demonstrated by both simulation studies and real data analysis.

preprint2016arXiv

Pricing the Woman Card: Gender Politics between Hillary Clinton and Donald Trump

In this paper, we propose a data-driven method to measure the impact of the 'woman card' exchange between Hillary Clinton and Donald Trump. Building from a unique dataset of the two candidates' Twitter followers, we first examine the transition dynamics of the two candidates' Twitter followers one week before the exchange and one week after. Then we train a convolutional neural network to classify the gender of the followers and unfollowers, and study how women in particular are reacting to the 'woman card' exchange. Our study suggests that the 'woman card' comment has made women more likely to follow Hillary Clinton, less likely to unfollow her and that it has apparently not affected the gender composition of Trump followers.

preprint2016arXiv

Tactics and Tallies: Inferring Voter Preferences in the 2016 U.S. Presidential Primaries Using Sparse Learning

In this paper, we propose a web-centered framework to infer voter preferences for the 2016 U.S. presidential primaries. Using Twitter data collected from Sept. 2015 to March 2016, we first uncover the tweeting tactics of the candidates and then exploit the variations in the number of 'likes' to infer voters' preference. With sparse learning, we are able to reveal neutral topics as well as positive and negative ones. Methodologically, we are able to achieve a higher predictive power with sparse learning. Substantively, we show that for Hillary Clinton the (only) positive issue area is women's rights. We demonstrate that Hillary Clinton's tactic of linking herself to President Obama resonates well with her supporters but the same is not true for Bernie Sanders. In addition, we show that Donald Trump is a major topic for all the other candidates, and that the women's rights issue is equally emphasized in Sanders' campaign as in Clinton's.

preprint2016arXiv

When Do Luxury Cars Hit the Road? Findings by A Big Data Approach

In this paper, we focus on studying the appearing time of different kinds of cars on the road. This information will enable us to infer the life style of the car owners. The results can further be used to guide marketing towards car owners. Conventionally, this kind of study is carried out by sending out questionnaires, which is limited in scale and diversity. To solve this problem, we propose a fully automatic method to carry out this study. Our study is based on publicly available surveillance camera data. To make the results reliable, we only use the high resolution cameras (i.e. resolution greater than $1280 \times 720$). Images from the public cameras are downloaded every minute. After obtaining 50,000 images, we apply faster R-CNN (region-based convoluntional neural network) to detect the cars in the downloaded images and a fine-tuned VGG16 model is used to recognize the car makes. Based on the recognition results, we present a data-driven analysis on the relationship between car makes and their appearing times, with implications on lifestyles.

preprint2016arXiv

Will Sanders Supporters Jump Ship for Trump? Fine-grained Analysis of Twitter Followers

In this paper, we study the likelihood of Bernie Sanders supporters voting for Donald Trump instead of Hillary Clinton. Building from a unique time-series dataset of the three candidates' Twitter followers, which we make public here, we first study the proportion of Sanders followers who simultaneously follow Trump (but not Clinton) and how this evolves over time. Then we train a convolutional neural network to classify the gender of Sanders followers, and study whether men are more likely to jump ship for Trump than women. Our study shows that between March and May an increasing proportion of Sanders followers are following Trump (but not Clinton). The proportion of Sanders followers who follow Clinton but not Trump has actually decreased. Equally important, our study suggests that the jumping ship behavior will be affected by gender and that men are more likely to switch to Trump than women.

preprint2015arXiv

Bayesian quantile regression with approximate likelihood

Quantile regression is often used when a comprehensive relationship between a response variable and one or more explanatory variables is desired. The traditional frequentists' approach to quantile regression has been well developed around asymptotic theories and efficient algorithms. However, not much work has been published under the Bayesian framework. One challenging problem for Bayesian quantile regression is that the full likelihood has no parametric forms. In this paper, we propose a Bayesian quantile regression method, the linearly interpolated density (LID) method, which uses a linear interpolation of the quantiles to approximate the likelihood. Unlike most of the existing methods that aim at tackling one quantile at a time, our proposed method estimates the joint posterior distribution of multiple quantiles, leading to higher global efficiency for all quantiles of interest. Markov chain Monte Carlo algorithms are developed to carry out the proposed method. We provide convergence results that justify both the algorithmic convergence and statistical approximations to an integrated-likelihood-based posterior. From the simulation results, we verify that LID has a clear advantage over other existing methods in estimating quantities that relate to two or more quantiles.

preprint2015arXiv

Disentangling the magnetoelectric and thermoelectric transport in topological insulator thin films

We report transport studies on (Bi,Sb)2Te3 topological insulator thin films with tunable electronic band structure. We find a doping and temperature regime in which the Hall coefficient is negative indicative of electron-type carriers, whereas the Seebeck coefficient is positive indicative of hole-type carriers. This sign anomaly is due to the distinct transport behaviors of the bulk and surface states: the surface Dirac fermions dominate magnetoelectric transport while the thermoelectric effect is mainly determined by the bulk states. These findings may inspire new ideas for designing topological insulator-based high efficiency thermoelectric devices.

preprint2015arXiv

Feature Augmentation via Nonparametrics and Selection (FANS) in High Dimensional Classification

We propose a high dimensional classification method that involves nonparametric feature augmentation. Knowing that marginal density ratios are the most powerful univariate classifiers, we use the ratio estimates to transform the original feature measurements. Subsequently, penalized logistic regression is invoked, taking as input the newly transformed or augmented features. This procedure trains models equipped with local complexity and global simplicity, thereby avoiding the curse of dimensionality while creating a flexible nonlinear decision boundary. The resulting method is called Feature Augmentation via Nonparametrics and Selection (FANS). We motivate FANS by generalizing the Naive Bayes model, writing the log ratio of joint densities as a linear combination of those of marginal densities. It is related to generalized additive models, but has better interpretability and computability. Risk bounds are developed for FANS. In numerical analysis, FANS is compared with competing methods, so as to provide a guideline on its best application domain. Real data analysis demonstrates that FANS performs very competitively on benchmark email spam and gene expression data sets. Moreover, FANS is implemented by an extremely fast algorithm through parallel computing.

preprint2015arXiv

How Many Communities Are There?

Stochastic blockmodels and variants thereof are among the most widely used approaches to community detection for social networks and relational data. A stochastic blockmodel partitions the nodes of a network into disjoint sets, called communities. The approach is inherently related to clustering with mixture models; and raises a similar model selection problem for the number of communities. The Bayesian information criterion (BIC) is a popular solution, however, for stochastic blockmodels, the conditional independence assumption given the communities of the endpoints among different edges is usually violated in practice. In this regard, we propose composite likelihood BIC (CL-BIC) to select the number of communities, and we show it is robust against possible misspecifications in the underlying stochastic blockmodel assumptions. We derive the requisite methodology and illustrate the approach using both simulated and real data. Supplementary materials containing the relevant computer code are available online.

preprint2015arXiv

Neyman-Pearson Classification under High-Dimensional Settings

Most existing binary classification methods target on the optimization of the overall classification risk and may fail to serve some real-world applications such as cancer diagnosis, where users are more concerned with the risk of misclassifying one specific class than the other. Neyman-Pearson (NP) paradigm was introduced in this context as a novel statistical framework for handling asymmetric type I/II error priorities. It seeks classifiers with a minimal type II error and a constrained type I error under a user specified level. This article is the first attempt to construct classifiers with guaranteed theoretical performance under the NP paradigm in high-dimensional settings. Based on the fundamental Neyman-Pearson Lemma, we used a plug-in approach to construct NP-type classifiers for Naive Bayes models. The proposed classifiers satisfy the NP oracle inequalities, which are natural NP paradigm counterparts of the oracle inequalities in classical binary classification. Besides their desirable theoretical properties, we also demonstrated their numerical advantages in prioritized error control via both simulation and real data studies.

preprint2015arXiv

Observation of the zero Hall plateau in a quantum anomalous Hall insulator

Quantum anomalous Hall (QAH) effect in magnetic topological insulator (TI) is a novel transport phenomenon in which the Hall resistance reaches the quantum plateau in the absence of external magnetic field. Recently, this exotic effect has been discovered experimentally in an ultrathin film of the Bi2Te3 family TI with spontaneous ferromagnetic (FM) order. An important question concerning the QAH state is whether it is simply a zero-magnetic-field version of the quantum Hall (QH) effect, or if there is new physics beyond the conventional paradigm. Here we report experimental investigations on the quantum phase transition between the two opposite Hall plateaus of a QAH insulator caused by magnetization reversal. We observe a well-defined plateau with zero Hall conductivity over a range of magnetic field around coercivity, consistent with a recent theoretical prediction. The features of the zero Hall plateau are shown to be closely related to that of the QAH effect, but its temperature evolution exhibits quantitative differences from the network model for conventional QH plateau transition. We propose that the chiral edge states residing at the magnetic domain boundaries, which are unique to a QAH insulator, are responsible for the zero Hall plateau. The rich magnetic domain dynamics makes the QAH effect a distinctive class of quantum phenomenon that may find novel applications in spintronics.

preprint2015arXiv

Simultaneous electrical-field-effect modulation of both top and bottom Dirac surface states of epitaxial thin films of three-dimensional topological insulators

It is crucial for the studies of the transport properties and quantum effects related to Dirac surface states of three-dimensional topological insulators (3D TIs) to be able to simultaneously tune the chemical potentials of both top and bottom surfaces of a 3D TI thin film. We have realized this in molecular beam epitaxy-grown thin films of 3D TIs, as well as magnetic 3D TIs, by fabricating dual-gate structures on them. The films could be tuned between n-type and p-type by each gate alone. Combined application of two gates can reduce the carrier density of a TI film to a much lower level than with only one of them and enhance the film resistance by 10000 %, implying that Fermi level is tuned very close to the Dirac points of both top and bottom surface states without crossing any bulk band. The result promises applications of 3D TIs in field effect devices.

preprint2014arXiv

Electrically tuned magnetic order and magnetoresistance in a topological insulator

The Dirac-like surface states of the topological insulators (TIs) are protected by time reversal symmetry (TRS) and exhibit a host of novel properties. Introducing magnetism into TI, which breaks the TRS, is expected to create exotic topological magnetoelectric effects. A particularly intriguing phenomenon in this case is the magnetic field dependence of electrical resistance, or magnetoresistance (MR). The intricate interplay between topological protection and broken-TRS may lead to highly unconventional MR behaviour that can find unique applications in magnetic sensing and data storage. However, so far the MR of TI with spontaneously broken TRS is still poorly understood, mainly due to the lack of well-controlled experiments. In this work, we investigate the magneto transport properties of a ferromagnetic TI thin film fabricated into a field effect transistor device. We observe an unusually complex evolution of MR when the Fermi level (EF) is tuned across the Dirac point (DP) by gate voltage. In particular, MR tends to be positive when EF lies close to the DP but becomes negative at higher energies. This trend is opposite to that expected from the Berry phase picture for localization, but is intimately correlated with the gate-tuned magnetic order. We show that the underlying physics is the competition between the topology-induced weak antilocalization and magnetism-induced negative MR. The simultaneous electrical control of magnetic order and magneto transport facilitates future TI-based spintronic devices.

preprint2014arXiv

Model Selection in High-Dimensional Misspecified Models

Model selection is indispensable to high-dimensional sparse modeling in selecting the best set of covariates among a sequence of candidate models. Most existing work assumes implicitly that the model is correctly specified or of fixed dimensions. Yet model misspecification and high dimensionality are common in real applications. In this paper, we investigate two classical Kullback-Leibler divergence and Bayesian principles of model selection in the setting of high-dimensional misspecified models. Asymptotic expansions of these principles reveal that the effect of model misspecification is crucial and should be taken into account, leading to the generalized AIC and generalized BIC in high dimensions. With a natural choice of prior probabilities, we suggest the generalized BIC with prior probability which involves a logarithmic factor of the dimensionality in penalizing model complexity. We further establish the consistency of the covariance contrast matrix estimator in a general setting. Our results and new method are supported by numerical studies.

preprint2013arXiv

APPLE: Approximate Path for Penalized Likelihood Estimators

In high-dimensional data analysis, penalized likelihood estimators are shown to provide superior results in both variable selection and parameter estimation. A new algorithm, APPLE, is proposed for calculating the Approximate Path for Penalized Likelihood Estimators. Both the convex penalty (such as LASSO) and the nonconvex penalty (such as SCAD and MCP) cases are considered. The APPLE efficiently computes the solution path for the penalized likelihood estimator using a hybrid of the modified predictor-corrector method and the coordinate-descent algorithm. APPLE is compared with several well-known packages via simulation and analysis of two gene expression data sets.

preprint2013arXiv

Functional and Parametric Estimation in a Semi- and Nonparametric Model with Application to Mass-Spectrometry Data

Motivated by modeling and analysis of mass-spectrometry data, a semi- and nonparametric model is proposed that consists of a linear parametric component for individual location and scale and a nonparametric regression function for the common shape. A multi-step approach is developed that simultaneously estimates the parametric components and the nonparametric function. Under certain regularity conditions, it is shown that the resulting estimators is consistent and asymptotic normal for the parametric part and achieve the optimal rate of convergence for the nonparametric part when the bandwidth is suitably chosen. Simulation results are presented to demonstrate the effectiveness and finite-sample performance of the method. The method is also applied to a SELDI-TOF mass spectrometry data set from a study of liver cancer patients.

preprint2013arXiv

Likelihood Adaptively Modified Penalties

A new family of penalty functions, adaptive to likelihood, is introduced for model selection in general regression models. It arises naturally through assuming certain types of prior distribution on the regression parameters. To study stability properties of the penalized maximum likelihood estimator, two types of asymptotic stability are defined. Theoretical properties, including the parameter estimation consistency, model selection consistency, and asymptotic stability, are established under suitable regularity conditions. An efficient coordinate-descent algorithm is proposed. Simulation results and real data analysis show that the proposed method has competitive performance in comparison with existing ones.

preprint2013arXiv

Modified Cross-Validation for Penalized High-Dimensional Linear Regression Models

In this paper, for Lasso penalized linear regression models in high-dimensional settings, we propose a modified cross-validation method for selecting the penalty parameter. The methodology is extended to other penalties, such as Elastic Net. We conduct extensive simulation studies and real data analysis to compare the performance of the modified cross-validation method with other methods. It is shown that the popular $K$-fold cross-validation method includes many noise variables in the selected model, while the modified cross-validation works well in a wide range of coefficient and correlation settings. Supplemental materials containing the computer code are available online.

preprint2013arXiv

Tuning Parameter Selection in Regularized Estimations of Large Covariance Matrices

Recently many regularized estimators of large covariance matrices have been proposed, and the tuning parameters in these estimators are usually selected via cross-validation. However, there is no guideline on the number of folds for conducting cross-validation and there is no comparison between cross-validation and the methods based on bootstrap. Through extensive simulations, we suggest 10-fold cross-validation (nine-tenths for training and one-tenth for validation) be appropriate when the estimation accuracy is measured in the Frobenius norm, while 2-fold cross-validation (half for training and half for validation) or reverse 3-fold cross-validation (one-third for training and two-thirds for validation) be appropriate in the operator norm. We also suggest the "optimal" cross-validation be more appropriate than the methods based on bootstrap for both types of norm.

preprint2011arXiv

A ROAD to Classification in High Dimensional Space

For high-dimensional classification, it is well known that naively performing the Fisher discriminant rule leads to poor results due to diverging spectra and noise accumulation. Therefore, researchers proposed independence rules to circumvent the diverse spectra, and sparse independence rules to mitigate the issue of noise accumulation. However, in biological applications, there are often a group of correlated genes responsible for clinical outcomes, and the use of the covariance information can significantly reduce misclassification rates. The extent of such error rate reductions is unveiled by comparing the misclassification rates of the Fisher discriminant rule and the independence rule. To materialize the gain based on finite samples, a Regularized Optimal Affine Discriminant (ROAD) is proposed based on a covariance penalty. ROAD selects an increasing number of features as the penalization relaxes. Further benefits can be achieved when a screening method is employed to narrow the feature pool before hitting the ROAD. An efficient Constrained Coordinate Descent algorithm (CCD) is also developed to solve the associated optimization problems. Sampling properties of oracle type are established. Simulation studies and real data analysis support our theoretical results and demonstrate the advantages of the new classification procedure under a variety of correlation structures. A delicate result on continuous piecewise linear solution path for the ROAD optimization problem at the population level justifies the linear interpolation of the CCD algorithm.

preprint2011arXiv

Nonparametric Independence Screening in Sparse Ultra-High Dimensional Additive Models

A variable screening procedure via correlation learning was proposed Fan and Lv (2008) to reduce dimensionality in sparse ultra-high dimensional models. Even when the true model is linear, the marginal regression can be highly nonlinear. To address this issue, we further extend the correlation learning to marginal nonparametric learning. Our nonparametric independence screening is called NIS, a specific member of the sure independence screening. Several closely related variable screening procedures are proposed. Under the nonparametric additive models, it is shown that under some mild technical conditions, the proposed independence screening methods enjoy a sure screening property. The extent to which the dimensionality can be reduced by independence screening is also explicitly quantified. As a methodological extension, an iterative nonparametric independence screening (INIS) is also proposed to enhance the finite sample performance for fitting sparse additive models. The simulation results and a real data analysis demonstrate that the proposed procedure works well with moderate sample size and large dimension and performs better than competing methods.

preprint2010arXiv

High-dimensional variable selection for Cox's proportional hazards model

Variable selection in high dimensional space has challenged many contemporary statistical problems from many frontiers of scientific disciplines. Recent technology advance has made it possible to collect a huge amount of covariate information such as microarray, proteomic and SNP data via bioimaging technology while observing survival information on patients in clinical studies. Thus, the same challenge applies to the survival analysis in order to understand the association between genomics information and clinical information about the survival time. In this work, we extend the sure screening procedure Fan and Lv (2008) to Cox's proportional hazards model with an iterative version available. Numerical simulation studies have shown encouraging performance of the proposed method in comparison with other techniques such as LASSO. This demonstrates the utility and versatility of the iterative sure independent screening scheme.

preprint2010arXiv

Nonparametric estimation of genewise variance for microarray data

Estimation of genewise variance arises from two important applications in microarray data analysis: selecting significantly differentially expressed genes and validation tests for normalization of microarray data. We approach the problem by introducing a two-way nonparametric model, which is an extension of the famous Neyman--Scott model and is applicable beyond microarray data. The problem itself poses interesting challenges because the number of nuisance parameters is proportional to the sample size and it is not obvious how the variance function can be estimated when measurements are correlated. In such a high-dimensional nonparametric problem, we proposed two novel nonparametric estimators for genewise variance function and semiparametric estimators for measurement correlation, via solving a system of nonlinear equations. Their asymptotic normality is established. The finite sample property is demonstrated by simulation studies. The estimators also improve the power of the tests for detecting statistically differentially expressed genes. The methodology is illustrated by the data from microarray quality control (MAQC) project.

Yang Feng

What is connected

Connect this record

See the researcher in context

Building this map preview

61 published item(s)

Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges

Towards Compositional Generalization of LLMs via Skill Taxonomy Guided Data Synthesis

Large Language Models for Unit Test Generation: Achievements, Challenges, and Opportunities

AI-enabled Automatic Multimodal Fusion of Cone-Beam CT and Intraoral Scans for Intelligent 3D Tooth-Bone Reconstruction and Clinical Applications

Direct visualization of percolating metal-insulator transition in V2O3 using scanning microwave impedance microscopy

Gaussian Multi-head Attention for Simultaneous Machine Translation

GReS: Graphical Cross-domain Recommendation for Supply Chain Platform

Influences of the dissipative topological edge state on quantized transport in MnBi2Te4

Mental Health Assessment for the Chatbots

Modeling Dual Read/Write Paths for Simultaneous Machine Translation

Neural Machine Translation with Phrase-Level Universal Visual Representations

One Reference Is Not Enough: Diverse Distillation with Reference Selection for Non-Autoregressive Translation

Overcoming Catastrophic Forgetting beyond Continual Learning: Balanced Training for Neural Machine Translation

Reducing Position Bias in Simultaneous Machine Translation with Length-Aware Framework

Relational Surrogate Loss Learning

STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation

Transfer Learning under High-dimensional Generalized Linear Models

Universal Simultaneous Machine Translation with Mixture-of-Experts Wait-k Policy

Gate Tunable Supercurrent in Josephson Junctions Based on Bi2Te3 Topological Insulator Thin Films

Layout and Image Recognition Driving Cross-Platform Automated Mobile Testing

Learning to Select Context in a Hierarchical and Global Perspective for Open-domain Dialogue Generation

PyART: Python API Recommendation in Real-Time

RaSE: A Variable Screening Framework via Random Subspace Ensembles

The Interplay of Demographic Variables and Social Distancing Scores in Deep Prediction of U.S. COVID-19 Cases

Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog

CDL: Curriculum Dual Learning for Emotion-Controllable Response Generation

DeepGini: Prioritizing Massive Tests to Enhance the Robustness of Deep Neural Networks

Nested Model Averaging on Solution Path for High-dimensional Linear Regression

Neyman-Pearson classification: parametrics and sample size requirement

Towards Multimodal Response Generation with Exemplar Augmentation and Curriculum Optimization

Unifying Specialist Image Embedding into Universal Image Embedding

Universal Model for Multi-Domain Medical Image Retrieval

Video-based Person Re-Identification using Gated Convolutional Recurrent Neural Networks

Community detection with nodal information

Do They All Look the Same? Deciphering Chinese, Japanese and Koreans by Fine-Grained Deep Learning

Experimental Observation of the Quantum Anomalous Hall Effect in a Magnetic Topological Insulator

Gender Politics in the 2016 U.S. Presidential Election: A Computer Vision Approach

Model Selection for High Dimensional Quadratic Regression via Regularization

Post Selection Shrinkage Estimation for High Dimensional Data Analysis

Pricing the Woman Card: Gender Politics between Hillary Clinton and Donald Trump

Tactics and Tallies: Inferring Voter Preferences in the 2016 U.S. Presidential Primaries Using Sparse Learning

When Do Luxury Cars Hit the Road? Findings by A Big Data Approach

Will Sanders Supporters Jump Ship for Trump? Fine-grained Analysis of Twitter Followers

Bayesian quantile regression with approximate likelihood

Disentangling the magnetoelectric and thermoelectric transport in topological insulator thin films

Feature Augmentation via Nonparametrics and Selection (FANS) in High Dimensional Classification

How Many Communities Are There?

Neyman-Pearson Classification under High-Dimensional Settings

Observation of the zero Hall plateau in a quantum anomalous Hall insulator

Simultaneous electrical-field-effect modulation of both top and bottom Dirac surface states of epitaxial thin films of three-dimensional topological insulators

Electrically tuned magnetic order and magnetoresistance in a topological insulator

Model Selection in High-Dimensional Misspecified Models

APPLE: Approximate Path for Penalized Likelihood Estimators

Functional and Parametric Estimation in a Semi- and Nonparametric Model with Application to Mass-Spectrometry Data

Likelihood Adaptively Modified Penalties

Modified Cross-Validation for Penalized High-Dimensional Linear Regression Models

Tuning Parameter Selection in Regularized Estimations of Large Covariance Matrices

A ROAD to Classification in High Dimensional Space

Nonparametric Independence Screening in Sparse Ultra-High Dimensional Additive Models

High-dimensional variable selection for Cox's proportional hazards model

Nonparametric estimation of genewise variance for microarray data