Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
43works
0followers
25topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

43 published item(s)

preprint2026arXiv

Forbidden second harmonics in centrosymmetric bilayer crystals

Optical spectroscopy based on second-order nonlinearity is a critical technique for characterizing two-dimensional (2D) crystals as well as bioimaging and quantum optics. It is generally believed that second-harmonic generation (SHG) in centrosymmetric crystals, such as graphene and other bilayer 2D crystals, is negligible without externally breaking the inversion symmetry. Here, we show that with a new homodyne detection technique, we can apparently circumvent this symmetry-imposed constraint and observe robust SHG in pristine centrosymmetric crystals, without any symmetry-breaking field. With its exceptional sensitivity, we resolve polarization-resolved SHG in bilayer hexagonal boron nitride (h-BN), bilayer 2H-WSe$_2$, and remarkably, Bernal-stacked bilayer graphene, allowing us to unambiguously identify the crystallographic orientation in these crystals via SHG for the first time. We also demonstrate that the new technique can be used to non-invasively detect uniaxial strain and optical geometric phase in these crystals. The observed SHG in our experiments is attributed to second-order nonlinearity in the quadrupole channel, which is controlled by the presence of the $C_2$ symmetry instead of the inversion symmetry. Our new technique expands the capability of nonlinear optical spectroscopy to encompass a large class of centrosymmetric materials that could never be measured before, and can be used for quantum sensing of moiré materials and twisted epitaxial films.

preprint2026arXiv

Reasoning over Precedents Alongside Statutes: Case-Augmented Deliberative Alignment for LLM Safety

Ensuring that Large Language Models (LLMs) adhere to safety principles without refusing benign requests remains a significant challenge. While OpenAI introduces deliberative alignment (DA) to enhance the safety of its o-series models through reasoning over detailed ``code-like'' safety rules, the effectiveness of this approach in open-source LLMs, which typically lack advanced reasoning capabilities, is understudied. In this work, we systematically evaluate the impact of explicitly specifying extensive safety codes versus demonstrating them through illustrative cases. We find that referencing explicit codes inconsistently improves harmlessness and systematically degrades helpfulness, whereas training on case-augmented simple codes yields more robust and generalized safety behaviors. By guiding LLMs with case-augmented reasoning instead of extensive code-like safety rules, we avoid rigid adherence to narrowly enumerated rules and enable broader adaptability. Building on these insights, we propose CADA, a case-augmented deliberative alignment method for LLMs utilizing reinforcement learning on self-generated safety reasoning chains. CADA effectively enhances harmlessness, improves robustness against attacks, and reduces over-refusal while preserving utility across diverse benchmarks, offering a practical alternative to rule-only DA for improving safety while maintaining helpfulness.

preprint2026arXiv

Transformers Efficiently Perform In-Context Logistic Regression via Normalized Gradient Descent

Transformers have demonstrated remarkable in-context learning (ICL) capabilities. The strong ICL performance of transformers is commonly believed to arise from their ability to implicitly execute certain algorithms on the context, thereby enhancing prediction and generation. In this work, we investigate how transformers with softmax attention perform in-context learning on linear classification data. We first construct a class of multi-layer transformers that can perform in-context logistic regression, with each layer exactly performing one step of normalized gradient descent on an in-context loss. Then, we show that our constructed transformer can be obtained through (i) training a single self-attention layer supervised by one-step gradient descent, and (ii) recurrently applying the trained layer to obtain a looped model. Training convergence guarantees of the self-attention layer and out-of-distribution generalization guarantees of the looped model are provided. Our results advance the theoretical understanding of ICL mechanism by showcasing how softmax transformers can effectively act as in-context learners.

preprint2022arXiv

Benign Overfitting in Two-layer Convolutional Neural Networks

Modern neural networks often have great expressive power and can be trained to overfit the training data, while still achieving a good test performance. This phenomenon is referred to as "benign overfitting". Recently, there emerges a line of works studying "benign overfitting" from the theoretical perspective. However, they are limited to linear models or kernel/random feature models, and there is still a lack of theoretical understanding about when and how benign overfitting occurs in neural networks. In this paper, we study the benign overfitting phenomenon in training a two-layer convolutional neural network (CNN). We show that when the signal-to-noise ratio satisfies a certain condition, a two-layer CNN trained by gradient descent can achieve arbitrarily small training and test loss. On the other hand, when this condition does not hold, overfitting becomes harmful and the obtained CNN can only achieve a constant level test loss. These together demonstrate a sharp phase transition between benign overfitting and harmful overfitting, driven by the signal-to-noise ratio. To the best of our knowledge, this is the first work that precisely characterizes the conditions under which benign overfitting can occur in training convolutional neural networks.

preprint2022arXiv

Building Machine Translation Systems for the Next Thousand Languages

In this paper we share findings from our effort to build practical machine translation (MT) systems capable of translating across over one thousand languages. We describe results in three research domains: (i) Building clean, web-mined datasets for 1500+ languages by leveraging semi-supervised pre-training for language identification and developing data-driven filtering techniques; (ii) Developing practical MT models for under-served languages by leveraging massively multilingual models trained with supervised parallel data for over 100 high-resource languages and monolingual datasets for an additional 1000+ languages; and (iii) Studying the limitations of evaluation metrics for these languages and conducting qualitative analysis of the outputs from our MT models, highlighting several frequent error modes of these types of models. We hope that our work provides useful insights to practitioners working towards building MT systems for currently understudied languages, and highlights research directions that can complement the weaknesses of massively multilingual models in data-sparse settings.

preprint2022arXiv

Description-Driven Task-Oriented Dialog Modeling

Task-oriented dialogue (TOD) systems are required to identify key information from conversations for the completion of given tasks. Such information is conventionally specified in terms of intents and slots contained in task-specific ontology or schemata. Since these schemata are designed by system developers, the naming convention for slots and intents is not uniform across tasks, and may not convey their semantics effectively. This can lead to models memorizing arbitrary patterns in data, resulting in suboptimal performance and generalization. In this paper, we propose that schemata should be modified by replacing names or notations entirely with natural language descriptions. We show that a language description-driven system exhibits better understanding of task specifications, higher performance on state tracking, improved data efficiency, and effective zero-shot transfer to unseen tasks. Following this paradigm, we present a simple yet effective Description-Driven Dialog State Tracking (D3ST) model, which relies purely on schema descriptions and an "index-picking" mechanism. We demonstrate the superiority in quality, data efficiency and robustness of our approach as measured on the MultiWOZ (Budzianowski et al.,2018), SGD (Rastogi et al., 2020), and the recent SGD-X (Lee et al., 2021) benchmarks.

preprint2022arXiv

Micius quantum experiments in space

Quantum theory has been successfully validated in numerous laboratory experiments. But would such a theory, which excellently describes the behavior of microscopic physical systems, and its predicted phenomena such as quantum entanglement, be still applicable on very large length scales? From a practical perspective, how can quantum key distribution -- where the security of establishing secret keys between distant parties is ensured by the laws of quantum mechanics -- be made technologically useful on a global scale? Due to photon loss in optical fibers and terrestrial free space, the achievable distance using direct transmission of single photons has been limited to a few hundred kilometers. A promising route to testing quantum physics over long distances and in the relativistic regimes, and thus realizing flexible global-scale quantum networks is via the use of satellites and space-based technologies, where a significant advantage is that the photon loss and turbulence predominantly occurs in the lower ~ 10 km of the atmosphere, and most of the photons' transmission path in the space is virtually in vacuum with almost zero absorption and decoherence. In this Article, we review the progress in free-space quantum experiments, with a focus on the fast-developing Micius satellite-based quantum communications. The perspective of space-ground integrated quantum networks and fundamental quantum optics experiments in space conceivable with satellites are discussed.

preprint2022arXiv

Multilingual Mix: Example Interpolation Improves Multilingual Neural Machine Translation

Multilingual neural machine translation models are trained to maximize the likelihood of a mix of examples drawn from multiple language pairs. The dominant inductive bias applied to these models is a shared vocabulary and a shared set of parameters across languages; the inputs and labels corresponding to examples drawn from different language pairs might still reside in distinct sub-spaces. In this paper, we introduce multilingual crossover encoder-decoder (mXEncDec) to fuse language pairs at an instance level. Our approach interpolates instances from different language pairs into joint `crossover examples' in order to encourage sharing input and output spaces across languages. To ensure better fusion of examples in multilingual settings, we propose several techniques to improve example interpolation across dissimilar languages under heavy data imbalance. Experiments on a large-scale WMT multilingual dataset demonstrate that our approach significantly improves quality on English-to-Many, Many-to-English and zero-shot translation tasks (from +0.5 BLEU up to +5.5 BLEU points). Results on code-switching sets demonstrate the capability of our approach to improve model generalization to out-of-distribution multilingual examples. We also conduct qualitative and quantitative representation comparisons to analyze the advantages of our approach at the representation level.

preprint2022arXiv

Portable ground stations for space-to-ground quantum key distribution

Quantum key distribution (QKD) uses the fundamental principles of quantum mechanics to share unconditionally secure keys between distant users. Previous works based on the quantum science satellite "Micius" have initially demonstrated the feasibility of a global QKD network. However, the practical applications of space-based QKD still face many technical problems, such as the huge size and weight of ground stations required to receive quantum signals. Here, we report space-to-ground QKD demonstrations based on portable receiving ground stations. The weight of the portable ground station is less than 100 kg, the space required is less than 1 m$^{3}$ and the installation time requires no more than 12 hours, all of the weight, required space and deployment time are about two orders of magnitude lower than those for the previous systems. Moreover, the equipment is easy to handle and can be placed on the roof of buildings in a metropolis. Secure keys have been successfully generated from the "Micius" satellite to these portable ground stations at six different places in China, and an average final secure key length is around 50 kb can be obtained during one passage. Our results pave the way for, and greatly accelerate the practical application of, space-based QKD.

preprint2022arXiv

Risk Bounds for Over-parameterized Maximum Margin Classification on Sub-Gaussian Mixtures

Modern machine learning systems such as deep neural networks are often highly over-parameterized so that they can fit the noisy training data exactly, yet they can still achieve small test errors in practice. In this paper, we study this "benign overfitting" phenomenon of the maximum margin classifier for linear classification problems. Specifically, we consider data generated from sub-Gaussian mixtures, and provide a tight risk bound for the maximum margin linear classifier in the over-parameterized setting. Our results precisely characterize the condition under which benign overfitting can occur in linear classification problems, and improve on previous work. They also have direct implications for over-parameterized logistic regression.

preprint2022arXiv

SGD-X: A Benchmark for Robust Generalization in Schema-Guided Dialogue Systems

Zero/few-shot transfer to unseen services is a critical challenge in task-oriented dialogue research. The Schema-Guided Dialogue (SGD) dataset introduced a paradigm for enabling models to support any service in zero-shot through schemas, which describe service APIs to models in natural language. We explore the robustness of dialogue systems to linguistic variations in schemas by designing SGD-X - a benchmark extending SGD with semantically similar yet stylistically diverse variants for every schema. We observe that two top state tracking models fail to generalize well across schema variants, measured by joint goal accuracy and a novel metric for measuring schema sensitivity. Additionally, we present a simple model-agnostic data augmentation method to improve schema robustness.

preprint2022arXiv

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

With recent progress in joint modeling of visual and textual representations, Vision-Language Pretraining (VLP) has achieved impressive performance on many multimodal downstream tasks. However, the requirement for expensive annotations including clean image captions and regional labels limits the scalability of existing approaches, and complicates the pretraining procedure with the introduction of multiple dataset-specific objectives. In this work, we relax these constraints and present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM). Unlike prior work, SimVLM reduces the training complexity by exploiting large-scale weak supervision, and is trained end-to-end with a single prefix language modeling objective. Without utilizing extra data or task-specific customization, the resulting model significantly outperforms previous pretraining methods and achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks, including VQA (+3.74% vqa-score), NLVR2 (+1.17% accuracy), SNLI-VE (+1.37% accuracy) and image captioning tasks (+10.1% average CIDEr score). Furthermore, we demonstrate that SimVLM acquires strong generalization and transfer ability, enabling zero-shot behavior including open-ended visual question answering and cross-modality transfer.

preprint2022arXiv

Spin Manipulation by Giant Valley-Zeeman Spin-Orbit Field in Atom-Thick WSe2

The phenomenon originating from spin-orbit coupling (SOC) provides energy-efficient strategies for spin manipulation and device applications. The broken inversion symmetry interface and resulting electric field induce a Rashba-type spin-orbit field (SOF), which has been demonstrated to generate spin-orbit torque for data storage applications. In this study, we found that spin flipping can be achieved by the valley-Zeeman SOF in monolayer WSe2 at room temperature, which manifests as a negative magnetoresistance in the vertical spin valve. Quantum transmission calculations based on an effective model near the K valley of WSe2 confirm the precessional spin transport of carriers under the giant SOF, which is estimated to be 650 T. In particular, the valley-Zeeman SOF-induced spin dynamics was demonstrated to be tunable with the layer number and stacking phase of WSe2 as well as the gate voltage, which provides a novel strategy for spin manipulation and can benefit the development of ultralow-power spintronic devices.

preprint2022arXiv

The geometry of integration in text classification RNNs

Despite the widespread application of recurrent neural networks (RNNs) across a variety of tasks, a unified understanding of how RNNs solve these tasks remains elusive. In particular, it is unclear what dynamical patterns arise in trained RNNs, and how those patterns depend on the training dataset or task. This work addresses these questions in the context of a specific natural language processing task: text classification. Using tools from dynamical systems analysis, we study recurrent networks trained on a battery of both natural and synthetic text classification tasks. We find the dynamics of these trained RNNs to be both interpretable and low-dimensional. Specifically, across architectures and datasets, RNNs accumulate evidence for each class as they process the text, using a low-dimensional attractor manifold as the underlying mechanism. Moreover, the dimensionality and geometry of the attractor manifold are determined by the structure of the training dataset; in particular, we describe how simple word-count statistics computed on the training dataset can be used to predict these properties. Our observations span multiple architectures and datasets, reflecting a common mechanism RNNs employ to perform text classification. To the degree that integration of evidence towards a decision is a common computational primitive, this work lays the foundation for using dynamical systems techniques to study the inner workings of RNNs.

preprint2022arXiv

Towards the Next 1000 Languages in Multilingual Machine Translation: Exploring the Synergy Between Supervised and Self-Supervised Learning

Achieving universal translation between all human language pairs is the holy-grail of machine translation (MT) research. While recent progress in massively multilingual MT is one step closer to reaching this goal, it is becoming evident that extending a multilingual MT system simply by training on more parallel data is unscalable, since the availability of labeled data for low-resource and non-English-centric language pairs is forbiddingly limited. To this end, we present a pragmatic approach towards building a multilingual MT model that covers hundreds of languages, using a mixture of supervised and self-supervised objectives, depending on the data availability for different language pairs. We demonstrate that the synergy between these two training paradigms enables the model to produce high-quality translations in the zero-resource setting, even surpassing supervised translation quality for low- and mid-resource languages. We conduct a wide array of experiments to understand the effect of the degree of multilingual supervision, domain mismatches and amounts of parallel and monolingual data on the quality of our self-supervised multilingual models. To demonstrate the scalability of the approach, we train models with over 200 languages and demonstrate high performance on zero-resource translation on several previously under-studied languages. We hope our findings will serve as a stepping stone towards enabling translation for the next thousand languages.

preprint2022arXiv

Unsupervised Slot Schema Induction for Task-oriented Dialog

Carefully-designed schemas describing how to collect and annotate dialog corpora are a prerequisite towards building task-oriented dialog systems. In practical applications, manually designing schemas can be error-prone, laborious, iterative, and slow, especially when the schema is complicated. To alleviate this expensive and time consuming process, we propose an unsupervised approach for slot schema induction from unlabeled dialog corpora. Leveraging in-domain language models and unsupervised parsing structures, our data-driven approach extracts candidate slots without constraints, followed by coarse-to-fine clustering to induce slot types. We compare our method against several strong supervised baselines, and show significant performance improvement in slot schema induction on MultiWoz and SGD datasets. We also demonstrate the effectiveness of induced schemas on downstream applications including dialog state tracking and response generation.

preprint2021arXiv

Agnostic Learning of Halfspaces with Gradient Descent via Soft Margins

We analyze the properties of gradient descent on convex surrogates for the zero-one loss for the agnostic learning of linear halfspaces. If $\mathsf{OPT}$ is the best classification error achieved by a halfspace, by appealing to the notion of soft margins we are able to show that gradient descent finds halfspaces with classification error $\tilde O(\mathsf{OPT}^{1/2}) + \varepsilon$ in $\mathrm{poly}(d,1/\varepsilon)$ time and sample complexity for a broad class of distributions that includes log-concave isotropic distributions as a subclass. Along the way we answer a question recently posed by Ji et al. (2020) on how the tail behavior of a loss function can affect sample complexity and runtime guarantees for gradient descent.

preprint2021arXiv

Benign Overfitting in Adversarially Robust Linear Classification

"Benign overfitting", where classifiers memorize noisy training data yet still achieve a good generalization performance, has drawn great attention in the machine learning community. To explain this surprising phenomenon, a series of works have provided theoretical justification in over-parameterized linear regression, classification, and kernel methods. However, it is not clear if benign overfitting still occurs in the presence of adversarial examples, i.e., examples with tiny and intentional perturbations to fool the classifiers. In this paper, we show that benign overfitting indeed occurs in adversarial training, a principled approach to defend against adversarial examples. In detail, we prove the risk bounds of the adversarially trained linear classifier on the mixture of sub-Gaussian data under $\ell_p$ adversarial perturbations. Our result suggests that under moderate perturbations, adversarially trained linear classifiers can achieve the near-optimal standard and adversarial risks, despite overfitting the noisy training data. Numerical experiments validate our theoretical findings.

preprint2021arXiv

Echo State Speech Recognition

We propose automatic speech recognition (ASR) models inspired by echo state network (ESN), in which a subset of recurrent neural networks (RNN) layers in the models are randomly initialized and untrained. Our study focuses on RNN-T and Conformer models, and we show that model quality does not drop even when the decoder is fully randomized. Furthermore, such models can be trained more efficiently as the decoders do not require to be updated. By contrast, randomizing encoders hurts model quality, indicating that optimizing encoders and learn proper representations for acoustic inputs are more vital for speech recognition. Overall, we challenge the common practice of training ASR models for all components, and demonstrate that ESN-based models can perform equally well but enable more efficient training and storage than fully-trainable counterparts.

preprint2021arXiv

Fractional Chern insulators in magic-angle twisted bilayer graphene

Fractional Chern insulators (FCIs) are lattice analogues of fractional quantum Hall states that may provide a new avenue toward manipulating non-abelian excitations. Early theoretical studies have predicted their existence in systems with energetically flat Chern bands and highlighted the critical role of a particular quantum band geometry. Thus far, however, FCI states have only been observed in Bernal-stacked bilayer graphene aligned with hexagonal boron nitride (BLG/hBN), in which a very large magnetic field is responsible for the existence of the Chern bands, precluding the realization of FCIs at zero field and limiting its potential for applications. By contrast, magic angle twisted bilayer graphene (MATBG) supports flat Chern bands at zero magnetic field, and therefore offers a promising route toward stabilizing zero-field FCIs. Here we report the observation of eight FCI states at low magnetic field in MATBG enabled by high-resolution local compressibility measurements. The first of these states emerge at 5 T, and their appearance is accompanied by the simultaneous disappearance of nearby topologically-trivial charge density wave states. Unlike the BLG/hBN platform, we demonstrate that the principal role of the weak magnetic field here is merely to redistribute the Berry curvature of the native Chern bands and thereby realize a quantum band geometry favorable for the emergence of FCIs. Our findings strongly suggest that FCIs may be realized at zero magnetic field and pave the way for the exploration and manipulation of anyonic excitations in moiré systems with native flat Chern bands.

preprint2021arXiv

High-Temperature Structure Detection in Ferromagnets

This paper studies structure detection problems in high temperature ferromagnetic (positive interaction only) Ising models. The goal is to distinguish whether the underlying graph is empty, i.e., the model consists of independent Rademacher variables, versus the alternative that the underlying graph contains a subgraph of a certain structure. We give matching upper and lower minimax bounds under which testing this problem is possible/impossible respectively. Our results reveal that a key quantity called graph arboricity drives the testability of the problem. On the computational front, under a conjecture of the computational hardness of sparse principal component analysis, we prove that, unless the signal is strong enough, there are no polynomial time tests which are capable of testing this problem. In order to prove this result we exhibit a way to give sharp inequalities for the even moments of sums of i.i.d. Rademacher random variables which may be of independent interest.

preprint2021arXiv

How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks?

A recent line of research on deep learning focuses on the extremely over-parameterized setting, and shows that when the network width is larger than a high degree polynomial of the training sample size $n$ and the inverse of the target error $ε^{-1}$, deep neural networks learned by (stochastic) gradient descent enjoy nice optimization and generalization guarantees. Very recently, it is shown that under certain margin assumptions on the training data, a polylogarithmic width condition suffices for two-layer ReLU networks to converge and generalize (Ji and Telgarsky, 2019). However, whether deep neural networks can be learned with such a mild over-parameterization is still an open question. In this work, we answer this question affirmatively and establish sharper learning guarantees for deep ReLU networks trained by (stochastic) gradient descent. In specific, under certain assumptions made in previous work, our optimization and generalization guarantees hold with network width polylogarithmic in $n$ and $ε^{-1}$. Our results push the study of over-parameterized deep neural networks towards more practical settings.

preprint2021arXiv

Provable Generalization of SGD-trained Neural Networks of Any Width in the Presence of Adversarial Label Noise

We consider a one-hidden-layer leaky ReLU network of arbitrary width trained by stochastic gradient descent (SGD) following an arbitrary initialization. We prove that SGD produces neural networks that have classification accuracy competitive with that of the best halfspace over the distribution for a broad class of distributions that includes log-concave isotropic and hard margin distributions. Equivalently, such networks can generalize when the data distribution is linearly separable but corrupted with adversarial label noise, despite the capacity to overfit. To the best of our knowledge, this is the first work to show that overparameterized neural networks trained by SGD can generalize when the data is corrupted with adversarial label noise.

preprint2021arXiv

Quantum Random Number Generation with Uncharacterized Laser and Sunlight

The entropy or randomness source is an essential ingredient in random number generation. Quantum random number generators generally require well modeled and calibrated light sources, such as a laser, to generate randomness. With uncharacterized light sources, such as sunlight or an uncharacterized laser, genuine randomness is practically hard to be quantified or extracted owing to its unknown or complicated structure. By exploiting a recently proposed source-independent randomness generation protocol, we theoretically modify it by considering practical issues and experimentally realize the modified scheme with an uncharacterized laser and a sunlight source. The extracted randomness is guaranteed to be secure independent of its source and the randomness generation speed reaches 1 Mbps, three orders of magnitude higher than the original realization. Our result signifies the power of quantum technology in randomness generation and paves the way to high-speed semi-self-testing quantum random number generators with practical light sources.

preprint2020arXiv

Agnostic Learning of a Single Neuron with Gradient Descent

We consider the problem of learning the best-fitting single neuron as measured by the expected square loss $\mathbb{E}_{(x,y)\sim \mathcal{D}}[(σ(w^\top x)-y)^2]$ over some unknown joint distribution $\mathcal{D}$ by using gradient descent to minimize the empirical risk induced by a set of i.i.d. samples $S\sim \mathcal{D}^n$. The activation function $σ$ is an arbitrary Lipschitz and non-decreasing function, making the optimization problem nonconvex and nonsmooth in general, and covers typical neural network activation functions and inverse link functions in the generalized linear model setting. In the agnostic PAC learning setting, where no assumption on the relationship between the labels $y$ and the input $x$ is made, if the optimal population risk is $\mathsf{OPT}$, we show that gradient descent achieves population risk $O(\mathsf{OPT})+ε$ in polynomial time and sample complexity when $σ$ is strictly increasing. For the ReLU activation, our population risk guarantee is $O(\mathsf{OPT}^{1/2})+ε$. When labels take the form $y = σ(v^\top x) + ξ$ for zero-mean sub-Gaussian noise $ξ$, we show that the population risk guarantees for gradient descent improve to $\mathsf{OPT} + ε$. Our sample complexity and runtime guarantees are (almost) dimension independent, and when $σ$ is strictly increasing, require no distributional assumptions beyond boundedness. For ReLU, we show the same results under a nondegeneracy assumption for the marginal distribution of the input.

preprint2020arXiv

An explicit expression for Euclidean self-dual cyclic codes of length $2^k$ over Galois ring ${\rm GR}(4,m)$

For any positive integers $m$ and $k$, existing literature only determines the number of all Euclidean self-dual cyclic codes of length $2^k$ over the Galois ring ${\rm GR}(4,m)$, such as in [Des. Codes Cryptogr. (2012) 63:105--112]. Using properties for Kronecker products of matrices of a specific type and column vectors of these matrices, we give a simple and efficient method to construct all these self-dual cyclic codes precisely. On this basis, we provide an explicit expression to accurately represent all distinct Euclidean self-dual cyclic codes of length $2^k$ over ${\rm GR}(4,m)$, using combination numbers. As an application, we list all distinct Euclidean self-dual cyclic codes over ${\rm GR}(4,m)$ of length $2^k$ explicitly, for $k=4,5,6$.

preprint2020arXiv

Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks

Adaptive gradient methods, which adopt historical gradient information to automatically adjust the learning rate, despite the nice property of fast convergence, have been observed to generalize worse than stochastic gradient descent (SGD) with momentum in training deep neural networks. This leaves how to close the generalization gap of adaptive gradient methods an open problem. In this work, we show that adaptive gradient methods such as Adam, Amsgrad, are sometimes "over adapted". We design a new algorithm, called Partially adaptive momentum estimation method, which unifies the Adam/Amsgrad with SGD by introducing a partial adaptive parameter $p$, to achieve the best from both worlds. We also prove the convergence rate of our proposed algorithm to a stationary point in the stochastic nonconvex optimization setting. Experiments on standard benchmarks show that our proposed algorithm can maintain a fast convergence rate as Adam/Amsgrad while generalizing as well as SGD in training deep neural networks. These results would suggest practitioners pick up adaptive gradient methods once again for faster training of deep neural networks.

preprint2020arXiv

Deep-Learning-Enabled Fast Optical Identification and Characterization of Two-Dimensional Materials

Advanced microscopy and/or spectroscopy tools play indispensable role in nanoscience and nanotechnology research, as it provides rich information about the growth mechanism, chemical compositions, crystallography, and other important physical and chemical properties. However, the interpretation of imaging data heavily relies on the "intuition" of experienced researchers. As a result, many of the deep graphical features obtained through these tools are often unused because of difficulties in processing the data and finding the correlations. Such challenges can be well addressed by deep learning. In this work, we use the optical characterization of two-dimensional (2D) materials as a case study, and demonstrate a neural-network-based algorithm for the material and thickness identification of exfoliated 2D materials with high prediction accuracy and real-time processing capability. Further analysis shows that the trained network can extract deep graphical features such as contrast, color, edges, shapes, segment sizes and their distributions, based on which we develop an ensemble approach topredict the most relevant physical properties of 2D materials. Finally, a transfer learning technique is applied to adapt the pretrained network to other applications such as identifying layer numbers of a new 2D material, or materials produced by a different synthetic approach. Our artificial-intelligence-based material characterization approach is a powerful tool that would speed up the preparation, initial characterization of 2D materials and other nanomaterials and potentially accelerate new material discoveries.

preprint2020arXiv

Echo State Neural Machine Translation

We present neural machine translation (NMT) models inspired by echo state network (ESN), named Echo State NMT (ESNMT), in which the encoder and decoder layer weights are randomly generated then fixed throughout training. We show that even with this extremely simple model construction and training procedure, ESNMT can already reach 70-80% quality of fully trainable baselines. We examine how spectral radius of the reservoir, a key quantity that characterizes the model, determines the model behavior. Our findings indicate that randomized networks can work well even for complicated sequence-to-sequence prediction NLP tasks.

preprint2020arXiv

Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis

This paper proposes a hierarchical, fine-grained and interpretable latent variable model for prosody based on the Tacotron 2 text-to-speech model. It achieves multi-resolution modeling of prosody by conditioning finer level representations on coarser level ones. Additionally, it imposes hierarchical conditioning across all latent dimensions using a conditional variational auto-encoder (VAE) with an auto-regressive structure. Evaluation of reconstruction performance illustrates that the new structure does not degrade the model while allowing better interpretability. Interpretations of prosody attributes are provided together with the comparison between word-level and phone-level prosody representations. Moreover, both qualitative and quantitative evaluations are used to demonstrate the improvement in the disentanglement of the latent dimensions.

preprint2020arXiv

Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and auto-regressive prosody prior

Recent neural text-to-speech (TTS) models with fine-grained latent features enable precise control of the prosody of synthesized speech. Such models typically incorporate a fine-grained variational autoencoder (VAE) structure, extracting latent features at each input token (e.g., phonemes). However, generating samples with the standard VAE prior often results in unnatural and discontinuous speech, with dramatic prosodic variation between tokens. This paper proposes a sequential prior in a discrete latent space which can generate more naturally sounding samples. This is accomplished by discretizing the latent features using vector quantization (VQ), and separately training an autoregressive (AR) prior model over the result. We evaluate the approach using listening tests, objective metrics of automatic speech recognition (ASR) performance, and measurements of prosody attributes. Experimental results show that the proposed model significantly improves the naturalness in random sample generation. Furthermore, initial experiments demonstrate that randomly sampling from the proposed model can be used as data augmentation to improve the ASR performance.

preprint2020arXiv

Leveraging Monolingual Data with Self-Supervision for Multilingual Neural Machine Translation

Over the last few years two promising research directions in low-resource neural machine translation (NMT) have emerged. The first focuses on utilizing high-resource languages to improve the quality of low-resource languages via multilingual NMT. The second direction employs monolingual data with self-supervision to pre-train translation models, followed by fine-tuning on small amounts of supervised data. In this work, we join these two lines of research and demonstrate the efficacy of monolingual data with self-supervision in multilingual NMT. We offer three major results: (i) Using monolingual data significantly boosts the translation quality of low-resource languages in multilingual models. (ii) Self-supervision improves zero-shot translation quality in multilingual models. (iii) Leveraging monolingual data with self-supervision provides a viable path towards adding new languages to multilingual models, getting up to 33 BLEU on ro-en translation without any parallel data or back-translation.

preprint2020arXiv

Managing Recurrent Virtual Network Updates in Multi-Tenant Datacenters: A System Perspective

With the advent of software-defined networking, network configuration through programmable interfaces becomes practical, leading to various on-demand opportunities for network routing update in multi-tenant datacenters, where tenants have diverse requirements on network routings such as short latency, low path inflation, large bandwidth, high reliability, etc. Conventional solutions that rely on topology search coupled with an objective function https:// www.overleaf.com/project/5beb742041ab9c0e3caec84f to find desired routings have at least two shortcomings: (i) they run into scalability issues when handling consistent and frequent routing updates and (ii) they restrict the flexibility and capability to satisfy various routing requirements. To address these issues, this paper proposes a novel search and optimization decoupled design, which not only saves considerable topology search costs via search result reuse, but also avoids possible sub-optimality in greedy routing search algorithms by making decisions based on the global view of all possible routings. We implement a prototype of our proposed system, OpReduce, and perform extensive evaluations to validate its design goals.

preprint2020arXiv

Super-resolution single-photon imaging at 8.2 kilometers

Single-photon light detection and ranging (LiDAR), offering single-photon sensitivity and picosecond time resolution, has been widely adopted for active imaging applications. Long-range active imaging is a great challenge, because the spatial resolution degrades significantly with the imaging range due to the diffraction limit of the optics, and only weak echo signal photons can return but mixed with a strong background noise. Here we propose and demonstrate a photon-efficient LiDAR approach that can achieve sub-Rayleigh resolution imaging over long ranges. This approach exploits fine sub-pixel scanning and a deconvolution algorithm tailored to this long-range application. Using this approach, we experimentally demonstrated active three-dimensional (3D) single-photon imaging by recognizing different postures of a mannequin model at a stand-off distance of 8.2 km in both daylight and night. The observed spatial (transversal) resolution is about 5.5 cm at 8.2 km, which is about twice of the system's resolution. This also beats the optical system's Rayleigh criterion. The results are valuable for geosciences and target recognition over long ranges.

preprint2020arXiv

Tunable Phase Boundaries and Ultra-Strong Coupling Superconductivity in Mirror Symmetric Magic-Angle Trilayer Graphene

Moiré superlattices have recently emerged as a novel platform where correlated physics and superconductivity can be studied with unprecedented tunability. Although correlated effects have been observed in several other moiré systems, magic-angle twisted bilayer graphene (MATBG) remains the only one where robust superconductivity has been reproducibly measured. Here we realize a new moiré superconductor, mirror symmetric magic-angle twisted trilayer graphene (MATTG) with dramatically richer tunability in electronic structure and superconducting properties. Hall effect and quantum oscillations measurements as a function of density and electric field allow us to determine the system's tunable phase boundaries in the normal state. Zero magnetic field resistivity measurements then reveal that the existence of superconductivity is intimately connected to the broken symmetry phase emerging from two carriers per moiré unit cell. Strikingly, we find that the superconducting phase gets suppressed and bounded at the van Hove singularities (vHs) partially surrounding the broken-symmetry phase, which is difficult to reconcile with weak-coupling BCS theory. Moreover, the extensive in situ tunability of our system allows us to achieve the ultra-strong coupling regime, characterized by a Ginzburg-Landau coherence length reaching the average inter-particle distance and very large $T_\mathrm{BKT}/T_{F}$ ratios in excess of 0.1, where $T_\mathrm{BKT}$ and $T_F$ are the Berezinskii-Kosterlitz-Thouless transition and Fermi temperatures, respectively. These observations suggest that MATTG can be electrically tuned close to the two-dimensional BCS-BEC crossover. Our results establish a new generation of tunable moiré superconductors with the potential to revolutionize our fundamental understanding and the applications of strong coupling superconductivity.

preprint2019arXiv

Cascade of Phase Transitions and Dirac Revivals in Magic Angle Graphene

Twisted bilayer graphene near the magic angle exhibits remarkably rich electron correlation physics, displaying insulating, magnetic, and superconducting phases. Here, using measurements of the local electronic compressibility, we reveal that these phases originate from a high-energy state with an unusual sequence of band populations. As carriers are added to the system, rather than filling all the four spin and valley flavors equally, we find that the population occurs through a sequence of sharp phase transitions, which appear as strong asymmetric jumps of the electronic compressibility near integer fillings of the moire lattice. At each transition, a single spin/valley flavor takes all the carriers from its partially filled peers, "resetting" them back to the vicinity of the charge neutrality point. As a result, the Dirac-like character observed near the charge neutrality reappears after each integer filling. Measurement of the in-plane magnetic field dependence of the chemical potential near filling factor one reveals a large spontaneous magnetization, further substantiating this picture of a cascade of symmetry breakings. The sequence of phase transitions and Dirac revivals is observed at temperatures well above the onset of the superconducting and correlated insulating states. This indicates that the state we reveal here, with its strongly broken electronic flavor symmetry and revived Dirac-like electronic character, is a key player in the physics of magic angle graphene, forming the parent state out of which the more fragile superconducting and correlated insulating ground states emerge.

preprint2019arXiv

Electric Field Tunable Correlated States and Magnetic Phase Transitions in Twisted Bilayer-Bilayer Graphene

The recent discovery of correlated insulator states and superconductivity in magic-angle twisted bilayer graphene has paved the way to the experimental investigation of electronic correlations in tunable flat band systems realized in twisted van der Waals heterostructures. This novel twist angle degree of freedom and control should be generalizable to other 2D systems, which may exhibit similar correlated physics behavior while at the same time enabling new techniques to tune and control the strength of electron-electron interactions. Here, we report on a new highly tunable correlated system based on small-angle twisted bilayer-bilayer graphene (TBBG), consisting of two rotated sheets of Bernal-stacked bilayer graphene. We find that TBBG exhibits a rich phase diagram, with tunable correlated insulators states that are highly sensitive to both twist angle and to the application of an electric displacement field, the latter reflecting the inherent polarizability of Bernal-stacked bilayer graphene. We find correlated insulator states that can be switched on and off by the displacement field at all integer electron fillings of the moiré unit cell. The response of these correlated states to magnetic fields points towards evidence of electrically switchable magnetism. Moreover, the strong dependence of the resistance at low temperature and near the correlated insulator states indicates possible proximity to a superconducting phase. Furthermore, in the regime of lower twist angles, TBBG shows multiple sets of flat bands near charge neutrality, resulting in numerous correlated states corresponding to half-filling of each of these flat bands. Our results pave the way to the exploration of novel twist-angle and electric-field controlled correlated phases of matter in novel multi-flat band twisted superlattices.

preprint2019arXiv

Mapping the twist angle and unconventional Landau levels in magic angle graphene

The emergence of flat electronic bands and of the recently discovered strongly correlated and superconducting phases in twisted bilayer graphene crucially depends on the interlayer twist angle upon approaching the magic angle $θ_M \approx 1.1°$. Although advanced fabrication methods allow alignment of graphene layers with global twist angle control of about 0.1$°$, little information is currently available on the distribution of the local twist angles in actual magic angle twisted bilayer graphene (MATBG) transport devices. Here we map the local $θ$ variations in hBN encapsulated devices with relative precision better than 0.002$°$ and spatial resolution of a few moir$é$ periods. Utilizing a scanning nanoSQUID-on-tip, we attain tomographic imaging of the Landau levels in the quantum Hall state in MATBG, which provides a highly sensitive probe of the charge disorder and of the local band structure determined by the local $θ$. We find that even state-of-the-art devices, exhibiting high-quality global MATBG features including superconductivity, display significant variations in the local $θ$ with a span close to 0.1$°$. Devices may even have substantial areas where no local MATBG behavior is detected, yet still display global MATBG characteristics in transport, highlighting the importance of percolation physics. The derived $θ$ maps reveal substantial gradients and a network of jumps. We show that the twist angle gradients generate large unscreened electric fields that drastically change the quantum Hall state by forming edge states in the bulk of the sample, and may also significantly affect the phase diagram of correlated and superconducting states. The findings call for exploration of band structure engineering utilizing twist-angle gradients and gate-tunable built-in planar electric fields for novel correlated phenomena and applications.

preprint2019arXiv

Single-photon computational 3D imaging at 45 km

Long-range active imaging has a variety of applications in remote sensing and target recognition. Single-photon LiDAR (light detection and ranging) offers single-photon sensitivity and picosecond timing resolution, which is desirable for high-precision three-dimensional (3D) imaging over long distances. Despite important progress, further extending the imaging range presents enormous challenges because only weak echo photons return and are mixed with strong noise. Herein, we tackled these challenges by constructing a high-efficiency, low-noise confocal single-photon LiDAR system, and developing a long-range-tailored computational algorithm that provides high photon efficiency and super-resolution in the transverse domain. Using this technique, we experimentally demonstrated active single-photon 3D-imaging at a distance of up to 45 km in an urban environment, with a low return-signal level of $\sim$1 photon per pixel. Our system is feasible for imaging at a few hundreds of kilometers by refining the setup, and thus represents a significant milestone towards rapid, low-power, and high-resolution LiDAR over extra-long ranges.

preprint2019arXiv

Spaceborne low-noise single-photon detection for satellite-based quantum communications

Single-photon detectors (SPDs) play important roles in highly sensitive detection applications, such as fluorescence spectroscopy, remote sensing and ranging, deep space optical communications, elementary particle detection, and quantum communications. However, the adverse conditions in space, such as the increased radiation flux and thermal vacuum, severely limit their noise performances, reliability, and lifetime. Herein, we present the first example of spaceborne, low-noise, high reliability SPDs, based on commercial off-the-shelf (COTS) silicon avalanche photodiodes (APD). Based on the high noise-radiation sensitivity of silicon APD, we have developed special shielding structures, multistage cooling technologies, and configurable driver electronics that significantly improved the COTS APD reliability and mitigated the SPD noise-radiation sensitivity. This led to a reduction of the expected in-orbit radiation-induced dark count rate (DCR) from ~219 counts per second (cps) per day to ~0.76 cps/day. During a continuous period of continuous operations in orbit which spanned of 1029 days, the SPD DCR was maintained below 1000 cps, i.e., the actual in-orbit radiation-induced DCR increment rate was ~0.54 cps/day, i.e., two orders of magnitude lower than those evoked by previous technologies, while its photon detection efficiency was > 45%. Our spaceborne, low-noise SPDs established a feasible satellite-based up-link quantum communication that was validated on the quantum experiment science satellite platform. Moreover, our SPDs open new windows of opportunities for space research and applications in deep-space optical communications, single-photon laser ranging, as well as for testing the fundamental principles of physics in space.

preprint2019arXiv

Strange metal in magic-angle graphene with near Planckian dissipation

Recent experiments on magic-angle twisted bilayer graphene have discovered correlated insulating behavior and superconductivity at a fractional filling of an isolated narrow band. In this paper we show that magic-angle bilayer graphene exhibits another hallmark of strongly correlated systems --- a broad regime of $T-$linear resistivity above a small, density dependent, crossover temperature--- for a range of fillings near the correlated insulator. We also extract a transport "scattering rate", which satisfies a near Planckian form that is universally related to the ratio of $(k_BT/\hbar)$. Our results establish magic-angle bilayer graphene as a highly tunable platform to investigate strange metal behavior, which could shed light on this mysterious ubiquitous phase of correlated matter.

preprint2019arXiv

Universal transfer and stacking technique of van der Waals heterostructures for spintronics

The key to achieving high-quality van der Waals heterostructure devices made from various two-dimensional (2D) materials lies in the control over clean and flexible interfaces. However, existing transfer methods based on different mediators possess insufficiencies including the presence of residues, the unavailability of flexible interface engineering, and the selectivity towards materials and substrates since their adhesions differ considerably with the various preparation conditions, from chemical vapor deposition (CVD) growth to mechanical exfoliation. In this paper, we introduce a more universal method using a prefabricated polyvinyl alcohol (PVA) film to transfer and stack 2D materials, whether they are prepared by CVD or exfoliation. This peel-off and drop-off technique promises an ideal interface of the materials without introducing contamination. In addition, the method exhibits a micron-scale spatial transfer accuracy and meets special experimental conditions such as the preparation of twisted graphene and the 2D/metal heterostructure construction. We illustrate the superiority of this method with a WSe2 vertical spin valve device, whose performance verifies the applicability and advantages of such a method for spintronics. Our PVA-assisted transfer process will promote the development of high-performance 2D-material-based devices.