Source author record

Yong Cheng

Yong Cheng appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computation and Language Machine Learning Information Theory math.IT Artificial Intelligence Cryptography and Security math.LO

Catalog footprint

What is connected

14works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

BlindFL: Vertical Federated Machine Learning without Peeking into Your Data

Due to the rising concerns on privacy protection, how to build machine learning (ML) models over different data sources with security guarantees is gaining more popularity. Vertical federated learning (VFL) describes such a case where ML models are built upon the private data of different participated parties that own disjoint features for the same set of instances, which fits many real-world collaborative tasks. Nevertheless, we find that existing solutions for VFL either support limited kinds of input features or suffer from potential data leakage during the federated execution. To this end, this paper aims to investigate both the functionality and security of ML modes in the VFL scenario. To be specific, we introduce BlindFL, a novel framework for VFL training and inference. First, to address the functionality of VFL models, we propose the federated source layers to unite the data from different parties. Various kinds of features can be supported efficiently by the federated source layers, including dense, sparse, numerical, and categorical features. Second, we carefully analyze the security during the federated execution and formalize the privacy requirements. Based on the analysis, we devise secure and accurate algorithm protocols, and further prove the security guarantees under the ideal-real simulation paradigm. Extensive experiments show that BlindFL supports diverse datasets and models efficiently whilst achieves robust privacy guarantees.

preprint2022arXiv

Examining Scaling and Transfer of Language Model Architectures for Machine Translation

Natural language understanding and generation models follow one of the two dominant architectural paradigms: language models (LMs) that process concatenated sequences in a single stack of layers, and encoder-decoder models (EncDec) that utilize separate layer stacks for input and output processing. In machine translation, EncDec has long been the favoured approach, but with few studies investigating the performance of LMs. In this work, we thoroughly examine the role of several architectural design choices on the performance of LMs on bilingual, (massively) multilingual and zero-shot translation tasks, under systematic variations of data conditions and model sizes. Our results show that: (i) Different LMs have different scaling properties, where architectural differences often have a significant impact on model performance at small scales, but the performance gap narrows as the number of parameters increases, (ii) Several design choices, including causal masking and language-modeling objectives for the source sequence, have detrimental effects on translation quality, and (iii) When paired with full-visible masking for source sequences, LMs could perform on par with EncDec on supervised bilingual and multilingual translation tasks, and improve greatly on zero-shot directions by facilitating the reduction of off-target translations.

preprint2022arXiv

mSLAM: Massively multilingual joint pre-training for speech and text

We present mSLAM, a multilingual Speech and LAnguage Model that learns cross-lingual cross-modal representations of speech and text by pre-training jointly on large amounts of unlabeled speech and text in multiple languages. mSLAM combines w2v-BERT pre-training on speech with SpanBERT pre-training on character-level text, along with Connectionist Temporal Classification (CTC) losses on paired speech and transcript data, to learn a single model capable of learning from and representing both speech and text signals in a shared representation space. We evaluate mSLAM on several downstream speech understanding tasks and find that joint pre-training with text improves quality on speech translation, speech intent classification and speech language-ID while being competitive on multilingual ASR, when compared against speech-only pre-training. Our speech translation model demonstrates zero-shot text translation without seeing any text translation data, providing evidence for cross-modal alignment of representations. mSLAM also benefits from multi-modal fine-tuning, further improving the quality of speech translation by directly leveraging text translation data during the fine-tuning process. Our empirical analysis highlights several opportunities and challenges arising from large-scale multimodal pre-training, suggesting directions for future research.

preprint2022arXiv

Multilingual Mix: Example Interpolation Improves Multilingual Neural Machine Translation

Multilingual neural machine translation models are trained to maximize the likelihood of a mix of examples drawn from multiple language pairs. The dominant inductive bias applied to these models is a shared vocabulary and a shared set of parameters across languages; the inputs and labels corresponding to examples drawn from different language pairs might still reside in distinct sub-spaces. In this paper, we introduce multilingual crossover encoder-decoder (mXEncDec) to fuse language pairs at an instance level. Our approach interpolates instances from different language pairs into joint `crossover examples' in order to encourage sharing input and output spaces across languages. To ensure better fusion of examples in multilingual settings, we propose several techniques to improve example interpolation across dissimilar languages under heavy data imbalance. Experiments on a large-scale WMT multilingual dataset demonstrate that our approach significantly improves quality on English-to-Many, Many-to-English and zero-shot translation tasks (from +0.5 BLEU up to +5.5 BLEU points). Results on code-switching sets demonstrate the capability of our approach to improve model generalization to out-of-distribution multilingual examples. We also conduct qualitative and quantitative representation comparisons to analyze the advantages of our approach at the representation level.

preprint2020arXiv

A Communication Efficient Collaborative Learning Framework for Distributed Features

We introduce a collaborative learning framework allowing multiple parties having different sets of attributes about the same user to jointly build models without exposing their raw data or model parameters. In particular, we propose a Federated Stochastic Block Coordinate Descent (FedBCD) algorithm, in which each party conducts multiple local updates before each communication to effectively reduce the number of communication rounds among parties, a principal bottleneck for collaborative learning problems. We analyze theoretically the impact of the number of local updates and show that when the batch size, sample size, and the local iterations are selected appropriately, within $T$ iterations, the algorithm performs $\mathcal{O}(\sqrt{T})$ communication rounds and achieves some $\mathcal{O}(1/\sqrt{T})$ accuracy (measured by the average of the gradient norm squared). The approach is supported by our empirical evaluations on a variety of tasks and datasets, demonstrating advantages over stochastic gradient descent (SGD) approaches.

preprint2020arXiv

AdvAug: Robust Adversarial Augmentation for Neural Machine Translation

In this paper, we propose a new adversarial augmentation method for Neural Machine Translation (NMT). The main idea is to minimize the vicinal risk over virtual sentences sampled from two vicinity distributions, of which the crucial one is a novel vicinity distribution for adversarial sentences that describes a smooth interpolated embedding space centered around observed training sentence pairs. We then discuss our approach, AdvAug, to train NMT models using the embeddings of virtual sentences in sequence-to-sequence learning. Experiments on Chinese-English, English-French, and English-German translation benchmarks show that AdvAug achieves significant improvements over the Transformer (up to 4.9 BLEU points), and substantially outperforms other data augmentation techniques (e.g. back-translation) without using extra corpora.

preprint2020arXiv

Learning to Detect Malicious Clients for Robust Federated Learning

Federated learning systems are vulnerable to attacks from malicious clients. As the central server in the system cannot govern the behaviors of the clients, a rogue client may initiate an attack by sending malicious model updates to the server, so as to degrade the learning performance or enforce targeted model poisoning attacks (a.k.a. backdoor attacks). Therefore, timely detecting these malicious model updates and the underlying attackers becomes critically important. In this work, we propose a new framework for robust federated learning where the central server learns to detect and remove the malicious model updates using a powerful detection model, leading to targeted defense. We evaluate our solution in both image classification and sentiment analysis tasks with a variety of machine learning models. Experimental results show that our solution ensures robust federated learning that is resilient to both the Byzantine attacks and the targeted model poisoning attacks.

preprint2016arXiv

Agreement-based Joint Training for Bidirectional Attention-based Neural Machine Translation

The attentional mechanism has proven to be effective in improving end-to-end neural machine translation. However, due to the intricate structural divergence between natural languages, unidirectional attention-based models might only capture partial aspects of attentional regularities. We propose agreement-based joint training for bidirectional attention-based end-to-end neural machine translation. Instead of training source-to-target and target-to-source translation models independently,our approach encourages the two complementary models to agree on word alignment matrices on the same training data. Experiments on Chinese-English and English-French translation tasks show that agreement-based joint training significantly improves both alignment and translation quality over independent training.

preprint2016arXiv

Minimum Risk Training for Neural Machine Translation

We propose minimum risk training for end-to-end neural machine translation. Unlike conventional maximum likelihood estimation, minimum risk training is capable of optimizing model parameters directly with respect to arbitrary evaluation metrics, which are not necessarily differentiable. Experiments show that our approach achieves significant improvements over maximum likelihood estimation on a state-of-the-art neural machine translation system across various languages pairs. Transparent to architectures, our approach can be applied to more neural networks and potentially benefit more NLP tasks.

preprint2016arXiv

Semi-Supervised Learning for Neural Machine Translation

While end-to-end neural machine translation (NMT) has made remarkable progress recently, NMT systems only rely on parallel corpora for parameter estimation. Since parallel corpora are usually limited in quantity, quality, and coverage, especially for low-resource languages, it is appealing to exploit monolingual corpora to improve NMT. We propose a semi-supervised approach for training NMT models on the concatenation of labeled (parallel corpora) and unlabeled (monolingual corpora) data. The central idea is to reconstruct the monolingual corpora using an autoencoder, in which the source-to-target and target-to-source translation models serve as the encoder and decoder, respectively. Our approach can not only exploit the monolingual corpora of the target language, but also of the source language. Experiments on the Chinese-English dataset show that our approach achieves significant improvements over state-of-the-art SMT and NMT systems.

preprint2015arXiv

Indestructibility properties of remarkable cardinals

Remarkable cardinals were introduced by Schindler, who showed that the existence of a remarkable cardinal is equiconsistent with the assertion that the theory of $L(\mathbb R)$ is absolute for proper forcing. Here, we study the indestructibility properties of remarkable cardinals. We show that if $κ$ is remarkable, then there is a forcing extension in which the remarkability of $κ$ becomes indestructible by all $\ltκ$-closed $\leqκ$-distributive forcing and all two-step iterations of the form ${\rm Add}(κ,θ)*\dot{\mathbb R}$, where $\dot{\mathbb R}$ is forced to be $\ltκ$-closed and $\leqκ$-distributive. In the process, we introduce the notion of a remarkable Laver function and show that every remarkable cardinal carries such a function. We also show that remarkability is preserved by the canonical forcing of the ${\rm GCH}$.

preprint2011arXiv

Distributive Network Utility Maximization (NUM) over Time-Varying Fading Channels

Distributed network utility maximization (NUM) has received an increasing intensity of interest over the past few years. Distributed solutions (e.g., the primal-dual gradient method) have been intensively investigated under fading channels. As such distributed solutions involve iterative updating and explicit message passing, it is unrealistic to assume that the wireless channel remains unchanged during the iterations. Unfortunately, the behavior of those distributed solutions under time-varying channels is in general unknown. In this paper, we shall investigate the convergence behavior and tracking errors of the iterative primal-dual scaled gradient algorithm (PDSGA) with dynamic scaling matrices (DSC) for solving distributive NUM problems under time-varying fading channels. We shall also study a specific application example, namely the multi-commodity flow control and multi-carrier power allocation problem in multi-hop ad hoc networks. Our analysis shows that the PDSGA converges to a limit region rather than a single point under the finite state Markov chain (FSMC) fading channels. We also show that the order of growth of the tracking errors is given by O(T/N), where T and N are the update interval and the average sojourn time of the FSMC, respectively. Based on this analysis, we derive a low complexity distributive adaptation algorithm for determining the adaptive scaling matrices, which can be implemented distributively at each transmitter. The numerical results show the superior performance of the proposed dynamic scaling matrix algorithm over several baseline schemes, such as the regular primal-dual gradient algorithm.

preprint2010arXiv

A Scalable Limited Feedback Design for Network MIMO using Per-Cell Product Codebook

In network MIMO systems, channel state information is required at the transmitter side to multiplex users in the spatial domain. Since perfect channel knowledge is difficult to obtain in practice, \emph{limited feedback} is a widely accepted solution. The {\em dynamic number of cooperating BSs} and {\em heterogeneous path loss effects} of network MIMO systems pose new challenges on limited feedback design. In this paper, we propose a scalable limited feedback design for network MIMO systems with multiple base stations, multiple users and multiple data streams for each user. We propose a {\em limited feedback framework using per-cell product codebooks}, along with a {\em low-complexity feedback indices selection algorithm}. We show that the proposed per-cell product codebook limited feedback design can asymptotically achieve the same performance as the joint-cell codebook approach. We also derive an asymptotic \emph{per-user throughput loss} due to limited feedback with per-cell product codebooks. Based on that, we show that when the number of per-user feedback-bits $B_{k}$ is $\mathcal{O}\big( Nn_{T}n_{R}\log_{2}(ρg_{k}^{sum})\big)$, the system operates in the \emph{noise-limited} regime in which the per-user throughput is $\mathcal{O} \left( n_{R} \log_{2} \big( \frac{n_{R}ρg_{k}^{sum}}{Nn_{T}} \big) \right)$. On the other hand, when the number of per-user feedback-bits $B_{k}$ does not scale with the \emph{system SNR} $ρ$, the system operates in the \emph{interference-limited} regime where the per-user throughput is $\mathcal{O}\left( \frac{n_{R}B_{k}}{(Nn_{T})^{2}} \right)$. Numerical results show that the proposed design is very flexible to accommodate dynamic number of cooperating BSs and achieves much better performance compared with other baselines (such as the Givens rotation approach).

preprint2010arXiv

Distributive Power Control Algorithm for Multicarrier Interference Network over Time-Varying Fading Channels - Tracking Performance Analysis and Optimization

Distributed power control over interference limited network has received an increasing intensity of interest over the past few years. Distributed solutions (like the iterative water-filling, gradient projection, etc.) have been intensively investigated under \emph{quasi-static} channels. However, as such distributed solutions involve iterative updating and explicit message passing, it is unrealistic to assume that the wireless channel remains unchanged during the iterations. Unfortunately, the behavior of those distributed solutions under \emph{time-varying} channels is in general unknown. In this paper, we shall investigate the distributed scaled gradient projection algorithm (DSGPA) in a $K$ pairs multicarrier interference network under a finite-state Markov channel (FSMC) model. We shall analyze the \emph{convergence property} as well as \emph{tracking performance} of the proposed DSGPA. Our analysis shows that the proposed DSGPA converges to a limit region rather than a single point under the FSMC model. We also show that the order of growth of the tracking errors is given by $\mathcal{O}$1 \big/ \bar{N}$$, where $\bar{N}$ is the \emph{average sojourn time} of the FSMC. Based on the analysis, we shall derive the \emph{tracking error optimal scaling matrices} via Markov decision process modeling. We shall show that the tracking error optimal scaling matrices can be implemented distributively at each transmitter. The numerical results show the superior performance of the proposed DSGPA over three baseline schemes, such as the gradient projection algorithm with a constant stepsize.

Yong Cheng

What is connected

Connect this record

See the researcher in context

Building this map preview

14 published item(s)

BlindFL: Vertical Federated Machine Learning without Peeking into Your Data

Examining Scaling and Transfer of Language Model Architectures for Machine Translation

mSLAM: Massively multilingual joint pre-training for speech and text

Multilingual Mix: Example Interpolation Improves Multilingual Neural Machine Translation

A Communication Efficient Collaborative Learning Framework for Distributed Features

AdvAug: Robust Adversarial Augmentation for Neural Machine Translation

Learning to Detect Malicious Clients for Robust Federated Learning

Agreement-based Joint Training for Bidirectional Attention-based Neural Machine Translation

Minimum Risk Training for Neural Machine Translation

Semi-Supervised Learning for Neural Machine Translation

Indestructibility properties of remarkable cardinals

Distributive Network Utility Maximization (NUM) over Time-Varying Fading Channels

A Scalable Limited Feedback Design for Network MIMO using Per-Cell Product Codebook

Distributive Power Control Algorithm for Multicarrier Interference Network over Time-Varying Fading Channels - Tracking Performance Analysis and Optimization