Source author record

Lei Lin

Lei Lin appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computation and Language Information Theory math.IT Computer Vision Data Structures and Algorithms Databases eess.SP Information Retrieval Machine Learning Multimedia

Catalog footprint

What is connected

10works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

MDQE: A More Accurate Direct Pretraining for Machine Translation Quality Estimation

It is expensive to evaluate the results of Machine Translation(MT), which usually requires manual translation as a reference. Machine Translation Quality Estimation (QE) is a task of predicting the quality of machine translations without relying on any reference. Recently, the emergence of predictor-estimator framework which trains the predictor as a feature extractor and estimator as a QE predictor, and pre-trained language models(PLM) have achieved promising QE performance. However, we argue that there are still gaps between the predictor and the estimator in both data quality and training objectives, which preclude QE models from benefiting from a large number of parallel corpora more directly. Based on previous related work that have alleviated gaps to some extent, we propose a novel framework that provides a more accurate direct pretraining for QE tasks. In this framework, a generator is trained to produce pseudo data that is closer to the real QE data, and a estimator is pretrained on these data with novel objectives that are the same as the QE task. Experiments on widely used benchmarks show that our proposed framework outperforms existing methods, without using any pretraining models such as BERT.

preprint2020arXiv

MCFlow: Monte Carlo Flow Models for Data Imputation

We consider the topic of data imputation, a foundational task in machine learning that addresses issues with missing data. To that end, we propose MCFlow, a deep framework for imputation that leverages normalizing flow generative models and Monte Carlo sampling. We address the causality dilemma that arises when training models with incomplete data by introducing an iterative learning scheme which alternately updates the density estimate and the values of the missing entries in the training data. We provide extensive empirical validation of the effectiveness of the proposed method on standard multivariate and image datasets, and benchmark its performance against state-of-the-art alternatives. We demonstrate that MCFlow is superior to competing methods in terms of the quality of the imputed data, as well as with regards to its ability to preserve the semantic structure of the data.

preprint2020arXiv

Predicting Station-Level Bike-Sharing Demands Using Graph Convolutional Neural Network

This study proposes a novel Graph Convolutional Neural Network with Data-driven Graph Filter (GCNN-DDGF) model that can learn hidden heterogeneous pairwise correlations among stations to predict station-level hourly demand in a large-scale bike-sharing network. Two architectures of the GCNN-DDGF model are explored: GCNNreg-DDGF is a regular GCNN-DDGF model which contains the convolution and feedforward blocks; GCNNrec-DDGF additionally contains a recurrent block from the Long Short-term Memory neural network to capture temporal dependencies in bike-sharing demand series. Furthermore, four GCNN models are proposed whose adjacency matrices are based on various bike-sharing system data, including Spatial Distance matrix (SD), Demand matrix (DE), Average Trip Duration matrix (ATD), and Demand Correlation matrix (DC). These six GCNN models along with seven other benchmark models are built and compared using the Citi Bike dataset from New York City, which includes 272 stations and over 28 million transactions from 2013 to 2016. Results show that the GCNNrec-DDGF performs the best in terms of the Root Mean Square Error, the Mean Absolute Error, and the coefficient of determination (R2), followed by the GCNNreg-DDGF. They outperform the other models.

preprint2018arXiv

Music Sequence Prediction with Mixture Hidden Markov Models

Recommendation systems that automatically generate personalized music playlists for users have attracted tremendous attention in recent years. Nowadays, most music recommendation systems rely on item-based or user-based collaborative filtering or content-based approaches. In this paper, we propose a novel mixture hidden Markov model (HMM) for music play sequence prediction. We compare the mixture model with state-of-the-art methods and evaluate the predictions quantitatively and qualitatively on a large-scale real-world dataset in a Kaggle competition. Results show that our model significantly outperforms traditional methods as well as other competitors. We conclude by envisioning a next-generation music recommendation system that integrates our model with recent advances in deep learning, computer vision, and speech techniques, and has promising potential in both academia and industry.

preprint2016arXiv

Learning Natural Language Inference using Bidirectional LSTM model and Inner-Attention

In this paper, we proposed a sentence encoding-based model for recognizing text entailment. In our approach, the encoding of sentence is a two-stage process. Firstly, average pooling was used over word-level bidirectional LSTM (biLSTM) to generate a first-stage sentence representation. Secondly, attention mechanism was employed to replace average pooling on the same sentence for better representations. Instead of using target sentence to attend words in source sentence, we utilized the sentence's first-stage representation to attend words appeared in itself, which is called "Inner-Attention" in our paper . Experiments conducted on Stanford Natural Language Inference (SNLI) Corpus has proved the effectiveness of "Inner-Attention" mechanism. With less number of parameters, our model outperformed the existing best sentence encoding-based approach by a large margin.

preprint2014arXiv

Radical-Enhanced Chinese Character Embedding

We present a method to leverage radical for learning Chinese character embedding. Radical is a semantic and phonetic component of Chinese character. It plays an important role as characters with the same radical usually have similar semantic meaning and grammatical usage. However, existing Chinese processing algorithms typically regard word or character as the basic unit but ignore the crucial radical information. In this paper, we fill this gap by leveraging radical for learning continuous representation of Chinese character. We develop a dedicated neural architecture to effectively learn character embedding and apply it on Chinese character similarity judgement and Chinese word segmentation. Experiment results show that our radical-enhanced method outperforms existing embedding learning algorithms on both tasks.

preprint2013arXiv

Accessible Capacity of Secondary Users

A new problem formulation is presented for the Gaussian interference channels (GIFC) with two pairs of users, which are distinguished as primary users and secondary users, respectively. The primary users employ a pair of encoder and decoder that were originally designed to satisfy a given error performance requirement under the assumption that no interference exists from other users. In the scenario when the secondary users attempt to access the same medium, we are interested in the maximum transmission rate (defined as {\em accessible capacity}) at which secondary users can communicate reliably without affecting the error performance requirement by the primary users under the constraint that the primary encoder (not the decoder) is kept unchanged. By modeling the primary encoder as a generalized trellis code (GTC), we are then able to treat the secondary link and the cross link from the secondary transmitter to the primary receiver as finite state channels (FSCs). Based on this, upper and lower bounds on the accessible capacity are derived. The impact of the error performance requirement by the primary users on the accessible capacity is analyzed by using the concept of interference margin. In the case of non-trivial interference margin, the secondary message is split into common and private parts and then encoded by superposition coding, which delivers a lower bound on the accessible capacity. For some special cases, these bounds can be computed numerically by using the BCJR algorithm. Numerical results are also provided to gain insight into the impacts of the GTC and the error performance requirement on the accessible capacity.

preprint2013arXiv

An information spectrum approach to the capacity region of GIFC

In this paper, we present a general formula for the capacity region of a general interference channel with two pairs of users. The formula shows that the capacity region is the union of a family of rectangles, where each rectangle is determined by a pair of spectral inf-mutual information rates. Although the presented formula is usually difficult to compute, it provides us useful insights into the interference channels. In particular, when the inputs are discrete ergodic Markov processes and the channel is stationary memoryless, the formula can be evaluated by BCJR algorithm. Also the formula suggests us that the simplest inner bounds (obtained by treating the interference as noise) could be improved by taking into account the structure of the interference processes. This is verified numerically by computing the mutual information rates for Gaussian interference channels with embedded convolutional codes. Moreover, we present a coding scheme to approach the theoretical achievable rate pairs. Numerical results show that decoding gain can be achieved by considering the structure of the interference.

preprint2012arXiv

An Information-Spectrum Approach to the Capacity Region of General Interference Channel

This paper is concerned with general interference channels characterized by a sequence of transition (conditional) probabilities. We present a general formula for the capacity region of the interference channel with two pairs of users. The formula shows that the capacity region is the union of a family of rectangles, where each rectangle is determined by a pair of spectral inf-mutual information rates. Although the presented formula is usually difficult to compute, it provides us useful insights into the interference channels. For example, the formula suggests us that the simplest inner bounds (obtained by treating the interference as noise) could be improved by taking into account the structure of the interference processes. This is verified numerically by computing the mutual information rates for Gaussian interference channels with embedded convolutional codes.

preprint2010arXiv

Efficient K-Nearest Neighbor Join Algorithms for High Dimensional Sparse Data

The K-Nearest Neighbor (KNN) join is an expensive but important operation in many data mining algorithms. Several recent applications need to perform KNN join for high dimensional sparse data. Unfortunately, all existing KNN join algorithms are designed for low dimensional data. To fulfill this void, we investigate the KNN join problem for high dimensional sparse data. In this paper, we propose three KNN join algorithms: a brute force (BF) algorithm, an inverted index-based(IIB) algorithm and an improved inverted index-based(IIIB) algorithm. Extensive experiments on both synthetic and real-world datasets were conducted to demonstrate the effectiveness of our algorithms for high dimensional sparse data.