Source author record

Kai Fan

Kai Fan appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computation and Language Machine Learning Applications Artificial Intelligence Databases eess.AS Sound

Catalog footprint

What is connected

8works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

A scalable pipeline for COVID-19: the case study of Germany, Czechia and Poland

Throughout the coronavirus disease 2019 (COVID-19) pandemic, decision makers have relied on forecasting models to determine and implement non-pharmaceutical interventions (NPI). In building the forecasting models, continuously updated datasets from various stakeholders including developers, analysts, and testers are required to provide precise predictions. Here we report the design of a scalable pipeline which serves as a data synchronization to support inter-country top-down spatiotemporal observations and forecasting models of COVID-19, named the where2test, for Germany, Czechia and Poland. We have built an operational data store (ODS) using PostgreSQL to continuously consolidate datasets from multiple data sources, perform collaborative work, facilitate high performance data analysis, and trace changes. The ODS has been built not only to store the COVID-19 data from Germany, Czechia, and Poland but also other areas. Employing the dimensional fact model, a schema of metadata is capable of synchronizing the various structures of data from those regions, and is scalable to the entire world. Next, the ODS is populated using batch Extract, Transfer, and Load (ETL) jobs. The SQL queries are subsequently created to reduce the need for pre-processing data for users. The data can then support not only forecasting using a version-controlled Arima-Holt model and other analyses to support decision making, but also risk calculator and optimisation apps. The data synchronization runs at a daily interval, which is displayed at https://www.where2test.de.

preprint2022arXiv

Unifying Cross-lingual Summarization and Machine Translation with Compression Rate

Cross-Lingual Summarization (CLS) is a task that extracts important information from a source document and summarizes it into a summary in another language. It is a challenging task that requires a system to understand, summarize, and translate at the same time, making it highly related to Monolingual Summarization (MS) and Machine Translation (MT). In practice, the training resources for Machine Translation are far more than that for cross-lingual and monolingual summarization. Thus incorporating the Machine Translation corpus into CLS would be beneficial for its performance. However, the present work only leverages a simple multi-task framework to bring Machine Translation in, lacking deeper exploration. In this paper, we propose a novel task, Cross-lingual Summarization with Compression rate (CSC), to benefit Cross-Lingual Summarization by large-scale Machine Translation corpus. Through introducing compression rate, the information ratio between the source and the target text, we regard the MT task as a special CLS task with a compression rate of 100%. Hence they can be trained as a unified task, sharing knowledge more effectively. However, a huge gap exists between the MT task and the CLS task, where samples with compression rates between 30% and 90% are extremely rare. Hence, to bridge these two tasks smoothly, we propose an effective data augmentation method to produce document-summary pairs with different compression rates. The proposed method not only improves the performance of the CLS task, but also provides controllability to generate summaries in desired lengths. Experiments demonstrate that our method outperforms various strong baselines in three cross-lingual summarization datasets. We released our code and data at https://github.com/ybai-nlp/CLS_CR.

preprint2020arXiv

A Practical Framework for Relation Extraction with Noisy Labels Based on Doubly Transitional Loss

Either human annotation or rule based automatic labeling is an effective method to augment data for relation extraction. However, the inevitable wrong labeling problem for example by distant supervision may deteriorate the performance of many existing methods. To address this issue, we introduce a practical end-to-end deep learning framework, including a standard feature extractor and a novel noisy classifier with our proposed doubly transitional mechanism. One transition is basically parameterized by a non-linear transformation between hidden layers that implicitly represents the conversion between the true and noisy labels, and it can be readily optimized together with other model parameters. Another is an explicit probability transition matrix that captures the direct conversion between labels but needs to be derived from an EM algorithm. We conduct experiments on the NYT dataset and SemEval 2018 Task 7. The empirical results show comparable or better performance over state-of-the-art methods.

preprint2020arXiv

Long-Short Term Masking Transformer: A Simple but Effective Baseline for Document-level Neural Machine Translation

Many document-level neural machine translation (NMT) systems have explored the utility of context-aware architecture, usually requiring an increasing number of parameters and computational complexity. However, few attention is paid to the baseline model. In this paper, we research extensively the pros and cons of the standard transformer in document-level translation, and find that the auto-regressive property can simultaneously bring both the advantage of the consistency and the disadvantage of error accumulation. Therefore, we propose a surprisingly simple long-short term masking self-attention on top of the standard transformer to both effectively capture the long-range dependence and reduce the propagation of errors. We examine our approach on the two publicly available document-level datasets. We can achieve a strong result in BLEU and capture discourse phenomena.

preprint2020arXiv

Neural Zero-Inflated Quality Estimation Model For Automatic Speech Recognition System

The performances of automatic speech recognition (ASR) systems are usually evaluated by the metric word error rate (WER) when the manually transcribed data are provided, which are, however, expensively available in the real scenario. In addition, the empirical distribution of WER for most ASR systems usually tends to put a significant mass near zero, making it difficult to simulate with a single continuous distribution. In order to address the two issues of ASR quality estimation (QE), we propose a novel neural zero-inflated model to predict the WER of the ASR result without transcripts. We design a neural zero-inflated beta regression on top of a bidirectional transformer language model conditional on speech features (speech-BERT). We adopt the pre-training strategy of token level mask language modeling for speech-BERT as well, and further fine-tune with our zero-inflated layer for the mixture of discrete and continuous outputs. The experimental results show that our approach achieves better performance on WER prediction in the metrics of Pearson and MAE, compared with most existed quality estimation algorithms for ASR or machine translation.

preprint2015arXiv

$k$-means: Fighting against Degeneracy in Sequential Monte Carlo with an Application to Tracking

For regular particle filter algorithm or Sequential Monte Carlo (SMC) methods, the initial weights are traditionally dependent on the proposed distribution, the posterior distribution at the current timestamp in the sampled sequence, and the target is the posterior distribution of the previous timestamp. This is technically correct, but leads to algorithms which usually have practical issues with degeneracy, where all particles eventually collapse onto a single particle. In this paper, we propose and evaluate using $k$ means clustering to attack and even take advantage of this degeneracy. Specifically, we propose a Stochastic SMC algorithm which initializes the set of $k$ means, providing the initial centers chosen from the collapsed particles. To fight against degeneracy, we adjust the regular SMC weights, mediated by cluster proportions, and then correct them to retain the same expectation as before. We experimentally demonstrate that our approach has better performance than vanilla algorithms.

preprint2015arXiv

Bayesian Models for Heterogeneous Personalized Health Data

The purpose of this study is to leverage modern technology (such as mobile or web apps in Beckman et al. (2014)) to enrich epidemiology data and infer the transmission of disease. Homogeneity related research on population level has been intensively studied in previous work. In contrast, we develop hierarchical Graph-Coupled Hidden Markov Models (hGCHMMs) to simultaneously track the spread of infection in a small cell phone community and capture person-specific infection parameters by leveraging a link prior that incorporates additional covariates. We also reexamine the model evolution of the hGCHMM from simple HMMs and LDA, elucidating additional flexibility and interpretability. Due to the non-conjugacy of sparsely coupled HMMs, we design a new approximate distribution, allowing our approach to be more applicable to other application areas. Additionally, we investigate two common link functions, the beta-exponential prior and sigmoid function, both of which allow the development of a principled Bayesian hierarchical framework for disease transmission. The results of our model allow us to predict the probability of infection for each person on each day, and also to infer personal physical vulnerability and the relevant association with covariates. We demonstrate our approach experimentally on both simulation data and real epidemiological records.

preprint2014arXiv

A Novel Non-Parametric Approach to Compare Paired General Statistical Distributions between Two Interventions

Despite of many measures applied for determine the difference between two groups of observations, such as mean value, median value, sample stan- dard deviation and so on, we propose a novel non parametric transformation method based on Mallows distance to investigate the location and variance differences between the two groups. The convexity theory of this method is constructed and thus it is a viable alternative for data of any distribu- tions. In addition, we are able to establish the similar method under other distance measures, such as Kolmogorov-Smirnov distance. The application of our method in real data is performed as well.

Kai Fan

What is connected

Connect this record

See the researcher in context

Building this map preview

8 published item(s)

A scalable pipeline for COVID-19: the case study of Germany, Czechia and Poland

Unifying Cross-lingual Summarization and Machine Translation with Compression Rate

A Practical Framework for Relation Extraction with Noisy Labels Based on Doubly Transitional Loss

Long-Short Term Masking Transformer: A Simple but Effective Baseline for Document-level Neural Machine Translation

Neural Zero-Inflated Quality Estimation Model For Automatic Speech Recognition System

$k$-means: Fighting against Degeneracy in Sequential Monte Carlo with an Application to Tracking

Bayesian Models for Heterogeneous Personalized Health Data

A Novel Non-Parametric Approach to Compare Paired General Statistical Distributions between Two Interventions