Source author record

Lan Jiang

Lan Jiang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Applications Computation math.NA Methodology Computation and Language Computer Vision Information Retrieval Machine Learning math.ST Mathematical Software physics.soc-ph Social and Information Networks Statistics Theory

Catalog footprint

What is connected

9works

13topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

On Length Divergence Bias in Textual Matching Models

Despite the remarkable success deep models have achieved in Textual Matching (TM) tasks, it still remains unclear whether they truly understand language or measure the semantic similarity of texts by exploiting statistical bias in datasets. In this work, we provide a new perspective to study this issue -- via the length divergence bias. We find the length divergence heuristic widely exists in prevalent TM datasets, providing direct cues for prediction. To determine whether TM models have adopted such heuristic, we introduce an adversarial evaluation scheme which invalidates the heuristic. In this adversarial setting, all TM models perform worse, indicating they have indeed adopted this heuristic. Through a well-designed probing experiment, we empirically validate that the bias of TM models can be attributed in part to extracting the text length information during training. To alleviate the length divergence bias, we propose an adversarial training method. The results demonstrate we successfully improve the robustness and generalization ability of models at the same time.

preprint2021arXiv

Detecting Layout Templates in Complex Multiregion Files

Spreadsheets are among the most commonly used file formats for data management, distribution, and analysis. Their widespread employment makes it easy to gather large collections of data, but their flexible canvas-based structure makes automated analysis difficult without heavy preparation. One of the common problems that practitioners face is the presence of multiple, independent regions in a single spreadsheet, possibly separated by repeated empty cells. We define such files as "multiregion" files. In collections of various spreadsheets, we can observe that some share the same layout. We present the Mondrian approach to automatically identify layout templates across multiple files and systematically extract the corresponding regions. Our approach is composed of three phases: first, each file is rendered as an image and inspected for elements that could form regions; then, using a clustering algorithm, the identified elements are grouped to form regions; finally, every file layout is represented as a graph and compared with others to find layout templates. We compare our method to state-of-the-art table recognition algorithms on two corpora of real-world enterprise spreadsheets. Our approach shows the best performances in detecting reliable region boundaries within each file and can correctly identify recurring layouts across files.

preprint2020arXiv

Structural balance in signed digraphs: considering transitivity to measure balance in graphs constructed by using different link signing methods

Structural balance theory assumes triads in networks to gravitate towards stable configurations. The theory has been verified for undirected graphs. Since real-world networks are often directed, we introduce a novel method for considering both transitivity and sign consistency for calculating balance in signed digraphs. We test our approach on graphs that we constructed by using different methods for identifying edge signs: natural language processing to infer signs from underlying text data, and self-reported survey data. Our results show that for various social contexts and edge sign detection methods, balance is moderately high, ranging from 67.5% to 92.4%.

preprint2016arXiv

Tracking multiple moving objects in images using Markov Chain Monte Carlo

A new Bayesian state and parameter learning algorithm for multiple target tracking (MTT) models with image observations is proposed. Specifically, a Markov chain Monte Carlo algorithm is designed to sample from the posterior distribution of the unknown number of targets, their birth and death times, states and model parameters, which constitutes the complete solution to the tracking problem. The conventional approach is to pre-process the images to extract point observations and then perform tracking. We model the image generation process directly to avoid potential loss of information when extracting point observations. Numerical examples show that our algorithm has improved tracking performance over commonly used techniques, for both synthetic examples and real florescent microscopy data, especially in the case of dim targets with overlapping illuminated regions.

preprint2015arXiv

GAIL---Guaranteed Automatic Integration Library in MATLAB: Documentation for Version 2.1

Automatic and adaptive approximation, optimization, or integration of functions in a cone with guarantee of accuracy is a relatively new paradigm. Our purpose is to create an open-source MATLAB package, Guaranteed Automatic Integration Library (GAIL), following the philosophy of reproducible research and sustainable practices of robust scientific software development. For our conviction that true scholarship in computational sciences are characterized by reliable reproducibility, we employ the best practices in mathematical research and software engineering known to us and available in MATLAB. This document describes the key features of functions in GAIL, which includes one-dimensional function approximation and minimization using linear splines, one-dimensional numerical integration using trapezoidal rule, and last but not least, mean estimation and multidimensional integration by Monte Carlo methods or Quasi Monte Carlo methods.

preprint2014arXiv

Bayesian tracking and parameter learning for non-linear multiple target tracking models

We propose a new Bayesian tracking and parameter learning algorithm for non-linear non-Gaussian multiple target tracking (MTT) models. We design a Markov chain Monte Carlo (MCMC) algorithm to sample from the posterior distribution of the target states, birth and death times, and association of observations to targets, which constitutes the solution to the tracking problem, as well as the model parameters. In the numerical section, we present performance comparisons with several competing techniques and demonstrate significant performance improvements in all cases.

preprint2014arXiv

Guaranteed Monte Carlo Methods for Bernoulli Random Variables

Simple Monte Carlo is a versatile computational method with a convergence rate of $O(n^{-1/2})$. It can be used to estimate the means of random variables whose distributions are unknown. Bernoulli random variables, $Y$, are widely used to model success (failure) of complex systems. Here $Y=1$ denotes a success (failure), and $p=\mathbb{E}(Y)$ denotes the probability of that success (failure). Another application of Bernoulli random variables is $Y=\mathbb{1}_{R}(\boldsymbol{X})$, where then $p$ is the probability of $\boldsymbol{X}$ lying in the region $R$. This article explores how estimate $p$ to a prescribed absolute error tolerance, $\varepsilon$, with a high level of confidence, $1-α$. The proposed algorithm automatically determines the number of samples of $Y$ needed to reach the prescribed error tolerance with the specified confidence level by using Hoeffding's inequality. The algorithm described here has been implemented in MATLAB and is part of the Guaranteed Automatic Integration Library (GAIL).

preprint2013arXiv

Guaranteed Conservative Fixed Width Confidence Intervals Via Monte Carlo Sampling

Monte Carlo methods are used to approximate the means, $μ$, of random variables $Y$, whose distributions are not known explicitly. The key idea is that the average of a random sample, $Y_1, ..., Y_n$, tends to $μ$ as $n$ tends to infinity. This article explores how one can reliably construct a confidence interval for $μ$ with a prescribed half-width (or error tolerance) $\varepsilon$. Our proposed two-stage algorithm assumes that the kurtosis of $Y$ does not exceed some user-specified bound. An initial independent and identically distributed (IID) sample is used to confidently estimate the variance of $Y$. A Berry-Esseen inequality then makes it possible to determine the size of the IID sample required to construct the desired confidence interval for $μ$. We discuss the important case where $Y=f(\vX)$ and $\vX$ is a random $d$-vector with probability density function $ρ$. In this case $μ$ can be interpreted as the integral $\int_{\reals^d} f(\vx) ρ(\vx) \dif \vx$, and the Monte Carlo method becomes a method for multidimensional cubature.

preprint2012arXiv

Estimating the Static Parameters in Linear Gaussian Multiple Target Tracking Models

We present both offline and online maximum likelihood estimation (MLE) techniques for inferring the static parameters of a multiple target tracking (MTT) model with linear Gaussian dynamics. We present the batch and online versions of the expectation-maximisation (EM) algorithm for short and long data sets respectively, and we show how Monte Carlo approximations of these methods can be implemented. Performance is assessed in numerical examples using simulated data for various scenarios and a comparison with a Bayesian estimation procedure is also provided.