Source author record

Binhai Zhu

Binhai Zhu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Computational Geometry Computational Complexity Formal Languages and Automata Theory

Catalog footprint

What is connected

9works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Computing Maximal Repeating Subsequences in a String

In this paper we initiate the study of computing a maximal (not necessarily maximum) repeating pattern in a single input string, where the corresponding problems have been studied (e.g., a maximal common subsequence) only in two or more input strings by Hirota and Sakai starting 2019. Given an input string $S$ of length $n$, we can compute a maximal square subsequence of $S$ in $O(n\log n)$ time, greatly improving the $O(n^2)$ bound for computing the longest square subsequence of $S$. For a maximal $k$-repeating subsequence, our bound is $O(f(k)n\log n)$, where $f(k)$ is a computable function such that $f(k) < k\cdot 4^k$. This greatly improves the $O(n^{2k-1})$ bound for computing a longest $k$-repeating subsequence of $S$, for $k\geq 3$. Both results hold for the constrained case, i.e., when the solution must contain a subsequence $X$ of $S$, though with higher running times.

preprint2022arXiv

Beyond the Longest Letter-duplicated Subsequence Problem

Given a sequence $S$ of length $n$, a letter-duplicated subsequence is a subsequence of $S$ in the form of $x_1^{d_1}x_2^{d_2}\cdots x_k^{d_k}$ with $x_i\inΣ$, $x_j\neq x_{j+1}$ and $d_i\geq 2$ for all $i$ in $[k]$ and $j$ in $[k-1]$. A linear time algorithm for computing the longest letter-duplicated subsequence (LLDS) of $S$ can be easily obtained. In this paper, we focus on two variants of this problem. We first consider the constrained version when $Σ$ is unbounded, each letter appears in $S$ at least 6 times and all the letters in $Σ$ must appear in the solution. We show that the problem is NP-hard (a further twist indicates that the problem does not admit any polynomial time approximation). The reduction is from possibly the simplest version of SAT that is NP-complete, $(\leq 2,1,\leq 3)$-SAT, where each variable appears at most twice positively and exact once negatively, and each clause contains at most three literals and some clauses must contain exactly two literals. (We hope that this technique will serve as a general tool to help us proving the NP-hardness for some more tricky sequence problems involving only one sequence -- much harder than with at least two input sequences, which we apply successfully at the end of the paper on some extra variations of the LLDS problem.) We then show that when each letter appears in $S$ at most 3 times, then the problem admits a factor $1.5-O(\frac{1}{n})$ approximation. Finally, we consider the weighted version, where the weight of a block $x_i^{d_i} (d_i\geq 2)$ could be any positive function which might not grow with $d_i$. We give a non-trivial $O(n^2)$ time dynamic programming algorithm for this version, i.e., computing an LD-subsequence of $S$ whose weight is maximized.

preprint2020arXiv

Genomic Problems Involving Copy Number Profiles: Complexity and Algorithms

Recently, due to the genomic sequence analysis in several types of cancer, the genomic data based on {\em copy number profiles} ({\em CNP} for short) are getting more and more popular. A CNP is a vector where each component is a non-negative integer representing the number of copies of a specific gene or segment of interest. In this paper, we present two streams of results. The first is the negative results on two open problems regarding the computational complexity of the Minimum Copy Number Generation (MCNG) problem posed by Qingge et al. in 2018. It was shown by Qingge et al. that the problem is NP-hard if the duplications are tandem and they left the open question of whether the problem remains NP-hard if arbitrary duplications are used. We answer this question affirmatively in this paper; in fact, we prove that it is NP-hard to even obtain a constant factor approximation. We also prove that the parameterized version is W[1]-hard, answering another open question by Qingge et al. The other result is positive and is based on a new (and more general) problem regarding CNP's. The \emph{Copy Number Profile Conforming (CNPC)} problem is formally defined as follows: given two CNP's $C_1$ and $C_2$, compute two strings $S_1$ and $S_2$ with $cnp(S_1)=C_1$ and $cnp(S_2)=C_2$ such that the distance between $S_1$ and $S_2$, $d(S_1,S_2)$, is minimized. Here, $d(S_1,S_2)$ is a very general term, which means it could be any genome rearrangement distance (like reversal, transposition, and tandem duplication, etc). We make the first step by showing that if $d(S_1,S_2)$ is measured by the breakpoint distance then the problem is polynomially solvable.

preprint2020arXiv

On Computing a Center Persistence Diagram

Throughout this paper, a persistence diagram ${\cal P}$ is composed of a set $P$ of planar points (each corresponding to a topological feature) above the line $Y=X$, as well as the line $Y=X$ itself, i.e., ${\cal P}=P\cup\{(x,y)|y=x\}$. Given a set of persistence diagrams ${\cal P}_1,...,{\cal P}_m$, for the data reduction purpose, one way to summarize their topological features is to compute the {\em center} ${\cal C}$ of them first under the bottleneck distance. We consider two discrete versions and one continuous version. For technical reasons, we first focus on the case when $|P_i|$'s are all the same (i.e., all have the same size $n$), and the problem is to compute a center point set $C$ under the bottleneck matching distance. We show, by a non-trivial reduction from the Planar 3D-Matching problem, that this problem is NP-hard even when $m=3$ diagrams are given. This implies that the general center problem for persistence diagrams under the bottleneck distance, when $P_i$'s possibly have different sizes, is also NP-hard when $m\geq 3$. On the positive side, we show that this problem is polynomially solvable when $m=2$ and admits a factor-2 approximation for $m\geq 3$. These positive results hold for any $L_p$ metric when $P_i$'s are point sets of the same size, and also hold for the case when $P_i$'s have different sizes in the $L_\infty$ metric (i.e., for the Center Persistence Diagram problem). This is the best possible in polynomial time for the Center Persistence Diagram under the bottleneck distance unless P = NP. All these results hold for both of the discrete versions as well as the continuous version; in fact, the NP-hardness and approximation results also hold under the Wasserstein distance for the continuous version.

preprint2016arXiv

Intermittent Map Matching with the Discrete Fréchet Distance

In this paper we focus on the map matching problem where the goal is to find a path through a planar graph such that the path through the vertices closely matches a given polygonal curve. The map matching problem is usually approached with the Fréchet distance matching the edges of the path as well. Here, we formally define the discrete map matching problem based on the discrete Fréchet distance. We then look at the complexities of some variations of the problem which allow for vertices in the graph to be unique or reused, and whether there is a bound on the length of the path or the number of vertices from the graph used in the path. We prove several of these problems to be NP-complete, and then conclude the paper with some open questions.

preprint2015arXiv

Complexity and Algorithms for the Discrete Fréchet Distance Upper Bound with Imprecise Input

We study the problem of computing the upper bound of the discrete Fréchet distance for imprecise input, and prove that the problem is NP-hard. This solves an open problem posed in 2010 by Ahn \emph{et al}. If shortcuts are allowed, we show that the upper bound of the discrete Fréchet distance with shortcuts for imprecise input can be computed in polynomial time and we present several efficient algorithms.

preprint2014arXiv

On the Chain Pair Simplification Problem

The problem of efficiently computing and visualizing the structural resemblance between a pair of protein backbones in 3D has led Bereg et al. to pose the Chain Pair Simplification problem (CPS). In this problem, given two polygonal chains $A$ and $B$ of lengths $m$ and $n$, respectively, one needs to simplify them simultaneously, such that each of the resulting simplified chains, $A'$ and $B'$, is of length at most $k$ and the discrete \frechet\ distance between $A'$ and $B'$ is at most $δ$, where $k$ and $δ$ are given parameters. In this paper we study the complexity of CPS under the discrete \frechet\ distance (CPS-3F), i.e., where the quality of the simplifications is also measured by the discrete \frechet\ distance. Since CPS-3F was posed in 2008, its complexity has remained open. However, it was believed to be \npc, since CPS under the Hausdorff distance (CPS-2H) was shown to be \npc. We first prove that the weighted version of CPS-3F is indeed weakly \npc\, even on the line, based on a reduction from the set partition problem. Then, we prove that CPS-3F is actually polynomially solvable, by presenting an $O(m^2n^2\min\{m,n\})$ time algorithm for the corresponding minimization problem. In fact, we prove a stronger statement, implying, for example, that if weights are assigned to the vertices of only one of the chains, then the problem remains polynomially solvable. We also study a few less rigid variants of CPS and present efficient solutions for them. Finally, we present some experimental results that suggest that (the minimization version of) CPS-3F is significantly better than previous algorithms for the motivating biological application.

preprint2013arXiv

Algorithms for Cut Problems on Trees

We study the {\sc multicut on trees} and the {\sc generalized multiway Cut on trees} problems. For the {\sc multicut on trees} problem, we present a parameterized algorithm that runs in time $O^{*}(ρ^k)$, where $ρ= \sqrt{\sqrt{2} + 1} \approx 1.555$ is the positive root of the polynomial $x^4-2x^2-1$. This improves the current-best algorithm of Chen et al. that runs in time $O^{*}(1.619^k)$. For the {\sc generalized multiway cut on trees} problem, we show that this problem is solvable in polynomial time if the number of terminal sets is fixed; this answers an open question posed in a recent paper by Liu and Zhang. By reducing the {\sc generalized multiway cut on trees} problem to the {\sc multicut on trees} problem, our results give a parameterized algorithm that solves the {\sc generalized multiway cut on trees} problem in time $O^{*}(ρ^k)$, where $ρ= \sqrt{\sqrt{2} + 1} \approx 1.555$ time.

preprint2011arXiv

Comparing Pedigree Graphs

Pedigree graphs, or family trees, are typically constructed by an expensive process of examining genealogical records to determine which pairs of individuals are parent and child. New methods to automate this process take as input genetic data from a set of extant individuals and reconstruct ancestral individuals. There is a great need to evaluate the quality of these methods by comparing the estimated pedigree to the true pedigree. In this paper, we consider two main pedigree comparison problems. The first is the pedigree isomorphism problem, for which we present a linear-time algorithm for leaf-labeled pedigrees. The second is the pedigree edit distance problem, for which we present 1) several algorithms that are fast and exact in various special cases, and 2) a general, randomized heuristic algorithm. In the negative direction, we first prove that the pedigree isomorphism problem is as hard as the general graph isomorphism problem, and that the sub-pedigree isomorphism problem is NP-hard. We then show that the pedigree edit distance problem is APX-hard in general and NP-hard on leaf-labeled pedigrees. We use simulated pedigrees to compare our edit-distance algorithms to each other as well as to a branch-and-bound algorithm that always finds an optimal solution.