Researcher profile

Binhai Zhu

Binhai Zhu contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 19 - UnverifiedVerification L1Unclaimed author
5works
0followers
4topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

5 published item(s)

preprint2026arXiv

Computing Maximal Repeating Subsequences in a String

In this paper we initiate the study of computing a maximal (not necessarily maximum) repeating pattern in a single input string, where the corresponding problems have been studied (e.g., a maximal common subsequence) only in two or more input strings by Hirota and Sakai starting 2019. Given an input string $S$ of length $n$, we can compute a maximal square subsequence of $S$ in $O(n\log n)$ time, greatly improving the $O(n^2)$ bound for computing the longest square subsequence of $S$. For a maximal $k$-repeating subsequence, our bound is $O(f(k)n\log n)$, where \(f(k)\) is a computable function such that $f(k) < k\cdot 4^k$. This greatly improves the $O(n^{2k-1})$ bound for computing a longest $k$-repeating subsequence of $S$, for $k\geq 3$. Both results hold for the constrained case, i.e., when the solution must contain a subsequence $X$ of $S$, though with higher running times.

preprint2022arXiv

Beyond the Longest Letter-duplicated Subsequence Problem

Given a sequence $S$ of length $n$, a letter-duplicated subsequence is a subsequence of $S$ in the form of $x_1^{d_1}x_2^{d_2}\cdots x_k^{d_k}$ with $x_i\inΣ$, $x_j\neq x_{j+1}$ and $d_i\geq 2$ for all $i$ in $[k]$ and $j$ in $[k-1]$. A linear time algorithm for computing the longest letter-duplicated subsequence (LLDS) of $S$ can be easily obtained. In this paper, we focus on two variants of this problem. We first consider the constrained version when $Σ$ is unbounded, each letter appears in $S$ at least 6 times and all the letters in $Σ$ must appear in the solution. We show that the problem is NP-hard (a further twist indicates that the problem does not admit any polynomial time approximation). The reduction is from possibly the simplest version of SAT that is NP-complete, $(\leq 2,1,\leq 3)$-SAT, where each variable appears at most twice positively and exact once negatively, and each clause contains at most three literals and some clauses must contain exactly two literals. (We hope that this technique will serve as a general tool to help us proving the NP-hardness for some more tricky sequence problems involving only one sequence -- much harder than with at least two input sequences, which we apply successfully at the end of the paper on some extra variations of the LLDS problem.) We then show that when each letter appears in $S$ at most 3 times, then the problem admits a factor $1.5-O(\frac{1}{n})$ approximation. Finally, we consider the weighted version, where the weight of a block $x_i^{d_i} (d_i\geq 2)$ could be any positive function which might not grow with $d_i$. We give a non-trivial $O(n^2)$ time dynamic programming algorithm for this version, i.e., computing an LD-subsequence of $S$ whose weight is maximized.

preprint2020arXiv

Genomic Problems Involving Copy Number Profiles: Complexity and Algorithms

Recently, due to the genomic sequence analysis in several types of cancer, the genomic data based on {\em copy number profiles} ({\em CNP} for short) are getting more and more popular. A CNP is a vector where each component is a non-negative integer representing the number of copies of a specific gene or segment of interest. In this paper, we present two streams of results. The first is the negative results on two open problems regarding the computational complexity of the Minimum Copy Number Generation (MCNG) problem posed by Qingge et al. in 2018. It was shown by Qingge et al. that the problem is NP-hard if the duplications are tandem and they left the open question of whether the problem remains NP-hard if arbitrary duplications are used. We answer this question affirmatively in this paper; in fact, we prove that it is NP-hard to even obtain a constant factor approximation. We also prove that the parameterized version is W[1]-hard, answering another open question by Qingge et al. The other result is positive and is based on a new (and more general) problem regarding CNP&#39;s. The \emph{Copy Number Profile Conforming (CNPC)} problem is formally defined as follows: given two CNP&#39;s $C_1$ and $C_2$, compute two strings $S_1$ and $S_2$ with $cnp(S_1)=C_1$ and $cnp(S_2)=C_2$ such that the distance between $S_1$ and $S_2$, $d(S_1,S_2)$, is minimized. Here, $d(S_1,S_2)$ is a very general term, which means it could be any genome rearrangement distance (like reversal, transposition, and tandem duplication, etc). We make the first step by showing that if $d(S_1,S_2)$ is measured by the breakpoint distance then the problem is polynomially solvable.

preprint2020arXiv

On Computing a Center Persistence Diagram

Throughout this paper, a persistence diagram ${\cal P}$ is composed of a set $P$ of planar points (each corresponding to a topological feature) above the line $Y=X$, as well as the line $Y=X$ itself, i.e., ${\cal P}=P\cup\{(x,y)|y=x\}$. Given a set of persistence diagrams ${\cal P}_1,...,{\cal P}_m$, for the data reduction purpose, one way to summarize their topological features is to compute the {\em center} ${\cal C}$ of them first under the bottleneck distance. We consider two discrete versions and one continuous version. For technical reasons, we first focus on the case when $|P_i|$&#39;s are all the same (i.e., all have the same size $n$), and the problem is to compute a center point set $C$ under the bottleneck matching distance. We show, by a non-trivial reduction from the Planar 3D-Matching problem, that this problem is NP-hard even when $m=3$ diagrams are given. This implies that the general center problem for persistence diagrams under the bottleneck distance, when $P_i$&#39;s possibly have different sizes, is also NP-hard when $m\geq 3$. On the positive side, we show that this problem is polynomially solvable when $m=2$ and admits a factor-2 approximation for $m\geq 3$. These positive results hold for any $L_p$ metric when $P_i$&#39;s are point sets of the same size, and also hold for the case when $P_i$&#39;s have different sizes in the $L_\infty$ metric (i.e., for the Center Persistence Diagram problem). This is the best possible in polynomial time for the Center Persistence Diagram under the bottleneck distance unless P = NP. All these results hold for both of the discrete versions as well as the continuous version; in fact, the NP-hardness and approximation results also hold under the Wasserstein distance for the continuous version.

preprint2011arXiv

Comparing Pedigree Graphs

Pedigree graphs, or family trees, are typically constructed by an expensive process of examining genealogical records to determine which pairs of individuals are parent and child. New methods to automate this process take as input genetic data from a set of extant individuals and reconstruct ancestral individuals. There is a great need to evaluate the quality of these methods by comparing the estimated pedigree to the true pedigree. In this paper, we consider two main pedigree comparison problems. The first is the pedigree isomorphism problem, for which we present a linear-time algorithm for leaf-labeled pedigrees. The second is the pedigree edit distance problem, for which we present 1) several algorithms that are fast and exact in various special cases, and 2) a general, randomized heuristic algorithm. In the negative direction, we first prove that the pedigree isomorphism problem is as hard as the general graph isomorphism problem, and that the sub-pedigree isomorphism problem is NP-hard. We then show that the pedigree edit distance problem is APX-hard in general and NP-hard on leaf-labeled pedigrees. We use simulated pedigrees to compare our edit-distance algorithms to each other as well as to a branch-and-bound algorithm that always finds an optimal solution.