Source author record

Or Zuk

Or Zuk appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Applications Computation Information Theory math.IT Artificial Intelligence Genomics math.PR math.ST Quantitative Methods Statistics Theory Computational Engineering, Finance, and Science cond-mat.stat-mech Methodology

Catalog footprint

What is connected

11works

14topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Inverse Design for Conditional Distribution Matching

Generative models are powerful tools for sampling from a learned distribution $\mathcal{P}(Y \mid X)$, and inverse-design methods invert this map to find an input $x$ that produces a desired point output $y^*$. However, many design goals are naturally distributional rather than pointwise, incorporating the inherent uncertainty of $Y$ and targeting a specific form for it, a task not addressed by standard inverse design. To address this issue we introduce Conditional Distribution Matching (CDM), a new inverse-design problem class in generative modeling: given a joint distribution $\mathcal{P}(X, Y)$ and a target distribution $\mathcal{G}(Y)$, find an input $x^*$ whose induced conditional distribution $\mathcal{P}(Y \mid X = x^*)$ matches $\mathcal{G}$. We formally define two variants: Conditional Distribution Matching Sampling (CDMS) and Conditional Distribution Matching Optimization (CDMO). To solve these problems, we propose MLGD-F (Matching-Loss Guided Diffusion with a Fast inner sampler), a plug-and-play inference-time algorithm that combines a pretrained score-based diffusion model with a pretrained fast conditional sampler, requiring no additional training or fine-tuning. By leveraging single-step conditional sampling, MLGD-F enables tractable gradient computation, making the estimation of $\mathcal{P}(Y \mid X)$ both memory-efficient and computationally lightweight. We validate MLGD-F on synthetic benchmarks, structured image transformations, and generative editing optimization, demonstrating reliable recovery of inputs whose conditional distributions match diverse user-specified targets, including discrete mixtures and continuous low-rank supports.

preprint2026arXiv

Measuring and Decomposing Mode Separation via the Canonical Diffusion

Mode separation, namely how sharply a distribution fragments into barrier-separated clusters, is a fundamental geometric property of densities, difficult to quantify in high dimensions. It is structurally distinct from dispersion, yet existing tools fall short: differential entropy rises with spread regardless of fragmentation, PCA orders directions by variance regardless of barriers, and mutual information requires a mixture decomposition one usually does not have. We measure mode separation through a single stochastic process intrinsic to the density: a unique reversible diffusion with $f$ as its stationary distribution and constant scalar diffusion coefficient. We extract two readouts from its autocovariance matrix: SSA (Sum of Squared Autocorrelations), a scalar barrier-sensitive measure; and DA (Dominant Autocorrelation directions), linear projections ordered by metastability rather than variance. Under an isotropic-Gaussian null, we derive a closed-form spectrum for the empirical autocovariance that generalizes Marchenko--Pastur, with an analytic upper edge that selects the lag at which DA is read off. Both readouts use only samples and a score function, scaling to high dimensions through pretrained score-based generative models via Tweedie's identity. We apply our framework to three settings: (i) synthetic Gaussian mixtures, where SSA tracks mutual information; (ii) SDXL text-to-image generations, where SSA and DA capture structure that entropy and PCA miss; and (iii) molecular dynamics of alanine dipeptide, where DA recovers the known slow backbone dihedrals from static samples alone.

preprint2022arXiv

A correlation inequality for random points in a hypercube with some implications

Let $\prec$ be the product order on $\mathbb{R}^k$ and assume that $X_1,X_2,\ldots,X_n$ ($n\geq3$) are i.i.d. random vectors distributed uniformly in the unit hypercube $[0,1]^k$. Let $S$ be the (random) set of vectors in $\mathbb{R}^k$ that $\prec$-dominate all vectors in $\{X_3,..,X_n\}$, and let $W$ be the set of vectors that are not $\prec$-dominated by any vector in $\{X_3,..,X_n\}$. The main result of this work is the correlation inequality \begin{equation*} P(X_2\in W|X_1\in W)\leq P(X_2\in W|X_1\in S)\,. \end{equation*} For every $1\leq i \leq n$ let $E_{i,n}$ be the event that $X_i$ is not $\prec$-dominated by any of the other vectors in $\{X_1,\ldots,X_n\}$. The main inequality yields an elementary proof for the result that the events $E_{1,n}$ and $E_{2,n}$ are asymptotically independent as $n\to\infty$. Furthermore, we derive a related combinatorial formula for the variance of the sum $\sum_{i=1}^n \textbf{1}_{E_{i,n}}$, i.e. the number of maxima under the product order $\prec$, and show that certain linear functionals of partial sums of $\{\textbf{1}_{E_{i,n}};1\leq i\leq n\}$ are asymptotically normal as $n\to\infty$.

preprint2015arXiv

Clustering Noisy Signals with Structured Sparsity Using Time-Frequency Representation

We propose a simple and efficient time-series clustering framework particularly suited for low Signal-to-Noise Ratio (SNR), by simultaneous smoothing and dimensionality reduction aimed at preserving clustering information. We extend the sparse K-means algorithm by incorporating structured sparsity, and use it to exploit the multi-scale property of wavelets and group structure in multivariate signals. Finally, we extract features invariant to translation and scaling with the scattering transform, which corresponds to a convolutional network with filters given by a wavelet operator, and use the network's structure in sparse clustering. By promoting sparsity, this transform can yield a low-dimensional representation of signals that gives improved clustering results on several real datasets.

preprint2015arXiv

Low-Rank Matrix Recovery from Row-and-Column Affine Measurements

We propose and study a row-and-column affine measurement scheme for low-rank matrix recovery. Each measurement is a linear combination of elements in one row or one column of a matrix $X$. This setting arises naturally in applications from different domains. However, current algorithms developed for standard matrix recovery problems do not perform well in our case, hence the need for developing new algorithms and theory for our problem. We propose a simple algorithm for the problem based on Singular Value Decomposition ($SVD$) and least-squares ($LS$), which we term \alg. We prove that (a simplified version of) our algorithm can recover $X$ exactly with the minimum possible number of measurements in the noiseless case. In the general noisy case, we prove performance guarantees on the reconstruction accuracy under the Frobenius norm. In simulations, our row-and-column design and \alg algorithm show improved speed, and comparable and in some cases better accuracy compared to standard measurements designs and algorithms. Our theoretical and experimental results suggest that the proposed row-and-column affine measurements scheme, together with our recovery algorithm, may provide a powerful framework for affine matrix reconstruction.

preprint2013arXiv

Accurate Profiling of Microbial Communities from Massively Parallel Sequencing using Convex Optimization

We describe the Microbial Community Reconstruction ({\bf MCR}) Problem, which is fundamental for microbiome analysis. In this problem, the goal is to reconstruct the identity and frequency of species comprising a microbial community, using short sequence reads from Massively Parallel Sequencing (MPS) data obtained for specified genomic regions. We formulate the problem mathematically as a convex optimization problem and provide sufficient conditions for identifiability, namely the ability to reconstruct species identity and frequency correctly when the data size (number of reads) grows to infinity. We discuss different metrics for assessing the quality of the reconstructed solution, including a novel phylogenetically-aware metric based on the Mahalanobis distance, and give upper-bounds on the reconstruction error for a finite number of reads under different metrics. We propose a scalable divide-and-conquer algorithm for the problem using convex optimization, which enables us to handle large problems (with $\sim10^6$ species). We show using numerical simulations that for realistic scenarios, where the microbial communities are sparse, our algorithm gives solutions with high accuracy, both in terms of obtaining accurate frequency, and in terms of species phylogenetic resolution.

preprint2012arXiv

On the Number of Samples Needed to Learn the Correct Structure of a Bayesian Network

Bayesian Networks (BNs) are useful tools giving a natural and compact representation of joint probability distributions. In many applications one needs to learn a Bayesian Network (BN) from data. In this context, it is important to understand the number of samples needed in order to guarantee a successful learning. Previous work have studied BNs sample complexity, yet it mainly focused on the requirement that the learned distribution will be close to the original distribution which generated the data. In this work, we study a different aspect of the learning, namely the number of samples needed in order to learn the correct structure of the network. We give both asymptotic results, valid in the large sample limit, and experimental results, demonstrating the learning behavior for feasible sample sizes. We show that structure learning is a more difficult task, compared to approximating the correct distribution, in the sense that it requires a much larger number of samples, regardless of the computational power available for the learner.

preprint2012arXiv

Ranking Under Uncertainty

Ranking objects is a simple and natural procedure for organizing data. It is often performed by assigning a quality score to each object according to its relevance to the problem at hand. Ranking is widely used for object selection, when resources are limited and it is necessary to select a subset of most relevant objects for further processing. In real world situations, the object's scores are often calculated from noisy measurements, casting doubt on the ranking reliability. We introduce an analytical method for assessing the influence of noise levels on the ranking reliability. We use two similarity measures for reliability evaluation, Top-K-List overlap and Kendall's tau measure, and show that the former is much more sensitive to noise than the latter. We apply our method to gene selection in a series of microarray experiments of several cancer types. The results indicate that the reliability of the lists obtained from these experiments is very poor, and that experiment sizes which are necessary for attaining reasonably stable Top-K-Lists are much larger than those currently available. Simulations support our analytical results.

preprint2011arXiv

FDR control with adaptive procedures and FDR monotonicity

The steep rise in availability and usage of high-throughput technologies in biology brought with it a clear need for methods to control the False Discovery Rate (FDR) in multiple tests. Benjamini and Hochberg (BH) introduced in 1995 a simple procedure and proved that it provided a bound on the expected value, $\mathit{FDR}\leq q$. Since then, many authors tried to improve the BH bound, with one approach being designing adaptive procedures, which aim at estimating the number of true null hypothesis in order to get a better FDR bound. Our two main rigorous results are the following: (i) a theorem that provides a bound on the FDR for adaptive procedures that use any estimator for the number of true hypotheses ($m_0$), (ii) a theorem that proves a monotonicity property of general BH-like procedures, both for the case where the hypotheses are independent. We also propose two improved procedures for which we prove FDR control for the independent case, and demonstrate their advantages over several available bounds, on simulated data and on a large number of gene expression data sets. Both applications are simple and involve a similar amount of computation as the original BH procedure. We compare the performance of our proposed procedures with BH and other procedures and find that in most cases we get more power for the same level of statistical significance.

preprint2010arXiv

Bacterial Community Reconstruction Using A Single Sequencing Reaction

Bacteria are the unseen majority on our planet, with millions of species and comprising most of the living protoplasm. While current methods enable in-depth study of a small number of communities, a simple tool for breadth studies of bacterial population composition in a large number of samples is lacking. We propose a novel approach for reconstruction of the composition of an unknown mixture of bacteria using a single Sanger-sequencing reaction of the mixture. This method is based on compressive sensing theory, which deals with reconstruction of a sparse signal using a small number of measurements. Utilizing the fact that in many cases each bacterial community is comprised of a small subset of the known bacterial species, we show the feasibility of this approach for determining the composition of a bacterial mixture. Using simulations, we show that sequencing a few hundred base-pairs of the 16S rRNA gene sequence may provide enough information for reconstruction of mixtures containing tens of species, out of tens of thousands, even in the presence of realistic measurement noise. Finally, we show initial promising results when applying our method for the reconstruction of a toy experimental mixture with five species. Our approach may have a potential for a practical and efficient way for identifying bacterial species compositions in biological samples.

preprint2005arXiv

Taylor series expansions for the entropy rate of Hidden Markov Processes

Finding the entropy rate of Hidden Markov Processes is an active research topic, of both theoretical and practical importance. A recently used approach is studying the asymptotic behavior of the entropy rate in various regimes. In this paper we generalize and prove a previous conjecture relating the entropy rate to entropies of finite systems. Building on our new theorems, we establish series expansions for the entropy rate in two different regimes. We also study the radius of convergence of the two series expansions.