Source author record

Jarek Duda

Jarek Duda appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Information Theory math.IT Machine Learning Data Structures and Algorithms eess.IV Multimedia Computational Engineering, Finance, and Science Computer Vision cond-mat.stat-mech Cryptography and Security Databases Genomics math-ph math.CO math.MP q-fin.ST quant-ph Quantitative Methods

Catalog footprint

What is connected

19works

18topics

3close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Higher order PCA-like rotation-invariant features for detailed shape descriptors modulo rotation

PCA can be used for rotation invariant features, describing a shape with its $p_{ab}=E[(x_i-E[x_a])(x_b-E[x_b])]$ covariance matrix approximating shape by ellipsoid, allowing for rotation invariants like its traces of powers. However, real shapes are usually much more complicated, hence there is proposed its extension to e.g. $p_{abc}=E[(x_a-E[x_a])(x_b-E[x_b])(x_c-E[x_c])]$ order-3 or higher tensors describing central moments, or polynomial times Gaussian allowing decodable shape descriptors of arbitrarily high accuracy, and their analogous rotation invariants. Its practical applications could be rotation-invariant features to include shape modulo rotation e.g. for molecular shape descriptors, or for up to rotation object recognition in 2D images/3D scans, or shape similarity metric allowing their inexpensive comparison (modulo rotation) without costly optimization over rotations.

preprint2022arXiv

Context binning, model clustering and adaptivity for data compression of genetic data

Rapid growth of genetic databases means huge savings from improvements in their data compression, what requires better inexpensive statistical models. This article proposes automatized optimizations e.g. of Markov-like models, especially context binning and model clustering. While it is popular to just remove low bits of the context, proposed context binning automatically optimizes such reduction as tabled: state=bin[context] determining probability distribution, this way extracting nearly all useful information also from very large contexts, into a relatively small number of states. The second proposed approach: model clustering uses k-means clustering in space of general statistical models, allowing to optimize a few models (as cluster centroids) to be chosen e.g. separately for each read. There are also briefly discussed some adaptivity techniques to include data non-stationarity.

preprint2022arXiv

Encoding of probability distributions for Asymmetric Numeral Systems

Many data compressors regularly encode probability distributions for entropy coding - requiring minimal description length type of optimizations. Canonical prefix/Huffman coding usually just writes lengths of bit sequences, this way approximating probabilities with powers-of-2. Operating on more accurate probabilities usually allows for better compression ratios, and is possible e.g. using arithmetic coding and Asymmetric Numeral Systems family. Especially the multiplication-free tabled variant of the latter (tANS) builds automaton often replacing Huffman coding due to better compression at similar computational cost - e.g. in popular Facebook Zstandard and Apple LZFSE compressors. There is discussed encoding of probability distributions for such applications, especially using Pyramid Vector Quantizer(PVQ)-based approach with deformation, bucket approximation, prefix trees, improving accuracy with additional bits, also tuned symbol spread for tANS.

preprint2022arXiv

Fast optimization of common basis for matrix set through Common Singular Value Decomposition

SVD (singular value decomposition) is one of the basic tools of machine learning, allowing to optimize basis for a given matrix. However, sometimes we have a set of matrices $\{A_k\}_k$ instead, and would like to optimize a single common basis for them: find orthogonal matrices $U$, $V$, such that $\{U^T A_k V\}$ set of matrices is somehow simpler. For example DCT-II is orthonormal basis of functions commonly used in image/video compression - as discussed here, this kind of basis can be quickly automatically optimized for a given dataset. While also discussed gradient descent optimization might be computationally costly, there is proposed CSVD (common SVD): fast general approach based on SVD. Specifically, we choose $U$ as built of eigenvectors of $\sum_i (w_k)^q (A_k A_k^T)^p$ and $V$ of $\sum_k (w_k)^q (A_k^T A_k)^p$, where $w_k$ are their weights, $p,q>0$ are some chosen powers e.g. 1/2, optionally with normalization e.g. $A \to A - rc^T$ where $r_i=\sum_j A_{ij}, c_j =\sum_i A_{ij}$.

preprint2022arXiv

Predicting probability distributions for cancer therapy drug selection optimization

Large variability between cell lines brings a difficult optimization problem of drug selection for cancer therapy. Standard approaches use prediction of value for this purpose, corresponding e.g. to expected value of their distribution. This article shows superiority of working on, predicting the entire probability distributions - proposing basic tools for this purpose. We are mostly interested in the best drug in their batch to be tested - proper optimization of their selection for extreme statistics requires knowledge of the entire probability distributions, which for distributions of drug properties among cell lines often turn out binomial, e.g. depending on corresponding gene. Hence for basic prediction mechanism there is proposed mixture of two Gaussians, trying to predict its weight based on additional information.

preprint2021arXiv

Improving distribution and flexible quantization for DCT coefficients

While it is a common knowledge that AC coefficients of Fourier-related transforms, like DCT-II of JPEG image compression, are from Laplace distribution, there was tested more general EPD (exponential power distribution) $ρ\sim \exp(-(|x-μ|/σ)^κ)$ family, leading to maximum likelihood estimated (MLE) $κ\approx 0.5$ instead of Laplace distribution $κ=1$ - such replacement gives $\approx 0.1$ bits/value mean savings (per pixel for grayscale, up to $3\times$ for RGB). There is also discussed predicting distributions (as $μ, σ, κ$ parameters) for DCT coefficients from already decoded coefficients in the current and neighboring DCT blocks. Predicting values $(μ)$ from neighboring blocks allows to reduce blocking artifacts, also improve compression ratio - for which prediction of uncertainty/width $σ$ alone provides much larger $\approx 0.5$ bits/value mean savings opportunity (often neglected). Especially for such continuous distributions, there is also discussed quantization approach through optimized continuous \emph{quantization density function} $q$, which inverse CDF (cumulative distribution function) $Q$ on regular lattice $\{Q^{-1}((i-1/2)/N):i=1\ldots N\}$ gives quantization nodes - allowing for flexible inexpensive choice of optimized (non-uniform) quantization - of varying size $N$, with rate-distortion control. Optimizing $q$ for distortion alone leads to significant improvement, however, at cost of increased entropy due to more uniform distribution. Optimizing both turns out leading to nearly uniform quantization here, with automatized tail handling.

preprint2020arXiv

Adaptive exponential power distribution with moving estimator for nonstationary time series

While standard estimation assumes that all datapoints are from probability distribution of the same fixed parameters $θ$, we will focus on maximum likelihood (ML) adaptive estimation for nonstationary time series: separately estimating parameters $θ_T$ for each time $T$ based on the earlier values $(x_t)_{t<T}$ using (exponential) moving ML estimator $θ_T=\arg\max_θl_T$ for $l_T=\sum_{t<T} η^{T-t} \ln(ρ_θ(x_t))$ and some $η\in(0,1]$. Computational cost of such moving estimator is generally much higher as we need to optimize log-likelihood multiple times, however, in many cases it can be made inexpensive thanks to dependencies. We focus on such example: $ρ(x)\propto \exp(-|(x-μ)/σ|^κ/κ)$ exponential power distribution (EPD) family, which covers wide range of tail behavior like Gaussian ($κ=2$) or Laplace ($κ=1$) distribution. It is also convenient for such adaptive estimation of scale parameter $σ$ as its standard ML estimation is $σ^κ$ being average $\|x-μ\|^κ$. By just replacing average with exponential moving average: $(σ_{T+1})^κ=η(σ_T)^κ+(1-η)|x_T-μ|^κ$ we can inexpensively make it adaptive. It is tested on daily log-return series for DJIA companies, leading to essentially better log-likelihoods than standard (static) estimation, with optimal $κ$ tails types varying between companies. Presented general alternative estimation philosophy provides tools which might be useful for building better models for analysis of nonstationary time-series.

preprint2020arXiv

Exploiting context dependence for image compression with upsampling

Image compression with upsampling encodes information to succeedingly increase image resolution, for example by encoding differences in FUIF and JPEG XL. It is useful for progressive decoding, also often can improve compression ratio - both for lossless compression and e.g. DC coefficients of lossy. However, the currently used solutions rather do not exploit context dependence for encoding of such upscaling information. This article discusses simple inexpensive general techniques for this purpose, which allowed to save on average $0.645$ bits/difference (between $0.138$ and $1.489$) for the last upscaling for 48 standard $512\times 512$ grayscale 8 bit images - compared to assumption of fixed Laplace distribution. Using least squares linear regression of context to predict center of Laplace distribution gave on average $0.393$ bits/difference savings. The remaining savings were obtained by additionally predicting width of this Laplace distribution, also using just the least squares linear regression. For RGB images, optimization of color transform alone gave mean $\approx 4.6\%$ size reduction comparing to standard YCrCb if using fixed transform, $\approx 6.3\%$ if optimizing transform individually for each image. Then further mean $\approx 10\%$ reduction was obtained if predicting Laplace parameters based on context. The presented simple inexpensive general methodology can be also used for different types of data like DCT coefficients in lossy image compression.

preprint2016arXiv

Distortion-Resistant Hashing for rapid search of similar DNA subsequence

One of the basic tasks in bioinformatics is localizing a short subsequence $S$, read while sequencing, in a long reference sequence $R$, like the human geneome. A natural rapid approach would be finding a hash value for $S$ and compare it with a prepared database of hash values for each of length $|S|$ subsequences of $R$. The problem with such approach is that it would only spot a perfect match, while in reality there are lots of small changes: substitutions, deletions and insertions. This issue could be repaired if having a hash function designed to tolerate some small distortion accordingly to an alignment metric (like Needleman-Wunch): designed to make that two similar sequences should most likely give the same hash value. This paper discusses construction of Distortion-Resistant Hashing (DRH) to generate such fingerprints for rapid search of similar subsequences. The proposed approach is based on the rate distortion theory: in a nearly uniform subset of length $|S|$ sequences, the hash value represents the closest sequence to $S$. This gives some control of the distance of collisions: sequences having the same hash value.

preprint2016arXiv

Fundamental Bounds and Approaches to Sequence Reconstruction from Nanopore Sequencers

Nanopore sequencers are emerging as promising new platforms for high-throughput sequencing. As with other technologies, sequencer errors pose a major challenge for their effective use. In this paper, we present a novel information theoretic analysis of the impact of insertion-deletion (indel) errors in nanopore sequencers. In particular, we consider the following problems: (i) for given indel error characteristics and rate, what is the probability of accurate reconstruction as a function of sequence length; (ii) what is the number of `typical' sequences within the distortion bound induced by indel errors; (iii) using replicated extrusion (the process of passing a DNA strand through the nanopore), what is the number of replicas needed to reduce the distortion bound so that only one typical sequence exists within the distortion bound. Our results provide a number of important insights: (i) the maximum length of a sequence that can be accurately reconstructed in the presence of indel and substitution errors is relatively small; (ii) the number of typical sequences within the distortion bound is large; and (iii) replicated extrusion is an effective technique for unique reconstruction. In particular, we show that the number of replicas is a slow function (logarithmic) of sequence length -- implying that through replicated extrusion, we can sequence large reads using nanopore sequencers. Our model considers indel and substitution errors separately. In this sense, it can be viewed as providing (tight) bounds on reconstruction lengths and repetitions for accurate reconstruction when the two error modes are considered in a single model.

preprint2016arXiv

Nonuniform probability modulation for reducing energy consumption of remote sensors

One of the main goals of 5G wireless telecommunication technology is improving energy efficiency, especially of remote sensors which should be able for example to transmit on average 1bit/s for 10 years from a single AAA battery. There will be discussed using modulation with nonuniform probability distribution of symbols for improving energy efficiency of transmission at cost of reduced throughput. While the zero-signal (silence) has zero energy cost to emit, it can carry information if used alongside other symbols. If used more frequently than others, for example for majority of time slots or OFDM subcarriers, the number of bits transmitted per energy unit can be significantly increased. For example for hexagonal modulation and zero noise, this amount of bits per energy unit can be doubled by reducing throughput 2.7 times, thanks to using the zero-signal with probability $\approx$ 0.84. There will be discussed models and methods for such nonuniform probability modulations (NPM).

preprint2016arXiv

Practical estimation of rotation distance and induced partial order for binary trees

Tree rotations (left and right) are basic local deformations allowing to transform between two unlabeled binary trees of the same size. Hence, there is a natural problem of practically finding such transformation path with low number of rotations, the optimal minimal number is called the rotation distance. Such distance could be used for instance to quantify similarity between two trees for various machine learning problems, for example to compare hierarchical clusterings or arbitrarily chosen spanning trees of two graphs, like in SMILES notation popular for describing chemical molecules. There will be presented inexpensive practical greedy algorithm for finding a short rotation path, optimality of which has still to be determined. It uses introduced partial order for binary trees of the same size: $t_1 \leq t_2$ iff $t_2$ can be obtained from $t_1$ by a sequence of only right rotations. Intuitively, the shortest rotation path should go through the least upper bound or the greatest lower bound for this partial order. The algorithm finds a path through candidates for both points in representation of binary tree as stack graph: describing evolution of content of stack while processing a formula described by a given binary tree. The article is accompanied with Mathematica implementation of all used procedures (Appendix).

preprint2015arXiv

Designing dedicated data compression for physics experiments within FPGA already used for data acquisition

Physics experiments produce enormous amount of raw data, counted in petabytes per day. Hence, there is large effort to reduce this amount, mainly by using some filters. The situation can be improved by additionally applying some data compression techniques: removing redundancy and optimally encoding the actual information. Preferably, both filtering and data compression should fit in FPGA already used for data acquisition - reducing requirements of both data storage and networking architecture. We will briefly explain and discuss some basic techniques, for a better focus applied to design a dedicated data compression system basing on a sample data from a prototype of a tracking detector: 10000 events for 48 channels. We will focus on the time data here, which after neglecting the headers and applying data filtering, requires on average 1170 bits/event using the current coding. Encoding relative times (differences) and grouping data by channels, reduces this number to 798 bits/channel, still using fixed length coding: a fixed number of bits used for a given value. Using variable length Huffman coding to encode numbers of digital pulses for a channel and the most significant bits of values (simple binning) reduces further this number to 552 bits/event. Using adaptive binning: denser for frequent values, and an accurate entropy coder we get further down to 455 bits/event - this option can easily fit unused resources of FPGA currently used for data acquisition. Finally, using separate probability distributions for different channels, what could be done by a software compressor, leads to 437bits/event, what is 2.67 times less than the original 1170 bits/event.

preprint2015arXiv

Joint error correction enhancement of the fountain codes concept

Fountain codes like LT or Raptor codes, also known as rateless erasure codes, allow to encode a message as some number of packets, such that any large enough subset of these packets is sufficient to fully reconstruct the message. It requires undamaged packets, while the packets which were not lost are usually damaged in real scenarios. Hence, an additional error correction layer is often required: adding some level of redundancy to each packet to be able to repair eventual damages. This approach requires a priori knowledge of the final damage level of every packet - insufficient redundancy leads to packet loss, overprotection means suboptimal channel rate. However, the sender may have inaccurate or even no a priori information about the final damage levels, for example in applications like broadcasting, degradation of a storage medium or damage of picture watermarking. Joint Reconstruction Codes (JRC) setting is introduced and discussed in this paper for the purpose of removing the need of a priori knowledge of damage level and sub-optimality caused by overprotection and discarding underprotected packets. It is obtained by combining both processes: reconstruction from multiple packets and forward error correction. The decoder combines the resultant informational content of all received packets accordingly to their actual noise level, which can be estimated a posteriori individually for each packet. Assuming binary symmetric channel (BSC) of $ε$ bit-flip probability, every potentially damaged bit carries $R_0(ε)=1-h_1(ε)$ bits of information, where $h_1$ is the Shannon entropy. The minimal requirement to fully reconstruct the message is that the sum of rate $R_0(ε)$ over all bits is at least the size of the message. We will discuss sequential decoding for the reconstruction purpose, which statistical behavior can be estimated using Renyi entropy.

preprint2015arXiv

Normalized rotation shape descriptors and lossy compression of molecular shape

There is a common need to search of molecular databases for compounds resembling some shape, what suggests having similar biological activity while searching for new drugs. The large size of the databases requires fast methods for such initial screening, for example based on feature vectors constructed to fulfill the requirement that similar molecules should correspond to close vectors. Ultrafast Shape Recognition (USR) is a popular approach of this type. It uses vectors of 12 real number as 3 first moments of distances from 4 emphasized points. These coordinates might contain unnecessary correlations and does not allow to reconstruct the approximated shape. In contrast, spherical harmonic (SH) decomposition uses orthogonal coordinates, suggesting their independence and so lager informational content of the feature vector. There is usually considered rotationally invariant SH descriptors, what means discarding of some essential information. This article discusses framework for descriptors with normalized rotation, for example by using principal component analysis (PCA-SH). As one of the most interesting are ligands which have to slide into a protein, we will introduce descriptors optimized for such flat elongated shapes. Bent deformed cylinder (BDC) describes the molecule as a cylinder which was first bent, then deformed such that its cross-sections became ellipses of evolving shape. Legendre polynomials are used to describe the central axis of such bent cylinder. Additional polynomials are used to define evolution of such elliptic cross-section along the main axis. There will be also discussed bent cylindrical harmonics (BCH), which uses cross-sections described by cylindrical harmonics instead of ellipses. All these normalized rotation descriptors allow to reconstruct (decode) the approximated representation of the shape, hence can be also used for lossy compression purposes.

preprint2014arXiv

Asymmetric numeral systems: entropy coding combining speed of Huffman coding with compression rate of arithmetic coding

The modern data compression is mainly based on two approaches to entropy coding: Huffman (HC) and arithmetic/range coding (AC). The former is much faster, but approximates probabilities with powers of 2, usually leading to relatively low compression rates. The latter uses nearly exact probabilities - easily approaching theoretical compression rate limit (Shannon entropy), but at cost of much larger computational cost. Asymmetric numeral systems (ANS) is a new approach to accurate entropy coding, which allows to end this trade-off between speed and rate: the recent implementation [1] provides about $50\%$ faster decoding than HC for 256 size alphabet, with compression rate similar to provided by AC. This advantage is due to being simpler than AC: using single natural number as the state, instead of two to represent a range. Beside simplifying renormalization, it allows to put the entire behavior for given probability distribution into a relatively small table: defining entropy coding automaton. The memory cost of such table for 256 size alphabet is a few kilobytes. There is a large freedom while choosing a specific table - using pseudorandom number generator initialized with cryptographic key for this purpose allows to simultaneously encrypt the data. This article also introduces and discusses many other variants of this new entropy coding approach, which can provide direct alternatives for standard AC, for large alphabet range coding, or for approximated quasi arithmetic coding.

preprint2012arXiv

Embedding grayscale halftone pictures in QR Codes using Correction Trees

Barcodes like QR Codes have made that encoded messages have entered our everyday life, what suggests to attach them a second layer of information: directly available to human receiver for informational or marketing purposes. We will discuss a general problem of using codes with chosen statistical constrains, for example reproducing given grayscale picture using halftone technique. If both sender and receiver know these constrains, the optimal capacity can be easily approached by entropy coder. The problem is that this time only the sender knows them - we will refer to these scenarios as constrained coding. Kuznetsov and Tsybakov problem in which only the sender knows which bits are fixed can be seen as a special case, surprisingly approaching the same capacity as if both sides would know the constrains. We will analyze Correction Trees to approach analogous capacity in the general case - use weaker: statistical constrains, what allows to apply them to all bits. Finding satisfying coding is similar to finding the proper correction in error correction problem, but instead of single ensured possibility, there are now statistically expected some. While in standard steganography we hide information in the least important bits, this time we create codes resembling given picture - hide information in the freedom of realizing grayness by black and white pixels using halftone technique. We will also discuss combining with error correction and application to rate distortion problem.

preprint2012arXiv

From Maximal Entropy Random Walk to quantum thermodynamics

Surprisingly the looking natural random walk leading to Brownian motion occurs to be often biased in a very subtle way: usually refers to only approximate fulfillment of thermodynamical principles like maximizing uncertainty. Recently, a new philosophy of stochastic modeling was introduced, which by being mathematically similar to euclidean path integrals, finally fulfills these principles exactly. Their local behavior is usually similar, but may lead to completely different global properties. In contrast to Brownian motion leading to nearly uniform stationary density, this recent approach turns out in agreement with having strong localization properties thermodynamical predictions of quantum mechanics, like thermalizing to dynamical equilibrium state having probability density as the quantum ground state: squares of coordinates of the lowest energy eigenvector of the Bose-Hubbard Hamiltonian for single particle in discrete case, or of the standard Schrodinger operator while including potential and making infinitesimal limit. It also provides a natural intuition of the amplitudes' squares relating to probabilities. The present paper gathers, formalizes and extends these results. There are also introduced and discussed some new generalizations, like considering multiple particles with thermodynamical analogue of Pauli exclusion principle or time dependent cases, which allowed to introduce thermodynamical analogues of momentum operator, Ehrenfest equation and Heisenberg uncertainty principle.

preprint2012arXiv

Optimal compression of hash-origin prefix trees

There is a common problem of operating on hash values of elements of some database. In this paper there will be analyzed informational content of such general task and how to practically approach such found lower boundaries. Minimal prefix tree which distinguish elements turns out to require asymptotically only about 2.77544 bits per element, while standard approaches use a few times more. While being certain of working inside the database, the cost of distinguishability can be reduced further to about 2.33275 bits per elements. Increasing minimal depth of nodes to reduce probability of false positives leads to simple relation with average depth of such random tree, which is asymptotically larger by about 1.33275 bits than lg(n) of the perfect binary tree. This asymptotic case can be also seen as a way to optimally encode n large unordered numbers - saving lg(n!) bits of information about their ordering, which can be the major part of contained information. This ability itself allows to reduce memory requirements even to about 0.693 of required in Bloom filter for the same false positive probability.

Jarek Duda

What is connected

Connect this record

See the researcher in context

Building this map preview

19 published item(s)

Higher order PCA-like rotation-invariant features for detailed shape descriptors modulo rotation

Context binning, model clustering and adaptivity for data compression of genetic data

Encoding of probability distributions for Asymmetric Numeral Systems

Fast optimization of common basis for matrix set through Common Singular Value Decomposition

Predicting probability distributions for cancer therapy drug selection optimization

Improving distribution and flexible quantization for DCT coefficients

Adaptive exponential power distribution with moving estimator for nonstationary time series

Exploiting context dependence for image compression with upsampling

Distortion-Resistant Hashing for rapid search of similar DNA subsequence

Fundamental Bounds and Approaches to Sequence Reconstruction from Nanopore Sequencers

Nonuniform probability modulation for reducing energy consumption of remote sensors

Practical estimation of rotation distance and induced partial order for binary trees

Designing dedicated data compression for physics experiments within FPGA already used for data acquisition

Joint error correction enhancement of the fountain codes concept

Normalized rotation shape descriptors and lossy compression of molecular shape

Asymmetric numeral systems: entropy coding combining speed of Huffman coding with compression rate of arithmetic coding

Embedding grayscale halftone pictures in QR Codes using Correction Trees

From Maximal Entropy Random Walk to quantum thermodynamics

Optimal compression of hash-origin prefix trees