Empirical Lossless Compression Bound of a Data Sequence

Li, Lei M

Full-text links:

Download:

Current browse context:

cs.IT

< prev | next >

new | recent | 2311

Computer Science > Information Theory

Title: Empirical Lossless Compression Bound of a Data Sequence

Authors: Lei M Li

(Submitted on 2 Nov 2023 (v1), last revised 22 Jan 2024 (this version, v2))

Abstract: We consider the lossless compression bound of any individual data sequence. If we fit the data by a parametric model, the entropy quantity $nH({\hat \theta}_n)$ obtained by plugging in the maximum likelihood estimate is an underestimate of the bound, where $n$ is the number of words. Shtarkov showed that the normalized maximum likelihood (NML) distribution or code length is optimal in a minimax sense for any parametric family. We show by the local asymptotic normality that the NML code length for the exponential families is $nH(\hat \theta_n) +\frac{d}{2}\log \, \frac{n}{2\pi} +\log \int_{\Theta} |I(\theta)|^{1/2}\, d\theta+o(1)$, where $d$ is the model dimension or dictionary size, and $|I(\theta)|$ is the determinant of the Fisher information matrix. We also demonstrate that sequentially predicting the optimal code length for the next word via a Bayesian mechanism leads to the mixture code, whose pathwise length is given by $nH({\hat \theta}_n) +\frac{d}{2}\log \, \frac{n}{2\pi} +\log \frac{|\, I({\hat \theta}_n)|^{1/2}}{w({\hat \theta}_n)}+o(1) $, where $w(\theta)$ is a prior. The asymptotics apply to not only discrete symbols but also continuous data if the code length for the former is replaced by the description length for the latter. The analytical result is exemplified by calculating compression bounds of protein-encoding DNA sequences under different parsing models. Typically, the highest compression is achieved when the parsing is in phase of the amino acid codons. On the other hand, the compression rates of pseudo-random sequences are larger than 1 regardless parsing models. These model-based results are in consistency with that random sequences are incompressible as asserted by the Kolmogorov complexity theory. The empirical lossless compression bound is particularly more accurate when dictionary size is relatively large.

Comments:	3 tables
Subjects:	Information Theory (cs.IT); Statistics Theory (math.ST)
MSC classes:	68P30, 94A15, 62B10
ACM classes:	G.0
Cite as:	arXiv:2311.01431 [cs.IT]
	(or arXiv:2311.01431v2 [cs.IT] for this version)

Submission history

From: Lei M Li [view email]
[v1] Thu, 2 Nov 2023 17:45:57 GMT (25kb)
[v2] Mon, 22 Jan 2024 17:21:02 GMT (25kb)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2311.01431

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Information Theory

Title: Empirical Lossless Compression Bound of a Data Sequence

Submission history