Abschlieende Arbeit im Rahmen der Reifeprüfung

Text Compression with Prediction Models

Ben Kinigadner

8D, 2024/25

Bundesgymnasium und Bundesrealgymnasium Wien 4
Wiedner Gymnasium / Sir Karl Popper Schule
A-1040 Wien, Wiedner Gürtel 68

Betreuungslehrperson: Alexander Hoefler
Vorgelegt am 14. Februar 2025

Abstract

This paper presents a novel approach to lossy text compression that leverages character correlations and statistical probabilities. The approach introduces a lightweight algorithm that compresses English text by encoding pairs of characters into 4-bit representations and embedding them into UTF-8 encoding. This method achieves a compression ratio of up to 65.7% while maintaining 93.62% accuracy in character reconstruction.
The algorithm employs a state machine for encoding and decoding, utilizing surrounding character context to predict the most probable character during decompression. A multitude of optimization techniques, including character pair combinations, activation functions, and relative probability weighting, have been employed to enhance accuracy and performance. These techniques will be explored in this paper.
This study demonstrates the feasibility of lossy text compression using minimal computational resources, with performance comparable to established lossless compression algorithms like ZIP. Potential applications include instant messaging, data mining for large language models, and search engine indexing.
This work serves as a proof of concept, opening avenues for further research into more advanced probability interpretation methods, n-gram analysis, and potential integration with natural language processing techniques to further improve compression efficiency and accuracy.

Introduction

Recent advancements in machine learning, especially in large language models (LLMs), have enabled computers to create coherent paragraphs of text [cf. @cho2019coherentcohesivelongformtext 1-8]. They can start with just a few characters and, leveraging predictive models, generate complete English sentences [cf. @ContextDependentRecurrent2012 1-5]. This ability can be utilized to compress data by removing unnecessary information, information that is above the Shannon entropy and is theoretically redundant [cf. @shannon1948mathematical 50-61], which the model can later fill in again [cf. @bellardLosslessDataCompression 1-9]. However, the current models, which are based on the transformer architecture, require significant computational resources due to their reliance on complex matrix operations [cf. @vaswani2023attentionneed 3-8].

There are already numerous methods of compressing text, but it is always a trade-off between computational costs, of compression and decompression, and size [cf. @alakuijala2015comparison 1-6]. Until now, compression algorithms that specialize in text prediction using generative models have been rather unattractive for lightweight workloads, such as web traffic and simple archiving, because it takes a lot of computing power to compress the text on the servers and a whole machine learning model is rolled out, which could lead to problems on weaker end devices. This raises the question of how it is potentially possible to predict the next letters with higher speed and sufficient accuracy with less compute. While transformer-based algorithms can effectively compress data, their high computational demands limit their practical applications [cf. @LargeTextCompression]. To leverage these predictive capabilities, it is crucial to reduce their computational costs and increase their speed.

Achieving this objective necessitates a shift from substantial matrix operations to the simplification of tasks into binary classification problems. The objective of this paper is to demonstrate a proof of concept that smaller, more efficient models can attain high accuracy without the necessity of a tokenizer and by relying exclusively on statistical methodologies. The objective is to demonstrate that substantial speed improvements and high accuracy are attainable even in the initial stages, rendering these models suitable for data compression and decompression tasks, and and making this type of compression potentially suitable for general use.

Background

In this section, underlying standards and algorithms are explained.

Compression Overview

In order to comprehend the functionality of compression algorithms, it is necessary to categorize them into two distinct variants. This classification facilitates a comprehensive understanding of their applications and potential uses.

In most cases, it is important to restore all data that was present before compression during decompression. This is particularly important for applications where it is not possible to guess what the data, that is to be compressed, will be used for. Losing precision in already complicated compiled programs or libraries could mean that they cannot be executed at all. The same applies to proprietary file formats where it is not fully known which bytes serve which purpose. This is why such compression algorithms are particularly important in universal data transfer protocols such as SFTP [cf. @RFC468FTP 1-7]. These compression algorithms are often less specific and the compressed files stay relatively large.

On the other hand, we as humans live in environments with high information density. We have developed effective ways of filtering out much of this data and in turn make sufficient predictions about the dropped information [cf. @marzenEvolutionLossyCompression2017 1-8]. This can even go so far that humans are able to see images with only a few data points and recognize what they must represent [cf. @BF02833890 1-5]. Compression algorithms that make use of this ability and do not preserve all the data, but enough to semantically reconstruct the data to its original state, fall into the category of lossy compression algorithms. The old file is therefore no longer fully reconstructable for the computer, but should semantically mean the same thing to the person viewing the file. One of the best-known representatives of these lossy compression algorithms is JPG.

Since this paper’s algorithm cannot provide 100% correctness in the majority of cases, but the output may sometimes require human interpretation or correction to fully match the original semantic intent, this algorithm falls under this category. In essence, this suggests that the information disclosed by the algorithm may be below the entropy threshold, rendering it logically impossible to reconstruct. Consequently, the function representing the compression algorithm is not injective.
[cf. @salomon2002data 2-14]

UTF-8

Length	Binary Representation
8-bit long	0xxxxxxx
16-bit long	110xxxxxx 10xxxxxx
24-bit long	1110xxxx 10xxxxxx 10xxxxxx
32-bit long	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

UTF-8 Encoding Structure Showing Byte Lengths and Prefix Patterns

In order for an algorithm to make a valid prediction, it must know what data it is dealing with and what kind of prediction it is allowed to make. Although all compressed characters are completely in the ASCII [cf. @ASCIIFormatNetwork1969 2-6] area, to allow smooth encoding and not too many problems with unknown characters, my system is embedded in UTF-8.

UTF-8, or Unicode Transformation Format – 8-bit, is a universal standard for encoding almost all the world’s languages and additional characters [cf. @yergeauUTF8TransformationFormat2003 2-11]. It is based on ASCII and is therefore fully interoperable. It was created, because the ASCII set is very limited, as it only contains American characters and special characters, such as ‘!’, ‘?’ and ‘*’. As each character is 8-bit long, it could only hold 256 different characters. Additionally, as the first bit is always a security bit, ASCII encoding contains practically only 128 characters. Far too few to represent all languages/special characters and nowadays also emojis.

UTF-8, on the other hand, has different lengths from only 8-bit to encode ASCII characters up to 32-bit to represent even the most fringe characters. You can always read from the first bit how many bytes this character will be. With ASCII characters, the first bit, which was usually a 0, has been reassigned accordingly, that it now indicates that the current character is only 8-bit long. Furthermore, the number of 1s before the first 0 indicates how many bytes the current character needs to be fully decoded. The table 1{reference-type=“ref” reference=“utf8”} visualizes this.\

Text compression algorithms can also be further divided into two different groups, Generative and Semantic. I would use the two terms as they were used in Witten: Semantic and Generative Models for Lossy Text Compression [cf. @wittenSemanticGenerativeModels1994 1-5].
Since generative text compression algorithms are leading in current benchmarks, I mainly drew inspiration from this to build mine. The basis of the code is therefore based on methods similar to those used in machine learning. Since this approach tries to be faster than current models such as NNCP, the model presented in this paper will not do just in time compilation, as this might increase accuracy, but in turn cost a lot in terms of computational power and therefore opted for a pretrained model trained on generic English text.

Character Entropy

The foundation on which my text prediction algorithm is based is: Shannon: Prediction and Entropy of Printed English [cf. @shannonPredictionEntropyPrinted 50-61]. The authors have formulated the thesis that the order of characters following each other is not random, but that some are more probable than others. The probabilities are even more precisely defined than the absolute probability of one letter following the next. So if you have a chain of characters $p$, it may be that ‘b’ is more likely to come next than ’e’, although without context ’e’ is much more likely since it is more often used in the English language [cf. @FoundationsComputerSecurity 2-6].

It is also suggested that more context helps to interpret the next letter with a higher probability. However, this always assumes that one character is identified at a time and fewer individual gaps are to be filled, because the experiments in this paper were conducted primarily on humans. It was assumed that if we have mastered the language, we have a good idea of which letters come when given the surrounding ones.\

Generative Text Compression

Generative text compression programs use the text prediction and learning capabilities of neural networks. When compressing, they train the model on the text file that is to be compressed and save the text and its relations within in their floating points as the model’s weights. The size of the compressed text is then almost only the size of the trained weights and is therefore much smaller compared to other algorithms. When decompressing, it is then possible to read the compressed text from the trained weights again. As these methods involve training machine learning models and calculating many matrix multiplications, these algorithms require a relatively large amount of compute power and memory.

A concrete approach is that of NNCP, in which a transformer model is trained, which means that the weights can be trained in parallel [cf. @bellardLosslessDataCompression 1-9]. This results in a large model, which can then be used again to reproduce the same text. Due to the possible hardware acceleration and parallel computing, it is somewhat faster than other generative text compression algorithms, in exchange for a lower compression ratio.

In the contrary, there is CMIX, which does not stem from one large model, but from up to 2,115 different smaller ones, whereby it must be said that not all of them have to be used for text compression [cf. @ByronKnoll]. When compressing, if there was no preprocessing, the models are trained similarly to NNCP, although no hardware acceleration is specifically used in this instance. The finished models then represent the majority of the compressed size. When decompressing, the trained models are then retrieved again to restore the file. Here, different models are used bit by bit, which have specialized in specific events during training. As a result, it is even better compressed than NNCP, but has higher compute costs.

Methodology

As this approach to text compression is novel, there are many fundamental questions to address. This section will explore how such systems should fundamentally work, which characters to encode, and how to gather and clean the necessary data for such applications. By systematically addressing these aspects, I aim to provide a comprehensive framework for developing and refining this text compression approach. This will allow for steady improvements to be systematically evaluated by comparisons against previous iterations.

Encoding Foundation

Prediction models utilize preceding characters to determine the current one, so it is leveraging existing information to assess the most probable next character with high precision. However, as the diversity of potential characters increases, the likelihood of misguessing the correct character also rises. Moreover, the computational expense escalates with the need to calculate numerous possibilities for each of these diverse characters and given the algorithm’s goals of computational and memory efficiency, limiting the number of possibilities that require calculation would significantly aid in achieving these aims.

When determining what should be compressible, it is essential to decide which characters must be overall accepted in the filetype. Ideally, the encoder should be capable of handling all text files to prevent errors when encountering characters outside the defined compressible set. Therefore, the goal is to integrate the compression scheme within a larger encoding like UTF-8, much like ASCII itself is a subset of UTF-8. With this approach, characters within the compressible list can be encoded, while the rest are represented using UTF-8. Thus, the question arises: which characters should be encodable and how?

Encodable Characters

Since I want the encoded character to take less disk space than UTF-8’s most compact encoding, which is 8 bits, there are multiple options, given that every compressed character uses the same amount of bits.

6-bit per character
Works if the algorithm always takes three bytes at once. This would result in 64 possibilities for every character.
4-bit per character
Every byte contains two characters. This would result in 16 possibilities for every character
3-bit per character
Works again if the algorithm takes three bytes and splits them into eight charaters. This would result in eight possibilities for every character.
2-bit per character
Every byte contains four characters. This would result in four possibilities for every character.

Since the characters intended for inclusion comprise the 26 English base characters, along with the Space and the Period, this totals to 28 different characters. An additional consideration is the encoded sequence escape character, which necessitates an entire possible character slot, as misinterpreting it could have catastrophic consequences, as will be explained later in the chapter 3.1.2{reference-type=“ref” reference=“decodingfoundation”}. Consequently, it is important to consider the number of states minus one for the various options.

Given that this investigation aims to ascertain the feasibility of such compression, allocating 4-bit per character appears to be the most suitable option. With 15 different states available, which is nearly half of 28, each state of the 4-bit character could potentially accommodate two characters, leaving two other states to be encoded individually. Considering that the Space and sentence separation character stand as outliers, I will treat these two as singular entities, with the base characters grouped in pairs of two and assigned to individual states. So 29 characters (26 letters + Space + Period + Escape), 4-bit (16 states) accommodate 15 usable slots, balancing expressiveness and efficiency.

Representation of which characters are encoded as bits, how I got to the exact combinations will be explained in chapter 4.3{reference-type=“ref” reference=“charactercombination”}.

Decoding Foundation

The encoded format operates in two distinct states within the UTF-8 stream: a standard UTF-8 state and a special encoding state for compressed characters. To ensure correct interpretation, a mechanism is required to explicitly signal to the decoder which state is currently active. This is crucial because attempting to decode specially encoded characters with a standard UTF-8 decoder will yield errors, and conversely, applying the special decoder to standard UTF-8 data will produce nonsensical output.

To address this state differentiation, a designated state transition marker henceforth referred to as the break character, is introduced. The decoding process commences in the standard UTF-8 state. Upon encountering the break character for the first time, the decoder transitions into the special decoding state. Subsequent encounters of the break character trigger a reversion back to the standard UTF-8 state, effectively toggling between decoding modes.

However, a potential ambiguity arises from the structure of UTF-8 itself. While a single byte beginning with ‘1111’ could be used as a break character in principle, this prefix also marks the start of 32-bit UTF-8 encoded characters. To avoid misinterpreting the break character as the beginning of a 32-bit UTF-8 sequence, a robust disambiguation strategy is necessary.

Therefore, the state transition into the special decoding mode is signaled by a sequence of two consecutive break characters, represented as a single byte ‘11111111’. This double break character sequence is highly unlikely to occur within standard UTF-8 encoded text and thus serves as an unambiguous delimiter for initiating the special decoding state. The single break character, represented as ‘1111’ within a byte, is reserved for signaling the end of a sequence of specially encoded characters and the return to the standard UTF-8 state.

For consistency and simplified decoding logic, even when an even number of compressed characters are encoded consecutively, necessitating a byte boundary immediately after the last character pair, a second break character is appended. This ensures that the state transition back to UTF-8 mode is consistently and explicitly marked, regardless of the length of the preceding encoded sequence.

As a consequence of introducing these break characters, it is important to note that while the special encoding achieves a 50% reduction in size for the encoded characters themselves, the inclusion of break characters introduces overhead. Therefore, to ensure effective compression, the encoder is designed to only initiate the special encoding if a minimum of twelve consecutive compressible characters are encountered. This threshold mitigates the overhead of break characters and ensures a net reduction in file size in practical scenarios.

Data

Following this, a significant amount of data is needed to achieve accurate compression and decompression with this statistical model. Since the algorithm is primarily data-driven, the required data is highly dependent on the algorithm and vice versa. Therefore, more data than needed has been collected, but only relevant snippets will be presented in section 3.3.2{reference-type=“ref” reference=“demonstration”}.

Which Data to Collect

The overall goal is to use probabilities to convert a half-precision character to a fully defined one. This will be represented as a function that takes in the half-precision bit and other environmental variables to return a satisfying output. The data that needs to be collected should help the function evaluate the specific half-precision character in a specific context.

The easiest context to obtain, and which will be used, is the characters surrounding the half-precision character. Therefore, the data needed should represent the correlation between the surrounding characters and the one in question. For example, if there is an "I" right after the character in question, is it more likely to be an "F" or a "D"? Visual depiction refer figure 2{reference-type=“ref” reference=“tdecoding”}.

There could be countless ways of grouping different characters together and using this information to decode the half-precision character. How these methods compare is a topic for other papers. This paper will use the probabilities of a specific target character occurring, given the context characters found at various distances. The collected data will also represents how much a given character influences the neighboring characters.

Context Window

It might be possible but not feasible, given the collected data, to look at all the characters before and all the characters after the character in question. The first argument against this is the sheer amount of data that would have to be collected. Given the function $p(\text{char1}, \text{char2}, \text{distance})$, which takes in character 1, character 2, and a distance, p returns the probability that when we look at character 1 in an English text, how likely is it to encounter char2 within the specified distance.

Disregarding the possibilities that there might be other characters in the whole text that are not present in the compressible 29 characters, the amount of data that has to be stored grows linearly by 841 * distance. Also diagram 3{reference-type=“ref” reference=“entropyfig”} shows the falloff of the divergence from the average entropy to the specific entropy of every character the farther away it is. So the higher the value, the more predictable is the 0th character from its current position. Average entropy is the overall average occurrence of a character in a text. So the graphic shows decay of predictability with distance. Following this, it seemed most efficient to provide the function the shown four preceding characters and the three subsequent ones, since with characters further appart, influence of the middle character is negligible.\

/ [pdf]

{#tdecoding}

$$\begin{aligned} & y = (\Sigma^{28}_{a=0} \Sigma^{28}_{i=0} (p(c[a], c[i], x) - e_{a}[c[i]])^2) / 29 * 29\\ & c = \text{array of decodable characters}\\ & e_{a} = \text{array of average entropy in English}\\ & p = \text{p(char, char, int) -> 64 bit float}

\end{aligned}$$

It calculates the average distance from the observed values compared to the English text entropy for every character at every distance. The graphic plots the average of these distances to the distance this has been calculated on.

Difference to average entropy by distance

Data Collection and Preprocessing

Large amounts of English text is needed to extract the frequency of each character following or preceding different characters and then normalize the occurrences. Since language models require the same basic information, it is easy to find such datasets. For this use case, the cleaned-up English Wikipedia was used, which comprises around 20 GB of English text data, without significant markup. The data then was saved into three dimensional HDF5 dataframes and manually verified before further processing. The dimensions of this data frame were, the character, which will later be compressed, the character that appears in the first one’s proximity and the distance at which the appearance occurred.

Processing

The data extraction was done the following. A program iterates over a given text which in this case was the english Wikipedia dump.

When the program finds an acceptable character, it records the distances to other nearby acceptable characters within the context window and increments the corresponding counts. When finished the program writes this three dimensional array, with the absolute values, to the previously described HDF5 file. While the absolute values describe how often the characters are found in the text, the distribution is very uneven and models based on this tend to just pick the one average more likely character in a character combination, this is why the data then gets normalized based on the frequency distribution of neighboring characters found for each base character at each specific distance.

Multiple methods of normalisation were tested, but the given formula resulted in the most optimal outcome.

$x = \frac{x - \mu}{\sigma}$

The in figure 2{reference-type=“ref” reference=“tdecoding”} described function $p(\text{char}, \text{char}, \text{int})$ now is a three dimensional index through this hdf5 matrix, and the output is the normalized data, saved as 64-bit floating points, that was collected during this process.

Demonstration

This section is a short demonstration how the collected data looks. The character b is used as an example. The graphics 4{reference-type=“ref” reference=“fig:show1”}, 5{reference-type=“ref” reference=“show2”}, 6{reference-type=“ref” reference=“show3”}

and 7{reference-type=“ref” reference=“show4”} show that if a "b" is encountered in an English text, how likely it is to find a specific character when you go backwards 1-4 steps.

4. Characters before B

3. Characters before B

2. Characters before B

1. Character before B

Validation

To test and validate the performance and correctness of the program at every stage of the development, I built a small testing suite. The program is tested on a text file, encoding and decoding it. After the process is completed, the program stops the timer and checks for differences between the converted file and the original. This comparison is made easier by the fact that, regardless of how incorrect the result is, it will always contain the same number of characters. The program calculates an overall accuracy and prints which characters were guessed incorrectly and how often. Then, it saves the result, to compare them to the next test run.

Encoder

/ [pdf]

{#encoder}

The encoding process, depicted in figure 8{reference-type=“ref” reference=“encoder”}, is managed by a state machine operating in two distinct states: encodable and not encodable. For a detailed discussion on input handling, refer to section 4.1{reference-type=“ref” reference=“inputbuffer”}. For this algorithmic description, it is assumed, that the program does character-by-character processing of a UTF-8 string with indexed access to all characters.

Initially, the encoder resides in the not encodable state. Incoming characters are temporarily stored in a rotation array buffer. As long as twelve consecutive encodable characters are not detected, the buffer’s content is flushed, and characters are directly written to the output file as standard UTF-8 encoded text.

Upon identifying twelve consecutive encodable characters, the state machine transitions to the encodable state. Prior to entering this state, the first four characters from the rotation array are written as UTF-8 to provide context for the decoder. Subsequently, an encoding break sequence is inserted to signal the commencement of the special encoding. Character pairs from the rotation array are then processed by the collaps function (detailed in section 4.2{reference-type=“ref” reference=“collapsfunktion”} and visualized in figure 8{reference-type=“ref” reference=“encoder”}), which converts pairs of 4-bit representations into 8-bit units for output to the encoded file. In the encodable state, the algorithm continues to check each incoming character for encodability. For every two encodable characters processed, one collapsed 8-bit unit, representing a character pair from the buffer, is written to the encoded file. This ensures a backlog of at least three theoretically encodable, but not yet written, characters.

If a non-encodable character is encountered while in the encodable state, or upon reaching the end of the input file, the state machine reverts to the not encodable state. Before fully transitioning, any remaining encodable characters in the rotation array backlog are handled. If four unencoded characters remain, the first is collapsed with a break character and written to file. If only three remain, a double break character sequence is written to file. Any remaining characters, including the encountered non-encodable character, are then written to file as standard UTF-8, and the state machine returns to the not encodable state.

Input Buffer

Buffering text is a method to enhance the speed of IO applications that heavily depend on many character indexes. If no buffer is used, single character indexing would theoretically require a system call to the kernel for every character passed to the program. This call reads from disk, which is very slow compared to reading data from random access memory. A buffer, having a given size, reads portions of the file into memory instead of one character at a time. It then serves these portions to the program as if they were read directly from the file. This approach drastically reduces the total number of system calls, resulting in faster performance ¹. This can be seen in the figure 9{reference-type=“ref” reference=“encodingspeed”}. The figure shows the two peaks, where the one that uses the buffer only takes 0.2 $\mu$ while the version without buffer takes most of the time 1.4 $\mu$ and this not even consistenly. This not only shows that the buffered version is much faster, but also that it is much more reliable regarding speed.

Rotation Array

While buffering improves read speeds, it also introduces challenges to the algorithm’s architecture. These challenges occur when the buffer switches, but the program needs to access data owned by the previous buffer. This data is no longer accessible and would result in an error. The problem arises if the program wants to switch to the encoding state before twelve characters have passed into the new buffer. If it were to switch, it would need to look back to write the first characters as UTF-8, but they would not be accessible.

To bridge this gap, a second buffer was introduced, making it easier to access passed characters, even if they are theoretically out of scope. This second buffer is a deque, referred to as a rotating array in the figure 8{reference-type=“ref” reference=“encoder”}. With each step, it rotates to remove the last character that has already been written to the file and adds the new one to the front. This ensures that every character is always easily accessible when needed.

Encoding speed with and without buffer

Collaps Function

In the context of the Rust programming language, the language the program is written in, the decision to use ‘u8’ instead of ‘u4’ is primarily driven by the language’s inherent characteristics. Rust does not support data types smaller than 8-bit, including the boolean type which theoretically could be represented with a single bit, but is stored as a full byte instead [cf. @BooleanTypeRust].

To circumvent this constraint the collapsing function is implemented. This function combines two theoretical unsigned 4-bit integers into a single 8-bit block, which is then written to the output file or buffer. However, this approach introduces a new challenge: the algorithm processes one character per cycle, and only saves a character to the output every other cycle.

To address this, the program maintains its state using the rotation array. Additionally, a secondary state machine is employed to ensure that characters are combined and written to the output only every second cycle.

Character Combination Optimisation

As characters are combined in pairs, it is also important how these pairs are arranged together. For instance, if character "A" always follows character "B", and character "C" always follows character "D", combining "A" and "C", and "B" and "D" together would yield some problems if these two pairs follow each other. If this happens, the decoder would not know how to decode the characters. Therefore, the encoder must have the most robust character combinations that have the fewest conflicts to achieve the best results. With an alphabetical combination the algorithm hits an accuracy of 85.95%, the goal is to improve this percentage by changing the character combinations.

Primary Difference

The first approach was formed by creating a matrix that shows, for every character, how different their ²dependencies are compared to those of every other character.
$$\text{matrix}[x][y] = \Sigma^{3}{d = -4; d != 0} \Sigma^{28}{i = 0} (p(c[x],c[i],d) - p(c[y], c[i], d))^{2}$$

$c = \text{array of decodable characters}$
$p = p(\text{char: } \text{char}_1, \text{char: } \text{char}_2, \text{int: distance})$ is the relative probability that if $\text{char}_1$ is given, how likely it is that $\text{char}_2$ occurs at the specified distance.

This 2D matrix now indicates that the higher the number at the intersection of two characters’ axes is, the more different their dependencies are. After this process, to determine the best possible combination, the solution with the highest overall value must be found. This can be done using the Hungarian algorithm, which is capable of matching a 2D matrix to obtain the highest overall value [cf. @https://doi.org/10.1002/nav.3800020109 1-14].

This results in an accuracy of 91.1%.

Gradient Descend

Character 1	Character 2	BINARY	DECIMAL
f	d	0000	0
p	k	0001	1
c	l	0010	2
g	o	0011	3
n	w	0100	4
a	z	0101	5
h	x	0110	6
m	q	0111	7
i	s	1000	8
j	e	1001	9
y	v	1010	10
b	u	1011	11
t	r	1100	12
’ ’		1101	13
.		1110	14
break		1111	15

Character 1 and 2 are both presented by the same number that is here shown in binary and decimal.

Character combinations

Gradient descent is another way of optimizing parameters if it is possible to change the parameters batch-wise and analyze the results using a single metric. The batches are character combinations that can be changed randomly. The metric used to evaluate how well it performed is the accuracy of the algorithm, evaluated by my evaluation program described in section 3.4{reference-type=“ref” reference=“validation”} [cf. @DBLP:journals/corr/Ruder16 2-9]. The algorithm now changes two pairs, which are called batches in this context and tests if the accuracy has improved. If it has, it uses the new combination; if not, it reverts to the old one. By repeating this process, it is very likely that some sort of accuracy plateau will be found, which is a theoretically optimal character combination table, so the characters that are combined are the least likely to overlap.

This has been done with 4., 3., and 2. character combinations at a time, but they plateaued, not on the same combination, but the same accuracy percentage. A dynamic way, where the batch count changes has also been tried, but does not change the output. The graphic 11{reference-type=“ref” reference=“gradienddescend”} shows the progression with two sized batches, so two character combinations get randomly exchanged. It plateaus at an 92.63% accuracy.

The combination can be found in figure 10{reference-type=“ref” reference=“charactercombinationfig”}.

Accuracy development with gradient descend

Given the validation test, the current configuration mostly switches up "r" with "t", while "t" is, relative to the reversed, seldom switched up for "r". With four times less error occurrences, the following weak points are "c" with "l" and "n" with "w". This might be explained by the fact that the algorithm treats "t" and "r" equally as the data has been normalized, but the overall probability of "t" occurring in an English text is significantly higher than that of an "r"[cf. @salomon2002data 2-14]. Although one has to note that the probability gap between "n" and "w" is significantly bigger.

Compression

In theory if all characters would be encodable, the algorithm would approach a 50% compression rate, where C is a fixed amount of overhead. $$\lim_{x \to \inf} (x/2 + C)/x = 0.5$$ As not all characters are compressible, the compression ratio depends on the context the algorithm is used in. For example, the data text from the Wikipedia dump has been analyzed and only 72.7% of the characters in the text are compressible. The Book is the novel called Frankenstein by Mary Wollstonecraft Shelley, and is used as an example for a average text file. 84.2% of the characters in this file are theoretically compressible. The figure 12{reference-type=“ref” reference=“compressable”} shows how big compared to the original file, the encoded version is. The text consists of short stories written by CHAT GPT. These contain no non-ASCII characters and represent plain English text.

File	Compressed size	Compressible
Wikipedia	77.4%	72.7%
Book	72.1%	84.2%
Text	65.7%	93.2%

Compression to compressible relation

Given that speed optimizations beyond input buffering have not been implemented, direct performance comparisons offer limited definitive value at this stage. Furthermore, a precise benchmark against other algorithms is inherently challenging due to the novelty of this lossy text compression approach, for which directly comparable methods are scarce. The inclusion of ZIP 3.0, compiled with GCC 13.2.1 using the -1³ argument, serves primarily to establish a theoretical baseline for performance relative to a widely recognized and efficient lossless compression algorithm. Therefore, the presented speed values in figure 13{reference-type=“ref” reference=“compressionspeed”} and Table 13{reference-type=“ref” reference=“compressionspeed”} should be interpreted as indicative of a potential performance level, rather than absolute benchmarks for optimized implementations.

Time	This	ZIP - 1
Wikipedia	266.36 secs	259.00 secs
Text	8.43 millis	8.59 millis

Compression speed

The speed tests have been run three times and the average has been taken, but as the hardware is not representative, the values should only be compared against each other and can differ depending on the underlying system.

Decoder

/ [pdf]

{#decoder}

The decoding process, illustrated in figure 14{reference-type=“ref” reference=“decoder”}, is also governed by a state machine with two states: decoding and not decoding. This section describes the theoretical operation, assuming character-by-character input and indexed access to all characters. The decoder initiates in the not decoding state, transitioning to the decoding state upon encountering the designated break character sequence.

In the not decoding state, the algorithm operates as a standard UTF-8 decoder, directly writing each byte to the output file or buffer without further processing. Upon entering the decoding state, triggered by a break character, the processing diverges. Instead of direct output, each incoming byte, representing a compressed character pair, is passed to the Decoder function, depicted in detail within figure 14{reference-type=“ref” reference=“decoder”}.

For each compressed byte, the Decoder function generates two potential character possibilities, denoted as "Possibility A" and "Possibility B" in figure 14{reference-type=“ref” reference=“decoder”}, corresponding to the two possible characters encoded within the 4-bit representation. For each character, the algorithm proceeds through the following steps:

Gather Possibilities: The Gather possibilities module analyzes the surrounding characters within the defined context window, as discussed in section 3.2.2{reference-type=“ref” reference=“contextwindow”},to collect relevant contextual information for both "Possibility A" and "Possibility B".
Activation Function: The contextual information is then processed by the Activation function module. This module, as detailed in section 5.2.1{reference-type=“ref” reference=“activationfunktion”}, applies a non-linear activation function to the probabilities associated with each character possibility in the given context, refining their relevance for subsequent evaluation.
Relative Multiplication: Following activation, the Relative multiplication module, described in section 5.2.2{reference-type=“ref” reference=“relativmultiplication”}, applies distance-based weighting to the probabilities. This step adjusts the influence of contextual characters based on their proximity to the character being decoded, enhancing the accuracy of the probability assessment.
Compare: Finally, the Compare module evaluates the processed probabilities for both "Possibility A" and "Possibility B". By comparing the aggregated and weighted probabilities, the algorithm determines the most probable character from the two possibilities and outputs this character to the decoded file. For example, if "Possibility A" exhibits significantly higher contextual probability, it is selected as the decoded character.

The decoder remains in the decoding state, processing subsequent compressed bytes through the Decoder function, until another break character sequence is encountered. Upon detecting a break character in the decoding state, the state machine reverts back to the not decoding state, resuming standard UTF-8 decoding for the subsequent input stream.

Gather Possibilities

The Gather function searches for the necessary context in the surrounding environment and presents it to the activation function. When the context is fully contained within the buffer, the gatherer can easily reference it. However, as discussed in 4.1{reference-type=“ref” reference=“inputbuffer”} , problems arise when the buffer goes out of range. To address this, the decoder employs a secondary buffer, similar to the encoder, to bridge this gap.

In the decoding function, unlike the encoding function, every ‘u8’ may represent two characters. Therefore, the Gather function must also determine the length of each ‘u8’ and split them accordingly before passing them to the activation function, as some might be collapsed characters. However, the decoder may encounter situations where it is uncertain whether a character should be handled as UTF-8 or not. In such cases, the Gather function has to go linearly through the surrounding characters determine the state, before saving it to the buffer. This is inherently serial and could therefore currently not be parallelized.

Optimisation

The way the algorithm handles probabilities has a significant impact on its performance. To optimize these differences, the algorithm applies two functions to the probabilities. These functions have been shown to improve the average accuracy of the algorithm to 93.62% on text that has been compressed to 65.7% of its original size.

Activation Function

Activation functions are commonly used in machine learning applications to handle the relevance of data, which in this case are the probabilities. The primary objective of an activation function is to introduce non-linearity into the model, which has been shown to produce better results while being computationally inexpensive [cf. @DBLP:journals/corr/abs-2109-14545 3-10].

In this algorithm, every probability is passed through an activation function before being used in subsequent calculations. One key difference between this algorithm and typical machine learning applications is that the probability values used in this algorithm will never be negative or zero. This is because every character has been observed at least once at every distance to every other character. Therefore, the activation function used in this algorithm must be able to handle positive input values. Different activation functions have been tested and evaluated on the algorithm, with some presented in figure 15{reference-type=“ref” reference=“activation”}. Of these Tanh presents the best accuracy, which in the context of this algorithm shrinks the smaller values exponentially, which would mean that it is filtering out low probability noise.

Activation functions

Activation	Accuracy
Sigmoid	92,632%
SELU	92,60%
TANH	92.78%

Activations functions accuracy

It has to be noted that many of the proposed activation functions in the cited paper have not been tested, since they would require a $\beta$ which has to be determined by training, which is subject for another paper.

Relative Multiplication

As demonstrated in section 3.2.2{reference-type=“ref” reference=“contextwindow”}, the relevance of a character to another character decreases as the distance between them increases. Previously, all seven probabilities were treated equally, but as shown, they are not equally relevant to a distinct character. To account for this discrepancy, weighting the probabilities based on the data presented in figure 3{reference-type=“ref” reference=“entropyfig”}, results in better accuracy. By doing so, it is possible to adjust the individual probabilities of each distance according to their relevance. Using this weighting scheme as a starting point and applied gradient descent-like algorithms to optimize the accuracy of the model, results in an 0.84% improvement in accuracy, indicating that this approach effectively captures the varying degrees of relevance between characters

Multiplication	Base	Plato
-4	1.05	1.05
-3	1.09	1.09
-2	1.3	1.3
-1	2.0	2.0
1	1.8	2.0
2	1.4	1.4
3	1.16	1.16

Relative muliplications

Adjust Weights

The weights themselves are still normalized and are prone to some error. To improve the accuracy of the model for real-world use, a weight optimizer has been implemented. This optimizer takes in text and trains the model’s weights on this text. Similar to the NNCP algorithm mentioned earlier, this optimizer could be used to train the model for specific use cases or vocabulary. This method could prove itselfe to be particularly useful in applications where the vocabulary or language usage is highly specialized or unique. As this algorithm aims to formulate a base model for general English text, use-case optimizations is subject for other papers.

To achieve a model for general English, a dataset of 24,000 characters was used, of which 4,000 were randomly selected from Wikipedia articles, and 20,000 were from short stories written by CHAT GPT. The test short story was not included in the training dataset to avoid bias. Different step sizes and epoch iterations were tried to find the optimal values. After several iterations, a step size of 0.00025 yielded the best results, with an accuracy of 93.62%.

Output

The output now is rather accurate and mostly readable. This is a short example of a compress text, that has been decompressed with all adjustments and optmisationes presented in this paper.

  Once bpon a time, in the bustling mettopolis of London,
  Once upon a time, in the bustling metropolis of London,
  there lived a man wamed Ben.
  there lived a man named Ben.

The algorithm makes some mistakes, but all words stay the same length and the errors are rather easy to spot because, if a character was identified wrongly there is just one other option, that it could have been.

Discussion

In this section, the results of the investigation will be presented and their implications discussed. Additionally, potential follow-up studies aimed at exploring alternative methods of interpreting probabilities to enhance text compression using character correlations will be outlined. Furthermore, possible optimizations that could improve both speed and accuracy, as well as potential future use cases will be discussed.

Results

This study investigated the feasibility of lossy text compression utilizing statistical probabilities and character correlations. The developed algorithm successfully demonstrated the viability of this approach, achieving a character reconstruction accuracy of 93.62% on English text. This accuracy represents an improvement from an initial 85.95%, realized through iterative optimization and refinement of the algorithm, as detailed in section 5{reference-type=“ref” reference=“decodersection”}.

Furthermore, performance evaluations indicate that the compression and decompression speed of the presented algorithm is comparable to that of the fast mode of the widely adopted lossless ZIP compression algorithm. As shown in figure 13{reference-type=“ref” reference=“compressionspeed”} and Table 13{reference-type=“ref” reference=“compressionspeed”}, the processing time is within a similar range to ZIP -1.

Analysis of the decompressed output reveals a high degree of semantic preservation, despite the lossy nature of the compression. The observed errors primarily manifest as predictable character substitutions, maintaining the original character count and presenting only one possible alternative for each potentially incorrect character. This characteristic suggests a low noise ratio in the decompressed text, with errors largely confined to orthographic variations.

Character Relations

Combination	0	1	2	3	5	6	7
F and D	0.180	0.196	0.253	0.359	0.314	0.251	0.224
P and K	0.190	0.231	0.258	0.315	0.304	0.230	0.213
C and L	0.186	0.196	0.245	0.331	0.368	0.266	0.213
G and O	0.187	0.201	0.260	0.376	0.421	0.221	0.195
N and W	0.185	0.200	0.239	0.352	0.339	0.253	0.221
A and Z	0.192	0.211	0.233	0.386	0.444	0.215	0.194
H and X	0.196	0.210	0.285	0.488	0.258	0.218	0.199
M and Q	0.205	0.217	0.224	0.252	0.367	0.282	0.209
I and S	0.188	0.209	0.224	0.391	0.468	0.225	0.199
J and E	0.196	0.204	0.231	0.364	0.410	0.223	0.208
Y and V	0.190	0.221	0.240	0.346	0.362	0.232	0.212
B and U	0.185	0.196	0.236	0.458	0.430	0.206	0.200
T and R	0.182	0.210	0.262	0.383	0.315	0.251	0.209
K and L	0.202	0.237	0.259	0.346	0.277	0.224	0.206

How mutch variables effect prediction

As this algorithm relies heavily on character correlation, specifically for single characters within a context window of 7, significant insights have been gathered by training a gradient boosted tree using the Google yggdrasil library [cf. @GBBSP23 1-10]. Some of these findings are shown in figure 16{reference-type=“ref” reference=“effekt”}, where index 4 has been omitted as it represents the character itself. The data demonstrates that surrounding characters contain enough information to predict the next character with a reasonable degree of confidence, though not all characters contribute equally. Characters at positions -1 and 1 are the most influential, but their relative importance varies depending on the specific character combination. Additionally, the rate at which characters lose relevance is not consistent, but even the least relevant character in the most extreme tested scenario retains about 40% of the relevance of the most important one. Therefore, this information should not be considered redundant.

Since this area has only been marginally explored, there remains significant room for further improvement and experimentation.

Further Investigations

This paper serves as a proof of concept, demonstrating that this sort of compression works at a small scale using minimalistic statistical machine learning models. As even this relatively simple approach, compared to the later presented ideas and the currently developed generative compression algorithms, proved successful, confidence is given to explore more advanced and challenging algorithms to further improve accuracy towards the 99.9% mark.

Currently, the probabilities between single characters at specific distances are analyzed. However, it could be hypothesized that characters in English are not isolated but embedded in character groups, syllables and words. Thus, examining them separately may result in suboptimal performance. Future research should investigate how character prediction models would behave if pairs of characters were analyzed instead of single characters. These N-grams might represent the English language better than the character separatly.

Furthermore, the use of tokenizers to optimize n-gram pairs could be explored, with the aim of grouping the most relevant pairs together. Of particular interest would be the development of encoding methods for such multi-character combinations as it would no longer be as simple as grouping two encoeded pairs together. Additional studies could investigate alternative ways of interpreting probabilities by leveraging architectures like decision trees. From the current perspective, random forests and classic neural networks could yield interesting results, as the current decoder’s comparison step acts as a non-linear, binary classifier. Finally, all these advancements could be combined and tested together to determine the full potential of this type of compression.

Optimization

The environment in which these characters are encoded and decoded is also important and could be optimized to yield better performance. Compute cost is recognized as an important metric when evaluating compression algorithms, as it often determines the feasibility of running the algorithm compared to storing them uncompressed. Currently, the speed is comparable to the zip compression library in fast mode, but the algorithm is running single-threaded and processing the file sequentially. With only a small reduction in the number of storable characters, the algorithm could be multithreaded by decoding buffers simultaneously. This approach would result in non-encodable space between the buffers but would also eliminate the need for the second buffer, as bridging the gap between following buffers was its main purpose.

Performance improvements could also be made separately from the algorithm itself. The algorithm has the advantage that all words maintain the same length, and errors, when made, are quite predictable as there is only one option where the algorithm could have made a mistake. A secondary algorithm could be implemented to review the written text using an English dictionary to verify if the words are valid. When encountering a word not found in the dictionary, it could search for possible alternatives. This secondary algorithm could then discard potential results that are impossible to have been produced by the current algorithm, such as when an ‘r’ is replaced by an ’l’, as this substitution is not possible with the current algorithm.

Implications

Text is a widely used medium that already stores information quite efficiently. It is used in various contexts, from comment sections to books, from informal brain dumps where syntax errors are tolerable to program files where they are not. This algorithm opens a mostly unexplored territory of lossy text compression. As already mentioned in section 2.1{reference-type=“ref” reference=“compressiontypes”}, lossy compression is often used when it directly interfaces with humans, and as it does not recreate the exact text most of the time, some room for interpretation is acceptable. Given these requirements, some use cases would be:

Instant Messaging
In areas where data transfers are not as fast and reliable, compressing text with such lossy text compression could be an option.
Data Mining
Large Language Model (LLM) trainings use huge amounts of data, which is mostly text. While language models benefit from high-quality training data, if compression accuracy could be improved, the saved disk space could be valuable, saving on disk and bandwidth.
Transcripts
Speech-to-text models could provide transcripts of audio. For storing such data, lossy text compression would be acceptable.
Search Engine Indexing
Storing compressed versions of web pages to save space and improve search speed, as keywords could still be visible in the compressed form.
Old Comment Sections
Old comment sections that have long been unvisited could also be stored in a compressed form.

Conclusion

This paper has successfully demonstrated the viability of lossy text compression through the application of lightweight prediction models and statistical character correlations. By encoding character pairs into compact 4-bit representations within the UTF-8 framework, a novel algorithm has been developed, achieving significant compression ratios while maintaining a high level of text reconstruction accuracy. The presented methodology, leveraging a state machine and contextual probabilities, offers a computationally efficient alternative to resource-intensive, transformer-based compression techniques. The empirical validation, demonstrating performance comparable to established lossless compression algorithms like ZIP in terms of speed, underscores the practical potential of this approach.

While this work serves as a proof of concept, it also highlights several avenues for future research and refinement. Further investigation into advanced probability interpretation methods, n-gram analysis, and integration with natural language processing techniques promises to yield even greater improvements in both compression efficiency and reconstruction fidelity. Moreover, optimizations in algorithmic implementation and hardware utilization could further enhance the speed and applicability of this compression paradigm. The findings of this study suggest that lossy text compression, grounded in statistical prediction, represents a promising direction for applications demanding lightweight data reduction without complete data preservation, thereby opening new possibilities in domains such as instant messaging, large language model data preprocessing, and search engine technologies. The initial success of this endeavor encourages continued exploration into the rich potential of predictive models for text compression and related data processing tasks.

Disregarding possible OS or file systems optimizations ↩︎
the propability that a certain character occours next to other specific character ↩︎
fastest compression method available in ZIP ↩︎

Lossy Text Compression

Abstract

Introduction

Background

Compression Overview

UTF-8

Related Work

Character Entropy

Generative Text Compression

Methodology

Encoding Foundation

Encodable Characters

Decoding Foundation

Data

Which Data to Collect

Context Window

Data Collection and Preprocessing

Processing

Demonstration

Validation

Encoder

Input Buffer

Rotation Array

Collaps Function

Character Combination Optimisation

Primary Difference

Gradient Descend

Compression

Decoder

Gather Possibilities

Optimisation

Activation Function

Relative Multiplication

Adjust Weights

Output

Discussion

Results

Character Relations

Further Investigations

Optimization

Implications

Conclusion