Abschlieende Arbeit im Rahmen der Reifeprüfung
Text Compression with Prediction Models
Ben Kinigadner
8D, 2024/25
Bundesgymnasium und Bundesrealgymnasium Wien 4
Wiedner Gymnasium / Sir Karl Popper Schule
A-1040 Wien, Wiedner Gürtel 68
Betreuungslehrperson: Alexander Hoefler
Vorgelegt am 14. Februar 2025
Abstract
This paper presents a novel approach to lossy text compression that
leverages character correlations and statistical probabilities. The
approach introduces a lightweight algorithm that compresses English text
by encoding pairs of characters into 4-bit representations and embedding
them into UTF-8 encoding. This method achieves a compression ratio of up
to 65.7% while maintaining 93.62% accuracy in character reconstruction.
The algorithm employs a state machine for encoding and decoding,
utilizing surrounding character context to predict the most probable
character during decompression. A multitude of optimization techniques,
including character pair combinations, activation functions, and
relative probability weighting, have been employed to enhance accuracy
and performance. These techniques will be explored in this paper.
This study demonstrates the feasibility of lossy text compression using
minimal computational resources, with performance comparable to
established lossless compression algorithms like ZIP. Potential
applications include instant messaging, data mining for large language
models, and search engine indexing.
This work serves as a proof of concept, opening avenues for further
research into more advanced probability interpretation methods, n-gram
analysis, and potential integration with natural language processing
techniques to further improve compression efficiency and accuracy.
Introduction
Recent advancements in machine learning, especially in large language
models (LLMs), have enabled computers to create coherent paragraphs of
text [cf. @cho2019coherentcohesivelongformtext 1-8]. They can start with
just a few characters and, leveraging predictive models, generate
complete English sentences [cf. @ContextDependentRecurrent2012 1-5].
This ability can be utilized to compress data by removing unnecessary
information, information that is above the Shannon entropy and is
theoretically redundant [cf. @shannon1948mathematical 50-61], which the
model can later fill in again [cf. @bellardLosslessDataCompression 1-9].
However, the current models, which are based on the transformer
architecture, require significant computational resources due to their
reliance on complex matrix operations [cf. @vaswani2023attentionneed
3-8].
There are already numerous methods of compressing text, but it is always
a trade-off between computational costs, of compression and
decompression, and size [cf. @alakuijala2015comparison 1-6]. Until now,
compression algorithms that specialize in text prediction using
generative models have been rather unattractive for lightweight
workloads, such as web traffic and simple archiving, because it takes a
lot of computing power to compress the text on the servers and a whole
machine learning model is rolled out, which could lead to problems on
weaker end devices. This raises the question of how it is potentially
possible to predict the next letters with higher speed and sufficient
accuracy with less compute. While transformer-based algorithms can
effectively compress data, their high computational demands limit their
practical applications [cf. @LargeTextCompression]. To leverage these
predictive capabilities, it is crucial to reduce their computational
costs and increase their speed.
Achieving this objective necessitates a shift from substantial matrix
operations to the simplification of tasks into binary classification
problems. The objective of this paper is to demonstrate a proof of
concept that smaller, more efficient models can attain high accuracy
without the necessity of a tokenizer and by relying exclusively on
statistical methodologies. The objective is to demonstrate that
substantial speed improvements and high accuracy are attainable even in
the initial stages, rendering these models suitable for data compression
and decompression tasks, and and making this type of compression
potentially suitable for general use.
Background
In this section, underlying standards and algorithms are explained.
Compression Overview
In order to comprehend the functionality of compression algorithms, it
is necessary to categorize them into two distinct variants. This
classification facilitates a comprehensive understanding of their
applications and potential uses.
In most cases, it is important to restore all data that was present
before compression during decompression. This is particularly important
for applications where it is not possible to guess what the data, that
is to be compressed, will be used for. Losing precision in already
complicated compiled programs or libraries could mean that they cannot
be executed at all. The same applies to proprietary file formats where
it is not fully known which bytes serve which purpose. This is why such
compression algorithms are particularly important in universal data
transfer protocols such as SFTP [cf. @RFC468FTP 1-7]. These compression
algorithms are often less specific and the compressed files stay
relatively large.
On the other hand, we as humans live in environments with high
information density. We have developed effective ways of filtering out
much of this data and in turn make sufficient predictions about the
dropped information [cf. @marzenEvolutionLossyCompression2017 1-8]. This
can even go so far that humans are able to see images with only a few
data points and recognize what they must represent [cf. @BF02833890
1-5]. Compression algorithms that make use of this ability and do not
preserve all the data, but enough to semantically reconstruct the data
to its original state, fall into the category of lossy compression
algorithms. The old file is therefore no longer fully reconstructable
for the computer, but should semantically mean the same thing to the
person viewing the file. One of the best-known representatives of these
lossy compression algorithms is JPG.
Since this paper’s algorithm cannot provide 100% correctness in the
majority of cases, but the output may sometimes require human
interpretation or correction to fully match the original semantic
intent, this algorithm falls under this category. In essence, this
suggests that the information disclosed by the algorithm may be below
the entropy threshold, rendering it logically impossible to reconstruct.
Consequently, the function representing the compression algorithm is not
injective.
[cf. @salomon2002data 2-14]
UTF-8
| Length | Binary Representation | |
|---|---|---|
| 8-bit long | 0xxxxxxx | |
| 16-bit long | 110xxxxxx 10xxxxxx | |
| 24-bit long | 1110xxxx 10xxxxxx 10xxxxxx | |
| 32-bit long | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
In order for an algorithm to make a valid prediction, it must know what
data it is dealing with and what kind of prediction it is allowed to
make. Although all compressed characters are completely in the ASCII
[cf. @ASCIIFormatNetwork1969 2-6] area, to allow smooth encoding and not
too many problems with unknown characters, my system is embedded in
UTF-8.
UTF-8, or Unicode Transformation Format – 8-bit, is a universal
standard for encoding almost all the world’s languages and additional
characters [cf. @yergeauUTF8TransformationFormat2003 2-11]. It is based
on ASCII and is therefore fully interoperable. It was created, because
the ASCII set is very limited, as it only contains American characters
and special characters, such as ‘!’, ‘?’ and ‘*’. As each character is
8-bit long, it could only hold 256 different characters. Additionally,
as the first bit is always a security bit, ASCII encoding contains
practically only 128 characters. Far too few to represent all
languages/special characters and nowadays also emojis.
UTF-8, on the other hand, has different lengths from only 8-bit to
encode ASCII characters up to 32-bit to represent even the most fringe
characters. You can always read from the first bit how many bytes this
character will be. With ASCII characters, the first bit, which was
usually a 0, has been reassigned accordingly, that it now indicates that
the current character is only 8-bit long. Furthermore, the number of 1s
before the first 0 indicates how many bytes the current character needs
to be fully decoded. The table 1{reference-type=“ref”
reference=“utf8”} visualizes this.\
Related Work
Text compression algorithms can also be further divided into two
different groups, Generative and Semantic. I would use the two terms as
they were used in Witten: Semantic and Generative Models for Lossy Text
Compression [cf. @wittenSemanticGenerativeModels1994 1-5].
Since generative text compression algorithms are leading in current
benchmarks, I mainly drew inspiration from this to build mine. The basis
of the code is therefore based on methods similar to those used in
machine learning. Since this approach tries to be faster than current
models such as NNCP, the model presented in this paper will not do just
in time compilation, as this might increase accuracy, but in turn cost a
lot in terms of computational power and therefore opted for a pretrained
model trained on generic English text.
Character Entropy
The foundation on which my text prediction algorithm is based is:
Shannon: Prediction and Entropy of Printed English [cf.
@shannonPredictionEntropyPrinted 50-61]. The authors have formulated the
thesis that the order of characters following each other is not random,
but that some are more probable than others. The probabilities are even
more precisely defined than the absolute probability of one letter
following the next. So if you have a chain of characters $p$, it may be
that ‘b’ is more likely to come next than ’e’, although without context
’e’ is much more likely since it is more often used in the English
language [cf. @FoundationsComputerSecurity 2-6].
It is also suggested that more context helps to interpret the next
letter with a higher probability. However, this always assumes that one
character is identified at a time and fewer individual gaps are to be
filled, because the experiments in this paper were conducted primarily
on humans. It was assumed that if we have mastered the language, we have
a good idea of which letters come when given the surrounding ones.\
Generative Text Compression
Generative text compression programs use the text prediction and
learning capabilities of neural networks. When compressing, they train
the model on the text file that is to be compressed and save the text
and its relations within in their floating points as the model’s
weights. The size of the compressed text is then almost only the size of
the trained weights and is therefore much smaller compared to other
algorithms. When decompressing, it is then possible to read the
compressed text from the trained weights again. As these methods involve
training machine learning models and calculating many matrix
multiplications, these algorithms require a relatively large amount of
compute power and memory.
A concrete approach is that of NNCP, in which a transformer model is
trained, which means that the weights can be trained in parallel [cf.
@bellardLosslessDataCompression 1-9]. This results in a large model,
which can then be used again to reproduce the same text. Due to the
possible hardware acceleration and parallel computing, it is somewhat
faster than other generative text compression algorithms, in exchange
for a lower compression ratio.
In the contrary, there is CMIX, which does not stem from one large
model, but from up to 2,115 different smaller ones, whereby it must be
said that not all of them have to be used for text compression [cf.
@ByronKnoll]. When compressing, if there was no preprocessing, the
models are trained similarly to NNCP, although no hardware acceleration
is specifically used in this instance. The finished models then
represent the majority of the compressed size. When decompressing, the
trained models are then retrieved again to restore the file. Here,
different models are used bit by bit, which have specialized in specific
events during training. As a result, it is even better compressed than
NNCP, but has higher compute costs.
Methodology
As this approach to text compression is novel, there are many fundamental questions to address. This section will explore how such systems should fundamentally work, which characters to encode, and how to gather and clean the necessary data for such applications. By systematically addressing these aspects, I aim to provide a comprehensive framework for developing and refining this text compression approach. This will allow for steady improvements to be systematically evaluated by comparisons against previous iterations.
Encoding Foundation
Prediction models utilize preceding characters to determine the current
one, so it is leveraging existing information to assess the most
probable next character with high precision. However, as the diversity
of potential characters increases, the likelihood of misguessing the
correct character also rises. Moreover, the computational expense
escalates with the need to calculate numerous possibilities for each of
these diverse characters and given the algorithm’s goals of
computational and memory efficiency, limiting the number of
possibilities that require calculation would significantly aid in
achieving these aims.
When determining what should be compressible, it is essential to decide
which characters must be overall accepted in the filetype. Ideally, the
encoder should be capable of handling all text files to prevent errors
when encountering characters outside the defined compressible set.
Therefore, the goal is to integrate the compression scheme within a
larger encoding like UTF-8, much like ASCII itself is a subset of UTF-8.
With this approach, characters within the compressible list can be
encoded, while the rest are represented using UTF-8. Thus, the question
arises: which characters should be encodable and how?
Encodable Characters
Since I want the encoded character to take less disk space than UTF-8’s most compact encoding, which is 8 bits, there are multiple options, given that every compressed character uses the same amount of bits.
-
6-bit per character
Works if the algorithm always takes three bytes at once. This would result in 64 possibilities for every character. -
4-bit per character
Every byte contains two characters. This would result in 16 possibilities for every character -
3-bit per character
Works again if the algorithm takes three bytes and splits them into eight charaters. This would result in eight possibilities for every character. -
2-bit per character
Every byte contains four characters. This would result in four possibilities for every character.
Since the characters intended for inclusion comprise the 26 English base
characters, along with the Space and the Period, this totals to 28
different characters. An additional consideration is the encoded
sequence escape character, which necessitates an entire possible
character slot, as misinterpreting it could have catastrophic
consequences, as will be explained later in the chapter
3.1.2{reference-type=“ref”
reference=“decodingfoundation”}. Consequently, it is important to
consider the number of states minus one for the various options.
Given that this investigation aims to ascertain the feasibility of such
compression, allocating 4-bit per character appears to be the most
suitable option. With 15 different states available, which is nearly
half of 28, each state of the 4-bit character could potentially
accommodate two characters, leaving two other states to be encoded
individually. Considering that the Space and sentence separation
character stand as outliers, I will treat these two as singular
entities, with the base characters grouped in pairs of two and assigned
to individual states. So 29 characters (26 letters + Space + Period +
Escape), 4-bit (16 states) accommodate 15 usable slots, balancing
expressiveness and efficiency.
Representation of which characters are encoded as bits, how I got to the
exact combinations will be explained in chapter
4.3{reference-type=“ref”
reference=“charactercombination”}.
Decoding Foundation
The encoded format operates in two distinct states within the UTF-8
stream: a standard UTF-8 state and a special encoding state for
compressed characters. To ensure correct interpretation, a mechanism is
required to explicitly signal to the decoder which state is currently
active. This is crucial because attempting to decode specially encoded
characters with a standard UTF-8 decoder will yield errors, and
conversely, applying the special decoder to standard UTF-8 data will
produce nonsensical output.
To address this state differentiation, a designated state transition
marker henceforth referred to as the break character, is introduced. The
decoding process commences in the standard UTF-8 state. Upon
encountering the break character for the first time, the decoder
transitions into the special decoding state. Subsequent encounters of
the break character trigger a reversion back to the standard UTF-8
state, effectively toggling between decoding modes.
However, a potential ambiguity arises from the structure of UTF-8
itself. While a single byte beginning with ‘1111’ could be used as a
break character in principle, this prefix also marks the start of 32-bit
UTF-8 encoded characters. To avoid misinterpreting the break character
as the beginning of a 32-bit UTF-8 sequence, a robust disambiguation
strategy is necessary.
Therefore, the state transition into the special decoding mode is
signaled by a sequence of two consecutive break characters, represented
as a single byte ‘11111111’. This double break character sequence is
highly unlikely to occur within standard UTF-8 encoded text and thus
serves as an unambiguous delimiter for initiating the special decoding
state. The single break character, represented as ‘1111’ within a byte,
is reserved for signaling the end of a sequence of specially encoded
characters and the return to the standard UTF-8 state.
For consistency and simplified decoding logic, even when an even number
of compressed characters are encoded consecutively, necessitating a byte
boundary immediately after the last character pair, a second break
character is appended. This ensures that the state transition back to
UTF-8 mode is consistently and explicitly marked, regardless of the
length of the preceding encoded sequence.
As a consequence of introducing these break characters, it is important
to note that while the special encoding achieves a 50% reduction in size
for the encoded characters themselves, the inclusion of break characters
introduces overhead. Therefore, to ensure effective compression, the
encoder is designed to only initiate the special encoding if a minimum
of twelve consecutive compressible characters are encountered. This
threshold mitigates the overhead of break characters and ensures a net
reduction in file size in practical scenarios.
Data
Following this, a significant amount of data is needed to achieve accurate compression and decompression with this statistical model. Since the algorithm is primarily data-driven, the required data is highly dependent on the algorithm and vice versa. Therefore, more data than needed has been collected, but only relevant snippets will be presented in section 3.3.2{reference-type=“ref” reference=“demonstration”}.
Which Data to Collect
The overall goal is to use probabilities to convert a half-precision
character to a fully defined one. This will be represented as a function
that takes in the half-precision bit and other environmental variables
to return a satisfying output. The data that needs to be collected
should help the function evaluate the specific half-precision character
in a specific context.
The easiest context to obtain, and which will be used, is the characters
surrounding the half-precision character. Therefore, the data needed
should represent the correlation between the surrounding characters and
the one in question. For example, if there is an "I" right after the
character in question, is it more likely to be an "F" or a "D"?
Visual depiction refer figure 2{reference-type=“ref”
reference=“tdecoding”}.
There could be countless ways of grouping different characters together
and using this information to decode the half-precision character. How
these methods compare is a topic for other papers. This paper will use
the probabilities of a specific target character occurring, given the
context characters found at various distances. The collected data will
also represents how much a given character influences the neighboring
characters.
Context Window
It might be possible but not feasible, given the collected data, to look
at all the characters before and all the characters after the character
in question. The first argument against this is the sheer amount of data
that would have to be collected. Given the function
$p(\text{char1}, \text{char2}, \text{distance})$, which takes in
character 1, character 2, and a distance, p returns the probability that
when we look at character 1 in an English text, how likely is it to
encounter char2 within the specified distance.
Disregarding the possibilities that there might be other characters in
the whole text that are not present in the compressible 29 characters,
the amount of data that has to be stored grows linearly by 841 *
distance. Also diagram 3{reference-type=“ref”
reference=“entropyfig”} shows the falloff of the divergence from the
average entropy to the specific entropy of every character the farther
away it is. So the higher the value, the more predictable is the 0th
character from its current position. Average entropy is the overall
average occurrence of a character in a text. So the graphic shows decay
of predictability with distance. Following this, it seemed most
efficient to provide the function the shown four preceding characters
and the three subsequent ones, since with characters further appart,
influence of the middle character is negligible.\
$$\begin{aligned}
& y = (\Sigma^{28}_{a=0} \Sigma^{28}_{i=0} (p(c[a], c[i], x) -
e_{a}[c[i]])^2) / 29 * 29\\
& c = \text{array of decodable characters}\\
& e_{a} = \text{array of average entropy in English}\\
& p = \text{p(char, char, int) -> 64 bit float}
\end{aligned}$$
It calculates the average distance from the observed values compared to the English text entropy for every character at every distance. The graphic plots the average of these distances to the distance this has been calculated on.
Data Collection and Preprocessing
Large amounts of English text is needed to extract the frequency of each character following or preceding different characters and then normalize the occurrences. Since language models require the same basic information, it is easy to find such datasets. For this use case, the cleaned-up English Wikipedia was used, which comprises around 20 GB of English text data, without significant markup. The data then was saved into three dimensional HDF5 dataframes and manually verified before further processing. The dimensions of this data frame were, the character, which will later be compressed, the character that appears in the first one’s proximity and the distance at which the appearance occurred.
Processing
The data extraction was done the following. A program iterates over a
given text which in this case was the english Wikipedia dump.
When the program finds an acceptable character, it records the distances
to other nearby acceptable characters within the context window and
increments the corresponding counts. When finished the program writes
this three dimensional array, with the absolute values, to the
previously described HDF5 file. While the absolute values describe how
often the characters are found in the text, the distribution is very
uneven and models based on this tend to just pick the one average more
likely character in a character combination, this is why the data then
gets normalized based on the frequency distribution of neighboring
characters found for each base character at each specific distance.
Multiple methods of normalisation were tested, but the given formula
resulted in the most optimal outcome.
$x = \frac{x - \mu}{\sigma}$
The in figure 2{reference-type=“ref” reference=“tdecoding”} described function $p(\text{char}, \text{char}, \text{int})$ now is a three dimensional index through this hdf5 matrix, and the output is the normalized data, saved as 64-bit floating points, that was collected during this process.
Demonstration
This section is a short demonstration how the collected data looks. The
character b is used as an example. The graphics
4{reference-type=“ref” reference=“fig:show1”},
5{reference-type=“ref” reference=“show2”},
6{reference-type=“ref” reference=“show3”}
and 7{reference-type=“ref” reference=“show4”} show that if a
"b" is encountered in an English text, how likely it is to find a
specific character when you go backwards 1-4 steps.
Validation
To test and validate the performance and correctness of the program at every stage of the development, I built a small testing suite. The program is tested on a text file, encoding and decoding it. After the process is completed, the program stops the timer and checks for differences between the converted file and the original. This comparison is made easier by the fact that, regardless of how incorrect the result is, it will always contain the same number of characters. The program calculates an overall accuracy and prints which characters were guessed incorrectly and how often. Then, it saves the result, to compare them to the next test run.
Encoder
The encoding process, depicted in figure
8{reference-type=“ref” reference=“encoder”}, is managed by a
state machine operating in two distinct states: encodable and not
encodable. For a detailed discussion on input handling, refer to section
4.1{reference-type=“ref” reference=“inputbuffer”}. For
this algorithmic description, it is assumed, that the program does
character-by-character processing of a UTF-8 string with indexed access
to all characters.
Initially, the encoder resides in the not encodable state. Incoming
characters are temporarily stored in a rotation array buffer. As long as
twelve consecutive encodable characters are not detected, the buffer’s
content is flushed, and characters are directly written to the output
file as standard UTF-8 encoded text.
Upon identifying twelve consecutive encodable characters, the state
machine transitions to the encodable state. Prior to entering this
state, the first four characters from the rotation array are written as
UTF-8 to provide context for the decoder. Subsequently, an encoding
break sequence is inserted to signal the commencement of the special
encoding. Character pairs from the rotation array are then processed by
the collaps function (detailed in section
4.2{reference-type=“ref”
reference=“collapsfunktion”} and visualized in figure
8{reference-type=“ref” reference=“encoder”}), which converts
pairs of 4-bit representations into 8-bit units for output to the
encoded file. In the encodable state, the algorithm continues to check
each incoming character for encodability. For every two encodable
characters processed, one collapsed 8-bit unit, representing a character
pair from the buffer, is written to the encoded file. This ensures a
backlog of at least three theoretically encodable, but not yet written,
characters.
If a non-encodable character is encountered while in the encodable
state, or upon reaching the end of the input file, the state machine
reverts to the not encodable state. Before fully transitioning, any
remaining encodable characters in the rotation array backlog are
handled. If four unencoded characters remain, the first is collapsed
with a break character and written to file. If only three remain, a
double break character sequence is written to file. Any remaining
characters, including the encountered non-encodable character, are then
written to file as standard UTF-8, and the state machine returns to the
not encodable state.
Input Buffer
Buffering text is a method to enhance the speed of IO applications that heavily depend on many character indexes. If no buffer is used, single character indexing would theoretically require a system call to the kernel for every character passed to the program. This call reads from disk, which is very slow compared to reading data from random access memory. A buffer, having a given size, reads portions of the file into memory instead of one character at a time. It then serves these portions to the program as if they were read directly from the file. This approach drastically reduces the total number of system calls, resulting in faster performance 1. This can be seen in the figure 9{reference-type=“ref” reference=“encodingspeed”}. The figure shows the two peaks, where the one that uses the buffer only takes 0.2 $\mu$ while the version without buffer takes most of the time 1.4 $\mu$ and this not even consistenly. This not only shows that the buffered version is much faster, but also that it is much more reliable regarding speed.
Rotation Array
While buffering improves read speeds, it also introduces challenges to
the algorithm’s architecture. These challenges occur when the buffer
switches, but the program needs to access data owned by the previous
buffer. This data is no longer accessible and would result in an error.
The problem arises if the program wants to switch to the encoding state
before twelve characters have passed into the new buffer. If it were to
switch, it would need to look back to write the first characters as
UTF-8, but they would not be accessible.
To bridge this gap, a second buffer was introduced, making it easier to
access passed characters, even if they are theoretically out of scope.
This second buffer is a deque, referred to as a rotating array in the
figure 8{reference-type=“ref” reference=“encoder”}. With
each step, it rotates to remove the last character that has already been
written to the file and adds the new one to the front. This ensures that
every character is always easily accessible when needed.
Collaps Function
In the context of the Rust programming language, the language the
program is written in, the decision to use ‘u8’ instead of ‘u4’ is
primarily driven by the language’s inherent characteristics. Rust does
not support data types smaller than 8-bit, including the boolean type
which theoretically could be represented with a single bit, but is
stored as a full byte instead [cf. @BooleanTypeRust].
To circumvent this constraint the collapsing function is implemented.
This function combines two theoretical unsigned 4-bit integers into a
single 8-bit block, which is then written to the output file or buffer.
However, this approach introduces a new challenge: the algorithm
processes one character per cycle, and only saves a character to the
output every other cycle.
To address this, the program maintains its state using the rotation
array. Additionally, a secondary state machine is employed to ensure
that characters are combined and written to the output only every second
cycle.
Character Combination Optimisation
As characters are combined in pairs, it is also important how these pairs are arranged together. For instance, if character "A" always follows character "B", and character "C" always follows character "D", combining "A" and "C", and "B" and "D" together would yield some problems if these two pairs follow each other. If this happens, the decoder would not know how to decode the characters. Therefore, the encoder must have the most robust character combinations that have the fewest conflicts to achieve the best results. With an alphabetical combination the algorithm hits an accuracy of 85.95%, the goal is to improve this percentage by changing the character combinations.
Primary Difference
The first approach was formed by creating a matrix that shows, for every
character, how different their 2dependencies are compared to those of
every other character.
$$\text{matrix}[x][y] = \Sigma^{3}{d = -4; d != 0} \Sigma^{28}{i = 0} (p(c[x],c[i],d) - p(c[y], c[i], d))^{2}$$
-
$c = \text{array of decodable characters}$
-
$p = p(\text{char: } \text{char}_1, \text{char: } \text{char}_2, \text{int: distance})$ is the relative probability that if $\text{char}_1$ is given, how likely it is that $\text{char}_2$ occurs at the specified distance.
This 2D matrix now indicates that the higher the number at the
intersection of two characters’ axes is, the more different their
dependencies are. After this process, to determine the best possible
combination, the solution with the highest overall value must be found.
This can be done using the Hungarian algorithm, which is capable of
matching a 2D matrix to obtain the highest overall value [cf.
@https://doi.org/10.1002/nav.3800020109 1-14].
This results in an accuracy of 91.1%.
Gradient Descend
| Character 1 | Character 2 | BINARY | DECIMAL |
|---|---|---|---|
| f | d | 0000 | 0 |
| p | k | 0001 | 1 |
| c | l | 0010 | 2 |
| g | o | 0011 | 3 |
| n | w | 0100 | 4 |
| a | z | 0101 | 5 |
| h | x | 0110 | 6 |
| m | q | 0111 | 7 |
| i | s | 1000 | 8 |
| j | e | 1001 | 9 |
| y | v | 1010 | 10 |
| b | u | 1011 | 11 |
| t | r | 1100 | 12 |
| ’ ’ | 1101 | 13 | |
| . | 1110 | 14 | |
| break | 1111 | 15 |
Character 1 and 2 are both presented by the same number that is here shown in binary and decimal.
Gradient descent is another way of optimizing parameters if it is
possible to change the parameters batch-wise and analyze the results
using a single metric. The batches are character combinations that can
be changed randomly. The metric used to evaluate how well it performed
is the accuracy of the algorithm, evaluated by my evaluation program
described in section 3.4{reference-type=“ref”
reference=“validation”} [cf. @DBLP:journals/corr/Ruder16 2-9]. The
algorithm now changes two pairs, which are called batches in this
context and tests if the accuracy has improved. If it has, it uses the
new combination; if not, it reverts to the old one. By repeating this
process, it is very likely that some sort of accuracy plateau will be
found, which is a theoretically optimal character combination table, so
the characters that are combined are the least likely to overlap.
This has been done with 4., 3., and 2. character combinations at a time,
but they plateaued, not on the same combination, but the same accuracy
percentage. A dynamic way, where the batch count changes has also been
tried, but does not change the output. The graphic
11{reference-type=“ref” reference=“gradienddescend”}
shows the progression with two sized batches, so two character
combinations get randomly exchanged. It plateaus at an 92.63% accuracy.
The combination can be found in figure
10{reference-type=“ref”
reference=“charactercombinationfig”}.
Given the validation test, the current configuration mostly switches up
"r" with "t", while "t" is, relative to the reversed, seldom
switched up for "r". With four times less error occurrences, the
following weak points are "c" with "l" and "n" with "w". This
might be explained by the fact that the algorithm treats "t" and "r"
equally as the data has been normalized, but the overall probability of
"t" occurring in an English text is significantly higher than that of
an "r"[cf. @salomon2002data 2-14]. Although one has to note that the
probability gap between "n" and "w" is significantly bigger.
Compression
In theory if all characters would be encodable, the algorithm would approach a 50% compression rate, where C is a fixed amount of overhead. $$\lim_{x \to \inf} (x/2 + C)/x = 0.5$$ As not all characters are compressible, the compression ratio depends on the context the algorithm is used in. For example, the data text from the Wikipedia dump has been analyzed and only 72.7% of the characters in the text are compressible. The Book is the novel called Frankenstein by Mary Wollstonecraft Shelley, and is used as an example for a average text file. 84.2% of the characters in this file are theoretically compressible. The figure 12{reference-type=“ref” reference=“compressable”} shows how big compared to the original file, the encoded version is. The text consists of short stories written by CHAT GPT. These contain no non-ASCII characters and represent plain English text.
| File | Compressed size | Compressible |
|---|---|---|
| Wikipedia | 77.4% | 72.7% |
| Book | 72.1% | 84.2% |
| Text | 65.7% | 93.2% |
Given that speed optimizations beyond input buffering have not been
implemented, direct performance comparisons offer limited definitive
value at this stage. Furthermore, a precise benchmark against other
algorithms is inherently challenging due to the novelty of this lossy
text compression approach, for which directly comparable methods are
scarce. The inclusion of ZIP 3.0, compiled with GCC 13.2.1 using the
-13 argument, serves primarily to establish a theoretical baseline
for performance relative to a widely recognized and efficient lossless
compression algorithm. Therefore, the presented speed values in figure
13{reference-type=“ref”
reference=“compressionspeed”} and Table
13{reference-type=“ref”
reference=“compressionspeed”} should be interpreted as indicative of a
potential performance level, rather than absolute benchmarks for
optimized implementations.
| Time | This | ZIP - 1 |
|---|---|---|
| Wikipedia | 266.36 secs | 259.00 secs |
| Text | 8.43 millis | 8.59 millis |
The speed tests have been run three times and the average has been
taken, but as the hardware is not representative, the values should only
be compared against each other and can differ depending on the
underlying system.
Decoder
The decoding process, illustrated in figure
14{reference-type=“ref” reference=“decoder”}, is also
governed by a state machine with two states: decoding and not decoding.
This section describes the theoretical operation, assuming
character-by-character input and indexed access to all characters. The
decoder initiates in the not decoding state, transitioning to the
decoding state upon encountering the designated break character
sequence.
In the not decoding state, the algorithm operates as a standard UTF-8
decoder, directly writing each byte to the output file or buffer without
further processing. Upon entering the decoding state, triggered by a
break character, the processing diverges. Instead of direct output, each
incoming byte, representing a compressed character pair, is passed to
the Decoder function, depicted in detail within figure
14{reference-type=“ref” reference=“decoder”}.
For each compressed byte, the Decoder function generates two potential
character possibilities, denoted as "Possibility A" and "Possibility
B" in figure 14{reference-type=“ref” reference=“decoder”},
corresponding to the two possible characters encoded within the 4-bit
representation. For each character, the algorithm proceeds through the
following steps:
-
Gather Possibilities: The Gather possibilities module analyzes the surrounding characters within the defined context window, as discussed in section 3.2.2{reference-type=“ref” reference=“contextwindow”},to collect relevant contextual information for both "Possibility A" and "Possibility B".
-
Activation Function: The contextual information is then processed by the Activation function module. This module, as detailed in section 5.2.1{reference-type=“ref” reference=“activationfunktion”}, applies a non-linear activation function to the probabilities associated with each character possibility in the given context, refining their relevance for subsequent evaluation.
-
Relative Multiplication: Following activation, the Relative multiplication module, described in section 5.2.2{reference-type=“ref” reference=“relativmultiplication”}, applies distance-based weighting to the probabilities. This step adjusts the influence of contextual characters based on their proximity to the character being decoded, enhancing the accuracy of the probability assessment.
-
Compare: Finally, the Compare module evaluates the processed probabilities for both "Possibility A" and "Possibility B". By comparing the aggregated and weighted probabilities, the algorithm determines the most probable character from the two possibilities and outputs this character to the decoded file. For example, if "Possibility A" exhibits significantly higher contextual probability, it is selected as the decoded character.
The decoder remains in the decoding state, processing subsequent compressed bytes through the Decoder function, until another break character sequence is encountered. Upon detecting a break character in the decoding state, the state machine reverts back to the not decoding state, resuming standard UTF-8 decoding for the subsequent input stream.
Gather Possibilities
The Gather function searches for the necessary context in the
surrounding environment and presents it to the activation function. When
the context is fully contained within the buffer, the gatherer can
easily reference it. However, as discussed in
4.1{reference-type=“ref” reference=“inputbuffer”} ,
problems arise when the buffer goes out of range. To address this, the
decoder employs a secondary buffer, similar to the encoder, to bridge
this gap.
In the decoding function, unlike the encoding function, every ‘u8’ may
represent two characters. Therefore, the Gather function must also
determine the length of each ‘u8’ and split them accordingly before
passing them to the activation function, as some might be collapsed
characters. However, the decoder may encounter situations where it is
uncertain whether a character should be handled as UTF-8 or not. In such
cases, the Gather function has to go linearly through the surrounding
characters determine the state, before saving it to the buffer. This is
inherently serial and could therefore currently not be parallelized.
Optimisation
The way the algorithm handles probabilities has a significant impact on its performance. To optimize these differences, the algorithm applies two functions to the probabilities. These functions have been shown to improve the average accuracy of the algorithm to 93.62% on text that has been compressed to 65.7% of its original size.
Activation Function
Activation functions are commonly used in machine learning applications
to handle the relevance of data, which in this case are the
probabilities. The primary objective of an activation function is to
introduce non-linearity into the model, which has been shown to produce
better results while being computationally inexpensive [cf.
@DBLP:journals/corr/abs-2109-14545 3-10].
In this algorithm, every probability is passed through an activation
function before being used in subsequent calculations. One key
difference between this algorithm and typical machine learning
applications is that the probability values used in this algorithm will
never be negative or zero. This is because every character has been
observed at least once at every distance to every other character.
Therefore, the activation function used in this algorithm must be able
to handle positive input values. Different activation functions have
been tested and evaluated on the algorithm, with some presented in
figure 15{reference-type=“ref” reference=“activation”}.
Of these Tanh presents the best accuracy, which in the context of this
algorithm shrinks the smaller values exponentially, which would mean
that it is filtering out low probability noise.
| Activation | Accuracy |
|---|---|
| Sigmoid | 92,632% |
| SELU | 92,60% |
| TANH | 92.78% |
It has to be noted that many of the proposed activation functions in the cited paper have not been tested, since they would require a $\beta$ which has to be determined by training, which is subject for another paper.
Relative Multiplication
As demonstrated in section 3.2.2{reference-type=“ref” reference=“contextwindow”}, the relevance of a character to another character decreases as the distance between them increases. Previously, all seven probabilities were treated equally, but as shown, they are not equally relevant to a distinct character. To account for this discrepancy, weighting the probabilities based on the data presented in figure 3{reference-type=“ref” reference=“entropyfig”}, results in better accuracy. By doing so, it is possible to adjust the individual probabilities of each distance according to their relevance. Using this weighting scheme as a starting point and applied gradient descent-like algorithms to optimize the accuracy of the model, results in an 0.84% improvement in accuracy, indicating that this approach effectively captures the varying degrees of relevance between characters
| Multiplication | Base | Plato |
|---|---|---|
| -4 | 1.05 | 1.05 |
| -3 | 1.09 | 1.09 |
| -2 | 1.3 | 1.3 |
| -1 | 2.0 | 2.0 |
| 1 | 1.8 | 2.0 |
| 2 | 1.4 | 1.4 |
| 3 | 1.16 | 1.16 |
Adjust Weights
The weights themselves are still normalized and are prone to some error.
To improve the accuracy of the model for real-world use, a weight
optimizer has been implemented. This optimizer takes in text and trains
the model’s weights on this text. Similar to the NNCP algorithm
mentioned earlier, this optimizer could be used to train the model for
specific use cases or vocabulary. This method could prove itselfe to be
particularly useful in applications where the vocabulary or language
usage is highly specialized or unique. As this algorithm aims to
formulate a base model for general English text, use-case optimizations
is subject for other papers.
To achieve a model for general English, a dataset of 24,000 characters
was used, of which 4,000 were randomly selected from Wikipedia articles,
and 20,000 were from short stories written by CHAT GPT. The test short
story was not included in the training dataset to avoid bias. Different
step sizes and epoch iterations were tried to find the optimal values.
After several iterations, a step size of 0.00025 yielded the best
results, with an accuracy of 93.62%.
Output
The output now is rather accurate and mostly readable. This is a short example of a compress text, that has been decompressed with all adjustments and optmisationes presented in this paper.
Once bpon a time, in the bustling mettopolis of London,
Once upon a time, in the bustling metropolis of London,
there lived a man wamed Ben.
there lived a man named Ben.
The algorithm makes some mistakes, but all words stay the same length and the errors are rather easy to spot because, if a character was identified wrongly there is just one other option, that it could have been.
Discussion
In this section, the results of the investigation will be presented and their implications discussed. Additionally, potential follow-up studies aimed at exploring alternative methods of interpreting probabilities to enhance text compression using character correlations will be outlined. Furthermore, possible optimizations that could improve both speed and accuracy, as well as potential future use cases will be discussed.
Results
This study investigated the feasibility of lossy text compression
utilizing statistical probabilities and character correlations. The
developed algorithm successfully demonstrated the viability of this
approach, achieving a character reconstruction accuracy of 93.62% on
English text. This accuracy represents an improvement from an initial
85.95%, realized through iterative optimization and refinement of
the algorithm, as detailed in section
5{reference-type=“ref” reference=“decodersection”}.
Furthermore, performance evaluations indicate that the compression and
decompression speed of the presented algorithm is comparable to that of
the fast mode of the widely adopted lossless ZIP compression algorithm.
As shown in figure 13{reference-type=“ref”
reference=“compressionspeed”} and
Table 13{reference-type=“ref”
reference=“compressionspeed”}, the processing time is within a similar
range to ZIP -1.
Analysis of the decompressed output reveals a high degree of semantic
preservation, despite the lossy nature of the compression. The observed
errors primarily manifest as predictable character substitutions,
maintaining the original character count and presenting only one
possible alternative for each potentially incorrect character. This
characteristic suggests a low noise ratio in the decompressed text, with
errors largely confined to orthographic variations.
Character Relations
| Combination | 0 | 1 | 2 | 3 | 5 | 6 | 7 | |
|---|---|---|---|---|---|---|---|---|
| F and D | 0.180 | 0.196 | 0.253 | 0.359 | 0.314 | 0.251 | 0.224 | |
| P and K | 0.190 | 0.231 | 0.258 | 0.315 | 0.304 | 0.230 | 0.213 | |
| C and L | 0.186 | 0.196 | 0.245 | 0.331 | 0.368 | 0.266 | 0.213 | |
| G and O | 0.187 | 0.201 | 0.260 | 0.376 | 0.421 | 0.221 | 0.195 | |
| N and W | 0.185 | 0.200 | 0.239 | 0.352 | 0.339 | 0.253 | 0.221 | |
| A and Z | 0.192 | 0.211 | 0.233 | 0.386 | 0.444 | 0.215 | 0.194 | |
| H and X | 0.196 | 0.210 | 0.285 | 0.488 | 0.258 | 0.218 | 0.199 | |
| M and Q | 0.205 | 0.217 | 0.224 | 0.252 | 0.367 | 0.282 | 0.209 | |
| I and S | 0.188 | 0.209 | 0.224 | 0.391 | 0.468 | 0.225 | 0.199 | |
| J and E | 0.196 | 0.204 | 0.231 | 0.364 | 0.410 | 0.223 | 0.208 | |
| Y and V | 0.190 | 0.221 | 0.240 | 0.346 | 0.362 | 0.232 | 0.212 | |
| B and U | 0.185 | 0.196 | 0.236 | 0.458 | 0.430 | 0.206 | 0.200 | |
| T and R | 0.182 | 0.210 | 0.262 | 0.383 | 0.315 | 0.251 | 0.209 | |
| K and L | 0.202 | 0.237 | 0.259 | 0.346 | 0.277 | 0.224 | 0.206 |
As this algorithm relies heavily on character correlation, specifically
for single characters within a context window of 7, significant insights
have been gathered by training a gradient boosted tree using the Google
yggdrasil library [cf. @GBBSP23 1-10]. Some of these findings are shown
in figure 16{reference-type=“ref” reference=“effekt”}, where
index 4 has been omitted as it represents the character itself. The data
demonstrates that surrounding characters contain enough information to
predict the next character with a reasonable degree of confidence,
though not all characters contribute equally. Characters at positions -1
and 1 are the most influential, but their relative importance varies
depending on the specific character combination. Additionally, the rate
at which characters lose relevance is not consistent, but even the least
relevant character in the most extreme tested scenario retains about 40%
of the relevance of the most important one. Therefore, this information
should not be considered redundant.
Since this area has only been marginally explored, there remains
significant room for further improvement and experimentation.
Further Investigations
This paper serves as a proof of concept, demonstrating that this sort of
compression works at a small scale using minimalistic statistical
machine learning models. As even this relatively simple approach,
compared to the later presented ideas and the currently developed
generative compression algorithms, proved successful, confidence is
given to explore more advanced and challenging algorithms to further
improve accuracy towards the 99.9% mark.
Currently, the probabilities between single characters at specific
distances are analyzed. However, it could be hypothesized that
characters in English are not isolated but embedded in character groups,
syllables and words. Thus, examining them separately may result in
suboptimal performance. Future research should investigate how character
prediction models would behave if pairs of characters were analyzed
instead of single characters. These N-grams might represent the English
language better than the character separatly.
Furthermore, the use of tokenizers to optimize n-gram pairs could be
explored, with the aim of grouping the most relevant pairs together. Of
particular interest would be the development of encoding methods for
such multi-character combinations as it would no longer be as simple as
grouping two encoeded pairs together. Additional studies could
investigate alternative ways of interpreting probabilities by leveraging
architectures like decision trees. From the current perspective, random
forests and classic neural networks could yield interesting results, as
the current decoder’s comparison step acts as a non-linear, binary
classifier. Finally, all these advancements could be combined and tested
together to determine the full potential of this type of compression.
Optimization
The environment in which these characters are encoded and decoded is
also important and could be optimized to yield better performance.
Compute cost is recognized as an important metric when evaluating
compression algorithms, as it often determines the feasibility of
running the algorithm compared to storing them uncompressed. Currently,
the speed is comparable to the zip compression library in fast mode, but
the algorithm is running single-threaded and processing the file
sequentially. With only a small reduction in the number of storable
characters, the algorithm could be multithreaded by decoding buffers
simultaneously. This approach would result in non-encodable space
between the buffers but would also eliminate the need for the second
buffer, as bridging the gap between following buffers was its main
purpose.
Performance improvements could also be made separately from the
algorithm itself. The algorithm has the advantage that all words
maintain the same length, and errors, when made, are quite predictable
as there is only one option where the algorithm could have made a
mistake. A secondary algorithm could be implemented to review the
written text using an English dictionary to verify if the words are
valid. When encountering a word not found in the dictionary, it could
search for possible alternatives. This secondary algorithm could then
discard potential results that are impossible to have been produced by
the current algorithm, such as when an ‘r’ is replaced by an ’l’, as
this substitution is not possible with the current algorithm.
Implications
Text is a widely used medium that already stores information quite efficiently. It is used in various contexts, from comment sections to books, from informal brain dumps where syntax errors are tolerable to program files where they are not. This algorithm opens a mostly unexplored territory of lossy text compression. As already mentioned in section 2.1{reference-type=“ref” reference=“compressiontypes”}, lossy compression is often used when it directly interfaces with humans, and as it does not recreate the exact text most of the time, some room for interpretation is acceptable. Given these requirements, some use cases would be:
-
Instant Messaging
In areas where data transfers are not as fast and reliable, compressing text with such lossy text compression could be an option. -
Data Mining
Large Language Model (LLM) trainings use huge amounts of data, which is mostly text. While language models benefit from high-quality training data, if compression accuracy could be improved, the saved disk space could be valuable, saving on disk and bandwidth. -
Transcripts
Speech-to-text models could provide transcripts of audio. For storing such data, lossy text compression would be acceptable. -
Search Engine Indexing
Storing compressed versions of web pages to save space and improve search speed, as keywords could still be visible in the compressed form. -
Old Comment Sections
Old comment sections that have long been unvisited could also be stored in a compressed form.Conclusion
This paper has successfully demonstrated the viability of lossy text compression through the application of lightweight prediction models and statistical character correlations. By encoding character pairs into compact 4-bit representations within the UTF-8 framework, a novel algorithm has been developed, achieving significant compression ratios while maintaining a high level of text reconstruction accuracy. The presented methodology, leveraging a state machine and contextual probabilities, offers a computationally efficient alternative to resource-intensive, transformer-based compression techniques. The empirical validation, demonstrating performance comparable to established lossless compression algorithms like ZIP in terms of speed, underscores the practical potential of this approach.
While this work serves as a proof of concept, it also highlights several avenues for future research and refinement. Further investigation into advanced probability interpretation methods, n-gram analysis, and integration with natural language processing techniques promises to yield even greater improvements in both compression efficiency and reconstruction fidelity. Moreover, optimizations in algorithmic implementation and hardware utilization could further enhance the speed and applicability of this compression paradigm. The findings of this study suggest that lossy text compression, grounded in statistical prediction, represents a promising direction for applications demanding lightweight data reduction without complete data preservation, thereby opening new possibilities in domains such as instant messaging, large language model data preprocessing, and search engine technologies. The initial success of this endeavor encourages continued exploration into the rich potential of predictive models for text compression and related data processing tasks.