Replacement Text compression

Who needs capital letters anyway

This was the question I woke up one morning, and decided to put an end to it. While I am already destroying any typographically beautiful text with my lossy text compression, I hope to post here soon, I might as well, test a world without capital Letters. The problem just is that with current utf-8, there is absolutely no gain, in removing all the capital letters, the text stays the same size, so is there a way to make good use of the missing capital letters?

Replace equivalents

The program will start with a given list of equivalents, like A -> a, B -> b, but the more you replace, the more space you will save, so adding ! -> . ? -> . “ -> ‘ could further improve the algorithms capabilities. The program will then construct a Hashmap out of this, and then goes through the file and whenever it finds a target it replaces the key with the value, pretty simple

How to use the Space

I will just focus on the ASCII aspect of UTF-8

ASCII has space for 128 different characters, but when we look at our typical written text, it might not surprise you to not find 128 different characters in it. Disregarding the control characters, not every text will use all the number characters or the + and so on, and if we let the capital letters be replaced, we gain another 26 slots of empty space, how can we use this.

Gathering most used characters

Besides checking which characters have not been used, I also check how often double characters combinations appear in the text that is going to be compressed. To make this kind of fast I created an 128x128 ndarray, where every crossing represents the x character following the y character and how often this occurred. Then the text file got split into buffers, and multiple threads calculated the ndarray for every buffer and the results get added up. I then only had to convert this into a single vector and sort it (in descending order) and I easily got the n most used double character combinations. Then for every unused ASCII character, I took one double character combinations, and created a Hashmap that is <(char, char), char>. Using this Hashmap, while iterating over the array, I can easily replace all occurrences of the two characters in this constellation and replacing it with the value.

Performance

As I as of now have no idea, how to split a text file into multiple buffers, do some transformation on them, and then concatenate it again in the correct order. This just simply goes through a file, and it is highly unoptimized so the time is not important, so i will just show the compression performance, and just think, that as it basically does nothing, it could be very fast(my naive implementation on my laptop takes 26.75s for an 524M text file)\

524M of Wikipedia Text -> 355M

Further improvements

I will first try to optimize which characters will be replaced, because currently it can only replace exactly two, but maybe there are times when replacing it with three ist just more optimal.

Replacement LTC/TC