Introduction
Following the lossy text compression route, I need to reliably gather and save relativ text data. Since this often needs to be done only once, it is acceptable for the program to take multiple hours to complete its query on the Wikipedia text dump. However, it would be problematic if the program ran for hours only to encounter an error and lose its progress.
To address this, I developed this little library. Currently, it can only extract data for very narrow use cases, but in the future, it aims to handle the entire data pipeline for the LTC compression library. This includes tackling all external data-related tasks such as gathering, analyzing, tokenizing, and normalizing text data.
If this task only needed to be done once, building such a library might seem excessive. However, when models need to be specialized for specific datasets or languages, having a working and robust data pipeline makes the training process easier, faster, and more reliable in the long run.
When finished, the whole project will be found in the project section, but the updates will be under posts.
File format
The model needs different types of data, but the main funktion currently working needs to handle 3 Dimensional u64 arrays and 3 Dimensional f64 arrays. When the ltc Project started, this was handlet, by having a csv file, that when the length of every column is given writes one depth after the other. As off accesability, it has it’s limitations. This approach was neither fast, reliable nor was it easy to work with if more extensive analysis on the data had to be made
HDF5
This is why this project ended up with hdf5.\ It is a portable file format that has, although not extensively tested, a robust library in rust and python, to interact with the data. It is easily capable of handling multidimensional data and could even group different types of data in one file. This would make it possible to have the Absolute, relative and post trained propabilities in the same file, but separated so each of them could be analyzed and used separately.\
Program
Currently the program is still very simple. It takes in as arguments how many characters it should look into the future, how many into the past and what characters to observe. It iterates over the lines in the Wikipedia dump, whenever it finds an acceptable_type it looks around it and saves the acceptables into a big Hashmap.
The Hashmap then gets converted into ndarray which itself gets saved to the hdf5 file.
Result
It is really slow, but it works. I think the hdf5 format is exactly what I needed, and as soon as the project is up, the gathered file will be available for download.
Ideas
- I will first try to multithread it, so that it could not only run faster but also utilize my servers’ many cores, for something other than idling.
- When that is done, I will try to implement the normalization, which could also greatly benefit from multicore performance.
- This is further into the Future than the other two points but currently it iterates over Vec
which is O.K. since all characters that I accept currently are ASCII, I would like to make it work with utf8 just so that I could theoretically also start analyzing german Text, and then use this information to build german models