File Compression
- FIXME
Terms defined: nybble
FIXME: create file compressor
- First idea: number the tokens in the file and replace each with its number
- Use blank-nonblank tokenization (which inadvertently captures punctuation)
compress.py
- Test with text of introduction to this book: makes things worse
- So store as binary
binary.py
- Better, but still larger than the original
- In fact, our binary compression is worse when tested every chapter
- Look at statistics
frequency.py
andfrequency.svg
- Lots of words only occur once (long tail)
- Single space occurs over 1600 times
- Use variable-length encoding Chapter 17
- Use nybbles (half-bytes)
- Most common seven values are 0000 through 0111
- Next 49 are (1…, 0…)
- 343 are (1…, 1…, 0…)
- Calculate compression before implementing
varlen_calc.py
tells us we can get the introduction from 13240 bytes to 10281 (approximately 1.3X)- But
gzip
can do 2.3X: how?
- Lempel-Ziv-Storer-Szymanski
- Use this tutorial by Tim Cogan
- That’s algorithm country rather than software design…