Home » Data compression, Headline

Random data compression – Is it possible? (Part 2)

20 December 2009 by Keshav Shetty 11,888 views 8 Comments

This is continuation of my previous article on Random data compression possibilities.

Compression of unique values

As I indicated in previous article if for every 256 bytes each value appears only ones, then we can achieve the compression using any of the below techniques.

1. Using insertion sort – by remembering the position.
Using this technique for every 256 byte – we can save exactly 32 bytes, i.e for first element no need to remember any position, for next value one bit sufficient to remember weather current element inserted after or before previous number, for next two numbers we need two bits for each value, for next 4 numbers we need 3 bits for each number and so on until for last 128 values we need 8 bit for each value. It is constant save – i.e for every possible combination we will have 1792 bits or 224bytes. (We save 256 – 224 = 32bytes)

2. Using factorial of 256
In this approach, all possible variations can represented using factorial of 256, that will end up with 1684bits for every possible combination. It is constant save – i.e for every possible combination we will have 1684 bits or 210.5 bytes. (We save 256 – 210.5 = 45.5bytes)

3. Using reverse merge sort – Using this approach use a merge sort and store bit information of the list from where we picked the smaller number. Advantage of this technique is, when one list gets over, we don’t require to store or remember unsorting information for remaining element in other list, we just need to append it as illustrated below.


In best case scenario(when all elements are in ascending order) we just need (1/2*8bit) x 256 = 128 bytes for total 256 input bytes, in worst case we need 1793 bits as illustrated below.


Depending on the unique numbers appear in the list we can save between 128 byte(best case) to 31.75 byte (worst case)

Similarly we can use reverse quick sort technique, however merge sort is better than quick in storing minimal unsorting information, because in merge when one list gets over, we don’t require unsorting information for remaining elements in other list, this feature missing if we use quick sort, but still reverse quick sort can be used with lesser than 2048bits.

How to use above theory when all numbers are not unique?

Well above said theory works when all numbers are unique, but in real time such scenario is rare, instead there will be some duplicates and some unique numbers missing.

In fact random digits created by the RAND group contains on an average 100bytes duplicates(repeated) in every block of 256 numbers. in such data we need to patch the dataset by removing non unique, and inserting missing unique numbers, so that above reverse merge sort can be applied. When we remove duplicates we need to retain its position and actual value. (No need to retain the introduced missing numbers, which will be discarded after unsorting. But to patch this way we need lot of byte, e.g: if 100 numbers is duplicates (i.e 100 numbers are missing), we need to have 100*2=200 bytes. (very costly).

However there is better patch as below
1. Read 256 block of data, mark the duplicates/non unique, remember these relative to current position of the number (so we don’t require 8bits for position), Now we have missing unique numbers.
2. Rearrange the input data by eliminating duplicates so that unique numbers moves upper part of the list, fill missing unique numbers at the end of list. (make sure all missing numbers are filled in ascending order so that we can avoid storing unsorting information of missing numbers – which is of nu use for us – This way we can save few bytes)
3. Perform merge sort on the list excluding introduced missing numbers, now the compressed file will have Unsorting information + duplicate number positions + duplicate values.
4. If missing numbers are less than 32 byte we can actually store position using eight bits per position, if it is more than 32, then we use a bit for each byte indicating duplicate or not.

Using above technique I developed a compressor which can compress random data up to 42 duplicates or missing unique numbers. (I used million digit of Rand with some modification to maintain maximum 42 missing numbers in every 256 input bytes. However I am yet to find way to address another 60 bytes so that million digit of Rand group can be compressed at least by a byte.

I will fine tune my compressor and announce in next article in next year. (After Xmas/new year), next article I will introduce another technique of transition or state representation for random data.

Wish you all happy new year 2010.

1 Star2 Stars3 Stars4 Stars5 Stars (4 votes, average: 5.00 out of 5)


  • Raimonds S. said:


    Regarding: “How to use above theory when all numbers are not unique?”.
    There are some more effective solution to store information about unique/notunique number positions.
    You can use combinatorics (I use combinadic – wikipedia have nice rticle about that).
    Let’s say 100 notunique and 156 unique values: it will take much less bits than 256 – but there must be specified count of unique numbers which takes 7 or 8 bits.
    The way to store information about repeated values. You must take unique value positions to represent notunique number.
    Let’s say we have sequence (0-255):
    4 6 10 32 6 85 10 2 …
    we know which number is unique – to represent this I’ll use bit sequence to show unique numbers:
    1 1 1 1 0 1 0 1 …
    all zeroes (not unique) could be stored in using much less bits than originally. In this example second number 6 could be stored using 2 bits, because there are only 4 unique values, and so further…

    The only thing I did not use is sorting for data storing, but this field is very interesting for me.
    The only one advice – try to implement (program) this, before go crazy about good results on paper 😉

    Best regards!


  • Keshav Shetty (author) said:

    Hi Raimonds,

    About storing position I mentioned
    “Read 256 block of data, mark the duplicates, remember these relative to current position of the number (so we don’t require 8bits for position)”
    It is the same thing you are referring, i.e if the third item is duplicate, then that value present in earlier two numbers, so we need just 1 bit to store the position.

    About the implementation I already have a working copy with max 42 duplicates can be compressed, I want to fine tune further and announce in this blog next week.

    Thanks & regards
    Keshav K Shetty

  • Jeff said:

    In addition to your method, you can also predict that the next byte will come from the side with more elements left, therefore, instead of storing 0 to signal “next is from left” and 1 for “next is from right” you could do 0 for “next is from predicted side” and 1 for “next is from non-predicted side”. Then, with each correct prediction, you write a zero, and therefore your un-sort bits are heavy with 0s and can therefore re-compress your un-sort bits due to the 0 redundancies. The best way I’ve found to do this is using factorial. For example, if your un-sort bits total 141 0s and 115 1s, you can represent them in 250 bits for a savings of 6 bits (not including the number of bits necessary to recall you have 141 0s which in this case could be a small number representing how far 141 is from 128, which is just 13, or 4 bits, making your total savings 2 whole bits. But hey, every bit counts!

  • games pinball said:

    There are surely lots of details like this to consider. This is a great point to bring up.

  • Lawyer said:

    This really is i’ll be attempting to find. It is exactly what I ask high quality. The information provided in the following paragraphs is usually to probably the most effective. I must say you must have invested your time within environment these types of happy with each other. They’re highly relevant to your concept. I will suggest this unique to all also to my affiliates. I shall go back here to try out the quantity of give good results. I appreciate you for producing these types of occur.

  • Random data compression – Lowerband & upper band filtering | Simpler solution for complex problem. Think different – Keshav Shetty said:

    […] Few users suggested to change the title, because title says “Random data” where as article describes about unique elements. Let me clarify that million random digit contains average 90-110 bytes of duplicate or non unique for every 256 bytes. If we could represent unique data set <=202 byte(worst case), we can use remaining 54 byte to represent duplicates. Refer previous article here […]

  • Shirani said:

    Hi Kashav,

    I have gone through your articles, I had similar idea but not as clear as of yours and in depth described. Using your statically generated symbol table I have reached to point where:

    Total Symbols : 162
    Hard Symbols : 90 //that are not duplicate
    Max position of duplicate : 227
    Max duplicates of single symbol : 4
    Out Length bits : 1610 bits without encoding duplicates
    Out Length Bytes : 201.25 bytes
    Task : 94 bytes to be encoded in 63

    Now the task is encoded only 94 duplicates that are definitely known. You mentioned “If we could represent unique data set <=202 byte(worst case), we can use remaining 54 byte to represent duplicates.". Please explain, do you propose huffman codes or among statically generated that you posted, and remember total number of symbols in duplicates are 72 that is because total symbols are 162 and 90 are hard symbols that don't repeat. As per your bitdiffmap here is proposed codes for 72 symbols:

    Pos[72] 0000000 0000001 0000010 0000011 0000100 0000101 0000110 0000111 0001000 0001001 0001010 0001011 0001100 0001101 0001110 0001111 001000 001001 001010 001011 001100 001101 001110 001111 010000 010001 010010 010011 010100 010101 010110 010111 011000 011001 011010 011011 011100 011101 011110 011111 100000 100001 100010 100011 100100 100101 100110 100111 101000 101001 101010 101011 101100 101101 101110 101111 110000 110001 110010 110011 110100 110101 110110 110111 111000 111001 111010 111011 111100 111101 111110 111111


  • Shirani said:

    So practically we have to encode 94 bytes [that have 72 symbols] in 63 bytes in total. I hope this is possible, but no claim in fact.


Leave your response!

Add your comment below, or trackback from your own site. You can also subscribe to these comments via RSS.

Be nice. Keep it clean. Stay on topic. No spam.

You can use these tags:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

This is a Gravatar-enabled weblog. To get your own globally-recognized-avatar, please register at Gravatar.