Home » Data compression, Featured, Headline

Random data compression – Is it possible? Part 1

7 December 2009 by Keshav Shetty 29,333 views 6 Comments

Random data compression lossless – Is it possible?

The answer is NO (…. and yes!!!)

This is little lengthy blog to show and prove the difficulties and possible ways. This article divided into three section.

(Before you read further I suggest you to read Mark Nelson “The Million Random Digit Challenge“)

1. Why it is not possible to compress random data. (brief introduction to Kolmogorov theory)

2. Why it is possible (Quantum theory, matter and anti matter introduction)

3. The future for compression and possible solution with transition representation and unsorting techniques

If you want to skip this introduction and continue with next article, please click – using unsorting techniques for random data compression.

1. Why it is not possible to compress random data. (brief introduction to Kolmogorov theory)

As per pigeonhole principle – if n pigeons are put into m pigeonholes with n > m, then at least one pigeonhole must contain more than one pigeon or if n<m there should be at least one empty pigeonhole. In pure mathematics term – There does not exist an injective function on finite sets whose codomain is smaller than its domain

datacompression1.JPG

The Russian mathematician Kolmogorov proved thru his complexity theory that – With the uniform probability distribution on the space of bitstrings of length n, the probability that a string is incompressible by c is at least 1 − 2^(-c+1) + 2^−n (More details please refer to wikipedia here)

Lets take a arithmetic approach to prove the above theory.

The byte i.e 8 bit can represent 256 variety of data i.e 0, 1, 2, ….255 – As you can see there are totally 256 variation or possible values and assume if each one appears only once(in 256 byte) then you cannot represent all these possibilities with a domain contains smaller set than 256.

All existing compression algorithms can compress the data if a data set where all possibilities are not present. e.g: In a input data if only 8 unique bytes present, then we just need 3 bits(0, 1, 2, ….7 – total possibilities) to represent each byte of the input data.

Note: Each algorithm works differently, however all existing algorithms like dictionary based, LZW, Huffman, RLE etc works based on patterns, repetitions, appearance, occurrence or uniqueness of the data.

People often fail to digest the simplicity of the above theory and often claim they have developed magic data compressor which can compress any type of data. (Refer comp.compression)

(Note: I am not claiming any such magic compressor – but I claim such possibilities)

2. Why do I believe that random data compression is possible?

Don’t mistake me, I strongly believe in Kolmogorov complexity and number theory, It is impossible to develop magic compressor which can compress any type of DATA.

In my school days I learnt that our universe was in the form of bindurupi (in the form of dot or originated from nothing) and expanded (still expanding), generating so many possibilities and varieties. As per the quantum theory matter and anti matter joins together – during this process new variations appears. Better example is each couple produce totally different baby, the new baby is neither 100% of father or 100% of mother (or not even 50% father + 50% mother), Baby might be having few feature of father and mother and ancestors, but it also got its own uniqueness. Genes keep mutating, but based on what?

Note: “Nothing” doesn’t mean empty or zero, we are still not clear about nothing. In Sanskrit and Sanathan mythology it is mentioned about Brahma vidya which deals with separation of matter and anti matter.

A matter is a materialistic world or physical world which we can see, we can feel or we can represent, and store (Random data is also a type of matter), whereas as anti matter we cannot imagine or can be represented, even quantum theory couldn’t provide any visual way to see or feel this. (May be human beings are not yet ready to digest this because we are still living in materialistic world)

As per Einstein’s Energy mass relationship E = mc^2 which states “E” energy generated when a object of mass “m” is converted to energy, neither mass nor energy were conserved separately. This is basic used in Nuclear fusion.

Now you might be thinking – What is the relation between Einstein’s mass/energy theory and random data compression?

In one of the lecture Einstein mentioned that you can convert mass “m” to energy “E”, neither mass or energy is destroyable. Similarly you can convert energy to matter !

One of the audience asked Einstein how is it possible to generate mass from energy? – effectively if you have enough energy you can generate any type of object like Gold, Uranium etc.

Einstein said it is outside the definition of Physics and a new world opens thru spiritual way to prove it. Mostly Einstein had a idea of anti matter at that time. In my view anti matter is huge energy reservoir which acted on matter to generate variety of materials we see today. (Imagine Nuclear fission of heavier material), please note I believe there should different type of antimatter similar to matter we see)

We heard many such information in ancient India and even current time it is believed that few got such capability to generate objects. e.g:

1. Pandavas had Akshya patre – to generate food
2. Kamadhenu – A holy cow could provide anything sage Vasista asked for
3. Devine Sri Sathya Sai Baba

I will not go further in spiritual way, which I cannot address properly or I have no sufficient knowledge in this area.

As long as we treat the material as data we cannot compress the random data, the day we find the way to represent the anti matter we can achieve this or in other words we should deal with existing random data as transition or state representation, because data is static form of Bhrahma vidya, it is anti matter which generates all variations.

What does it mean transition or state representation?

There exist a huge set of static data – The size of static data is beyond the imagination of human being. The so called anti matter acts on these matter and generates new set of static data. (This is how universe is expanding?)

The static data is growing! then how come it is static? (I used static to represent default or initial set) During this process there is a different stage of the matter it passes thru. e.g: Radio activity material decays over a period of long time to generate another material, during this long process in the intervals we have variety of data or material, if we know the state of the item we can accurately tell the process involved in it or time involved in it, Similarly if we knew all the process it passed thru, then we can define the state of the material.

Kolmogorov turing machine is actually generating data from above said static data. We call it as highly random or high entropy because we cannot define it as set or within domain. We human beings not reached the stage where we can define the random. Note: Random doesn’t mean uncertainty or non guessable future, but it is predefined beyond the imagination or understanding limit of materialistic human.

Summary : As long as we treat the random numbers as data we cannot compress it, because number theory which applicable to data clearly proves that it is impossible. However the day if we find the way to treat them as either anti matter or state transition then we can change the way we represent the data.

3. The future for compression and possible solution with transition representation and unsorting techniques

In above section I mentioned about state or transition representation, now if we represent random data as state or transition then we will see totally different scenario or concept.

I will take a small portion of the static data I mentioned above, i.e a subset of 256 unique values. That mean I need 8 bits to represent all variations. If in every 256 data each value appears only once then none of the existing algorithm can compress the data. (Although such entropy chances in real time is not high, but they are highly random that cannot be compressed)

[I will try to prove this using conventional manner instead of mathematical form, I don’t want my readers get confused with mathematical formulas, I will try to avoid any use of formula or function unless it is un-avoidable].

Using a unsorting algorithm we can achieve compression of such data between 10% to 50%!!!.

The catch here is data set is fixed, but only their positions are different or data not organized. When all numbers are unique we need to find a way to represent original position (Which requires lesser than 8 bits per value)

I hope I have not confused you, let me describe in other way

I have a set of 256 variations and all are unique – In this scenario I need 8bits to represent each value, The values might be 250, 140, 0, 130, 239, … and so on, each value appears only once.

If you use any of the existing compression algorithm we cannot achieve any compression, since there is no repetition or patterns present (Highly random and each value is unique)

In fact most of the compressor generates compressed data in the form of above mentioned form i.e highly unique and non repeatable patterns (Close to unique within every 256 block, but not exact) A better example is random digits created by the RAND group. Please refer Mark Nelson The Million Random Digit Challenge Revisited. (This data is very close to each 256 variations, but not 100% unique – that is based on Kolmogorov complexity)

But my methods can compress such random data (Not Kolmogorov complexity or his Turing machine based random data – because it is not 100% unique within 256 block, I am yet to reach there)

What is unsorting algorithm? In our graduation or engineering we studied various sorting algorithms like bubble sort, quick sort, merge sort etc, However we never came across anything called unsorting algorithm! – Why we need it?

Actually an unsorting algorithm is shuffling back the sorted items back into original order or position. Although such algorithms may not be that useful in real time applications, but I will demonstrate compression of 256 unique data using that.

If there are 256 data, then there are 256 positions that means still I need 8 bit to store each position. However unsorting algorithm is not just about placing the sorted order into original list, it also uses minimal place or storage required to represent original places.

Lets take merge sort, if we have list of two items, then we just need 1 bit(both best and worst case) to unsort back to original list, that bit represent which is smaller among two. If we have four items then we need 4 bits best case scenario and 5 bits in worst case, If we have 8 items then we need 12 bits best case and 17 bits worst case scenario. (Note: In case of 8 items we need actually 24 bits to represent all unique values), If we have 256 unique items we need 1024bits(128byte) best case and 1856 worst case. Actually 256 unique value requires 2048 bits. As you might observed using reverse merge sort we can achieve the compression mentioned above. I will not go in details, because it will fill another page. If you need further details please contact me. Please write me if you need a working code of such compression using reverse merge sort. Similar results can be obtained using reverse quick sort.

Above compressor works well for each value uniqueness within every 256 values, However in real time data is not that unique and Kolmogorov doesn’t guarantee that entropy within predefined set of data.

The above concept is same as probability theory, i.e If we toss a coin chances of getting either head or tail is 50%, However it won’t guarantee that for every two successive toss one will head and other will be tail. But over large number of toss you will end up with 50% head and 50% tail.

Coming back to random digits created by the RAND group – I observed for every 256 block of data uniqueness of values lies anywhere between 166 to 146, that means if we find a way to patch the missing unique number within every 256 block, then we reached the stage of random data compression, however patched unique number + unsorting information should be less than 2047 bits for every 256 block. I am still looking for such possibility.

Well now other approach, as I mentioned in previous section instead of representing random data as data, but as state or transition then new concept opens. i.e treat random digits of RAND group as unique transition of another process.

Continue reading next article – using unsorting techniques for random data compression.

1 Star2 Stars3 Stars4 Stars5 Stars (2 votes, average: 5.00 out of 5)
Loading...

6 Comments »

  • Jules Gilbert said:

    I stopped reading when you started on things that are apart from the God of the Bible.

    God gave us our world and indeed, he loves us, even sending Jesus to help us get back on track. So don’t ignore him! For he really is God come to earth. Instead honor him as God for that is exactly who he is.

    Now about compression.

    Won’t you be surprised because what you believe to be impossible, well, that’s exactly what I’ve done.

    I like Mark but if he actually said what you claim (I haven’t checked,) well then, he was wrong. In point of fact, while the basic points you (or he,) makes are quite true, they shouldn’t be extended to imply that repeated compression of random input is impossible.

    One more time. No, it’s not impossible, I do it.

  • Raimonds S. said:

    I clearly understand what You believe could be done and I though the same before.

    After 2 year research in this field (90% looks like your work description), I consider that only useful thing about random data value uniqueness is that – there are limits of compressible and un-compressible data.
    I did not find information regarding such uniqueness compressor, but I found/calculate some numbers.
    Let’s take 8b sample (all 256 values from 0-255 in random order):
    1) the best (cost less bits) method to store such unique values is log(256!, 2) (base 2).
    2) the margins I found for compressible data are:
    a) about less than 130 unique values in 256 value sequence (cmpressible using arithmetic coding and similar);
    b) 256 unique values or more than about 180 unique values in 256 value sequence (regarding our research).

    I did not write any paper about that but if you will go further you will face those values. What about your research so far, did you found some more facts?

    P.S. I’m not mathematician and English is not my native 🙂

  • Keshav Shetty (author) said:

    Hi Raimonds,

    Thanks for your comment, I will be posting my next part of the subject this weekend.
    We can discuss after this post.

    Just to you inform you that
    1. If all 256 are unique(in each 256block) I can compress using reverse merge sort with approximately 10% to 50%.
    (Depending on the order of items, if it is in ascending order we can achieve 50% compression, worst case if it in descending order.)
    2. If the unique numbers(in each 256block) are between 230 to 255 then we can compress around 1% to 10%.
    3. But if the uniqueness is between 128 to 230, I couldn’t get any solution so far.

    In fact Million Random digit contains around 150 unique values and remaining duplicates.

    Thanks & regards
    Keshav K Shetty

  • Bear Jharls said:

    When we have exactly one each of the values 0-255 in a string of 256 bytes (2048 bits) then the sum of the string is a maximum of factorial(256), which can be encoded in 1684 bits (rounded-up) for all values.

    simple example using factorial(2):

    data = ’10’; // length = 2.

    encoding:
    newdata = data[0]; // length = 1.

    decoding:
    data = newdata;
    if data = ‘0’ then data = data + ‘1’ else data = data + ‘0’;

    Encoding and decoding factorials of (256) (or even factorial(2^31)) can be done quite easily using a 32-bit arithmetic encoder (like Mark Nelson’s for example) and a MoveToBack model lowering the high bound at each step.

  • Arsene said:

    It is fairly easy to see that if we have exactly 256 different byte values, it’s easy to compress them, simply because each time you compress an item, you can exclude it from the list, so it get’s easier to code every item (requires less bits, and it’s even easier to program than a reverse sort algorithm). Although it doesn’t seems that way, random data usually contains about 128-160 different values. I investigated those possibilities too, and I’m very excited to see how far you’ll go.

    As stated before, there’s data that just cannot be compressed. In my opinion, though, MOST of data we believe it’s random it’s not. It can be further compressed. The only problem is that we are, currently, unable to devise models adequately to compress them.

  • Khan said:

    Try this array Shetty,

    char cc = {52, -66, -55, -98, 26, -10, 45, 5, -34, 58, -63, 15, -57, 60, 43, 21};

    I was wondering if you can reduce this 16 bytes.

    And also we will be interested in seeing the unsort source you mentioned above and we test it at our research center.

Leave your response!

Add your comment below, or trackback from your own site. You can also subscribe to these comments via RSS.

Be nice. Keep it clean. Stay on topic. No spam.

You can use these tags:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

This is a Gravatar-enabled weblog. To get your own globally-recognized-avatar, please register at Gravatar.