Categories
ChatGPT

The Art of Prompt Compression from Claude Shannon to LLMs

For about a year, I’ve been thinking about what compression looks like for an LLM prompt. Compression in computing is pretty well defined. But I was curious about how short a prompt can be while still having enough information to pull the right answer from the model. 

In order to understand compression in computing, we need to go back to the 1940s. Claude Shannon was trying to understand how to make information communication more efficient. He wanted to answer the question, “How much superfluous content could you squeeze out of the data and still have the same amount of information?” 

He came up with the idea of information entropy. Entropy measures the unpredictability or randomness in a set of data. When things are predictable or repeated, you can compress them. One method for doing this is Run-Length Encoding (RLE). RLE works by identifying sequences where the same data value occurs in many consecutive data elements. Instead of storing each element individually, RLE stores a single data value and a count of how many times it occurs. For example, consider a simple text with the sequence “AAAAAABBBCCDAAA.” Using RLE, this sequence would be compressed to “6A3B2C1D3A.” This means there are six ‘A’s, followed by three ‘B’s, two ‘C’s, one ‘D’, and three ‘A’s.

But there’s only so much you can compress without losing something. If you want to regenerate a whole file, you need all of that information. But say we just want to regenerate an image or a sound. There’s a whole lot of information in the computer file that we can’t see or hear. We can compress it even more if we throw some of this information away. This method, called lossy compression, is often used for multimedia files like images, audio, and video, where a perfect reproduction is not necessary. For example, JPEG images and MP3 audio files use lossy compression to reduce file sizes significantly, sacrificing some quality in the process.

So now let’s go a step further. Think about all of the information that’s stored in your brain. In your head is an enormous amount of data, stored in a highly efficient manner. Your brain filters out unnecessary details and retains what’s important, storing them in the neural network.

Large Language Models, like OpenAI, store information in a computational neural network similar to the brain. They compress all of the knowledge of the world into a model. These models are trained on diverse datasets, ranging from books and articles to websites and social media posts. The training process involves identifying patterns and relationships within the data, effectively compressing the vast amount of information into a smaller, more manageable form. This compressed knowledge allows the model to generate relevant and coherent responses to a wide variety of prompts, mimicking human-like understanding and communication. 

Now let’s move on to LLM prompts. Prompts are instructions to pull out information from this model. If we want something “obvious” from the LLM, the prompt can be quite short. Think of a knitting pattern: a simple scarf needs few instructions, while an intricate sweater requires detailed directions. The more intricate the pattern, the more information is needed to ensure it turns out well. These complex requests have higher entropy because they contain more uncertainty and require more detailed information to achieve the desired result.

So entropy in an LLM prompt is about how much uncertainty, complexity, and variety there is in the text you give the model. It’s about how surprising the prompt is to the model and how many twists and turns there are. Imagine you’re asking a question that no one would expect, with lots of details crammed into it. That’s a high-entropy prompt. The model, or computer program, then has to dig through all its knowledge to give a complex and interesting answer.

Just like computational compression, LLM prompt compression is all about entropy. We can make the prompts shorter by cutting out the extra parts and keeping only the key details, making prompts that are short but still clear. But there’s a limit to how small this can be made. It all comes back to entropy and how novel and surprising we want the answer to be. The more surprising, vs. what the model expects, the more entropy and the less it can be compressed.