Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
posted an update Jul 23
LazyLLM - Unusual Colab (Apple & Meta) Yields Impactful Work

LLM inference typically consists of two stages: prefilling/tokenizing and decoding. In the prefilling stage, the model processes the entire input prompt, computing and caching key-value (KV) pairs for each token, which can be time-consuming for long prompts. This is followed by the decoding stage, where the model generates tokens sequentially, reusing the cached KVs.

LazyLLM introduces a dynamic token pruning technique. Instead of computing KVs for all tokens during prefilling, LazyLLM selectively processes only the most important tokens based on attention scores, deferring less important ones to later steps if needed. It uses progressive token pruning across transformer layers and introduces an Aux Cache to store hidden states of pruned tokens.

This approach significantly reduces the time-to-first-token (TTFT) and overall generation time while maintaining accuracy across various tasks. LazyLLM outperforms baseline techniques like random token dropping and static pruning, and can be easily integrated into existing LLMs without fine-tuning, offering a practical solution for accelerating LLM inference, especially in long context scenarios.

When you prompt a large language model (LLM), it usually looks at every single word/subword (or tokens) in your prompt before generating a response. This can be time consuming, especially for prompts with very long texts. This paper introduces a new technique that solves this problem by being more selective. Instead of looking at every word right away, it only focuses on the most important words first. It decides which words are important based on how much attention the model gives them. If it needs other words later, it can go back and look at them then. This approach is like skimming a text for key information before reading it in detail.

Read More:

4 bit quants.. lol.