Education • September 15, 2023

How Speculative Sampling Speeds up Large Language Model Inference

Forefront Team

How Speculative Sampling Speeds up Large Language Model Inference

Artificial intelligence has taken great leaps in recent years, with models like GPT-4 and Claude showing remarkable abilities in language generation. However, these large language models come with a major drawback - they are extremely slow to generate text, often taking several seconds per token. This is problematic for applications like conversational agents that require quick response times.

In a new paper, researchers from DeepMind present an innovative algorithm called speculative sampling that can accelerate the decoding of large language models by up to 2.5x, without compromising the quality of the samples. This speedup is achieved by enabling the generation of multiple tokens from each call to the model, rather than just one token at a time.

The key insight is that scoring multiple token continuations from a smaller, faster model called the draft model has similar latency to sampling a single token from the full model. The paper refers to the full model as the target model, since it represents the desired distribution we want to sample from.

After generating a draft sequence of say 4 tokens using the smaller model, these tokens are scored by the target model. Then, a modified rejection sampling scheme is used to accept or reject each draft token sequentially, in a way that recovers the distribution of the target model.

Intuitively, if there is strong agreement between the two models on a token, it can be accepted and added to the final sequence. If not, it is rejected and a new token is sampled from the target model. By accepting multiple draft tokens per loop, far fewer calls to the expensive target model are needed to generate text. An appealing aspect of this method is its flexibility in choosing a draft model. The paper discusses options like distilling a smaller model directly from the target model. But even just using a smaller version of the target model itself works surprisingly well, despite not being exposed to the target model's exact distributions during training.

The authors evaluated speculative sampling on two tasks using a 70 billion parameter model called Chinchilla. For a natural language summarization task called XSum, it achieved speedups of 1.9-2x. Even more impressive was a 2.46x speedup on HumanEval, a challenging code generation task.

Critically, the performance on these tasks was unchanged compared to normal autoregressive sampling. This indicates the draft model was not biasing the results, thanks to the modified rejection sampling scheme preserving the target distribution.

An interesting aspect is the trade-off between longer drafts and more frequent scoring. As the number of tokens generated by the draft model per loop increases, there are fewer calls to the expensive target model. However, longer drafts reduce the efficiency and acceptance rate, while increasing the variance in decoding time.

The implications of this work are exciting. The speculative sampling algorithm provides a practical way to reduce the latency of large language models without compromising quality or modifying the model itself. This could enable the use of models with over 100 billion parameters in interactive applications.

The method also complements other techniques for accelerating transformers, like quantization and caching. It will be fascinating to see how far these decoding optimizations can push the limits of foundation models in terms of scale and capabilities. If you found this glimpse into the inner workings of large language models interesting, be sure to check out the original paper here

Share this post

Your new AI assistant.
Available now.

Product