Apple and NVIDIA Enhance LLM Text Generation Speed

Apple and NVIDIA have collaborated to boost the performance of large language models (LLMs). Apple's open-source Recurrent Drafter (ReDrafter) technique, combined with NVIDIA TensorRT-LLM, significantly accelerates text generation.

How ReDrafter Works

ReDrafter combines beam search and dynamic tree attention for faster and more efficient text generation. Integrating this with NVIDIA TensorRT-LLM allows LLMs to run faster on NVIDIA GPUs.

Performance Improvements

Benchmarking shows a 2.7x speed increase in generated tokens per second using ReDrafter with NVIDIA TensorRT-LLM. This translates to reduced latency for users, lower GPU usage, and decreased power consumption.

Impact and Availability

This collaboration makes ReDrafter's accelerated token generation readily available to ML developers using NVIDIA GPUs. The improved efficiency reduces computational costs and latency in production applications. For more information on LLMs and related advancements, check out articles on Gemini app updates and Android 16's performance enhancements. Those interested in Apple devices can also read about the future of Apple AR glasses.