Optimize AI Inference Performance with NVIDIA Full-Stack Solutions

The explosion of AI-driven applications has placed unprecedented demands on both developers, who must balance delivering cutting-edge performance with managing operational complexity and cost, and AI infrastructure.

NVIDIA is empowering developers with full-stack innovations—spanning chips, systems, and software—that redefine what’s possible in AI inference, making it faster, more efficient, and more scalable than ever before.

Easily deploy high-throughput, low-latency inference

Six years ago, NVIDIA set out to create an AI inference server specifically designed for developers building high-throughput, latency-critical production applications. At the time, many developers were grappling with custom, framework-specific servers that increased complexity, drove up operational costs, and struggled to meet stringent service-level agreements for latency and throughput.

To address this, NVIDIA developed the NVIDIA Triton Inference Server, an open-source platform capable of serving models from any AI framework. By consolidating framework-specific inference servers, Triton streamlined AI inference deployment and increased AI prediction capacity. This approach has made Triton one of the most widely adopted NVIDIA open-source projects, now used by hundreds of leading organizations to deploy production AI models efficiently.

In addition to Triton, NVIDIA offers a broad ecosystem of AI inference solutions. For developers seeking powerful, customizable tools, NVIDIA TensorRT provides a high-performance deep learning inference library with APIs that enable fine-grained optimizations. NVIDIA NIM microservices provide a flexible framework for deploying AI models across the cloud, data centers, or workstations.

Optimizations for AI inference workloads

Inference is a full-stack problem today, requiring high-performance infrastructure and efficient software to make effective use of that infrastructure. In addition, inference workloads continue to become more challenging, as model sizes continue to grow and latency constraints tighten, all while the number of users leveraging these AI services also continues to increase. And with the introduction of inference time scaling, a new paradigm for scaling model intelligence, more compute is being applied during inference to enhance model performance. 

These trends mean that it’s important to continue advancing delivered inference performance, even on the same underlying hardware platform. By combining established methods like model parallelism, mixed-precision training, pruning, quantization, and data preprocessing optimization with cutting-edge advancements in inference technologies, developers can achieve remarkable gains in speed, scalability, and cost-effectiveness.

The TensorRT-LLM library incorporates many state-of-the-art features that accelerate inference performance for large language models (LLMs), which are outlined below.

Prefill and KV cache optimizations

  • Key-value (KV) cache early reuse: By reusing system prompts across users, the KV Cache Early Reuse feature accelerates time-to-first-token (TTFT) by up to 5x. Flexible KV block sizing and efficient eviction protocols ensure seamless memory management, enabling faster response times even in multi-user environments.
  • Chunked prefill: For smarter deployment, chunked prefill divides the prefill phase into smaller tasks, enhancing GPU utilization and reducing latency. This innovation simplifies deployment and ensures consistent performance, even with fluctuating user demands.
  • Supercharging multiturn interactionsThe NVIDIA GH200 Superchip architecture enables efficient KV cache offloading, improving TTFT by up to 2x in multiturn interactions with Llama models while maintaining high throughput.

Decoding optimization

  • Multiblock attention for long sequences: Addressing the challenge of long input sequences, TensorRT-LLM multiblock attention maximizes GPU utilization by distributing tasks across streaming multiprocessors (SMs). This technique improves system throughput by more than 3x, enabling support for larger context lengths without additional hardware costs.
  • Speculative decoding for accelerated throughput: Leveraging a smaller draft model alongside a larger target model, speculative decoding enables up to a 3.6x improvement in inference throughput. This approach ensures high-speed, high-accuracy generation of model outputs, streamlining workflows for large-scale AI applications.
  • Speculative decoding with Medusa: The Medusa speculative decoding algorithm is available as part of TensorRT-LLM optimizations. By predicting multiple subsequent tokens simultaneously, Medusa boosts throughput for Llama 3.1 models by up to 1.9x on the NVIDIA HGX H200 platform. This innovation enables faster responses for applications that rely on LLMs, such as customer support and content creation.

Multi-GPU inference

  • MultiShot communication protocol: Traditional Ring AllReduce operations can become a bottleneck in multi-GPU scenarios. TensorRT-LLM MultiShot, powered by NVSwitch, reduces communication steps to just two, irrespective of GPU count. This innovation boosts AllReduce speeds by up to 3x, making low-latency inference scalable and efficient.
  • Pipeline parallelism for high-concurrency efficiency: Parallelism techniques require that GPUs be able to transfer data quickly and efficiently, necessitating a robust GPU-to-GPU interconnect fabric for maximum performance. Pipeline parallelism on NVIDIA H200 Tensor Core GPUs achieved a 1.5x throughput increase for Llama 3.1 405B and demonstrated their versatility with a 1.2x speedup for Llama 2 70B in MLPerf Inference benchmarks. MLPerf Inference is a suite of industry-standard inference performance benchmarks developed by the MLCommons consortium.
  • Large NVLink domains: The NVIDIA GH200 NVL32 system, powered by 32 NVIDIA GH200 Grace Hopper Superchips connected using the NVLink Switch system, and with TensorRT-LLM improvements, delivers up to 3x faster TTFT for Llama models. With up to 127 petaflops of AI compute, this next-generation architecture sets the stage for unprecedented real-time responsiveness in AI applications.

Quantization and lower-precision compute 

  • NVIDIA TensorRT Model Optimizer for precision and performance: The NVIDIA custom FP8 quantization recipe in the NVIDIA TensorRT Model Optimizer delivers up to 1.44x higher throughput without sacrificing accuracy. These optimizations enable more cost-effective deployment by reducing latency and hardware requirements for demanding workloads.
  • End-to-end full-stack optimization: NVIDIA TensorRT libraries and FP8 Tensor Core innovations ensure high performance across a wide range of devices, from data center GPUs to edge systems. NVIDIA has optimized the Llama 3.2 collection of models for great performance, demonstrating how full-stack software can adaptively unlock efficiency across diverse AI deployment environments.

With these features, as well as many others within Triton and TensorRT-LLM, developers can now deploy LLMs that are not only faster and more efficient but also capable of handling a wider range of tasks and user demands. This opens new opportunities for businesses to enhance customer service, automate complex processes, and gain deeper insights from their data. 

Evaluating inference performance 

Delivering world-class inference performance takes a full technology stack—chips, systems, and software—all contributing to boosting throughput, reducing energy consumption per token, and minimizing costs. 

MLPerf Inference is one key measure of inference performance is MLPerf Inference. The benchmark measures inference throughput under standardized conditions, with results subject to extensive peer review. The benchmark is regularly updated to reflect new advances in AI, ensuring that organizations can rely on these results to evaluate platform performance.

In the latest round of MLPerf Inference, NVIDIA Blackwell made its debut, delivering up to 4x more performance than the NVIDIA H100 Tensor Core GPU on the Llama 2 70B benchmark. This achievement was the result of the many architectural innovations at the heart of the Blackwell GPU, including the second-generation Transformer Engine with FP4 Tensor Cores and ultrafast HBM3e GPU memory that delivers 8 TB/s of memory bandwidth per GPU. 

In addition, many aspects of the NVIDIA software stack, including NVIDIA TensorRT-LLM, were re-engineered to make use of new capabilities in Blackwell, such as support for FP4 precision, while continuing to meet the rigorous accuracy target of the benchmark. 

The NVIDIA H200 Tensor Core GPU, available now from server makers and cloud service providers, also achieved outstanding results on every benchmark in the data center category. This includes the newly added Mixtral 8x7B mixture-of-experts (MoE) LLM, as well as on the Llama 2 70B LLM and Stable Diffusion XL text-to-image tests. As a result of continued software improvements, the Hopper architecture delivered up to 27% more inference performance compared to the prior round. 

NVIDIA Triton Inference Server, running on a system with eight H200 GPUs achieved virtually identical performance compared to the NVIDIA bare-metal submission on the Llama 2 70B benchmark in MLPerf Inference v4.1. This shows that enterprises no longer need to choose between a feature-rich, production-grade AI inference server and peak throughput performance—both can be achieved simultaneously with NVIDIA Triton.

The landscape of AI inference is rapidly evolving, driven by a series of groundbreaking advancements and emerging technologies. Models continue to get smarter, as increases in compute at data center scale enable pretraining larger models. The introduction of sparse mixture-of-experts model architectures, such as GPT-MoE 1.8T, will also help boost model intelligence while improving compute efficiency. These larger models, whether dense or sparse, will require that GPUs individually become much more capable. NVIDIA Blackwell architecture is set to fuel next-generation generative AI inference.

Each Blackwell GPU features second-generation Transformer Engine and fifth-generationTensor Cores utilizing FP4. Lower-precision data formats help to increase computational throughput and reduce memory requirements. To ensure they can deliver significant performance benefits while maintaining high accuracy, an incredible amount of software craftsmanship is needed.  

At the same time, to serve the most demanding models at brisk, real-time rates, many of the most capable GPUs will need to work in concert to generate responses.

The NVIDIA GB200 NVL72 rack-scale solution creates a 72-GPU NVLink domain that acts as a single massive GPU. For GPT-MoE 1.8T real-time inference, it provides up to a 30x improvement in throughput compared to the prior generation Hopper GPU. 

In addition, the emergence of a new scaling law—test-time compute—is providing yet another way to improve response quality and accuracy for even more complex tasks. This new paradigm, first introduced with the OpenAI o1 model, enables models to “reason” by generating many intermediate tokens before outputting the final result. Reasoning models are particularly helpful in domains such as complex mathematics and generating computer code. This new paradigm is set to fuel a new wave of breakthroughs requiring more computational performance during inference time. 

The path to artificial general intelligence will rely on continued breakthroughs in data center compute performance. Pretraining, post-training, and test-time scaling all depend on state-of-the-art infrastructure running expertly crafted software. The NVIDIA platform is evolving rapidly, with a brisk one-year innovation rhythm, to enable the ecosystem to continue pushing the frontiers of AI. 

Get started

Check out How to Get Started with AI Inference, learn more about the NVIDIA AI Inference platform, and stay informed about the latest AI inference performance updates. 

Watch a demo on how to quickly deploy NVIDIA NIM microservices or read A Simple Guide to Deploying Generative AI with NVIDIA NIM. Optimizations from TensorRT, TensorRT-LLM, and TensorRT Model Optimizer libraries are combined and available through production-ready deployments using NVIDIA NIM microservices.

Published by Nick Comly and Ashraf Eassa