NVIDIA is pushing local AI performance forward with new optimizations for Google DeepMind’s DiffusionGemma, an experimental open model designed to generate text faster by using a diffusion-based approach.
Unlike most large language models, which produce responses one token at a time, DiffusionGemma generates multiple tokens in parallel. This allows the model to create blocks of text more efficiently, opening the door to faster responses for developers, researchers and AI enthusiasts running AI workloads locally.
The model has been optimized for NVIDIA GeForce RTX GPUs, the NVIDIA RTX PRO platform, NVIDIA DGX Spark systems and DGX Station, giving users more ways to experiment with advanced AI without relying entirely on cloud infrastructure.
What Makes DiffusionGemma Different?
Most popular AI chatbots and language models use an autoregressive process. That means they predict the next token, then the next, and continue step by step until a response is complete.
DiffusionGemma takes a different approach. Inspired by diffusion models used in image generation, it starts with noisy information and refines a block of text at once. NVIDIA says the model can denoise up to 256 tokens per step, instead of producing only one token at a time.
This parallel generation method could be especially useful for low-latency AI tasks, including:
- Interactive AI chat
- Local AI assistants
- Agentic AI workflows
- Developer prototyping
- Research experiments
- On-device AI applications
For users who want fast, responsive AI running on local hardware, this could be a major step forward.
Built on Google DeepMind’s Gemma Architecture
DiffusionGemma is built on Google DeepMind’s Gemma 4 architecture. According to NVIDIA, the model uses a 26-billion-parameter mixture-of-experts design, activating only a smaller portion of parameters per step.
This design helps balance performance and efficiency. By combining Google’s Gemma architecture with a diffusion-based generation method, DiffusionGemma aims to deliver high-speed text generation while remaining practical for local AI systems.
The model is also open-weight and available under the Apache 2.0 license, making it more accessible for developers and researchers who want to test, adapt or deploy it in their own workflows.
NVIDIA RTX GPUs Give DiffusionGemma a Performance Boost
NVIDIA says DiffusionGemma’s design fits well with GPU acceleration. Traditional token-by-token language models are often limited by memory bandwidth. Diffusion-style generation, on the other hand, relies more heavily on parallel computation, which is where NVIDIA GPUs are strongest.
Using NVIDIA Tensor Cores and the CUDA software stack, DiffusionGemma can run efficiently across several NVIDIA platforms.
NVIDIA reported performance of up to 1,000 tokens per second on a single H100 Tensor Core GPU and up to 2,000 tokens per second on DGX Station. The company says this can be roughly 4x faster than an equivalent autoregressive model in similar single-user scenarios.
Local AI Without the Cloud
One of the biggest advantages of DiffusionGemma is its ability to run locally. That means users can test and build AI systems without depending on cloud-based APIs or paying per-token usage fees.
Local deployment is becoming increasingly important as businesses, developers and researchers look for more control over AI workloads. Running models on local machines can help improve privacy, reduce latency and support offline experimentation.
NVIDIA says DiffusionGemma can run on several local AI platforms, including:
- NVIDIA DGX Spark
- NVIDIA RTX PRO 6000 workstations
- NVIDIA DGX Station
- GeForce RTX GPUs, with llama.cpp support coming soon
This makes the model relevant not only for AI labs and enterprise teams, but also for individual developers with powerful RTX hardware.
Developer Support Through Hugging Face, vLLM and Unsloth
NVIDIA says DiffusionGemma has day-one support across popular AI development tools. Developers can begin testing the model through Hugging Face Transformers, while vLLM provides support for higher-throughput inference.
For fine-tuning and customization, DiffusionGemma is supported through Unsloth and NVIDIA NeMo. This gives developers options to adapt the model for specialized tasks, domains or local agent workflows.
NVIDIA is also providing playbooks for systems such as DGX Spark, RTX PRO and DGX Station, helping users set up local environments more quickly.
Why This Matters for AI Developers
DiffusionGemma highlights a growing shift in artificial intelligence: faster, more capable AI models that can run locally.
As AI agents, coding assistants and personal AI tools become more common, response speed matters. A model that can generate text in larger parallel blocks may help reduce delays in workflows where users need fast iteration.
For developers building local AI applications, this could improve the experience of running assistants, agents and research tools directly on personal or workstation hardware.
It also strengthens NVIDIA’s position in the local AI ecosystem, where its RTX and DGX platforms are increasingly being positioned as powerful alternatives to cloud-only AI deployment.
The Future of Local AI Generation
DiffusionGemma is still experimental, but its parallel generation approach could point to a new direction for AI text models. Instead of relying only on traditional token-by-token generation, future models may use diffusion-style techniques to improve speed and responsiveness.
With NVIDIA optimization, Google DeepMind’s DiffusionGemma could become an important test case for how open models perform on local AI hardware.
For AI developers, researchers and enthusiasts, the message is clear: local AI is becoming faster, more flexible and more practical.
Key Takeaway
NVIDIA’s optimization of Google DeepMind’s DiffusionGemma shows how diffusion-based text generation could make local AI significantly faster. By generating text in parallel and running efficiently on RTX and DGX systems, DiffusionGemma offers a promising path for low-latency AI applications outside the cloud.

