NVIDIA has introduced full support for Google DeepMind’s latest open AI model, DiffusionGemma, across its entire RTX and DGX hardware lineup. This integration enables developers and researchers to run high-speed, local text generation workflows using NVIDIA’s advanced GPU architectures and software stacks, bypassing the need for cloud services or per-token fees.
DiffusionGemma stands out with its parallel token generation capability, denoising up to 256 tokens simultaneously per step, a significant departure from traditional autoregressive models that generate tokens one at a time. Built on the 26-billion-parameter Gemma 4 framework, it activates 3.8 billion parameters per step, combining a diffusion head with DeepMind’s Gemma 4 architecture to deliver performance up to four times faster than comparable models.
NVIDIA’s implementation leverages its tensor core architecture and CUDA software, providing robust and optimized support without requiring additional tuning. The company reported that its flagship H100 Tensor Core GPUs on DGX Stations reach speeds exceeding 1,000 tokens per second on single-GPU setups. Meanwhile, DGX Spark clusters deliver around 150 tokens per second, and DGX Station systems handle up to 800 tokens per second, placing them among the fastest local inference solutions available.
The DiffusionGemma model is open-weight under the Apache 2.0 license and supports multiple precision formats including BF16 and NVFP4. Its architecture accommodates up to 256,000 tokens of context length, expanding the scope for large-scale language and image tasks. NVIDIA’s support covers GeForce RTX GPUs, RTX PRO platforms, and an array of DGX systems ranging from compact Spark Mini workstations to high-end datacenter-grade devices.
This release also integrates seamlessly with popular AI frameworks and repositories, offering immediate compatibility through Hugging Face Transformers, vLLM, and Unsloth. NVIDIA’s DGX Spark personal AI supercomputer, powered by the Grace Blackwell Superchip with unified memory, facilitates local prototyping, fine-tuning, and full agent workflows efficiently.
Additionally, NVIDIA RTX PRO 6000 series workstations empower AI professionals with low-latency generation and agentic loop processing, ideal for complex, real-time AI applications within professional environments. The turnkey software stack ensures developers can harness DiffusionGemma’s capabilities right after deployment without manual optimization.

