Overview
NVIDIA Dynamo-Triton, formerly NVIDIA Triton Inference Server, is an open-source inference serving software designed to streamline AI model deployment across diverse hardware and software ecosystems. It supports major frameworks like TensorRT, PyTorch, ONNX, and OpenVINO, enabling real-time, batched, and streaming workloads on NVIDIA GPUs, non-NVIDIA accelerators, x86, and ARM CPUs. Dynamo-Triton optimizes performance with dynamic batching, concurrent execution, and optimized configurations. It integrates seamlessly with Kubernetes for scaling and Prometheus for monitoring, facilitating DevOps and MLOps workflows. NVIDIA Dynamo complements it for LLM use cases with optimizations like disaggregated serving and key-value caching to storage, enhancing large language model inference and multi-mode deployment.
