Datadog launches GPU monitoring tool for AI workload optimization
Datadog Inc. (NASDAQ: DDOG) announced the general availability of its GPU Monitoring product, designed to help organizations manage graphics processing unit costs and performance as they scale artificial intelligence projects.
The monitoring tool provides visibility across AI infrastructure, linking GPU fleet health, cost, and performance data to specific teams and workloads. According to the company, GPU instances account for 14 percent of compute costs, presenting challenges for organizations seeking to manage AI expenses effectively.
"GPU instances account for 14 percent of compute costs—which is a huge issue as companies are struggling to build AI-first technology in scalable and smart ways," said Yanbing Li, Chief Product Officer at Datadog.
The product addresses what the company identifies as limitations in current GPU monitoring tools, which typically provide device health metrics but lack visibility into resource contention issues or workload failures. GPU Monitoring connects fleet telemetry to the workloads using those resources, enabling platform engineering and machine learning teams to investigate issues collaboratively.
Key features include capacity planning based on usage patterns, workload troubleshooting through correlation of stalled processes to underlying GPUs, proactive identification of unhealthy GPUs, and utilization tracking to identify overprovisioning or underutilization.
Kai Huang, Head of Product at Hyperbolic, said the tool provides visibility into utilization, memory, power and thermals with customizable dashboards. "We can go from a model latency spike straight to the underlying GPU metrics without switching tools," Huang stated.
The product is available to Datadog customers globally and integrates with the company's existing LLM Observability platform.
