CVE-2025-23320

NVIDIA · NVIDIA Multiple Products

A high-severity vulnerability has been identified in the NVIDIA Triton Inference Server, affecting both Windows and Linux deployments.

Executive summary

A high-severity vulnerability has been identified in the NVIDIA Triton Inference Server, affecting both Windows and Linux deployments. An unauthenticated attacker can remotely send a specially crafted large request to the server, causing it to crash and resulting in a denial-of-service condition. This can disrupt critical business functions that rely on AI and machine learning models, making the server and its hosted models unavailable.

Vulnerability

The vulnerability exists within the Python backend of the NVIDIA Triton Inference Server. The server fails to properly validate the size of incoming requests before processing them and allocating resources. An attacker can exploit this by sending a request with an exceptionally large payload, which forces the Python backend to attempt an allocation of shared memory that exceeds the pre-configured system limit (shm-size). This excessive memory allocation request causes the server process to terminate abruptly, leading to a denial-of-service. Exploitation requires network access to the Triton Inference Server endpoint but does not require any authentication.

Business impact

This vulnerability is rated as High severity with a CVSS score of 7.5, primarily impacting service availability. Successful exploitation will lead to a denial-of-service, rendering the Triton Inference Server and all hosted AI/ML models inoperative. For organizations that rely on these models for real-time decision-making, analytics, fraud detection, or customer-facing applications, the impact includes operational disruption, potential revenue loss, and reputational damage. The ease of exploitation (a single large request) increases the risk of targeted attacks aimed at disrupting key business processes.

Remediation

Immediate Action: Organizations must identify all instances of the NVIDIA Triton Inference Server within their environment and apply the security updates provided by the vendor immediately. System administrators should prioritize patching for internet-facing or business-critical servers to mitigate the risk of disruption. After patching, monitor server logs and performance to ensure the update has been applied successfully and the service is stable.

Proactive Monitoring:

Log Analysis: Review Triton server logs for any error messages related to shared memory allocation failures (e.g., "failed to allocate shared memory segment"), out-of-memory errors, or unexpected process terminations.
Network Traffic Analysis: Monitor network traffic for unusually large HTTP/gRPC requests directed at Triton server endpoints. Establish a baseline for normal request sizes and alert on significant deviations.
System Health: Monitor system resource utilization, specifically focusing on shared memory usage (/dev/shm on Linux) and the stability of the Triton server process. Configure alerts for unexpected crashes or service restarts.

Compensating Controls: If immediate patching is not feasible, implement the following controls to reduce risk:

Request Filtering: Place a reverse proxy or Web Application Firewall (WAF) in front of the Triton Inference Server. Configure rules to enforce a maximum request body size, dropping any requests that exceed a reasonable threshold before they reach the vulnerable application.
Access Control: Restrict network access to the Triton server to only trusted and authorized clients. Avoid exposing the server directly to the internet if possible.
Resource Isolation: If running in a containerized environment, ensure resource limits (CPU, memory) are properly configured to limit the impact of a crash and enable automated restarts.

Exploitation status

Public Exploit Available: False

Analyst recommendation

Given the high CVSS score of 7.5 and the critical role of inference servers in modern AI-driven applications, we strongly recommend that organizations prioritize the immediate patching of CVE-2025-23320. Although there is no evidence of active exploitation at this time, the low complexity of the attack makes it a significant and easily exploitable risk for service availability. If patching must be delayed for operational reasons, the implementation of compensating controls, particularly request size limiting at the network edge, is critical to prevent service disruption.