AI updates
2024-12-22 11:10:55 Pacfic

OpenAI ChatGPT Outage Caused by Telemetry Issue - 8d
Read more: techcrunch.com

OpenAI experienced a significant outage on Wednesday, impacting its ChatGPT, Sora, and developer API services for approximately three hours. The disruption was attributed to a newly deployed telemetry service designed to gather Kubernetes metrics. This service, unfortunately, triggered resource-intensive API operations across all nodes in their Kubernetes clusters. The overwhelming load on the Kubernetes API servers resulted in the control plane going down in many of the large clusters.

The cascading effect of this initial issue led to DNS resolution failures. The service disruption was further prolonged due to the reliance on DNS caching. As cached records expired, services began to fail, as they could not perform the real-time DNS lookups to contact each other. OpenAI acknowledges the incident was not due to a security breach or a new product release, but rather a confluence of system failures interacting in unexpected ways.