AI updates
2024-12-22 16:07:29 Pacfic

OpenAI ChatGPT Outage Caused by Telemetry Issue - 8d
OpenAI ChatGPT Outage Caused by Telemetry Issue

OpenAI experienced a significant three-hour outage affecting its ChatGPT, Sora, and developer API services due to a newly deployed telemetry service. This service, designed to collect Kubernetes metrics, triggered resource-intensive API operations across all nodes, overwhelming Kubernetes API servers. The resulting DNS resolution failures and expired cached records led to widespread service disruptions. The incident highlights the risks associated with telemetry services and the importance of robust system design to prevent cascading failures.