The AWS outage that took down a sizable portion of the web in October was caused by something almost mundane: automated systems that tried to update the same DNS record simultaneously, resulting in chaos that may have cost hundreds of millions of dollars.
And just a few weeks later, a similar such issue arose from Cloudflare, with the vendor’s machine learning-based Bot Management system falling foul of one of its own database systems – which resulted in Cloudflare’s servers “panicking” and eventually taking out yet more of the web.
For Jeff Gray, CEO of network automation vendor Gluware, these incidents illustrate a problem the industry isn’t talking about enough. With enterprises racing to automate with agentic AI lathered across their infrastructure stack, who’s making sure those agents don’t break things at scale?