by Michael Haugh, VP of Product Marketing
With Facebook making news last week with an extended network outage affecting their end-users of multiple applications along with their internal users, questions are being raised as to how this could have been avoided.
Let’s start by saying Facebook is not like most other large enterprises when it comes to their network infrastructure; they are in the 1% of mega-scale tech companies that serve massive amounts of traffic and transactions. Facebook was quickly forthcoming about the cause of the outage. What became clear is that even the largest organizations with the most complex infrastructures can suffer serious network outages. Facebook has been known to build their own hardware, software and management-plane (which provides tooling, automation and orchestration). They also have spoken publicly about understanding that outages will happen and their goal is to write software that can detect, react and recover from outage situations.
What was shared publicly indicates that “a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all connections in our backbone network”. An audit command could have prevented the issue, but a bug in the tool prevented it from stopping the command. It seems that this change (which is implied to have affected BGP routing) may have caused routing issues, which triggered their DNS to disable BGP advertisements as a part of the failure response. With routing issues occurring and DNS being down, most applications became unreachable. The issue was exacerbated by the fact that many of their tools were offline and to restore services engineers had to physically access devices in the data center, which used a security system that went offline due to the outage. This unforeseen series of events is likely resulting in some interesting conversations as to how to prevent this type of issue going forward and if limiting the “blast radius” through segmentation or some other means should be considered.
A challenge this creates for all network operators is how to trust automation and not experience the same negative impacts Facebook and other companies that have had high-profile outages.
The approach to configuration management that Gluware provides helps to de-risk production network changes. With Gluware’s intent-based intelligent network automation, organizations can rely less on manual changes and home-grown scripts which have a large internal cost to develop, test and properly maintain to use on a production network. Most large enterprises have been down the path and are struggling to build their own automation. Many of the top Fortune 500 and Global 2000 enterprises are now turning to Gluware for a more intelligent approach and focusing their development resources on integrations and business differentiation instead of low-level network automation.
The top 10 ways in which intelligent network automation from Gluware can de-risk production network changes include:
- Implement an intent-based, declarative approach – Provides the ability for the user to define the “intended state” of a configuration, then through a process of reading the current state, comparing and rendering the necessary commands to reach that intended state for each network device
- Provide the ability to create custom abstraction (and guardrails) – An abstraction layer can help to simplify the end-user interface to provide focus and implement guardrails on configuration parameters, instead of purely using low-level vendor-specific commands and semantics
- Provide idempotent network changes – Each time the automation is run, you get the same, predictable results. To achieve this, network automation solutions must be able to read, compare, render required commands, write the changes and validate the changes.
- Provide the ability to “Preview” the changes – see what is going to happen with the automation runs and validate before committing the change.
- Pre/post-state check – The automation system should provide the ability to define custom pre and post-change checks, which usually will check the state and operational status of protocols related to the change. For example, something that affects BGP can check the neighbor state, a route count and even perform ping, trace or resolve DNS lookups.
- Provide deep logging – When things go wrong it is critical to be able to dig into the details and see exactly what happened. These logs accelerate identifying and resolving issues.
- Automate ad-hoc queries – Enable NetOps to automate running commands across many devices and process the results looking for expected output. This can help to find the needle in the haystack when troubleshooting or performing routine checks.
- Config drift monitoring – Monitor the network for change, identify un-expected, un-approved changes, with the ability to see exactly what changed. Enable the ability to trigger the check using Syslog to identify when a device is being configured.
- Run regular config audits – Perform regular audits to ensure the network config is in the correct state and compliant to any required 3rd party regulatory requirements without the risk of issuing a rogue command that created a config change like what happened at Facebook.
- Custom dashboards and reporting – NetOps should know what the normal baselines for the network are including visual dashboards and detailed reports of regularly monitored network information related to performance, changes and security.