Real-World Network Automation

By October 24, 2018 No Comments

Author: Gluware guest blogger, Terry Slattery from NetCraftsmen,

This is the second blog in a series that will dive into the details around a modern approach to network automation.

Here is how a global pharmaceutical company embraced network automation.

Getting Started

I had the opportunity to talk with Sal Rannazzisi, the Principal Network Architect at a global pharmaceutical company about their trip down the network automation path. I was especially interested in learning the particulars of the motivation to use automation and the steps that they followed. It provides us with a real-world look at the transition from manual configuration to fully automated configuration updates.

Sal has been looking at automation for some time, primarily based on open source tools like Ansible, Puppet, and Chef. These approaches all involved getting some programmers to help build systems or for the networking team to learn programming. Programming was going to make the project take a long time. There had to be a better way. Sal then received an update from Gluware about their approach to network configuration that did not need programming skills to operate. This was exactly what he and his team needed.

The company IT organization then decided to deploy a major SaaS application suite (Microsoft OneDrive and Windows 10) across the entire company. The network configurations on over four hundred core routers would have to be modified. Manual processes weren’t going to allow the deployment to proceed at the desired pace. Network configuration automation was suddenly a lot more compelling.

The global network consisted of 5 data centers with the two main data centers in the US that each have direct connections to cloud providers. All cloud traffic was to be backhauled to the two main US data centers to get to the SaaS provider’s network. There were going to be routing changes, QoS changes, and network monitoring changes. The existing configurations were not consistent across all sites, creating opportunities where automation could fail. The automation system obviously had to avoid killing the network.

Key Steps

Sal approached the problem by identifying a relatively simple problem to be solved. The idea is to not attempt too much on the first try. The “crawl, walk, run” approach works best, letting them make any mistakes when the consequences are less severe.

The first simple problems were the deployment of QoS, SNMP and NetFlow configurations. The QoS design gave the SaaS applications relatively high priority to ensure that they worked smoothly. The SaaS traffic transited the direct links to the provider’s network, making it very similar to applications that are hosted in a corporate datacenter.

The direct links to the cloud allowed QoS to work.  Sal and his team worked with the SaaS vendor to identify all the flow characteristics that were needed create the QoS policies (i.e. protocol IDs, port numbers, data rates, queue buffer sizes, and bandwidth allocations). Multiple queues were defined to handle the different traffic types. The SNMP monitoring configuration had to be consistent across all devices, so that the QoS performance could be monitored. Queue utilization and queue drops are important parameters for understanding when a queue is getting full or is oversubscribed.

At the same time, the routing configuration changes needed to be identified. The SaaS provider’s network advertisements needed to direct traffic over the dedicated links to the provider.  (They did not change routing policies with Gluware, but we are working on it for DMVPN.)

To make all this work, the configurations of 400+ routers needed to be uniform. Sal and his team started using Gluware to build configuration validation templates. These templates defined the configuration foundation upon which the new configurations would be based. As expected, this took a bit of time to implement. The benefit was that they were able to remove a lot of old configuration information that had been around a long time due to the use of paper-based configuration standards (configuration statements for Kazaa were mentioned). The perpetual copying of configurations from one device to the next, with no reference to the new standards or new platforms, exacerbated the situation. Once the new configuration templates were built, Gluware’s declarative modeling was able to remove thousands of lines of configuration that had existed for decades.

The Gluware configuration modeling abstraction makes different router models look the same, allowing the team to work more efficiently. Once the specific configuration statements of a router function are covered by the abstraction, that part of its configuration can be managed. A key advantage is that the whole configuration does not have to be managed by Gluware. Only those parts that have been modeled are touched, leaving the remainder of the configuration alone. This means that things like QoS policy map definitions don’t interfere with routing policy map definitions.

The networking team then had to create new processes for handling the upcoming changes. Along with process changes come cultural changes. Device configuration changes must not be done manually. Every configuration change for the modeled parts of the configuration have to go through Gluware. That’s typically a big cultural change for organizations that are accustomed to doing manual changes via the CLI.

The team built a lab for testing updates. The lab contains examples of all hardware and operating systems within the production network. It is an essential tool for testing prior to applying a change on the production network. Because the configurations across the 400 devices are standardized, the tests validate that a proposed change will work across all the devices.

Once a production configuration change has been tested, the initial SaaS POC deployment starts with 10 sites and 13 network devices, using automation to make the changes simultaneously. Then they expand to 40 devices at once, and eventually deploy to all global devices during the same change window.

Transitioning to Automation

Sal told me that they are able to do configuration automation without programming. While they have to use variables for things like IP addresses, but that’s not close to coding. They have been through multiple QoS configuration revisions, with Gluware making it easy to deploy each change.

Sal pointed out a major cost savings that wasn’t originally obvious. They are able to avoid the significant cost of paying a managed service provider to make the changes.  They also save the time and effort required to write a SOW for each change over 7 devices. Gluware also allowed them to significantly reduce the time to deploy changes.

Current Status

Gluware is now helping Sal and his team manage the configurations of approximately 500 network devices. They are expanding the scope of what is covered. For example, they are now experimenting with a process to follow up on PSIRT announcements and gather data about the vulnerability. Do they have vulnerable devices and can they deploy a configuration change to eliminate the vulnerability or perform code promotions at scale to remediate the vulnerability?

Sal indicated that usability is paramount. Gluware makes automation consumable. He can understand the implementation in the lab, verify how it works, preview the change, and then push the button to deploy it to groups of devices.

The biggest pushback that Sal originally received was on reliability and trustworthiness. The organization was aware that automation can break a lot of things very quickly and wanted to minimize the risk. Lab testing and controlled rollouts are critical steps. The corporate IT culture had to adopt a new methodology that minimized the risk of using automation.

The conversation ended with a compelling statement:

“Every hour of work we expend results in many hundreds of hours of effort being saved. I have my life back now.”