Monday, March 22, 2021

Design Pattern: Looking Glasses

It's probably safe to say that service provider networking is pretty unique.

One particular design pattern - Looking Glasses - is extremely useful for complex dynamically routed networks

I'd really like to shift the gatekeeping needle here - networks that are complex enough to benefit from a looking glass should move to:
  • >100 Routing table entries globally
  • Some vague preference towards reliability
  • Dynamic Routing (BGP is preferred)
In any small to medium enterprise, I'd posit that the only thing truly preventing benefits, in this case, is the lack of dynamic routing adoption, primarily because pre-packaged offerings in this range don't have an "easy button" for implementing them. This lack of accessibility causes a real problem with SMB networking, as reliability features stay out of their reach.

Design Pattern: Looking Glass

A Network "Looking Glass" is a type of web server that responds to user requests, providing externalized (without userspace access to network equipment) to an authenticated or unauthenticated client. This allows clients to view BGP meta-data, routing tables to ensure outbound advertisements between Service Providers have propagated. 

Here's my starting point for this design pattern.

History (non-inclusive)

Note: I don't have everything here. It seems most Looking Glasses were stood up silently by telecommunications companies. They're searchable, but I can't find any citable data on when they started out.

Form

  • Least (Zero) Privilege Access to a network services routing table, searchable via API and/or GUI

Forces

Of these forces, #1 is probably the biggest. Since we cannot force all of the networking industry titans (yet) to provide a permission set that will facilitate this use - I'd propose the following approach:
In this solution, I'm proposing some additional safeguards/scale-guards to make sure that the approach will not be harmful to a "host" network. In addition to implementing the looking glass, I'd propose the deployment of a series of Virtual Network Functions (VNFs) scaled out with monitored routing tables. This is where the collectors would interact - if the physical network doesn't allow any inbound prefixes from the VNF, it's easy enough to build a solution to safely collect from it. There are tons of VNF options here - as we only need BGP capability and a collection method.

Saturday, March 13, 2021

Unearned Uptime - Present and Future Design Patterns

After all that meatspace talk, let's look at a few technical solutions and why they might not meet business needs in a specific setting.

Shared Control Planes / Shared Failure Plane

Shared Control Plane design patterns are prolific within the networking industry - and there's a continuum. Generally, a control plane between devices should be designed with reliability in mind, but most shared control plane implementations tend to have "ease of administration" as intent instead of reliability. Here are some common examples.

Stacking

"Stacking" implementations represent an early industry pattern where (typically) campus deployments weren't entirely large enough to justify a chassis switch but still wanted enough lateral bandwidth to eliminate a worry point. Primary motivations for "stacking" were:

  • Single Point of Administration
  • Linear scale-out costs

Stacking was an artifact from when software like Ansible, Cisco DNA, ArubaOS-CX/NetEdit, etc. didn't exist from within the industry. Significant downsides exist to stacking software, including:

  • Tight coupling with software, often a total outage or a many-step ISSU upgrade path
  • Software problems take the whole stack down
  • Stacking cables are expensive and proprietary

Stacking is still a pretty good, viable technology for small to medium campus networks. One particular technology I have found interesting is Aruba's Spine and Leaf design, leveraging Aruba's mobility tunnel features to handle anything that needs to keep an IP address.

MC-LAG

Multi-Chassis LAG is a pretty contentious issue within the industry.

Note: In Service Provider applications, Layer 2 Loop Prevention is a foundational design pattern for delivering Metro Ethernet services by creating a loop-free single endpoint path. I'm not covering this design pattern, as it's a completely different subject. In this case, I'm illustrating Data Center/Private Cloud network design patterns, and then tangentially Campus from there.

MC-LAG as a design pattern isn't all that bad compared to some - however, some applications of MC-LAG in the data center turn out to be fairly problematic.

Modern Data Center Fabric Switching

Given the rise of Hyper-Converged Infrastructure - we're actually seeing data center hardware get used. Prior to this last generation (2012-onwards) just "being 10 Gig" was good enough for most use cases. Commodity server hardware wasn't powerful enough to really tax fabric oversubscribed switches.

...or was it? Anybody remember liking Cisco FEXes? TRILL? 802.3br?

Storage Area Networks (SAN) offloaded all compute storage traffic in many applications, and basically constituted an out-of-band fabric that was capable of 8-32Gbits/s.

The main problem here is Ethernet. Ethernet forwarding protocols aren't really capable of non-blocking redundant forwarding. This is because there is no routing protocol. Fiber Channel will use either IS-IS or SPF in most cases for this purpose, and hosts participate in this routing protocol.

The biggest change that this has - Fiber Channel can have two completely independent fabrics, devoid of interconnection. This allows an entire fabric to go completely offline with no issues.

MC-LAG goes in a completely different direction - forcing redundant Ethernet switches to share a failure plane. With Data Centers, the eventual goal for this design pattern is to move to this "share-nothing" approach, eventually resulting in EGP or IGP participation by all subtending devices in a fabric.

Now - we don't have that capability in most hypervisors today. Cumulus does have a Host Routing Implementation, but most common hypervisors have yet to adopt this approach. VMware, Amazon, Microsoft, and Cumulus all contribute to a common routing code base (FRRouting) and are using it to varying extents within their networks to prevent this "Layer 2 Absenteeism" from becoming a workload problem. Of these solutions - VMware's NSX-T is probably the most prolific solution if you're not a hyperscaler that can develop your own hypervisor / NOS combination like Amazon/Microsoft: https://nsx.techzone.vmware.com/

Closing Notes

Like it or not, these examples are perfectly viable design patterns when used properly. Given industry trends and some crippling deficiencies with Giant-Scale Ethernet Topologies in large-scale data center and campus networks, we as network designers must keep an eye to the future, and plan accordingly. In these examples, we examined (probably very for some) tightly coupled design patterns used in commodity networks, and where they commonly fail.

If you use these design patterns in production - I would strongly recommend asking yourself the following questions:

  • What's the impact of a software upgrade, worst-case?
  • What happens if a loop is introduced?
  • What's the plan for removing that solution in a way that is not business invasive?
  • What if your end-users scale beyond the intended throughput/device count you anticipated when performing that design exercise?
Hopefully, this explains some of the why behind existing trends. We're moving to a common goal - an automatable, reliable, vendor-independent fabric for interconnection of network devices using common protocols - and nearly all of the weirdness around this can be placed at the networking industry's feet - We treat BGP as this "protocol of the elites" instead of teaching people how to use EGPs. We (the networking industry) need to do more work to become more accessible to adjacent industries - They'll be needing us really soon if they don't already.

Unearned Uptime: Letting Old Ideas Go

We don't always earn reliability with the systems we deploy, design, and maintain

Infrastructure reliability is a pretty prickly subject for the community - we as engineers and designers tend to anthropomorphize, attach, and associate personal convictions with what we maintain. It's a natural pattern, but it inflicts a certain level of self-harm when we fail to improve upon the platforms that serve as the backbone to those we support.

There are two major problems I perceive with regards to translating unearned uptime to reliability

  • History
  • Ego
  • Architecture (later post)

Throughout this article, I'll cover these problems and then transition into common examples of "unearned uptime" in the industry. These are not "networking" issues - it's an infrastructure issue. We have the same problems with most civil structures, interchanges, runways, etc.

The idea that we didn't earn reliability delivered to the business is one thing that we as infrastructure engineers and designers aren't particularly comfortable with.

History

It doesn't have a problem! It's been working fine for years!


 Credit: Marc Olivier-Jodoin

Infrastructure needs routine replacement to function correctly

Consumers rarely notice issues with infrastructure until they've gotten to be truly problematic. An easy example of this is asphalt concrete (or bitumen, depending on where you live).

The material itself is relatively simple, rock aggregate + oil - but it's pretty magical in terms of usefulness. Asphalt itself functions as a temporary adhesive, bonding to automotive tires and making roads really safe by shortening stopping distances. The composite material is also flexible, allowing the ground below it to shift to an extent - which means that places with more dynamic geology.

We don't really think about wear to this surface as consumers after it's been installed. Public works / Civil Engineers sure do, because it's their job, but think about it - if you drive your car over a residential street three times a day, that's probably over 4 metric tons of material that the road has to withstand in a day. This wear adds up! A typical residential (neighborhood) street will see over 15,000 metric tons of weight per year.

The sheer scale of road wear is utterly staggering. This GAO Report on Weight Enforcement illustrates how controlling wear (usage) is a method of conveying importance, but that doesn't really work all that well for us...

Practical IT Applications

When designing technology infrastructure, especially as a service provider, you want to encourage usage.

Usage drives bigger budgets and your salary! Ultimately, wear with tech infrastructure is going to be about the same regardless of load. Scarcity economics don't work particularly well in IT.

To solve the history problem, you want to convince business line owners to desire and delight in what you provide.

The antithesis to "customer delight" in this case is often this big guy:  By User:MrChrome, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=33206669

Fun fact, the Cisco 6500 is a lot older than you'd think, entering service in 1999. For more: https://en.wikipedia.org/wiki/Catalyst_6500

Cisco 6500 series switches were simply too reliable. The Toyota Camry of switches, Cisco's 6500s lived everywhere, convincing executives that it was totally okay to skip infrastructure refreshes, much to the chagrin of Infrastructure Managers worldwide.

The Solution - Messaging

We shouldn't be waiting for stuff to fail to replace it - it's time to get uncomfortable and speak to consumers. Most humans are intelligent - let's help them understand why we care about 25/100 Gigabit connectivity, cut-through switching, 802.11ax in terms that are geared towards them.

Here are some pointers on where to start:

  • You're not replacing something because it was bad.
    • A pretty easy pitfall for IT professionals - if you devalue "what came before" you devalue the role a replacement fills. It may be hard to do, but most things here were built for a reason - the intent behind the design is important for other reasons, but this negativity will affect anything you do after that.
  • Show how they can use it
    • This might not make a lot of sense at the outset, but any trivial method for interaction will make a particular change feel more concrete. Some examples:
      • Add a Looking Glass view if it's a new network. Providing users a way to "peek inside" is a time-honored tradition with many industries.
      • Open some iPerf/Spirent servers for users to interact with, or other benchmarking
      • Functional demos like blocking internetbadguys.com
  • Share how it is made
    • You never know, why not try?

Ego

This one's a bit harder - and I'm not trying to apply major negative connotations here. As engineers, we get pretty attached to our decisions, attributing significant personal effort to the products we purchase.

As an industry, IT professionals really need to re-align here. We consider vendor relationships allegiances and fundamentally attribute our own personal integrity. If I had my way, I'd stop hearing that someone's a "Cisco" or a "VMware" guy - we need to shift this focus back to consumers.

The biggest point for improvement here is also on the negativity front. Let's start by shifting from "this solution is bad" (devaluing your own work for no reason) to "This solution doesn't fit our needs, and this is why." The latter helps improve future results by getting the ball rolling on what criteria consumers value more.

After deploying quite a few solutions "cradle-to-grave," my personal approach here is to think of them like old cars, computers, stuff like that. I fondly remember riding around in my parents' 80's suburban, but we replaced it because it wasn't reliable enough for the weather we had to face in Rural Alaska, and it was too big.

Here are some examples of how I regard these older, later replaced solutions/products:

  • Cisco 6500s: Fantastically reliable, fantastic power bills, fantastic complexity to administer
  • Aruba 1xx series Access Points: Revolutionary access control, less than stellar radio performance
  • Palo Alto 2000/4000 series firewalls: Again, revolutionary approaches to network security, but not enough performance for modern businesses to function. Commit times improved greatly on later generations
  • TM-OS 11.x: Incredible documentation, incredible feature depth. If it's more modern than 2015, you're going to want more features

All of these served businesses well, then needed to be replaced. I see too many engineers beat themselves up when services eventually fell apart, and it's just not necessary.

NSX Advanced Load Balancer - NSX-T Service Engine Creation Failures: `CC_SE_CREATION_FAILURE` and `Transport Node Not Found to create service engine`

TL;DR If you see either of these errors, check  grep 'ERROR' /opt/avi/log/cc_agent_go_{{ cloud }}  for the potential cause. In my ca...