Sunday, April 21, 2019

Spine and Leaf Networks, an Introduction

The two biggest problems a network engineer will face are changes and loops.


In our Cisco exams, an overwhelming majority of the educational content is on methods of preventing loops, and for good reason. Unfortunately, the methods we use to prevent loops are highly complex - and if we partially understand these concepts, the risk is then transferred to changes, where we modify loop prevention mechanisms for a variety of reasons in the datacenter:

  • Adding new servers
  • Adding new networks to accommodate workloads
  • Adding new hardware to accommodate the addition of new servers and workloads
  • Adding new interconnects because the previous operations were so successful that new sites are now required
I'm sure many have said, "Make it Layer 3!" as if that's some form of easy fix that will magically remote datacenter reliability issues. The reality is that improperly Layer 3 networks can be just as unstable as Layer 2 ones, if not more so. To make it worse, you may not be able to accommodate your workload needs, causing the business to fail.

First, let's cover what network designers mean by Layer 2 or 3, as it violates the OSI model. In short, it is a reference to port configuration and whether or not Layer 2 loop prevention mechanisms are in play.

Layer 2

Layer 2 network provisioning is probably the easiest to configure, and the least scalable. Most typical systems administrators won't have any issues deploying a workable small-scale Layer 2 network on their own - and probably have experience doing so. 

Layer 2 network configuration involves the creation of a VLAN, which in turn instantiates a loop prevention process of some kind, like:
Oddly enough, TRILL can actually be configured to conform to a Spine and Leaf spec. I won't discuss that here - I'll get into why later.

Layer 3

Layer 3 network provisioning is much less flexible, but can also be much more stable and scalable. In this case, Layer 2 loop prevention may be in play, such as with SVIs, but is not the primary or mandatory source of loop prevention. Instead, routing protocols and potentially redistribution are used, each with their own hazards:
Again, this is all just to prevent loops. Most network designs do a good job of preventing loops in the ways listed above but at the cost of making change riskier by tightly coupling networks to specific devices. As we know, tight coupling is a big negative with high change frequency.

The goals

To design a highly reliable, highly mutable, and highly maintainable network, a network designer must meet the following goals:
  1. Prevent loops reliably and automatically
  2. Allow for frequent, preferably automatic additions and removals of new networks
  3. Be easy to maintain, fix and troubleshoot
  4. Do all of the above, but with a minimum number of changes, to a minimum number of devices

Enter Spine and Leaf

Introduced in the 1950s by Charles Clos (details here), Clos networking is a mathematical model for non-blocking multistage circuits. This is a lot to unpack:
  • Non-Blocking: Nearly all Layer 2 loop prevention mechanisms will prevent loops by refusing to forward on secondary or n-scale paths. Telecommunication companies don't really like this, as this reduces available bandwidth (and therefore revenue) by half. Non-blocking indicates that all available ports are able to forward at all available speeds.
  • Multistage: Nearly every datacenter network has more than 6 network ports. As a result, we need the ability to scale beyond a single integrated circuit or network device.
Today, we have a few more technological advances than in the 1950s. Most datacenter network switches leverage Clos topologies to reduce manufacturing costs, increase reliability, by providing more ports on a switch than a single ASIC can provide by aggregating 4,6, or 8 port ASICs onto a crossbar.
This begs the question, why not layer 1? Since the switch itself is Clos, why not just buy a big switch and call it a day? There are some upsides here:
  • One IP to administer, making change easier
  • Layer 1 topologies are pretty reliable, and break-fix actions are typically just reseating something
  • Layer 2/3 loop prevention isn't required
But when we think about it a second, the downsides are pretty big:
  • Unless you use technology such as VSS, VCS, you have no redundancy
  • You must always perfectly assess the correct port count for your data center on the first try, leading to massive amounts of waste
  • It completely violates rule 4, because you're either changing the entire network or not at all.
Layer 2 Leaf Spine suffers from more or less the same issue but removes the need to always be completely correct with port-count.

Layer 3 Leaf-Spine (L3LS from here on out) leverages only Layer 3 loop prevention mechanisms between network devices - while built in a non-blocking Clos pattern in a 3-stage topology:

Odd looking, isn't it? Where are the connections between Spines, or behind leafs?

With Clos Networking, crossbars/spines should not connect to each other - it violates rule #4, and leads to blocking circuitry. 

Now - this obviously removes all IP portability completely, and forces workloads running on-fabric to participate in routing on at least some level, because Leafs aren't aware of each other directly, but has certain reliability gains:
  • Imagine if you could do ASIC-level troubleshooting internal to a switch
  • Now imagine if you, as a network engineer, could do this without having to learn how to do ASIC-level troubleshooting. Instead, routing protocols that are familiar to you are your interface into the fabric
  • Now imagine that all failure domains are constrained to the individual ASIC you're working on and won't have higher repercussions to the switch
Pretty big upsides, right?

So here's where we need to diverge a bit, due to what I mentioned here. There are quite a few ways to deploy Spine-and-Leaf networks, and nearly all are highly reliable. Some are even used as production networks!

Humor set aside - the usability problem is a big one. My recommendation and the order of this series of blog posts would be to choose whatever platform, protocols, and administration methods best suit your organizational needs, 'cause they all work. Even RIP.

Before we move on, I'd like to cover some fairly serious problems I've seen when discussing the use of L3LS in the datacenter. I apologize for the length but there is a lot to cover here. The statements listed below are misconceptions that I've seen kill adoption of this technological principal.
  • L3LS is expensive: This is just flat out wrong. Generation 1 Catalyst 3560s with routing licensed can run it. All you need is layer 3 switching. While this is expensive in some cases, product selection can help a bit. Even older Layer 2-only datacenter switches cost quite a bit when compared to newer 10/25g switch options. If your department can afford new, unused 10 gigabit switches, L3LS probably won't cost more, if at all.
  • L3LS is a product: While some products like Big Switch, Cisco ACI, or Juniper's QFabric provide a pre-made, self-provisioning network solution that loosely conforms to these design principles, it's not particularly difficult to build your own if a canned solution meets your needs.
  • L3LS is difficult: We'll cover that in later posts, but it's mildly difficult to design, but easy to maintain and grow.
  • L3LS has to use <insert protocol here>. Pretty much anything goes.
With that out of the way, let's have a bit of fun on the next one - running L3LS with RIPv2/3 as the designated routing protocol. In all cases the goal will be to provide a dual-stack network - IPv6 went final over 7 years ago. I'll be using CSR1000v and virtual NX-OS images for these examples, but your routing platform flavor of choice will work just fine. My point is that you know your platform, and should be able to map it out. This isn't stack overflow :)

Traditional Datacenter Network, a Preamble to Spine-and-Leaf

Datacenter Network Engineers have two problems.

CHANGE.

LOOPS. (let's cover this one later!)

Change is everywhere. Emerging trends such as DevOps and/or CI/CD have created the need for dynamic, ephemeral allocation of data center resources; not only in large-scale deployments, but medium-sized companies are starting down this direction as well.
...but we still have to schedule change windows to add/remove networks from our datacenters due to the risks involved with network changes.

Current State

Today, most datacenter network deployments consist of 2 or 3 layers, with a huge variety of opinions on the "core" or topside layer. I'll start from the bottom up, as this will generally cover the areas of primary focus first.

Please keep in mind that I'm not throwing shade on this type of design. It's *highly* reliable and is the backbone of many companies. If it's working well for you, you don't have to throw it away. In many cases, the possibilities I will discuss may not even feasible for you. Eventually, I'll have enough time to cover all the various aspects that result in successful data center networks - but for now, I am going to cover a topic that tends to have a great deal of misinformation and markitechture that confuses many network engineers.

Core-Aggregation-Access Topologies

This reference design consists of three tiers, Core, Aggregation, and Access:

Datacenter Access

This particular layer is where the rubber hits the road. Servers directly connect to the Access layer, and the design of the service is geared primarily toward facilitating the needs of the subtending servers. All kinds of atypical services are typically deployed at this point, such as MC-LAG (vPC, MEC, etc). Generally speaking, this should be where the most change frequency will occur, as it is the least.
Most deployments at this point are Layer 2, trunking server/workload VLANs so that workloads do not have to change addressing as they traverse different switches. This is a workload problem that for the majority of deployments have not been solved - and must be mitigated by this design.
There are a few downsides, however:

  • New network turn-ups involve all access-tier switches, and at a minimum, the aggregation layer. You're not mitigating risk with change if you have to modify pretty much every single device in your network!
  • Loop prevention methods must all be Layer 2, e.g. spanning tree, FabricPath, TRILL, MC-LAG. These loop prevention methods are not very resilient, and any failures will cascade through the entire data center in most cases.
I do have some recommendations when facing this problem:
  • When creating new networks, always explore the possibility of routed access. Adding SVIs to your access layer mitigates a great deal of this, but you lose workload portability. Perhaps not all workloads need portability, ex. storage over IP, host management.
  • Preconfigure all ports with a default configuration that will support new server turn-ups. Server administrators love being able to just plug their equipment in and have it work. Spend a lot of time planning this default port configuration with your systems team - it'll pay off.

Datacenter Aggregation

This is where most of the meat and potatoes are as far as data center networking, and for most deployments, this is as far as most designs go. This section of a data center network will be running tons of services, into one place they can be aggregated (thus the name). You'll typically see the following connected to / running on the access layer:
  • Firewalls
  • Load Balancers
  • Datacenter Interconnects, if there's no Core
  • Loop prevention methods such as MC-LAG
  • Layer 3 gateways for the majority of VLANs
  • All your VLANs are belong to the aggregation layer
The Aggregation Layer is probably the riskiest device in a data center network to modify. I recommend doing a few things to mitigate these risks:
  • Waste TONS of address space. Create lots of new networks, and keep them relatively small if you can (sub-/24). Deliver them to all of the access layers in a scalable manner, and preconfigure it all at the outset. Remember, no matter what capacity you allocate to, customers will overrun it.
  • Don't pile too much on the aggregation devices. You can connect a separate firewall, LB, etc to the aggregation layer, keeping these devices as simple as possible will ensure that administration work is as simple as possible.
  • Ensure you have an adequate port count. The move to adopting a data center core is an expensive one and is necessitated by the 3rd set of aggregation layer devices, typically.

Datacenter Core

This is the one where your VAR starts seeing dollar signs. Most deployments will not need this layer, even up when thousands of workloads (VMs, containers, I don't discriminate), as your port-count with an average Aggregation-Access network (we'll call them pods from now on) will be:
  • 32-48 Aggregation ports
  • 32-48 Access ports
You can dual-home 1,024-2,304 servers, or quad-home 512-1,152 servers on paper with one pod. Of course, most of these ports are wasted because you can't always fit 24-48 servers into a cabinet. Real-world maximum server count per pod would be in the hundreds.

The primary point where a data center network would expand to a network core would be when interconnecting 3 or more pods or physical locations. I do have recommendations on design here as well:
  • Don't budge on IP portability here. Keep it Layer 3
  • When I say Layer 3, I mean it. No VLANs at all - use stuff like `no switchport` and .1q tags if necessary. Eliminate spanning-tree completely
  • Carefully choose your routing protocols here. BGP is fault-tolerant and complex, OSPF rapidly recovers from failure due to link state and often overreacts to changes. I won't talk about EIGRP because I don't like proprietary routing protocols. Deal with it.
So this is where most people are at with their data centers - and when properly designed, the biggest danger to reliability is the network engineer. Once you finish building this design, network additions, configuration changes, software upgrades will be the leading cause of network outages. Most network gear available for purchase today is highly reliable, and since everything is Layer 2, this network design will not fail unless an anomaly is introduced or a change is made.

The next section will be for those of us that suffer undue stress and pressure due to a high frequency of change - it's possible to have the level of comfort that most systems engineers have when performing their work.

Using VM Templates and NSX-T for Repeatable Virtual Network Deployments

So far, we've provided the infrastructure for continuous delivery / continuous integration, but it's been for those other guys . Is ...