So far, this is all great, but it IS missing something...Networks are useless unless you do something with it. In most cases, a device (in this case, we'll use a server) needs to connect via redundant links to a common Layer 2 segment.
Why is this?
Well, most servers are incapable of dynamic routing. Instead, the server (which is a perfectly capable router as far as forwarding is concerned) simply has a static route (default gateway) that is used for all Layer 3 forwarding. This is not really a deal breaker for Clos fabrics - there are a few ways to solve this problem - and several of them intermix really well:
The VMWare Way
This is probably the most achievable. It's not really a Clos fabric, due to some deficiencies (ESXi doesn't do BGP yet) that will probably be resolved at some point, but it is close enough to achieve our goals. Let's review those
- Make change frictionless and low risk so network changes can be on-demand (The Change Problem)
- Ensure that the network utilizes all links, with consistent forwarding (The Loop Problem)
The primary value proposition with Layer 3 Leaf Spine (a Clos implementation) is to leverage a consistent 3-stage (leaf, spine, leaf) forwarding topology where all links have the same exact latency and link speed. This, along with some other features (ECMP support being the big one) allows for N-scaling leaf-to-leaf communications - you can have 1,2,..64 spines in a network.
Cisco really pushed this to the limit, publishing a paper on a reference implementation where leveraging 16+ spines actually saved money versus using QSFP+ capable devices. The conclusion is somewhat dated due to QSFP28 coming out and being more affordable, but the takeaway should be the same - BGP/IS-IS are built to facilitate tens of thousands of network nodes in irregular topologies. Datacenter networks with hundreds of switches don't really hold a candle to that, but we can use this overkill to our advantage.
VMWare is also now on board with this topology, because they're starting to solve the routing problem with NSX. The currently published reference architecture (VMWare Validated Design 5.0 at the time of this post) featured a new compromise with version 4.0 - linking ToR switches into pairs more like a traditional switch deployment, with VLANs subtending the leafs to provide server reachability.
There's a problem here - how do you get virtual machines to keep their IP address when moving between ToR pairs?
This is where NSX comes in. NSX-V/T both provide overlay networking, where dynamically pinned tunnel adapters (like GRE on ubersteroids) manage membership for virtual network segments inside of an encapsulation method (VXLAN/GENEVE) providing a fully virtualized Layer 2 segment, portable anywhere. This ensures that the only thing that isn't portable is the servers, which is good enough for now.
VMWare's approach isn't "pure" (whatever that means) but when revisiting our goals here (to provide an ultra-stable, change-friendly datacenter network), it does meet our needs and provides a demarcation point where the changes are substantially simpler as far as the datacenter fabric is concerned. If NSX breaks on a host, you may lose part of a vn-segment or a few VMs at worst. The fabric failing is far more disastrous.
- Change risk is low due to distributing the work
- Highly flexible
- NSX-T can run on things that aren't ESXi
You're going to need to blur the line between "Network" and "Systems". I've seen a pattern in many prod environments - where an organization's networking team will manage the vSphere Distributed/Standard Switches to ensure that switch and host are well-integrated. If this model is not one that is organizationally feasible, you'll have a difficult time with NSX. Even if it isn't, your Network/Systems team must cross-train.
The Cisco / Big Switch Way
Another option is to fully offload the responsibility of overlay networking onto the datacenter fabric, maintaining a "pure" Clos topology, and handling the ToR bonding in software with the same overlay technology. I'm keeping this at a high level because, honestly, I haven't worked anywhere large enough to benefit yet.
- Works on just about anything
- Usually comes with an automated lifecycle management and provisioning platform
- You can keep your network and systems teams separated
- If it's not a use case your vendor anticipated, you don't get the flexibility you need
- Vendor lock-in is basically guaranteed
- Doesn't run on generic hardware
The Mad Scientist Way
...just install a routing package on every virtual machine, docker host, or virtualization host. Run DHCP for the initial address issuance, and then run OSPF or BGP with a dynamic range, and then advertise a loopback address for the service you're offering.
It's not actually all that hard. If you have a linux OS, you just need to install a software package to run a routing service. If you're a systems guy, this is easier than setting up a LEMP stack! Here are some examples of open-source, publically available software that will perform this task:
Free Range Routing. It's under the hood of VMWare's new version of NSX (NSX-T) Cumulus Networks, and a ton of other stuff. It's the most feature complete of this list, performing tasks that you'd normally pay far more for (Cisco still has BGP as an add-on license).
One neat thing this can provide is the concept of anycast network services. For a stateless service like DHCP, DNS, etc it is possible to leverage one of these daemons to advertise a common address. Instead of searching for the correct server or assembling a shortlist of DNS services, clients can simply ask for the nearest DNS server - this is how many exascale DNS implementations work, like:
The downside to this approach is that you have no formal support whatsoever - which is a pretty big con. The good news is that there are some commercially viable host-routing products out there like Cumulus' host pack (white paper here) Eventually, products like this will be run as a plug-in on common hypervisors.
Why and where should I try this?
Let's keep this simple - applied Clos datacenter fabrics will require some level of solution design - it cannot simply be forklifted in the datacenter, but many products are available today that will solve the near-term issues. Few of these implementations are perfect, so design for iterative improvement, leave extra ports at the datacenter perimeter for new versions, etc.
What routing protocol, technology should I use?
Use what you know. The lab examples I provided in this block were manufactured in 2002. If it's layer 3 and familiar, use it. We only have one hard requirement - fast Layer 3 switching.
With routing protocols, use what you know - if a protocol is unfamiliar, you'll have a difficult time supporting it. There's nothing wrong with running OSPF (or even RIP!) for these purposes. My personal favorite is actually running two - either IS-IS or OSPF combined with BGP - but this is driven by a few requirements I have for the future:
- NSX-T only supports BGP
- BGP is the way to go for highly scalable deployments
- Any carrier network engineer will feel at home using it
This concludes this part on Clos networking. Later, I might even apply it!