Sunday, May 24, 2020

Why Automate, Part 2: RESTFul APIs and why they aren't as hard as you think

Let's be realistic about the API craze - it seems everything has one, and everybody is talking about API consumption in their environment as if they've invented fire.

Here are a few things to know about APIs that could have been communicated better:

  • Writing code to consume an API is easy. Most of the time, a cURL command will do what you need. To top it off, most platforms have a Swagger UI, or even better, an API Sandbox to guide you through it.
  • You have to write code to consume an API. Most of the time, you're simply buying a product that does this for you. For example, with ArubaOS all management plane traffic uses PAPI to communicate, and you just interact with the controller. Even better, platforms like Ansible and Hashi's Terraform make it as easy as defining what you want in a YAML file.
  • APIs need to be secured. As a security practitioner, this one is pretty scary. Think of an API as your SSH connection, but with less baked-in security controls, because the industry hasn't hardened m(any) of them yet. API proxies are really useful here because you can limit what permissions any given client can have.
  • APIs are useful in ways that the CLI isn't. There are features and advantages to performing work via any API - one of which is platform abstraction. You can easily write code to make changes to a Juniper switch as a Cisco guy, just by learning the automation constructs!
  • If you're sick of PuTTY/(insert SSH client here)'s bulk copy issues, the API is for you. Even if you don't want to regularly use an API for most things, bulk changes are typically authenticated and validated and will tell you where any breakage is. Next time you install a few hundred static routes, import multi-line ACL, try it. How do you validate that those changes went in today? Have you ever had issues with just one missing line when doing those bulk imports?
Let's try and consume an API with base code - just to see how easy it really is.

First, let's try something easy, adding a few hundred static routes to an NX-OS device. The main reason why I'm using NX-OS here is that the platform includes an "API Sandbox" by default, which should be disabled in production environments:

no nxapi sandbox

That being said, we're using a lab, and it's stitched together via NSX-T. We can firewall, IDS, etc. the management and data plane of any simulated network asset, and connect them as arbitrary topologies to fit our needs really easily. These workloads (virtual routers & switches) should be ephemeral, so it should be OK for now. Later I'll go into automatically securing and loading base configurations.

Let's get started! Here's the NX-API Sandbox:
I generated an IP list of /32s starting from up to as null routes, with individual tags, and applied it accordingly. Then I set the format to JSON, mode to cli_conf, and set the error action to "rollback on error". this would convert everything into a common language, and roll back a change if there are problems.

Generated code is here.

First, we check the routing table beforehand:
sho ip ro
IP Route Table for VRF "default"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

Then we run it:

And then we verify. 
show ip ro
IP Route Table for VRF "default"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>, ubest/mbest: 1/0
*via Null0, [222/0], 00:00:15, static, tag 1111, ubest/mbest: 1/0
*via Null0, [222/0], 00:00:15, static, tag 1112, ubest/mbest: 1/0
*via Null0, [222/0], 00:00:15, static, tag 1113, ubest/mbest: 1/0
*via Null0, [222/0], 00:00:15, static, tag 1114, ubest/mbest: 1/0
*via Null0, [222/0], 00:00:15, static, tag 1115, ubest/mbest: 1/0
*via Null0, [222/0], 00:00:15, static, tag 1116, ubest/mbest: 1/0
*via Null0, [222/0], 00:00:14, static, tag 1117, ubest/mbest: 1/0
*via Null0, [222/0], 00:00:14, static, tag 1118

We can also roll back (script in GitHub):

And verify:
show ip ro
IP Route Table for VRF "default"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

Just to be clear, this is a starting point. There is no error handling, no automatic validation, no secure storage of credentials. It's fantastic that Cisco and other vendors provide this, but there are quite a few things that should be improved with just a tiny bit of coding time:
 - User-friendly formatting of the payload. You'll want to prettify the payload blob, so that it's easier to peer review.
 - try-catch statements: You want to, at a minimum, get a 200 OK or 400 Failure of some kind, and report it to the executor of your script. This is pretty easy to capture.
 - Automatic change validation: In this example, capturing the routing table after the fact could also be generated automatically by the sandbox, and would make for the perfect validation step. Be creative!
 - Test Test Test: These API calls go by pretty quickly, and you don't have the typical MOP approach where constant validation is taking place. Get a lab, and thoroughly test your automation before using it on a live network.

I'll be adding another example that incorporates these values at a later date.

My examples of automation implementations are here.

Cisco's library of NX-OS examples are here.

Sunday, March 15, 2020

IPv6 Sage Certification with NSX-T, Part 2

To get past the first major test (Explorer), you simply need to access a page over IPv6, and pass a quiz. To do this, spin up a desktop VM on your dual-stack vn-segment and navigate to

To get past your next phase (Enthusiast) you do have to spend some money - purchase a domain (the cheaper, the better) and link it to's name servers. Jacob Salmela has a pretty good step-by-step on this: (

From here, you should be able to get through it via trial and error. I recommend just spinning up a linux VM on that vn-segment and toying around with it, e.g. installing apache, postfix, etc.

One thing worth noting is that the last few phases (Professional on up) have automated tests that may need to be manually restarted by HE to work. If you get really stuck, you can ask them at

IPv6 Up and Running - Dual-Stack connectivity with NSX-T

The next step is to get IPv6 up and running with NSX-T!

This should be pretty short - as with existing deployments of NSX-T, most of the difficult work is already completed. Here are a few preparatory steps to be performed before getting started:
  • Ensure MP-BGP is on and that the data center fabric is running the ipv6-unicast address-family.
  • Ensure the same on NSX-T manager by navigating to Advanced Networking & Security -> Networking -> Routers -> Global Config:
Now, let's review feature support (up to date as of NSX-T 2.5), as it's not really in the NSX-T documents. More detail can be found here

  • Routing
    • IPv6 Unicast AFI
    • eBGP and iBGP
    • ECMP
    • BGP Route Aggregation, Redistribution, tuning
  • Dataplane forwarding
    • Route Advertisements
    • Neighbor Discovery
    • Duplicate Address detection
    • DHCPv6 helper
  • Security
    • Full Layer 4 firewalling
    • IP Discovery/Security, e.g. IP spoofing prevention, DHCPv6 spoofing prevention
We're pretty much covered on the data plane portion, with one notable exception - IPv6 load balancing is not supported. Other things that are not supported include:
  • IPv6 native underlay: VTEPs, Controller-to-host communication is IPv4 only. I'd expect this to be resolved relatively soon...
  • NSX Manager cannot have an IPv6 address, nor can it cluster via IPv6
  • vCenter and ESXi still does not fully support IPv6. Additionally, with the deprecation of the FLEX UI, the experimental feature that allowed you to try is no longer exposed via any GUI.
  • Versions of vRA prior to 8.0 don't appear to support IPv6 autoconfiguration, so it may be a while before you can automatically invoke these features.
Now that I've been a total buzzkill on feature support (VMWare historically hasn't been great on this front), let's get to configuring!

First, let's configure an IPv6 address on our Tier-0 routers:
Add BGP Peers:
Note that you already have Tier-0 to Tier-1 automatically set up - click "View More" under router links, and you'll see it's using the prefix fcc4::which is currently reserved by RFC4193 for Unique local connectivity. Props to VMWare for following spec!
There actually isn't much else to do here - you're done. You can add IPv6 subnets and profiles to segments really easily:

And that's it! Interestingly enough, you can run IPv6 only on NSX-T vn-segments as well - just create a new external interface, attach it to the VyOS VM via a vn-segment, and peer BGP.

Saturday, February 15, 2020

IPv6 Sage Certification with NSX-T, Part 1: Requesting an extended prefix

As is probably obvious from the sidebar, I'm pretty enthusiastic about IPv6 - for quite a few reasons, not least of which is implementing a new Layer 3 protocol after guys like Vint Cerf already did most of the cool stuff.

However, I didn't want to simply complete this task - most people complete all of these tasks without properly implementing IPv6 - no routing, network configuration is required if you simply install a tunnel client on your computer and work from there.

So instead, let's introduce a lot of complexity and make it easier for the testing to fail.

First things first, since we have a whole network in play instead of a single Layer 2 domain, we need to request a bigger prefix. Since you can't (shouldn't) chop up a /64 for end devices, let's start with establishing a larger prefix.'s tunnelbroker site lets us one-click request a /48:
So I'd recommend doing that - and from there we'd want to modify the tunnel created in my previous blog post, and chopping it up as you see fit.

I already have a dual-stack Clos fabric in my lab, so establishing tunneled connectivity here was trivial - standing up a VyOS virtual router (config here) and peering BGP with the fabric. This is pretty much the upside to Clos fabrics - you have flexibility in spades.

Saturday, February 1, 2020

Why Automate, Part 1: Network Config Templating in Jinja2

Let's answer the big question: "What's the answer to the ultimate question of life, the universe, and everything?"

Kidding, it's easier to cover the question: "Why automate?"

So let's get started! Here I'm going to start a few easy and quick ways to benefit from automation, with a slight networking bias...

File Templating

Have you ever deployed a single-config device (doesn't have to be a router or switch) and encountered copy-paste errors, adding old VLAN names, from some master config  (ideally) or other devices (not ideally)? 

As it turns out, so many developers ran into this issue that they created a parsing language specifically for purposes like this - Jinja2.

When visiting their website, the API documentation can be a bit overwhelming. There are many features for single-file templating, but if your goal is to cookie-cutter generate device configurations, you don't need to learn all that much of it, as Ansible takes care of the vast majority of the coding required. That's right - no coding required.

The Basics

Jinja2 file templates emphasize the use of variables, and escape them with double curly brackets, for example:
hostname {{ hostname }}

As a language, it also supports a hierarchy of variables:
hostname {{ global.hostname }}-{{ global.switchid }}

This is pretty simple, right? The first step I'd recommend here is to go through any configuration standards you have and highlight all of the variables in it.

Now to add a little bit more difficulty - it's time to define the variables in a document, to eventually combine together with the Jinja template we're creating. This is incredibly difficult to do in a vacuum, as you need a good way to name/organize the variables. So let's take that highlighted document, and start attaching names / organizing them at the same time. I'd recommend using a text editor that supports multi-document editing, putting your variable list on one side, and your Jinja template on the other. Here's how I did it in Visual Studio Code:
As you can see, on the left I have used YAML to define attributes of a leaf switch, while adding the names into the template itself. I'll keep this brief, as there's one important aspect to automation here:

YOU are automating YOUR OWN, EXISTING, expertise on a platform. This is not replacing YOU, nor is it making YOUR SKILLS IRRELEVANT. Those skills are still absolutely necessary. YOU will still have to hand-configure and explore equipment like you always have. The biggest change YOU will see is that you'll have more time to test configurations and making them more reliable, instead of performing some of the more boring tasks like editing text files.

For this reason, I'm not going to get very prescriptive on the what or the how from here, as this is an exploration exercise that will vary greatly based on the use case. Here are some quick guidelines while trying it out:
  • Keep it organized! The Jinja document's supporting YAML file is there for YOU to read. Make it easy to do so.
  • If you think you'll need it, add it. You man not have a use case for making MTU a variable currently, but it's seeing widespread adoption in the data center and campus networks - if you think you may change it someday in the future, add it into the documents.
  • Use with extreme prejudice against your configuration templates!
Now that the vast majority of the work here is done, let's focus on the no-code way to combine these files. For this, all you need is python and Ansible, and pretty much any version works. To achieve this, Ansible has a pre-installed module called template.

- hosts: localhost
    - name: Import Vars...
        file: example-ios-switch-dictionary.yml
    - name: Combine IOS Stackable Leaf...
        src: templates/example-ios-stackable-leaf.j2
        dest: example-ios-stackable-leaf.conf

..and that's it. Run it with the command ansible-playbook, and it will create a new file. Unfortunately, this requires one playbook per configuration, as the include_vars module doesn't unload anything from the YAML file.

Usage At Scale

This method scales extremely well - I have provided an example on Github ( which leaves some standardized framework for keeping things organized, like using roles per device configuration, so it should be pretty easy to fork and expand to encompass multiple switches and multiple configuration standards, all in one repo.

In the real world, I use several Git repositories - the sheer quantity of templates and roles just gets out of control otherwise, and collaboration like using Git Pull Requests for continuous review and improvement (It's amazing what you can do with the saved time!) is much easier with that separation.

I've also generated an entire datacenter fabric configuration in seconds this way. Once you get your repositories organized, that's not even that big of a deal.

Demystifying CI/CD and Automation in General

You're already using automation. If you use Pull Requests to improve templates, you're simply formalizing previous practices you already did, but you also (probably) accidentally did CI/CD and network automation here.

A lot of DevOps gurus tend to treat automation work like it's the technological equivalent to inventing the wheel, and a lot of that is more to advance and protect the profession, and less a play to establish dominance / a place of power. Unfortunately, this tends to create a bit of a rift between them and the people they are there to help, but I've never seen that be intentional with DevOps engineers. They're developers, just like other ones, with a fiery burning passion for reducing boring, repetitive tasks for you, and making sure that the methods to do so are well-organized, and want to share those experiences. You don't need to give them a hug, but ask how they do stuff, it's probably the quickest way for you and them to learn something.

Sunday, December 29, 2019

Securing Dual-Stack (IPv4,IPv6) Endpoints with NSX-T

I have mentioned in a previous blog post that I'm not using any ACLs on my tunnel broker VM.

This is usually pretty bad, but again, we can get those protections outside of the VM - I'm using this to prove out how NSX-T can provide utility in this situation.

Solution Overview

VyOS is a fantastic platform, with a ton of rich, extensive features that can empower any network engineer to achieve greater outcomes. There's a lot of good stuff - here I'm using it as a tunnel broker, but we also have these other features:


  • Configuration versioning: Any network platform with in-built configuration versioning (and its cousin, the wonderful "commit review" capability) gets a favorable vote in my book
  • API/CLI: The two have feature parity. It's source control friendly, as I have already shown
  • IPv6: You do not need an IPv4 management plane for this platform to work


  • All routing protocols except IS-IS
  • All VPN functionality except VPNv4 (although EdgeOS, Ubiquiti's fork, has that. It shouldn't take long). This includes WireGuard and OpenVPN, and SIT as I used in this previous example
  • Full IPv6 support, including DHCPv6, RA, SLAAC, OSPFv3, MP-BGP, etc. The only thing missing is 6to4 for completely native IPv6 deployments
It'd be fair to say that VyOS is a fantastically capable router, which like Cisco ISR or any other traditional router, does have some downsides.

What's Missing - or What Could Be Easier

Just as a caveat, I do think we'll see this a lot with virtualized routing and switching. 

VyOS has always had a bit of a problem with firewalling. I've been using it since it was simply Vyatta, prior to Brocade's acquisition, and the primary focus of the platform has always been high-quality routing and switching. Functions like NAT and firewalling are disabled by default and have an extremely obtuse, Layer-4 centric interface for creating new rules. This gets messy pretty quickly, as the rules themselves consume significant configuration space and have to be carefully stacked to apply correctly. This interface is manageable but becomes difficult at scale.

Of course, if it was my entire job to manage firewall policies, I'd automate baseline generation and change modifications, the platform is pretty friendly for that. This may not necessarily be maintainable if it's not placed in an area easily discoverable by other engineers, and definitely doesn't resemble the "single pane of glass" I'd rather have when running a network.

What I'd like to see is a way to intuitively and centrally implement a set of firewall security policies against this device, in a way that can be centrally audited, managed, and maintained. Keep in mind - the auditing aspect is critically important, as any security control that isn't periodically reviewed may not necessarily be effective.

Fortunately, VMWare's NSX (or as it was previously known, vShield) has been doing this for quite some time. There are some advantages to this:
  • Distributed Firewall enforces traffic at the VM's NIC, but is not controlled by the VM. This means that you don't have to automatically trust the workload to secure it.
  • VM Guest Firewalling CPU/NIC costs don't impact the guest's allocation. This blade has two edges:
    • VM Guests don't need firewall resources factored into their workload, as it's not their problem. This allows for easy onboarding, as the application you're protecting doesn't have to be refactored.
    • VM Hosts need CPU to be over-provisioned, as this will be taken out of the host resources at a high priority. This being said, if you're going down the full VMWare Cloud Foundations / Software Defined Data Center (VCF/SDDC) it is important to re-think host overhead, as other components such as vSAN, HA do the same thing!

Securing Workloads

First - we need to ensure that the IPv6 tunnel endpoint VM is on a machine that is eligible for Distributed Firewalling. From the NSX-T homepage, click on the VM Inventory:

Then we select the IPv6 tunnel VM:
From here, let's verify those tags, as we'll be using that in our security policies:

We also need to add some IP Sets - this is the NSX-T construct that handles non-VM or non-Container addressing for external entities. Technically, East-West Firewalling shouldn't always be used for this, but IPv6 tunnel brokering is an edge case: (IP Sets guide here)
From here, you want to add the IP Sets to a group via tag membership - a topic I will cover later as it's vitally important to get right with NSX-T:
We also want to do the same with our virtual machines:

We're all set to start applying policies to it! Navigate over to Security -> East-West Firewalling -> Distributed Firewall:
Add these policies. I have obfuscated my actual addresses under groups for privacy reasons.

That's about it! If you want to add more tunnel nodes, you'd simply apply the tag to any relevant VM with NSX Manager, and all policies are automatically inherited.

Some Recommendations

  • If you haven't deployed a micro-segmentation platform, the #1 thing to remember is that distributed firewalling, because it captures all lateral traffic, generates a TON of logs, all of which happens to be invaluable troubleshooting data. I'd recommend rolling out vRealize Log Insight + Network Insight (vRLI/vRNI) to help here, but ELK stack will probably work just fine in a pinch. 
  • Have a tag plan! Retroactive refactoring of tags is a pretty miserable task, so try and get it at least well organized the first time.
  • Have a naming convention for all of the objects listed above! I'll write a skeleton later on and place on this blog, along with tagging strategies.
  • Make sure to set "Applied to" whenever possible, as this will prevent your changes from negatively affecting other data center tenants.
  • Try to use North-South firewalling (tier-0 and tier-1 edges ONLY) for traffic that leaves the data center. East-West wasn't really designed for that.
  • Try to use North-South firewalling, period. If a data center tenant (or their workload) is not globally trusted, assign that entity its own tier-1, making it really easy to wall off from the rest of the network. This is probably the easiest thing to do in NSX-T, and generates the most value!

Saturday, November 23, 2019

IPv6 Up and Running - Address Planning Basics and using a Tunnel Broker

First things first - let's cover some IPv6 basics.

What's Different

Many aspects of IPv6 is actually much easier than most people would expect - since there's such a large addressing space, entire fields of work with IPv6 go away.

Custom CIDR / Subnetting

Remember how you had to do binary math, and use your crystal ball to guess how many hosts will be on any given subnet? Well, if you use CIDR masks from /29 to /19 for individual subnets, that will be replaced with a /64. 

A great deal of functionality breaks if you use a subnet mask longer than /64 for generic devices - such as RA/DHCP. When setting up a network for any host-facing network, you need to remember only four masks:
  • /64: Use this everywhere
  • /126: Use like a /30, but ONLY when interconnecting network devices. You're not saving space by trying to use this for hosts.
  • /127: Use like a /31, but with even more flakey vendor support. This is more space efficient, but you need to verify that ALL of your equipment supports it, or deal with a really fragmented point-to-point prefix.
  • /128: Loopbacks


You don't need it, because it's IPv4 duct tape. Prepare yourself for a simpler life without it.

Private Addressing

IPv6 does take a different approach here - there are TWO "private" allocations:
  • Link-local addressing (fe80::/10): This addressing allocation is used on a per-segment basis, and pretty much just exists so that every IPv6 speaker will always have an IP address, allowing routing protocols to work on unnumbered interfaces, for example.
  • ULA (fc00::/7) Unique local addresses are on the should not be routed list, and should not be used, generally speaking. You have to use NAT Prefix translation to be globally routable, a feature that isn't well supported. I use this in my spine-and-leaf fabric examples to avoid revealing my publicly allocated prefix, and only in my lab.
Instead, IPv6 architecture focuses on the inverse - allocating prefixes you CAN use. Right now the planet (e.g. Earth, not kidding) has the Global (hehehe) allocation of 2::/3. All IPv6 prefixes are allocated out of this block by providers, using large allocations to ensure easy summarization.


DHCPv6 is not mandatory, as SLAAC/RA Configuration can provide any client device with the default gateway and DNS servers. For enterprise applications, however, it is recommended to use DHCPv6 so you don't unintentionally disclose any information encoded into your IP by SLAAC, and so that your ARP tables aren't murdered by SLAAC privacy extensions. More here.


DNS actually isn't all that different anymore, but still deserves mention for a few reasons. 

The first reason why I think it deserves mention is because, as an application, its IPv6 journey was extremely well designed. 
  • IPv6 Constructs are available, regardless of which "stack" you're running: Global DNS Servers have a new (ish) record type, AAAA, that indicates that IPv6 is available for any service, and any DNS server should serve AAAA records, even if solicited on IPv6. This is useful in situations where your DNS server may have additional attack surface over IPv6, like Microsoft's Active Directory servers. It also helps make your migration strategy a bit smoother, as you implement the IPv6 stack progressively throughout your network.
Second, if you don't have AAAA resolving, IPv6 won't do much for you.

IPv6 Address Planning

IPv6 address planning is fundamentally different for the reasons listed above, but I do have some general guidelines that help establish a good starting point:
  • /48 and /56 are good site prefixes: Since we are using 8x the space in our FIB for each route, allocate a /48 or /56 depending on size per site, but don't do anything weird like allocating a /63 or a /62 to save space. Keep your sites consistent. A  /56 is the IPv6 equivalent of a /16 in IPv4 - you'll almost always be right allocating at this length.
  • Allocate the last 2 /64s in your prefix for point-to-point prefixes and loopbacks, respectively. It just keeps address fragmentation less messy, and you can summarize the /64s at your backbone to ensure that traceroute "just works".
  • You have lots of space, leave gaps between sites. If you get a /48, you have 255 sites to play with. You can block out entire regions, sites, in a myriad of ways to help your routing table "make sense".
Here's how I did it (/48 allocated to me, prefix is masked):
  • ffff:ffff:ffff:ffff::/64: Loopbacks
  • ffff:ffff:ffff:fffd::/64: Point-to-point links
  • ffff:ffff:ffff:e::/49: Allocated to NSX-T, because I don't have multiple sites in my lab. Don't do this in the real world, this is for various (messy) experiments with address summarization.
  • ffff:ffff:ffff:b::/49: Allocated to the underlay fabric. See above.
  • ffff:ffff:ffff:a::/64: Home campus network. This is where Pinterest, and other meatspace activities live.
I'm actually not using much else - I'm allocating large because IPv6 Address shortening makes it easier to type (P.S. IPv4 Address shortening works too, but there are fewer opportunities. Try and ping 1.1) and allocating properly would look like:
  • ffff:ffff:ffff::/56 for Site A (Maybe a headquarters location?)
  • ffff:ffff:ffff:001::/56 for Site B (Satellite office near HQ?)
  • ffff:ffff:ffff:008::/56 for Site C (in another geographic region or state?)
  • ffff:ffff:ffff:1::/56 for Site D (HQ in another country?)
Hopefully this is helpful - when in doubt, whiteboard it out.

Well that's nice, but I'd like to actually do something!

Let's go through the process of selecting a tunnel broker (this assumes you do not have native IPv6 connectivity, because this would already be done):

Step 1: Use Wikipedia's Cheat Sheet to select the best tunnel broker for you. Since I'm in the United States, I selected Hurricane Electric. I am biased by their educational outreach and certification program. I cannot recommend enough taking a crack at their Sage certification.
Step 2: Sign up using the links provided in the cheat sheet. If possible, ask for a /48 for maximum productivity.
Step 3: Establish a tunnel - I have provided a VyOS template here, but a great deal of networking equipment supports SIT tunneling, so it's not particularly difficult to set up. Keep in mind that there's no firewall enabled here, I wouldn't recommend the same approach, but I'm doing that elsewhere.
Step 4: Start experimenting!

Saturday, October 26, 2019

Anycast Stateless Services with NSX-T, Implementation

First off, let's cover what's been built so far:
To set up an anycast vIP in NSX-T after standing up your base infrastructure (already depicted and configured), all you have to do is stand up a load balanced vIP at multiple sites. NSX-T takes care of the rest. Here's how:
Create a new load balancing pool.

Create a new load balancer:
Create a new virtual server:
If your Tier-1 gateways have the following configured, you should see a new /32 in your routing table:
Repeat the process for creating a new load balancer and virtual server on your second Tier-1 interface, pinned to a completely separate Tier-0. If multipath is enabled, you should see entries like this in your routing table:

It really is that easy. This process can be repeated for load balancers, and (when eventually supported) multisite network segments.

A few caveats:

  • State isn't carried through: if you're using a stateful service, use your routing protocols (AS-PATH is an easy one) to ensure that devices consistently forward to the same load balancer
  • Anycast isn't load balancing: This is easy here, as NSX-T can do both. This won't protect your servers from overload unless you use one.
  • Use the same server pool: It was (hopefully) apparent that I used the same pool everywhere. Try to keep regional configurations consistent, to ensure that new additions aren't missed for a pool. Server pools should be configured on a per region or per transport zone basis.
Some additional light reading on anycast implementations:

Saturday, October 19, 2019

Anycast Stateless Services with NSX-T, the Theory

Before getting started, let's cover what different IP message types exist in a brief summary, coupled with a "day in the life of a datagram" as it were.

One source, one well-defined destination. Most network traffic falls into this category.

Mayfly perspective:
Source device originates packet, and fires it to whatever route (yes, hosts, VMs and containers can have a routing table) matches based on the destination.
The destination router, if reachable, forwards the packet, and decrements the time-to-live (TTL) field by 1. Rinse and repeat until the destination is reached. Note: the TTL field is 8 bits, so if a message needs over 255 hops, it won't make it. (we're looking at YOU, Mars!) Pretty boring, but boring is good. 

One source, many specific destinations. This has a moderate gain in efficiency over bandwidth constrained links when routed.

In most cases, if a group pruning protocol, e.g. IGMP, MLD, is not running, multicast traffic "floods" and distributes all messages across all ports. The most common application for multicast is as a discovery or routing protocol.

Mayfly perspective:
Source device originates packet and the next layer 2 device replicates the packet to all multicast destinations (if IGMP/MLD is not doing its job, this becomes a flood, and forwards on all ports, which removes the forwarding efficiency) and then stops.
If multicast routing is enabled, traffic will forward just like it did with unicast, and have a moderate increase in efficiency. This is at the expense of traffic control. Since all multicast traffic is inherently stateless, there's no way to manage bandwidth consumption, fully eliminating the efficiency gain in many cases. If you're running routed multicast, I'd highly recommend using BGP to prune the multicast table... to help with some of this.

One source, ALL destinations. This is usually the least efficient traffic type and is part of why most networks don't have one all-encompassing VLAN, but instead use a number of subnetworks. With some exceptions, this traffic type is exclusively for when a source doesn't know how to get to a destination, e.g. ARP.

Mayfly Perspective:
Source device originates packet and the next layer 2 device floods on all ports but the origin (unless it's a hub). This traffic is subsequently dropped by all layer 3 forwarding devices unless a broadcast helper address is configured.

Unicast with a twist. Addresses (or networks) are advertised by multiple nodes, all capable of providing a service, enabling an end device to speak to the nearest available node.

Mayfly Perspective:
Source device originates packet and forwards on the appropriate interface leverages whatever routing metrics will choose. Next Layer 3 device will forward traffic to the available node with the most favorable routing protocol metric. 
There's a lot to unpack here. Let's focus on the main points re: Anycast:
  • It DOES forward to the nearest available node, and if configured correctly, will use less reachable nodes as a backup.
  • It DOES NOT load balance traffic in any meaningful way.
  • It DOES NOT retain state
This is a pretty big deal-breaker, but let's keep in mind that we have more tools - these incapabilities are completely achievable. The only things you need to provide to make a anycast service are:
  • A load balancer
  • A load balancer that provides stateful services, or one that will synchronize state.
  • A load balancer
NSX-T conveniently provides the above with fully integrated routing and switching (We set up BGP, the routing protocol of the internet before), and adds micro-segmentation firewalling to boot. I'll cover more of that on the next post.

Before we go much further, this is a critically important that we understand something very fundamental. 


I know it sounds dramatic, but VMWare's concept of a "transport zone" seems to imply that universal reachability via a PORTABLE SUBNET is the primary goal. In NSX-V, this was described as a Universal Distributed Logical Router (UDLR), and does not appear to be fully implemented in NSX-T. As a network designer, we should plan for universal reachability leveraging the Anycast model, e.g. "Will the nearest NSX-T Edge please stand up" wherever possible. 

Hopefully, it is clear by now, but Anycast isn't a specific IP message type, but instead a design for network reachability. It's commonly Unicast, but can be multicast if an implementation is carefully designed. The core principle for Anycast is to provide the shortest path to an asset, to the best knowledge of the network routing protocol.

More on the practical side of this post, but common Anycast applications include:
  • DNS
  • Application load balancers
  • Content Delivery Networks (CDNs)
Coming soon - how to do this with NSX-T!

Saturday, October 12, 2019

BGP Graceful Restart, some inter-platform oddities, what to do with it

Since most of NSX-T runs in a firewall mode of sorts, it's probably worthwhile to discuss on of the less well-known routing protocol features - Graceful Restart.

As published for BGP, IETF RFC 4724 outlines a mechanism for "preserving forwarding traffic during a BGP restart." This definition may be a little misleading, but that's mostly because of HOW the industry is leveraging Graceful Restart. Here are a few of the "normal use-cases" for BGP GR:

Cisco Non-Stop Forwarding and other similar technologies:
Cisco has developed another standard - NSF - that applies industry-generic methods for executing a BGP restart with forwarding continuity, with a twist. In many cases, multi-supervisor redundancy is a popular way of keeping your high-availability numbers up, with either a chassis switch running multiple supervisor modules or multiple devices bonded into a virtual chassis. In theory, these implementations get better availability numbers because they'll keep the main IP address reachable during software upgrades or system failures.
In my experience, this is great in campus applications, where client devices don't really have any routing/switching capability (like a cell phone) and where availability expectations are somewhat low (99%-99.99% uptime). However, in higher availability situations or ones running extensive routing protocol functionality, this appears to fall apart somewhat, where the caveats start to break the paradigm:

  • ISSU caveats: You have to CONSTANTLY upgrade your routers because ISSU is typically only supported across 1 or 2 minor releases. If you have a "cold" cutover, i.e. with a major version upgrade, you'll see a pretty extensive outage (5-30 minutes long depending on hardware)
  • Older implementations of a multi-supervisor chassis tend to have configuration sync issues, you need to CONSTANTLY test your failover capability (I mean, you should do that anyway...)

Just my 2 cents.  But here's where Graceful Restart does its job: During a supervisor failover, the IP address of the routing protocol speaker is shared between supervisors, so when establishing a routing protocol adjacency, the speakers negotiate GR capability, along with tunable timers. Since the IP doesn't change, the greatest availability action would be to continue forwarding to a "dead" address until the adjacency is established, ensuring sub-second availability for a dynamic routing protocol speaker (except in the case of updating your gear...)
Most firewall implementations are either Active-Active or Active-Standby, with shared IP addresses and session state tables. Well-designed firewall platforms use a generic method for sharing the state table, which includes (ideally) the session table, routing table, etc. ensuring that mismatched software versions do not introduce a disproportionate outage. The primary downside to this approach is that you don't have a good way to test your forwarding path (beyond Layer 2) so you should TEST OFTEN.

Now let's cover where you should NOT use Graceful Restart:
Any situation where the routing protocol speaker does not have a backup supervisor or any state mechanism. Easy, right?

NOPE. You have to enable Graceful Restart on speakers that have an adjacent firewall (or NSX-T Tier-0 gateway) to support the downstream failover.

RFC 4724 outlines two modes for Graceful Restart: Capable and Aware. Intuitively, GR Capable speakers should be stateful network devices, such as multi-supervisor chassis, firewalls, or NSX-T edges, and GR Aware devices should be stateless network devices, such as layer 3 switches.
The catch, however, is that not all devices support GR Awareness mode. For example, it IS supported in IOS 12, but provides caveats on what hardware has this capability.

So why does this matter? Well, Cisco illustrated it well in this NANOG presentation by stating that if an NSF-Capable advertising device fails, but there is no backup device sharing that same IP address, all traffic is dropped until the GR timers expire. Ouch. This is especially bad given some defaults:

  • RFC 8538 Recommendation: 180 seconds
  • Palo Alto: 120 seconds
  • Cisco: 240-300 seconds
  • VMWare NSX-T: 600 seconds?!?!?!?

Now that's pretty weird. If we fetch from VMWare's VVD 5.0.1, it says the following:
NSXT-VISDN-038 Do not enable Graceful Restart between BGP neighbors. Avoids loss of traffic. Graceful Restart maintains the forwarding table which in turn will forward packets to a down neighbor even after the BGP timers have expired causing loss of traffic. 
Coupled with the recommendation for Tier-0 to be active-active (remember, as I stated before, stateless devices do NOT need GR):

Oddly, it did not warn me about needing to restart the session. Let's find out why:

bgp-rrc-l0#show ip bgp summary
BGP router identifier, local AS number 65000
BGP table version is 84, main routing table version 84
7 network entries using 819 bytes of memory
11 path entries using 572 bytes of memory
14/6 BGP path/bestpath attribute entries using 1960 bytes of memory
2 BGP AS-PATH entries using 48 bytes of memory
0 BGP route-map cache entries using 0 bytes of memory
0 BGP filter-list cache entries using 0 bytes of memory
BGP using 3399 total bytes of memory
BGP activity 102/93 prefixes, 264/247 paths, scan interval 60 secs

Neighbor        V    AS MsgRcvd MsgSent   TblVer  InQ OutQ Up/Down  State/PfxRcd      4 65000  143031  142962       84    0    0 14w1d           2      4 65000  143036  142962       84    0    0 14w1d           1       4 64900  330104  280526       84    0    0 1d17h           1      4 65001  178250  174230       84    0    0 1w0d            3
FD00:6::240     4 65000  310833  578924       84    0    0 14w1d           0
FD00:6::241     4 65000  301493  578924       84    0    0 14w1d           1

Note that for GR to be modified, the BGP session must re-start, so if this was a production environment with equipment that supports GR (*sigh*) you would want to get into the leaf switch and perform a hard restart of the BGP peering.

VMWare's VVD recommendation here is pretty sound, as with most devices the GR checkbox is a global one, so you'd want to buffer between GR/Non-GR with a dedicated router (it's just a VM in NSX's case!), keeping in mind most leaf switches will have GR enabled by default.

Oddly enough, Cisco's Nexus 9000 platform (flagship datacenter switches) default to graceful restart capable. My recommendations (to pile on with the VVD) on this platform would be to:

  • Set BGP timers to 4/12
  • Set GR timers to 120/120 or lower (they're fast switches, so I chose 30/30)
  • Under BGP, configure graceful-restart-helper to make the device GR-Aware instead of GR-Capable
Obviously, the VVD will adequately protect your infrastructure to issues like this, but I think it's unlikely you'll have NSX-T as the only firewall in your entire datacenter.

Saturday, October 5, 2019

NSX-T 2.5 Getting Started, Part 2 - Service Configuration!

Now that the primary infrastructure components for NSX-T are in place, it is now possible to build-out the actual functions that NSX-T is designed to provide.

A friendly suggestion, make sure your Fabric is healthy before doing this:
NSX-T differs from NSX-V quite a bit here. Irregular topologies between edge routers aren't supported, and you have to design any virtual network deployments in a two-tier topology that somewhat resembles Cisco's Aggregation-Access model, but in REVERSE.

The top tier of this network, or as VMWare calls it in their design guide, Tier-0, the primary function provided by logical routers in this layer are simply route aggregation devices, performing tasks such as:
  • Firewalling
  • Dynamic Routing to Physical Network
  • Route Summarization
  • ECMP
The second logical tier, Tier-1 is automatically and dynamically connected to Tier-0 routers via /31s generated from a prefix of your choosing. This logical router will experience a much higher frequency of change, performing tasks like:
  • Layer 2 segment termination/default gateway
  • Load Balancing
  • Firewalling
  • VPN Termination
  • NAT
  • Policy-Based Forwarding
Before implementing said network design, I prefer to write out a network diagram.

Let's start with configuring the Tier-0 gateway:
We'll configure the Tier-0 router to redistribute pretty much everything:
Configure the uplink interface:
Oddly enough, we have spotted a new addition with 2.5 in the wild - the automatic inclusion of prefix-lists!
We also want to configure route summarization, as the switches in my lab are pretty ancient (WS-3560-24TS-E). I'd recommend doing this anyway in production, as it will limit the impact of widespread changes. To pull that off, you *should* reserve the following prefixes, even if they seem excessive:
  • A /16 for Virtual Network Services per transport zone
  • A /16 for NSX-T Internals, allocating /19s to each tier-0 cluster, as outlined in our diagram.
I did so below, and it makes route aggregation or summarization EASY.
Now, we configure BGP Neighbors:
At this point, we want to save and test the configuration. It'll take a while for NSX-T to provision the services listed here, but once it's up, you'll see:
Check for advertised routes. Only routes that exist are aggregated, so you should only see

As a downside, I have prefix-filtering to prevent my lab from stomping on the vital pinterest and netfix network, so I had to add the new prefixes to that:
That was quite a journey! Fortunately, Tier-1 gateway configuration is MUCH simpler, initially. Most of the work performed on a Tier-1 Gateway is Day 1/Day 2, where you add/remove network entities as you need them:
Let's add a segment to test advertisements. I STRONGLY RECOMMEND WRITING A NAMING CONVENTION HERE. This is one big difference between NSX-V and NSX-T, where you don't have this massive UUID in the port group obfuscating what you have. Name this something obvious and readable, your future self will thank you.
Hey look, new routes!

As I previously mentioned, these segments, once provisioned, are just available as port-groups for consumption by other VMs on any NSX prepared host:
Next, we'll configure NSX-T to make waffles!

Why Automate, Part 2: RESTFul APIs and why they aren't as hard as you think

Let's be realistic about the API craze - it seems everything has one, and everybody is talking about API consumption in their environmen...