Saturday, May 15, 2021

VMware NSX Advanced Load Balancer - Installation


Before beginning the Avi installer, I configured the following in my environment:
  • Management Segment (NSX-T Overlay). This is set up with DHCP for quick automatic provisioning - no ephemeral addresses required
  • Data Segments (NSX-T Overlay). Avi will build direct routes to IPs in this network for vIP processing. I built 3 - 
    • Layer 2 Cloud (attached to Tier-1)
    • NSX Integrated (attached to Tier-1)
    • Layer 3 Cloud (attached to Tier-0)

Avi also supports automatic SE deployment - which means that automatic IP configuration is important. Avi supports SLAAC (IPv6) and DHCP (IPv4) for this purpose.

NSX-T is unsurprisingly symbiotic here. I have built a dedicated Tier-1 for NSX ALB, and we're going to provide DHCP services via the Tier-1 router. If this was a production deployment or a VVD-compliant SDDC, this should be performed with a DHCP relay. I just haven't set aside time to deploy DHCP/IPAM tools for reasons that are beyond me.

The following changes are performed on the Tier-1 Logical Router. This step is not required for external DHCP servers!

The following changes are to be performed on the Logical Segment. 
If production, DHCP relay is selectable from the following screen:


 Avi Controller

VMware provides a prepackaged OVA for the Vantage controller - and it's a pretty large appliance. 24 GB of memory and 8 vCPUs is a lot of resourcing for a home lab. There are no sizing options here.

Installation is pretty easy - once the OVA is deployed, I used my CI/CD pipeline and GitHub to deploy DNS updates and logged right into the installation wizard.

AVI version 20.1.5 has changed the installer approach from the above to this. When "No cloud setup" is selected, it still insists on configuring a new cloud. This isn't too much of a problem:
Note: This passphrase is for backups - make sure to store it somewhere safe!
From here, we configure vCenter's integration:

Let's ensure that Avi is connected to vCenter and has no issues. Note: VMware recommends write-mode for vCenter clouds.

After install, it's useful to get a little oriented. Up in the top left of the Avi Vantage GUI. In the top left, you'll find the major configuration branches by expanding the triple ellipsis. Get familiar with this part of the GUI - you'll be using it a lot!


Before we build anything, I prefer to load any patches (if applicable) prior to building anything. This should help avoid any software issues on deployment, and patching is usually simpler/lower impact if you have no configuration yet. 

Avi Vantage really excels here - this upgrade process is pretty much fully automated, with extensive testing. As a result, it's probably going to be slower than "manual" upgrades, but is definitely more reliable. Our industry really needs to get over this - If you have a good way to keep an eye on things while keeping busy, you're ahead of the curve!

We'll hop on over to Administration -> Controller -> Software:

While this upgrade takes place - Avi's controller will serve up a "Sorry Page" indicating that it's not available yet - which is pretty neat.

When complete, you should see this:

Avi Clouds

Clouds are Avi's construct for deployment points - and we'll start with the more traditional one here - vCenter. Most of this has already been configured as part of the wizard above. Several things need to be finished for this to run well, however:

  • Service Engine Group - here we customize service engine settings
  • IPAM - Push IP address, get a cookie
SE Group Changes are executed under Infrastructure -> SE Groups. Here I want t to constrain the deployment to specific datastores and clusters.
IPAM is located in two places, Templates -> Profiles -> IPAM/DNS Profiles (bindable profile):

Ranges are configured under Networks. If you configure a write-access cloud, it'll scan all of the port groups and used IP ranges for you. IP ranges and Subnets will still need to be configured and/or confirmed:

Note: This IPAM profile does need to be added to the applicable cloud to leverage auto-allocate functions with vIPs.

Avi Service Engines

Now that the setup work is done, we can fire up the SE deployments by configuring a vIP. By default, Avi will conserve resources by deploying the minimum SEs required to get the job done - if there's no vIP, this means none. It takes some getting used to!
Once the vIP is applied, you should see some deployment jobs in vCenter:

Service engines take a while to deploy - don't get too obsessive if the deployment lags. There doesn't appear to be a whole lot of logging to indicate deployment stages, so the only option here is to wait it out. If a service engine doesn't deploy quite right, delete it. This is not the type of application we just hack until it works - I did notice that it occasionally will deploy with vNICs incorrectly configured.

From here, we can verify that all service engines are deployed. The health score will climb up over time if the deployment is successful.

Now we can build stuff! 

Sunday, May 9, 2021

Leveraging Hyperglass and NSX-T!

 For this example deployment, I'll be using my NSX-T Lab as the fabric, VyOS for the Overloaded Router role, and trying out hyperglass:

Installation (VyOS)

I already have a base image for VyOS with its management VRF set up - and updating the base image prior to deployment is a breeze due to the vSphere 7 VM Template Check Out Feature.

In this case, I'll deploy to an NSX-T External Port and peer up, with fully implemented ingress filtering:
Export Filters - Permit all prefixes:
Import Filters - don't trust any prefixes from this router:
Set in the correct directions:
Configure the BGP Neighbors:

From here, we build the VNF, by adding the following configuration:
protocols {
    bgp 64932 {
        address-family {
            ipv4-unicast {
                maximum-paths {
                    ebgp 4
            ipv6-unicast {
                maximum-paths {
                    ebgp 4
        neighbor {
            remote-as 64902
        neighbor {
            remote-as 64902
        neighbor x:x:x:dea::1 {
            address-family {
                ipv6-unicast {
            remote-as 64902
        neighbor x:x:x:dea::2 {
            address-family {
                ipv6-unicast {
            remote-as 64902
        timers {
            holdtime 12
            keepalive 4

Then, let's verify that BGP is working:

vyos@vyos-lg-01:~$ show ip bgp summary

IPv4 Unicast Summary:
BGP router identifier, local AS number 64932 vrf-id 0
BGP table version 156
RIB entries 75, using 14 KiB of memory
Peers 4, using 85 KiB of memory

Neighbor             V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt             4      64902       278       272        0    0    0 00:11:31           40       42             4      64902        16        13        0    0    0 00:00:16           39       42
x:x:x:dea::1 		 4      64902       234       264        0    0    0 00:11:43 NoNeg
x:x:x:dea::2 		 4      64902       283       368        0    0    0 00:11:43 NoNeg

Total number of neighbors 4

The VNF is configured! Now, we'll follow the application maintainer's instructions for installation:

The documentation for install is pretty good - but some customization is still required. I built the following configuration files out - hyperglass leverages YAML as a configuration file format, examples are here. I did make some changes:

  • Some combination of VyOS 1.4, MP-BGP, and/or VRF-lite changed the syntax for the BGP views around. Setting a commands file fixes this.
  • VyOS driver is appending a host mask (/32, /128) on routes with no prefix specified.
    • NB: I reached out to the maintainer (Matt Love) and he informed me that this was configurable per-VRF using the force-cidr option.
This particular tool has been extremely useful to me, as NSX-T still lacks comprehensive BGP visibility without CLI access - and even if it didn't, this will provide consumers an easy way to validate that prefixes have propagated, and where.

Sunday, May 2, 2021

PSA: PAN-OS Drops BGP peers with an invalid NLRI / Always filter inbound prefixes from Avi Vantage

If Avi Vantage IPAM cannot allocate an address for a new vIP, it will advertise an all-zeros host address -

This will cause Palo Alto PAN-OS to restart a peer - even if it is not the immediate downstream prefix. Palo Alto uses routed as their dynamic routing engine - so this is probably default behavior inherited from there:

**** EXCEPTION   0x4103 - 57   (0000) **** I:008e7cd1 F:00000004
qbmlpar2.c 1352 :at 20:54:21, 2 May 2021 (1822572648 ms)
UPDATE message contains NLRI of

**** PROBLEM     0x4102 - 46   (0000) **** I:008e7cd1 F:00000004
qbnmmsg.c 1074 :at 20:54:21, 2 May 2021 (1822572648 ms)
NM has received an UPDATE message that failed to parse.
Entity index               = 1
Local address              =
Local port                 = 0
Remote address             =
Remote port                = 0
Scope ID                   = 0

**** EXCEPTION   0x4102 - 71   (0000) **** I:008e7cd1 F:00000020
qbnmsnd2.c 167 :at 20:54:21, 2 May 2021 (1822572648 ms)
A NOTIFICATION message is being sent to a neighbor due to an unexpected
NM entity index       = 1
Local address         =
Local port            = 0
Remote address        =
Remote port           = 0
Scope ID              = 0
Remote AS number      = 64905
Remote BGP ID         = 0X0A06400C
Error code            = UPDATE Message Error (3)
Error subcode         = Invalid Network Field (10)

This could cause a network outage for all subtending networks on this peer. Consider this a friendly reminder to always leverage route filtering between autonomous systems!

Unfortunately, strict import filters on PAN-OS did not resolve this issue.

NSX-T Edge Transport Node Packet Captures

NSX-T Edge Transport Node Packet Captures

NSX-T Edge nodes have a rudimentary packet capture tool built in to the box. It is important to have a built-in tool here, as GENEVE encapsulation will wrap just about everything coming out of a transport node.

NSX-T's CLI guide indicates the method for packet captures - from here we can break it down to a few steps:

  • Find the VRF you want to capture from
  • Find the interface in that VRF you want to capture from
  • Capture from this interface!
get logical-routers
vrf {{ desired VRF }}
get interfaces
set capture session 0 interface {{ interface-id }} direction dual
set capture session 0 file example.pcap

The result will be placed in:


I do have some notes to be aware of here:

  • Be careful with packet captures! This is on an all-CPU router - so isolating the device before capturing packets is a wise choice. We can do that with NSX-T, we just need to remember to.
  • It's possible to use tcpdump-based packet filters instead of a wholesale capture - just replace the final line with a command similar to this:
set capture session 0 file example.pcap expression port 179

Sunday, April 11, 2021

Saturday, April 3, 2021

VMware NSX Advanced Load Balancer - Overview

Load Balancing is Important

Load balancing is an important aspect of network mobility.

How is a network useful if you can't move around within it?

  • Cellular networks lose their appeal if you drop connectivity every time you roam between towers
  • Wi-Fi networks are designed to facilitate smaller-scale movements. Imagine if you had to sit still for your Wi-Fi to work
Network Movements also facilitate migrations between services - as a consumer of a network service, frequent cutovers occur without your knowledge:
  • Infrastructure upgrades: Firewalls, routers, switches constantly need to be bumped up to higher speeds, and feeds
  • Preventing outages: Network "Maintenance Mode"

As computer networks get more complex - SDN is important for the orchestration of these changes or "movements". A distributed, off-box, dedicated management and control plane is essential to tracking "customers" in a scalable fashion - but load balancing is special here.

Most of our consumed services today leverage load balancers to "symmetrify" network traffic to accommodate nodes that do not support them. This can solve a lot of problems large enterprises have:

  • Need to scale firewalls past 2?
  • Need to scale firewalls in any public cloud?
  • Imperfect link balancing with ECMP hashing?
  • Want to prefer an ISP over another, but use both?
These problems are all solvable by the right load balancer platform - but are infrastructure specific. Load balancers often solve application-specific problems, including:
  • HTTP Transforms
  • TLS Quality Enforcement / Consolidated Stack
  • "Diet" Acceleration, e.g. HTTP Compression

Stateless apps work perfectly without some form of load balancer/ingress controller but still benefit greatly from a discrete point to ingest data as well.

NSX Advanced Load Balancer Differentiating Points

N.B. I will probably revise this in a later post as I get more familiar with Avi Vantage

Avi Networks was founded in 2012 with the goal of providing a software load balancer designed from the ground up to leverage Software-Defined Networking (SDN) capabilities. Every aspect of the platform's design appears to eschew this - the company clearly wanted to perform a totally new platform without any need for maintaining legacy platforms. In 2019, VMware acquired Avi Networks and is rebranding the platform to "NSX Advanced Load Balancer".

Here are some clear differentiating points I have found with the platform so far:
  • Enterprise (Web) Oriented - Some load balancing platforms, like Kemp Technologies and focus on clear, common enterprise needs and executing as effectively as possible; instead of "boiling the ocean" with a more feature-complete platform. If this is you as a customer, you can expect significant cost and quality improvements due to this more narrow focus - but Service Providers and specialty customers may be turned off by this.
  • This product is designed for self-service, with robust management plane multi-tenancy
  • This is a VMware product, so Avi is diving head-first into providing high-quality Kubernetes support
  • Offloaded Control Plane: So far, this is a big one for me personally. I'm continually amazed as to how much rich data can be extracted simply by offloading telemetry processing to a controller. Logging and Analytics do not impact data plane performance and have minimal impact on sizing/costs due to per Service Engine licensing
  • Software-only Kitchen Sink: Few load balancing platforms can support all clouds, KVM, K8s, Cisco ACI, Mesosphere, Acropolis, and OpenStack with direct support. Usually, the best we can hope for with a KVM install is an ISO and a prayer. This is refreshing.
  • Support for dynamic routing: The vast majority of load balancers on the market don't natively support this, and specific implementations like anycast or multi-site load balancing stand to benefit from this particular feature.
  • Global Server Load Balancing (GSLB) allows an engineer to control which site traffic may route to with anycast DNS. This provides them the ability to perform application-level capacity management with multiple sites in one solution.

Design Elements


This is Avi's brain and the primary reason for using a platform like Vantage - the control and management planes are, by default, managed by an offboard controller. The following functions are available here, with no performance penalty to the data plane:
  • Central Configuration Management, all locations, all the time.
    • Configure BGP once
    • Configure routes once
    • Configure vIPs once
    • Configure hardening (logging, TLS settings, passwords) once
  • Monitoring of vIPs, if a service is down relocate it
  • Software Lifecycle Management
  • IP Address Management
  • Periodic monitoring for common issues
  • Per Virtual Service extensive Analytics (Avi Enterprise only). They are running ElasticSearch on-box to achieve this, it's pretty neat.
NB: Avi Release 20.1.4 has <900 Debian packages (based on bullseye/sid), so they are running a little lean but could do more cleanup. 20.1.5 is down to 820 - so they are working on this.

Service Engine

Generally, these components do work. Structurally, these appliances are running Debian bullseye/sid with load balancer processes as Docker images. They're running the same edition of FRRouting as NSX-T - with the same approximate OS edition.

Service Engines do:
  • Report in to the AVI controller
  • Perform actual load balancing functions
NB: Avi Release 20.1.5 is much leaner than prior releases, and SEs typically have a much more compressed install base. 515 Debian packages here - almost in line with NSX-T 3.1.2!


  • AVI Controller UI and vCenter/NSX-T Interaction have hard-coded IPv4 Constructs, 20.1.5 introduces preliminary support for IPv6, but VMware's NSBU is usually ahead of everyone else here. I'll be testing vCenter + IPv6 in a later post.
  • AVI Controllers appear to pick up an IPv6 address via SLAAC
  • This platform appears to have full data-plane support.

Deployment Methodology

Management/Control Plane

No orchestrator pre-sets will be used here - per the Avi NSX-T Integration Guide. The primary reason for my doing this is as a more thorough test of this platform - I'll be deploying 3 "Clouds":
  • Layer 2 Cloud (Typical A/P Load Balancer Deployment)
  • Layer 3 Cloud (MP-BGP Load Balancer Deployment)
  • NSX-T Cloud (NSX-T Integrated Deployment)
Avi Vantage designates any grouping of infrastructure presets as a "Cloud", which can have its own tenancy and routing table. This construct allows us to allocate multiple infrastructures to each administrative tenant or customer. This access is decoupled from "Tenant", which is the parent for RBAC.

Data Plane Topologies

The Avi Vantage VCF Design Guide 4.1 indicates that service engines should be attached to a tier-1 router as an overlay segment. The primary reason for this has to do with NSX-T and Avi's integration - in short, the Avi controller invokes the NSX-T API to add and advertise static routes to each service engine to handle dynamic advertisement.

Monday, March 22, 2021

Design Pattern: Looking Glasses

It's probably safe to say that service provider networking is pretty unique.

One particular design pattern - Looking Glasses - is extremely useful for complex dynamically routed networks

I'd really like to shift the gatekeeping needle here - networks that are complex enough to benefit from a looking glass should move to:
  • >100 Routing table entries globally
  • Some vague preference towards reliability
  • Dynamic Routing (BGP is preferred)
In any small to medium enterprise, I'd posit that the only thing truly preventing benefits, in this case, is the lack of dynamic routing adoption, primarily because pre-packaged offerings in this range don't have an "easy button" for implementing them. This lack of accessibility causes a real problem with SMB networking, as reliability features stay out of their reach.

Design Pattern: Looking Glass

A Network "Looking Glass" is a type of web server that responds to user requests, providing externalized (without userspace access to network equipment) to an authenticated or unauthenticated client. This allows clients to view BGP meta-data, routing tables to ensure outbound advertisements between Service Providers have propagated. 

Here's my starting point for this design pattern.

History (non-inclusive)

Note: I don't have everything here. It seems most Looking Glasses were stood up silently by telecommunications companies. They're searchable, but I can't find any citable data on when they started out.


  • Least (Zero) Privilege Access to a network services routing table, searchable via API and/or GUI


Of these forces, #1 is probably the biggest. Since we cannot force all of the networking industry titans (yet) to provide a permission set that will facilitate this use - I'd propose the following approach:
In this solution, I'm proposing some additional safeguards/scale-guards to make sure that the approach will not be harmful to a "host" network. In addition to implementing the looking glass, I'd propose the deployment of a series of Virtual Network Functions (VNFs) scaled out with monitored routing tables. This is where the collectors would interact - if the physical network doesn't allow any inbound prefixes from the VNF, it's easy enough to build a solution to safely collect from it. There are tons of VNF options here - as we only need BGP capability and a collection method.

Saturday, March 13, 2021

Unearned Uptime - Present and Future Design Patterns

After all that meatspace talk, let's look at a few technical solutions and why they might not meet business needs in a specific setting.

Shared Control Planes / Shared Failure Plane

Shared Control Plane design patterns are prolific within the networking industry - and there's a continuum. Generally, a control plane between devices should be designed with reliability in mind, but most shared control plane implementations tend to have "ease of administration" as intent instead of reliability. Here are some common examples.


"Stacking" implementations represent an early industry pattern where (typically) campus deployments weren't entirely large enough to justify a chassis switch but still wanted enough lateral bandwidth to eliminate a worry point. Primary motivations for "stacking" were:

  • Single Point of Administration
  • Linear scale-out costs

Stacking was an artifact from when software like Ansible, Cisco DNA, ArubaOS-CX/NetEdit, etc. didn't exist from within the industry. Significant downsides exist to stacking software, including:

  • Tight coupling with software, often a total outage or a many-step ISSU upgrade path
  • Software problems take the whole stack down
  • Stacking cables are expensive and proprietary

Stacking is still a pretty good, viable technology for small to medium campus networks. One particular technology I have found interesting is Aruba's Spine and Leaf design, leveraging Aruba's mobility tunnel features to handle anything that needs to keep an IP address.


Multi-Chassis LAG is a pretty contentious issue within the industry.

Note: In Service Provider applications, Layer 2 Loop Prevention is a foundational design pattern for delivering Metro Ethernet services by creating a loop-free single endpoint path. I'm not covering this design pattern, as it's a completely different subject. In this case, I'm illustrating Data Center/Private Cloud network design patterns, and then tangentially Campus from there.

MC-LAG as a design pattern isn't all that bad compared to some - however, some applications of MC-LAG in the data center turn out to be fairly problematic.

Modern Data Center Fabric Switching

Given the rise of Hyper-Converged Infrastructure - we're actually seeing data center hardware get used. Prior to this last generation (2012-onwards) just "being 10 Gig" was good enough for most use cases. Commodity server hardware wasn't powerful enough to really tax fabric oversubscribed switches.

...or was it? Anybody remember liking Cisco FEXes? TRILL? 802.3br?

Storage Area Networks (SAN) offloaded all compute storage traffic in many applications, and basically constituted an out-of-band fabric that was capable of 8-32Gbits/s.

The main problem here is Ethernet. Ethernet forwarding protocols aren't really capable of non-blocking redundant forwarding. This is because there is no routing protocol. Fiber Channel will use either IS-IS or SPF in most cases for this purpose, and hosts participate in this routing protocol.

The biggest change that this has - Fiber Channel can have two completely independent fabrics, devoid of interconnection. This allows an entire fabric to go completely offline with no issues.

MC-LAG goes in a completely different direction - forcing redundant Ethernet switches to share a failure plane. With Data Centers, the eventual goal for this design pattern is to move to this "share-nothing" approach, eventually resulting in EGP or IGP participation by all subtending devices in a fabric.

Now - we don't have that capability in most hypervisors today. Cumulus does have a Host Routing Implementation, but most common hypervisors have yet to adopt this approach. VMware, Amazon, Microsoft, and Cumulus all contribute to a common routing code base (FRRouting) and are using it to varying extents within their networks to prevent this "Layer 2 Absenteeism" from becoming a workload problem. Of these solutions - VMware's NSX-T is probably the most prolific solution if you're not a hyperscaler that can develop your own hypervisor / NOS combination like Amazon/Microsoft:

Closing Notes

Like it or not, these examples are perfectly viable design patterns when used properly. Given industry trends and some crippling deficiencies with Giant-Scale Ethernet Topologies in large-scale data center and campus networks, we as network designers must keep an eye to the future, and plan accordingly. In these examples, we examined (probably very for some) tightly coupled design patterns used in commodity networks, and where they commonly fail.

If you use these design patterns in production - I would strongly recommend asking yourself the following questions:

  • What's the impact of a software upgrade, worst-case?
  • What happens if a loop is introduced?
  • What's the plan for removing that solution in a way that is not business invasive?
  • What if your end-users scale beyond the intended throughput/device count you anticipated when performing that design exercise?
Hopefully, this explains some of the why behind existing trends. We're moving to a common goal - an automatable, reliable, vendor-independent fabric for interconnection of network devices using common protocols - and nearly all of the weirdness around this can be placed at the networking industry's feet - We treat BGP as this "protocol of the elites" instead of teaching people how to use EGPs. We (the networking industry) need to do more work to become more accessible to adjacent industries - They'll be needing us really soon if they don't already.

Unearned Uptime: Letting Old Ideas Go

We don't always earn reliability with the systems we deploy, design, and maintain

Infrastructure reliability is a pretty prickly subject for the community - we as engineers and designers tend to anthropomorphize, attach, and associate personal convictions with what we maintain. It's a natural pattern, but it inflicts a certain level of self-harm when we fail to improve upon the platforms that serve as the backbone to those we support.

There are two major problems I perceive with regards to translating unearned uptime to reliability

  • History
  • Ego
  • Architecture (later post)

Throughout this article, I'll cover these problems and then transition into common examples of "unearned uptime" in the industry. These are not "networking" issues - it's an infrastructure issue. We have the same problems with most civil structures, interchanges, runways, etc.

The idea that we didn't earn reliability delivered to the business is one thing that we as infrastructure engineers and designers aren't particularly comfortable with.


It doesn't have a problem! It's been working fine for years!

 Credit: Marc Olivier-Jodoin

Infrastructure needs routine replacement to function correctly

Consumers rarely notice issues with infrastructure until they've gotten to be truly problematic. An easy example of this is asphalt concrete (or bitumen, depending on where you live).

The material itself is relatively simple, rock aggregate + oil - but it's pretty magical in terms of usefulness. Asphalt itself functions as a temporary adhesive, bonding to automotive tires and making roads really safe by shortening stopping distances. The composite material is also flexible, allowing the ground below it to shift to an extent - which means that places with more dynamic geology.

We don't really think about wear to this surface as consumers after it's been installed. Public works / Civil Engineers sure do, because it's their job, but think about it - if you drive your car over a residential street three times a day, that's probably over 4 metric tons of material that the road has to withstand in a day. This wear adds up! A typical residential (neighborhood) street will see over 15,000 metric tons of weight per year.

The sheer scale of road wear is utterly staggering. This GAO Report on Weight Enforcement illustrates how controlling wear (usage) is a method of conveying importance, but that doesn't really work all that well for us...

Practical IT Applications

When designing technology infrastructure, especially as a service provider, you want to encourage usage.

Usage drives bigger budgets and your salary! Ultimately, wear with tech infrastructure is going to be about the same regardless of load. Scarcity economics don't work particularly well in IT.

To solve the history problem, you want to convince business line owners to desire and delight in what you provide.

The antithesis to "customer delight" in this case is often this big guy:  By User:MrChrome, CC BY 3.0,

Fun fact, the Cisco 6500 is a lot older than you'd think, entering service in 1999. For more:

Cisco 6500 series switches were simply too reliable. The Toyota Camry of switches, Cisco's 6500s lived everywhere, convincing executives that it was totally okay to skip infrastructure refreshes, much to the chagrin of Infrastructure Managers worldwide.

The Solution - Messaging

We shouldn't be waiting for stuff to fail to replace it - it's time to get uncomfortable and speak to consumers. Most humans are intelligent - let's help them understand why we care about 25/100 Gigabit connectivity, cut-through switching, 802.11ax in terms that are geared towards them.

Here are some pointers on where to start:

  • You're not replacing something because it was bad.
    • A pretty easy pitfall for IT professionals - if you devalue "what came before" you devalue the role a replacement fills. It may be hard to do, but most things here were built for a reason - the intent behind the design is important for other reasons, but this negativity will affect anything you do after that.
  • Show how they can use it
    • This might not make a lot of sense at the outset, but any trivial method for interaction will make a particular change feel more concrete. Some examples:
      • Add a Looking Glass view if it's a new network. Providing users a way to "peek inside" is a time-honored tradition with many industries.
      • Open some iPerf/Spirent servers for users to interact with, or other benchmarking
      • Functional demos like blocking
  • Share how it is made
    • You never know, why not try?


This one's a bit harder - and I'm not trying to apply major negative connotations here. As engineers, we get pretty attached to our decisions, attributing significant personal effort to the products we purchase.

As an industry, IT professionals really need to re-align here. We consider vendor relationships allegiances and fundamentally attribute our own personal integrity. If I had my way, I'd stop hearing that someone's a "Cisco" or a "VMware" guy - we need to shift this focus back to consumers.

The biggest point for improvement here is also on the negativity front. Let's start by shifting from "this solution is bad" (devaluing your own work for no reason) to "This solution doesn't fit our needs, and this is why." The latter helps improve future results by getting the ball rolling on what criteria consumers value more.

After deploying quite a few solutions "cradle-to-grave," my personal approach here is to think of them like old cars, computers, stuff like that. I fondly remember riding around in my parents' 80's suburban, but we replaced it because it wasn't reliable enough for the weather we had to face in Rural Alaska, and it was too big.

Here are some examples of how I regard these older, later replaced solutions/products:

  • Cisco 6500s: Fantastically reliable, fantastic power bills, fantastic complexity to administer
  • Aruba 1xx series Access Points: Revolutionary access control, less than stellar radio performance
  • Palo Alto 2000/4000 series firewalls: Again, revolutionary approaches to network security, but not enough performance for modern businesses to function. Commit times improved greatly on later generations
  • TM-OS 11.x: Incredible documentation, incredible feature depth. If it's more modern than 2015, you're going to want more features

All of these served businesses well, then needed to be replaced. I see too many engineers beat themselves up when services eventually fell apart, and it's just not necessary.

Sunday, January 17, 2021

9/10 NGINX Use Cases, URI and Host rewrites

NGINX Rewrite Directives, The 9/10 Solutions

When doing ADC/Load Balancer work, nearly all requests fit into two categories:

  • Please rewrite part of the URL/URI
  • Please change the host header for this reverse proxy

These are fairly simple to implement in NGINX, so I'm creating a couple of cheat-sheet code snippets here.

"Strip Part of the URL Out"

URI stripping is fairly common, and the primary motivation for this blog post. As enterprises move to Kubernetes, they're more likely to use proxy_pass directives (among other things) to multi-plex multiple discrete services into one endpoint.

With URI stripping, an engineer can set an arbitrary URI prefix and then remove it before the web application becomes aware. URI stripping is a useful function to stitch multiple web services together into one coherent endpoint for customer use.

NGINX comes to the rescue here, with a relatively simple solution:

  • location directive: Anchors the micro- or sub- service to an NGINX URI
  • rewrite directive: Rewrites the micro- or sub- service to a new directory, allowing for minimal backend modifications

The below example achieves this by rewriting the URI /build* to /, ensuring that the build service (Jenkins) doesn't need to be re-tooled to work behind a proxy:

  location /builds/ {
    root /var/lib/jenkins/workspace/;
    rewrite ^/builds(.*)$ $1 break;
    autoindex on;

As you can see, this example is an obvious security risk, as the autoindex directive lets clients browse through the build service without authentication and potentially access secrets, and is intended as an illustration and not a direct recommendation for production practice. Here's a little bit more production-appropriate example providing Jenkins over TLS (source:

    server {
        listen       443 ssl http2 default_server;
        listen       [::]:443 ssl http2 default_server;

        ssl_certificate "CERT;
        ssl_certificate_key "KEY";
        ssl_session_cache shared:SSL:1m;
        ssl_session_timeout  10m;
        ssl_protocols TLSv1.2 TLSv1.3;
        ssl_ciphers ALL:!AES:!RC4:!SHA:!MD5;
        ssl_prefer_server_ciphers on;

        # Load configuration files for the default server block.
        include /etc/nginx/default.d/*.conf;

        location ~ "^/static/[0-9a-fA-F]{8}\/(.*)$" {
            #rewrite all static files into requests to the root
            #E.g /static/12345678/css/something.css will become /css/something.css
            rewrite "^/static/[0-9a-fA-F]{8}\/(.*)" /$1 last;

        location /userContent {
            # have nginx handle all the static requests to userContent folder
            #note : This is the $JENKINS_HOME dir
            root /var/lib/jenkins/;
            if (!-f $request_filename){
            #this file does not exist, might be a directory or a /**view** url
            rewrite (.*) /$1 last;
            sendfile on;
        location / {
                    sendfile off;
                    proxy_pass http://jenkins/;
            # Required for Jenkins websocket agents
            proxy_set_header   Connection        $connection_upgrade;
            proxy_set_header   Upgrade           $http_upgrade;

            proxy_set_header   Host              $host;
            proxy_set_header   X-Real-IP         $remote_addr;
            proxy_set_header   X-Forwarded-For   $proxy_add_x_forwarded_for;
            proxy_set_header   X-Forwarded-Proto $scheme;
            proxy_max_temp_file_size 0;

            #this is the maximum upload size
            client_max_body_size       10m;
            client_body_buffer_size    128k;

            proxy_connect_timeout      90;
            proxy_send_timeout         90;
            proxy_read_timeout         90;
            proxy_buffering            off;
            proxy_request_buffering    off; # Required for HTTP CLI commands
            proxy_set_header Connection ""; # Clear for keepalive
        error_page 404 /404.html;
            location = /40x.html {

        error_page 500 502 503 504 /50x.html;
            location = /50x.html {

Set Host Headers

This is quite a bit easier, using the proxy_set_header directive:

  location /builds/ {
    proxy_pass http://localhost:8080;
    proxy_set_header Host
    rewrite ^/fabric-builds(.*)$ $1 break;

Sunday, January 3, 2021

NSX-T Transitive Networking

One major advantage to NSX-T is that Edge Transport Nodes (ETNs) are transitive.

Transitivity (Wikipedia) (Consortium GARR) is an extremely important concept in network science, and in computer networking. 

In simple terms, a network node (any speaker capable of transmitting or receiving on a network) can have the following transitivity patterns:
  • Transitive: Most network equipment fit in this category. The primary purpose of these devices is to allow traffic to flow through them and to occasionally offer services over-the-top. 
    • Examples:
      • Switches
      • Routers
      • Firewalls
      • Load Balancers
      • Service Meshes
      • Any Linux host with ip_forward set
      • Mobile devices with tethering
  • Non-Transitive: Most servers, client devices fit in this category. These nodes are typically either offering services over a network or consuming them (Usually both). In nearly all cases, this is a deliberate choice by the system designer for loop prevention purposes. 
    • Note: It's completely possible to participate in a routing protocol while being non-transitive. 
    • Examples:
      • VMware vSphere Standard Switch && vSphere Distributed Switch (no Spanning-Tree participation)
      • Amazon vPC
      • Azure VNet
      • Any Linux host with ip_forward disabled
      • Nearly any server, workstation, mobile device
  • Anti-Transitive: This is a bit of a special use case, where traffic is transitive but only in specific use cases. Anti-Transitive network nodes have some form of control in place to prevent transit in specific scenarios but allowing it in others. The most common scenario is when an enterprise has multiple service providers - where the enterprise doesn't want to pay for traffic going between those two carriers.
    • Examples:
      • Amazon Transit Gateway
      • Any BGP Router with import/export filters

vSphere Switch Transitive Networking Design

To fully understand VMware's approach, it is important to first understand earlier approaches to network virtualization. vSphere switches are a bit of a misnomer, as you don't actually switch at any given point. Instead, vSphere switches leverage a "Layer 2 Proxy" of sorts, where NIC-accelerated software replaces ASIC flow-based transitive switching.

This approach offers incredible flexibility, but is theoretically slower than software switching; to preserve this capability VMware noticed early on that loop prevention would become an issue. Pre-empting this problem, making the platform completely non-transitive to ensure that this flexibility will be more readily adopted.

Note: VMware's design choices here contained the direct intent to simplify the execution and management of virtualized networking. This choice made computer networking simple enough for most typical VI administrators to perform, but more of the advanced features (QoS, teaming configurations) require more direct involvement from network engineers to execute well. Generally speaking, the lack of need for direct networking intervention for a VSS/vDS to work has led to a negative trend with the VI administrator community. Co-operation between VI administration and networking teams often suffer due to this lack of synchronization, and with it systems performance as well.

NSX-T Transitive Networking Design

NSX-T is highly prescriptive in terms of topology. VMware has known for years that a highly controlled design for transitive networking will provide stability to the networks it may participate in - just look at the maturity/popularity of vDS vs Nexus 1000v.

NSX-T does depend on VDS for Layer 2 forwarding (as we've established, not really switching), but does follow the same general principles for design. 

To be stable, you have to sacrifice flexibility. This is for your own protection. These choices are artificial design limitations, intentionally placed for easy network virtualization deployment.

VMware NSX-T Tier-0 logical routers have to be transitive to perform their main goal, transporting overlay traffic to underlay network nodes. Every time a network node becomes transitive in this way, specific design decisions must be made to ensure that anti-transitive measures are appropriately used to achieve network stability. 

NSX-T Tier-1 Distributed routers are completely nontransitive, and NSX-T Tier-1 Service Routers have severely limited transitive capabilities. I have diagrammed this interaction as non-transitive because the Tier-1 services provided are technically owned by that logical router.

Applications for Transitive Tier-0 Routers

Given how tightly controlled transit is with NSX-T, the only place we can perform these tasks is via the Tier-0 Logical Router. Let's see if it'll let us transit networks originated from a foreign device, shall we?


NSX-T Tier-0 Logical Routers are capable as transit providers, and the only constructs preventing transit are open standards (BGP import/export filters)

Unit Test

Peer with vCLOS network via (transiting) NSX-T Tier-0 Logical Router:

 Let's build it, starting with the vn-segments:
Then, configuring Tier-0 External Interfaces:
Ensure that we're re-distributing External Interface Subnets:
Ensure that the additional prefixes are being advertised. Note: This is a pretty big gripe of mine with the NSX GUI - we really ought to be able to drill down further here...
Configure BGP Peering to the VyOS vCLOS Network:
We're good to go on the NSX Side. In theory, this should provide a transitive peering, as BGP learned routes are not Re-Distributed but learned.

(The other side is VyOS, configured in the pipeline method outlined in a previous post. This pipeline delivery method is really growing on me)

We can verify that prefixes are propagating transitively via the NSX-T Tier-0 in both protocol stacks by checking in on the spines that previously had no default route:$ show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
       F - PBR, f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup

B>* [20/0] via, eth1, weight 1, 00:15:20
B>* [20/0] via, eth1, weight 1, 00:15:20$ show ipv6 route
Codes: K - kernel route, C - connected, S - static, R - RIPng,
       O - OSPFv3, I - IS-IS, B - BGP, N - NHRP, T - Table,
       v - VNC, V - VNC-Direct, A - Babel, D - SHARP, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup

B>* ::/0 [20/0] via fe80::250:56ff:febc:b05, eth1, weight 1, 00:15:25
Now, to test whether or not packets actually forward$ ping
PING ( 56(84) bytes of data.
64 bytes from icmp_seq=1 ttl=53 time=49.7 ms
64 bytes from icmp_seq=2 ttl=53 time=48.10 ms
64 bytes from icmp_seq=3 ttl=53 time=45.9 ms
64 bytes from icmp_seq=4 ttl=53 time=45.0 ms

Looks like Tier-0 Logical Routers are transitive! This can have a lot of future implications - because NSX-T can become a launchpad for all sorts of virtualized networking. Some easy examples:
  • Tier-0 Aggregation: Like with aggregation-access topologies within the data center and campus, this is a way to manage BGP peer/linkage count at scale, allowing for thousands of Tier-0 Logical Routers per fabric switch.
  • Load Balancers: This shifts the peering relationship for load balancers/ADC platforms from a direct physical peering downward, making those workloads portable (if virtualized)
  • Firewalls: This provides Cloud Service Providers (CSP) the ability to provide customers a completely virtual, completely customer-owned private network, and the ability to share common services like internet connectivity.
  • NFVi: There are plenty of features that can leverage this flexibly in the NFV realm, as any given Enterprise VNF and Service Provider VNF can run BGP. Imagine running a Wireless LAN Controller and injecting a customer's WLAN prefixes into their MPLS cloud - or even better, their cellular clients.

VMware NSX Advanced Load Balancer - Installation

Pre-Requisites Before beginning the Avi installer, I configured the following in my environment: Management Segment (NSX-T Overlay). This is...