Sunday, April 11, 2021

Monday, March 22, 2021

Design Pattern: Looking Glasses

It's probably safe to say that service provider networking is pretty unique.

One particular design pattern - Looking Glasses - is extremely useful for complex dynamically routed networks

I'd really like to shift the gatekeeping needle here - networks that are complex enough to benefit from a looking glass should move to:
  • >100 Routing table entries globally
  • Some vague preference towards reliability
  • Dynamic Routing (BGP is preferred)
In any small to medium enterprise, I'd posit that the only thing truly preventing benefits, in this case, is the lack of dynamic routing adoption, primarily because pre-packaged offerings in this range don't have an "easy button" for implementing them. This lack of accessibility causes a real problem with SMB networking, as reliability features stay out of their reach.

Design Pattern: Looking Glass

A Network "Looking Glass" is a type of web server that responds to user requests, providing externalized (without userspace access to network equipment) to an authenticated or unauthenticated client. This allows clients to view BGP meta-data, routing tables to ensure outbound advertisements between Service Providers have propagated. 

Here's my starting point for this design pattern.

History (non-inclusive)

Note: I don't have everything here. It seems most Looking Glasses were stood up silently by telecommunications companies. They're searchable, but I can't find any citable data on when they started out.


  • Least (Zero) Privilege Access to a network services routing table, searchable via API and/or GUI


Of these forces, #1 is probably the biggest. Since we cannot force all of the networking industry titans (yet) to provide a permission set that will facilitate this use - I'd propose the following approach:
In this solution, I'm proposing some additional safeguards/scale-guards to make sure that the approach will not be harmful to a "host" network. In addition to implementing the looking glass, I'd propose the deployment of a series of Virtual Network Functions (VNFs) scaled out with monitored routing tables. This is where the collectors would interact - if the physical network doesn't allow any inbound prefixes from the VNF, it's easy enough to build a solution to safely collect from it. There are tons of VNF options here - as we only need BGP capability and a collection method.

Saturday, March 13, 2021

Unearned Uptime - Present and Future Design Patterns

After all that meatspace talk, let's look at a few technical solutions and why they might not meet business needs in a specific setting.

Shared Control Planes / Shared Failure Plane

Shared Control Plane design patterns are prolific within the networking industry - and there's a continuum. Generally, a control plane between devices should be designed with reliability in mind, but most shared control plane implementations tend to have "ease of administration" as intent instead of reliability. Here are some common examples.


"Stacking" implementations represent an early industry pattern where (typically) campus deployments weren't entirely large enough to justify a chassis switch but still wanted enough lateral bandwidth to eliminate a worry point. Primary motivations for "stacking" were:

  • Single Point of Administration
  • Linear scale-out costs

Stacking was an artifact from when software like Ansible, Cisco DNA, ArubaOS-CX/NetEdit, etc. didn't exist from within the industry. Significant downsides exist to stacking software, including:

  • Tight coupling with software, often a total outage or a many-step ISSU upgrade path
  • Software problems take the whole stack down
  • Stacking cables are expensive and proprietary

Stacking is still a pretty good, viable technology for small to medium campus networks. One particular technology I have found interesting is Aruba's Spine and Leaf design, leveraging Aruba's mobility tunnel features to handle anything that needs to keep an IP address.


Multi-Chassis LAG is a pretty contentious issue within the industry.

Note: In Service Provider applications, Layer 2 Loop Prevention is a foundational design pattern for delivering Metro Ethernet services by creating a loop-free single endpoint path. I'm not covering this design pattern, as it's a completely different subject. In this case, I'm illustrating Data Center/Private Cloud network design patterns, and then tangentially Campus from there.

MC-LAG as a design pattern isn't all that bad compared to some - however, some applications of MC-LAG in the data center turn out to be fairly problematic.

Modern Data Center Fabric Switching

Given the rise of Hyper-Converged Infrastructure - we're actually seeing data center hardware get used. Prior to this last generation (2012-onwards) just "being 10 Gig" was good enough for most use cases. Commodity server hardware wasn't powerful enough to really tax fabric oversubscribed switches.

...or was it? Anybody remember liking Cisco FEXes? TRILL? 802.3br?

Storage Area Networks (SAN) offloaded all compute storage traffic in many applications, and basically constituted an out-of-band fabric that was capable of 8-32Gbits/s.

The main problem here is Ethernet. Ethernet forwarding protocols aren't really capable of non-blocking redundant forwarding. This is because there is no routing protocol. Fiber Channel will use either IS-IS or SPF in most cases for this purpose, and hosts participate in this routing protocol.

The biggest change that this has - Fiber Channel can have two completely independent fabrics, devoid of interconnection. This allows an entire fabric to go completely offline with no issues.

MC-LAG goes in a completely different direction - forcing redundant Ethernet switches to share a failure plane. With Data Centers, the eventual goal for this design pattern is to move to this "share-nothing" approach, eventually resulting in EGP or IGP participation by all subtending devices in a fabric.

Now - we don't have that capability in most hypervisors today. Cumulus does have a Host Routing Implementation, but most common hypervisors have yet to adopt this approach. VMware, Amazon, Microsoft, and Cumulus all contribute to a common routing code base (FRRouting) and are using it to varying extents within their networks to prevent this "Layer 2 Absenteeism" from becoming a workload problem. Of these solutions - VMware's NSX-T is probably the most prolific solution if you're not a hyperscaler that can develop your own hypervisor / NOS combination like Amazon/Microsoft:

Closing Notes

Like it or not, these examples are perfectly viable design patterns when used properly. Given industry trends and some crippling deficiencies with Giant-Scale Ethernet Topologies in large-scale data center and campus networks, we as network designers must keep an eye to the future, and plan accordingly. In these examples, we examined (probably very for some) tightly coupled design patterns used in commodity networks, and where they commonly fail.

If you use these design patterns in production - I would strongly recommend asking yourself the following questions:

  • What's the impact of a software upgrade, worst-case?
  • What happens if a loop is introduced?
  • What's the plan for removing that solution in a way that is not business invasive?
  • What if your end-users scale beyond the intended throughput/device count you anticipated when performing that design exercise?
Hopefully, this explains some of the why behind existing trends. We're moving to a common goal - an automatable, reliable, vendor-independent fabric for interconnection of network devices using common protocols - and nearly all of the weirdness around this can be placed at the networking industry's feet - We treat BGP as this "protocol of the elites" instead of teaching people how to use EGPs. We (the networking industry) need to do more work to become more accessible to adjacent industries - They'll be needing us really soon if they don't already.

Unearned Uptime: Letting Old Ideas Go

We don't always earn reliability with the systems we deploy, design, and maintain

Infrastructure reliability is a pretty prickly subject for the community - we as engineers and designers tend to anthropomorphize, attach, and associate personal convictions with what we maintain. It's a natural pattern, but it inflicts a certain level of self-harm when we fail to improve upon the platforms that serve as the backbone to those we support.

There are two major problems I perceive with regards to translating unearned uptime to reliability

  • History
  • Ego
  • Architecture (later post)

Throughout this article, I'll cover these problems and then transition into common examples of "unearned uptime" in the industry. These are not "networking" issues - it's an infrastructure issue. We have the same problems with most civil structures, interchanges, runways, etc.

The idea that we didn't earn reliability delivered to the business is one thing that we as infrastructure engineers and designers aren't particularly comfortable with.


It doesn't have a problem! It's been working fine for years!

 Credit: Marc Olivier-Jodoin

Infrastructure needs routine replacement to function correctly

Consumers rarely notice issues with infrastructure until they've gotten to be truly problematic. An easy example of this is asphalt concrete (or bitumen, depending on where you live).

The material itself is relatively simple, rock aggregate + oil - but it's pretty magical in terms of usefulness. Asphalt itself functions as a temporary adhesive, bonding to automotive tires and making roads really safe by shortening stopping distances. The composite material is also flexible, allowing the ground below it to shift to an extent - which means that places with more dynamic geology.

We don't really think about wear to this surface as consumers after it's been installed. Public works / Civil Engineers sure do, because it's their job, but think about it - if you drive your car over a residential street three times a day, that's probably over 4 metric tons of material that the road has to withstand in a day. This wear adds up! A typical residential (neighborhood) street will see over 15,000 metric tons of weight per year.

The sheer scale of road wear is utterly staggering. This GAO Report on Weight Enforcement illustrates how controlling wear (usage) is a method of conveying importance, but that doesn't really work all that well for us...

Practical IT Applications

When designing technology infrastructure, especially as a service provider, you want to encourage usage.

Usage drives bigger budgets and your salary! Ultimately, wear with tech infrastructure is going to be about the same regardless of load. Scarcity economics don't work particularly well in IT.

To solve the history problem, you want to convince business line owners to desire and delight in what you provide.

The antithesis to "customer delight" in this case is often this big guy:  By User:MrChrome, CC BY 3.0,

Fun fact, the Cisco 6500 is a lot older than you'd think, entering service in 1999. For more:

Cisco 6500 series switches were simply too reliable. The Toyota Camry of switches, Cisco's 6500s lived everywhere, convincing executives that it was totally okay to skip infrastructure refreshes, much to the chagrin of Infrastructure Managers worldwide.

The Solution - Messaging

We shouldn't be waiting for stuff to fail to replace it - it's time to get uncomfortable and speak to consumers. Most humans are intelligent - let's help them understand why we care about 25/100 Gigabit connectivity, cut-through switching, 802.11ax in terms that are geared towards them.

Here are some pointers on where to start:

  • You're not replacing something because it was bad.
    • A pretty easy pitfall for IT professionals - if you devalue "what came before" you devalue the role a replacement fills. It may be hard to do, but most things here were built for a reason - the intent behind the design is important for other reasons, but this negativity will affect anything you do after that.
  • Show how they can use it
    • This might not make a lot of sense at the outset, but any trivial method for interaction will make a particular change feel more concrete. Some examples:
      • Add a Looking Glass view if it's a new network. Providing users a way to "peek inside" is a time-honored tradition with many industries.
      • Open some iPerf/Spirent servers for users to interact with, or other benchmarking
      • Functional demos like blocking
  • Share how it is made
    • You never know, why not try?


This one's a bit harder - and I'm not trying to apply major negative connotations here. As engineers, we get pretty attached to our decisions, attributing significant personal effort to the products we purchase.

As an industry, IT professionals really need to re-align here. We consider vendor relationships allegiances and fundamentally attribute our own personal integrity. If I had my way, I'd stop hearing that someone's a "Cisco" or a "VMware" guy - we need to shift this focus back to consumers.

The biggest point for improvement here is also on the negativity front. Let's start by shifting from "this solution is bad" (devaluing your own work for no reason) to "This solution doesn't fit our needs, and this is why." The latter helps improve future results by getting the ball rolling on what criteria consumers value more.

After deploying quite a few solutions "cradle-to-grave," my personal approach here is to think of them like old cars, computers, stuff like that. I fondly remember riding around in my parents' 80's suburban, but we replaced it because it wasn't reliable enough for the weather we had to face in Rural Alaska, and it was too big.

Here are some examples of how I regard these older, later replaced solutions/products:

  • Cisco 6500s: Fantastically reliable, fantastic power bills, fantastic complexity to administer
  • Aruba 1xx series Access Points: Revolutionary access control, less than stellar radio performance
  • Palo Alto 2000/4000 series firewalls: Again, revolutionary approaches to network security, but not enough performance for modern businesses to function. Commit times improved greatly on later generations
  • TM-OS 11.x: Incredible documentation, incredible feature depth. If it's more modern than 2015, you're going to want more features

All of these served businesses well, then needed to be replaced. I see too many engineers beat themselves up when services eventually fell apart, and it's just not necessary.

Sunday, January 17, 2021

9/10 NGINX Use Cases, URI and Host rewrites

NGINX Rewrite Directives, The 9/10 Solutions

When doing ADC/Load Balancer work, nearly all requests fit into two categories:

  • Please rewrite part of the URL/URI
  • Please change the host header for this reverse proxy

These are fairly simple to implement in NGINX, so I'm creating a couple of cheat-sheet code snippets here.

"Strip Part of the URL Out"

URI stripping is fairly common, and the primary motivation for this blog post. As enterprises move to Kubernetes, they're more likely to use proxy_pass directives (among other things) to multi-plex multiple discrete services into one endpoint.

With URI stripping, an engineer can set an arbitrary URI prefix and then remove it before the web application becomes aware. URI stripping is a useful function to stitch multiple web services together into one coherent endpoint for customer use.

NGINX comes to the rescue here, with a relatively simple solution:

  • location directive: Anchors the micro- or sub- service to an NGINX URI
  • rewrite directive: Rewrites the micro- or sub- service to a new directory, allowing for minimal backend modifications

The below example achieves this by rewriting the URI /build* to /, ensuring that the build service (Jenkins) doesn't need to be re-tooled to work behind a proxy:

  location /builds/ {
    root /var/lib/jenkins/workspace/;
    rewrite ^/builds(.*)$ $1 break;
    autoindex on;

As you can see, this example is an obvious security risk, as the autoindex directive lets clients browse through the build service without authentication and potentially access secrets, and is intended as an illustration and not a direct recommendation for production practice. Here's a little bit more production-appropriate example providing Jenkins over TLS (source:

    server {
        listen       443 ssl http2 default_server;
        listen       [::]:443 ssl http2 default_server;

        ssl_certificate "CERT;
        ssl_certificate_key "KEY";
        ssl_session_cache shared:SSL:1m;
        ssl_session_timeout  10m;
        ssl_protocols TLSv1.2 TLSv1.3;
        ssl_ciphers ALL:!AES:!RC4:!SHA:!MD5;
        ssl_prefer_server_ciphers on;

        # Load configuration files for the default server block.
        include /etc/nginx/default.d/*.conf;

        location ~ "^/static/[0-9a-fA-F]{8}\/(.*)$" {
            #rewrite all static files into requests to the root
            #E.g /static/12345678/css/something.css will become /css/something.css
            rewrite "^/static/[0-9a-fA-F]{8}\/(.*)" /$1 last;

        location /userContent {
            # have nginx handle all the static requests to userContent folder
            #note : This is the $JENKINS_HOME dir
            root /var/lib/jenkins/;
            if (!-f $request_filename){
            #this file does not exist, might be a directory or a /**view** url
            rewrite (.*) /$1 last;
            sendfile on;
        location / {
                    sendfile off;
                    proxy_pass http://jenkins/;
            # Required for Jenkins websocket agents
            proxy_set_header   Connection        $connection_upgrade;
            proxy_set_header   Upgrade           $http_upgrade;

            proxy_set_header   Host              $host;
            proxy_set_header   X-Real-IP         $remote_addr;
            proxy_set_header   X-Forwarded-For   $proxy_add_x_forwarded_for;
            proxy_set_header   X-Forwarded-Proto $scheme;
            proxy_max_temp_file_size 0;

            #this is the maximum upload size
            client_max_body_size       10m;
            client_body_buffer_size    128k;

            proxy_connect_timeout      90;
            proxy_send_timeout         90;
            proxy_read_timeout         90;
            proxy_buffering            off;
            proxy_request_buffering    off; # Required for HTTP CLI commands
            proxy_set_header Connection ""; # Clear for keepalive
        error_page 404 /404.html;
            location = /40x.html {

        error_page 500 502 503 504 /50x.html;
            location = /50x.html {

Set Host Headers

This is quite a bit easier, using the proxy_set_header directive:

  location /builds/ {
    proxy_pass http://localhost:8080;
    proxy_set_header Host
    rewrite ^/fabric-builds(.*)$ $1 break;

Sunday, January 3, 2021

NSX-T Transitive Networking

One major advantage to NSX-T is that Edge Transport Nodes (ETNs) are transitive.

Transitivity (Wikipedia) (Consortium GARR) is an extremely important concept in network science, and in computer networking. 

In simple terms, a network node (any speaker capable of transmitting or receiving on a network) can have the following transitivity patterns:
  • Transitive: Most network equipment fit in this category. The primary purpose of these devices is to allow traffic to flow through them and to occasionally offer services over-the-top. 
    • Examples:
      • Switches
      • Routers
      • Firewalls
      • Load Balancers
      • Service Meshes
      • Any Linux host with ip_forward set
      • Mobile devices with tethering
  • Non-Transitive: Most servers, client devices fit in this category. These nodes are typically either offering services over a network or consuming them (Usually both). In nearly all cases, this is a deliberate choice by the system designer for loop prevention purposes. 
    • Note: It's completely possible to participate in a routing protocol while being non-transitive. 
    • Examples:
      • VMware vSphere Standard Switch && vSphere Distributed Switch (no Spanning-Tree participation)
      • Amazon vPC
      • Azure VNet
      • Any Linux host with ip_forward disabled
      • Nearly any server, workstation, mobile device
  • Anti-Transitive: This is a bit of a special use case, where traffic is transitive but only in specific use cases. Anti-Transitive network nodes have some form of control in place to prevent transit in specific scenarios but allowing it in others. The most common scenario is when an enterprise has multiple service providers - where the enterprise doesn't want to pay for traffic going between those two carriers.
    • Examples:
      • Amazon Transit Gateway
      • Any BGP Router with import/export filters

vSphere Switch Transitive Networking Design

To fully understand VMware's approach, it is important to first understand earlier approaches to network virtualization. vSphere switches are a bit of a misnomer, as you don't actually switch at any given point. Instead, vSphere switches leverage a "Layer 2 Proxy" of sorts, where NIC-accelerated software replaces ASIC flow-based transitive switching.

This approach offers incredible flexibility, but is theoretically slower than software switching; to preserve this capability VMware noticed early on that loop prevention would become an issue. Pre-empting this problem, making the platform completely non-transitive to ensure that this flexibility will be more readily adopted.

Note: VMware's design choices here contained the direct intent to simplify the execution and management of virtualized networking. This choice made computer networking simple enough for most typical VI administrators to perform, but more of the advanced features (QoS, teaming configurations) require more direct involvement from network engineers to execute well. Generally speaking, the lack of need for direct networking intervention for a VSS/vDS to work has led to a negative trend with the VI administrator community. Co-operation between VI administration and networking teams often suffer due to this lack of synchronization, and with it systems performance as well.

NSX-T Transitive Networking Design

NSX-T is highly prescriptive in terms of topology. VMware has known for years that a highly controlled design for transitive networking will provide stability to the networks it may participate in - just look at the maturity/popularity of vDS vs Nexus 1000v.

NSX-T does depend on VDS for Layer 2 forwarding (as we've established, not really switching), but does follow the same general principles for design. 

To be stable, you have to sacrifice flexibility. This is for your own protection. These choices are artificial design limitations, intentionally placed for easy network virtualization deployment.

VMware NSX-T Tier-0 logical routers have to be transitive to perform their main goal, transporting overlay traffic to underlay network nodes. Every time a network node becomes transitive in this way, specific design decisions must be made to ensure that anti-transitive measures are appropriately used to achieve network stability. 

NSX-T Tier-1 Distributed routers are completely nontransitive, and NSX-T Tier-1 Service Routers have severely limited transitive capabilities. I have diagrammed this interaction as non-transitive because the Tier-1 services provided are technically owned by that logical router.

Applications for Transitive Tier-0 Routers

Given how tightly controlled transit is with NSX-T, the only place we can perform these tasks is via the Tier-0 Logical Router. Let's see if it'll let us transit networks originated from a foreign device, shall we?


NSX-T Tier-0 Logical Routers are capable as transit providers, and the only constructs preventing transit are open standards (BGP import/export filters)

Unit Test

Peer with vCLOS network via (transiting) NSX-T Tier-0 Logical Router:

 Let's build it, starting with the vn-segments:
Then, configuring Tier-0 External Interfaces:
Ensure that we're re-distributing External Interface Subnets:
Ensure that the additional prefixes are being advertised. Note: This is a pretty big gripe of mine with the NSX GUI - we really ought to be able to drill down further here...
Configure BGP Peering to the VyOS vCLOS Network:
We're good to go on the NSX Side. In theory, this should provide a transitive peering, as BGP learned routes are not Re-Distributed but learned.

(The other side is VyOS, configured in the pipeline method outlined in a previous post. This pipeline delivery method is really growing on me)

We can verify that prefixes are propagating transitively via the NSX-T Tier-0 in both protocol stacks by checking in on the spines that previously had no default route:$ show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
       F - PBR, f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup

B>* [20/0] via, eth1, weight 1, 00:15:20
B>* [20/0] via, eth1, weight 1, 00:15:20$ show ipv6 route
Codes: K - kernel route, C - connected, S - static, R - RIPng,
       O - OSPFv3, I - IS-IS, B - BGP, N - NHRP, T - Table,
       v - VNC, V - VNC-Direct, A - Babel, D - SHARP, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup

B>* ::/0 [20/0] via fe80::250:56ff:febc:b05, eth1, weight 1, 00:15:25
Now, to test whether or not packets actually forward$ ping
PING ( 56(84) bytes of data.
64 bytes from icmp_seq=1 ttl=53 time=49.7 ms
64 bytes from icmp_seq=2 ttl=53 time=48.10 ms
64 bytes from icmp_seq=3 ttl=53 time=45.9 ms
64 bytes from icmp_seq=4 ttl=53 time=45.0 ms

Looks like Tier-0 Logical Routers are transitive! This can have a lot of future implications - because NSX-T can become a launchpad for all sorts of virtualized networking. Some easy examples:
  • Tier-0 Aggregation: Like with aggregation-access topologies within the data center and campus, this is a way to manage BGP peer/linkage count at scale, allowing for thousands of Tier-0 Logical Routers per fabric switch.
  • Load Balancers: This shifts the peering relationship for load balancers/ADC platforms from a direct physical peering downward, making those workloads portable (if virtualized)
  • Firewalls: This provides Cloud Service Providers (CSP) the ability to provide customers a completely virtual, completely customer-owned private network, and the ability to share common services like internet connectivity.
  • NFVi: There are plenty of features that can leverage this flexibly in the NFV realm, as any given Enterprise VNF and Service Provider VNF can run BGP. Imagine running a Wireless LAN Controller and injecting a customer's WLAN prefixes into their MPLS cloud - or even better, their cellular clients.

Thursday, December 31, 2020

Why Automate? Using Pipelines to Develop and Manage Network Configurations

Continuous Delivery: No Rest for the Wicked

Now that we have:

  • A method to generate Desired State Configurations, by defining Declaratively what the device config should be, and combining it with what a device config should have
  • A method to apply configurations automatically, without PuTTY Copy-Pasting

We now can achieve Infrastructure As Code, where we can take a few artifacts from source control and turn them into a live, viable network device.

This is handy, but what about maintaining it? CI/CD Pipelines

In simplest terms, CI/CD tools provide an automated way to "do a thing" to make it pretty easy to perform repetitive tasks. For this example, I'll be using Jenkins CI, but the steps we'll be performing are pretty simple.

Pipelines aren't the only things that a CI tool can do, but there are some pretty big differences between a traditional pipeline and managing a network - for example, there's no code to compile. Instead, it's best to map out the steps that we want a CI tool to perform. Jenkins has a project type - Freestyle that lends itself well to applications like this, but it can also get fairly messy/disorganized.

A more comprehensive definition of a pipeline (from Red Hat) is here:

Installing Tools

In this case, I am leveraging a purpose-build CentOS host with Ansible, Jenkins, Jinja, and Python3 installed. Since this prerequisite list is fairly short, it should lend itself rather well to containerization.

Network infrastructure tends to have inbound access restrictions that most container platforms cannot meet in an auditable, secure method. This capability can be provided with VMware NSX-T or with Project Calico, but these capabilities are pretty advanced. I'd consider containerization an option for those willing to take it on in this case, and am keeping this guide as agnostic as possible.

Perhaps later I'll build on this and provide a dockerfile. Starring the repository will probably be the best way to keep track!

Executing Continuous Integration / Continuous Delivery

Let's start with the specifications for what we want to do. This doesn't need to be excessively convoluted.

  • The CI Tool should simply execute code, minimally. If we resort to a ton of shell scripting here, it won't be managed by source control and cannot easily be updated.
  • The CI Tool is responsible for:
    • Execution of written code
    • Logging
    • Notification
    • Testing of written code
    • Scoring of results to assess code viability / production readiness


  • Fetch code from GitHub. Execute every five minutes, if a new code commit is available.
  • Lint (syntax validate) all code.
  • Compile Network Configurations, and apply to network infrastructure
  • Test
  • Notify of build success

I have added an example CI Project file to this repository. It does not contain testing or validating steps yet, as those are considerably more complex - writing a parsable logger will take quite a bit more time than I feel an individual post is worth.

The CI Project

We're not asking much of Jenkins CI in this case, so you can easily replicate this configuration by:

  • Setting a Git repository to clone from (Under Source Code Management)
  • Setting the Build Trigger to Poll SCM (H/5 * * * *)
  • Execute the playbooks (provided in the GitHub repository). Instead of executing each individual pipeline, I elected to make a main.yml playbook that contains all steps, so that the control aspects of this remain centralized in the Git repository.
  • Automated Evaluation: I provided a yamllint example, eventually this should be tallying the results of each automated test and scoring it.

People want new stuff, and some of it might be new networking Features

Now that we have an easy way of keeping all of our networking gear (2-N nodes) managed and in baseline with the same level of effort, it's pretty straightforward to automatically roll out Features.

Features in this case don't need to be a large, earth-shaking new capability in more traditional software development parlance. Instead, let's consider a Feature something smaller:

  • A Feature should be something a consumer wants (DevOps term would be to delight users)
  • A Feature should be a notable change to an information system
  • A Feature should be maintainable or maintainability. A system's infrastructure administrator/engineer/architect is a consumer as well, and that person's needs have value, too!

Some Examples of Network Features:

  • Wireless AP-to-AP Roaming: Users like having connectivity stay as they move about. This can vary from 802.11i in Personal Mode, to 802.11i in Enterprise mode with 802.11k/r/v implemented to be truly seamless.
    • If this were a CI Project:
      • Minimum Viable Product would be defined. If the security teams are okay with WPA2-PSK, then that would be it. If not, the roaming capability would be at ~6 seconds, with lots of room for improvement.
      • Roll out 802.11k reports for better AP association decisions
      • Roll out 802.11v for better notifications around Power Saving
      • Roll out 802.11r or OKC for secure hand-off
      • No Rest for the Wicked: Do it all again with WPA3!
  • VPN Capability
    • If this were a CI Project:
      • MVP: IPSec-based VPN with RADIUS authentication
      • TLS Fallback for low-MTU networks or PMTUD
      • Improved authentication mechanisms, like PKI or SAML
      • Client Posture Assessment

In the world of continuous delivery, these can be done out of order, or to a roadmap. When you're done with a capability, deliver it instead of waiting for the next major code drop.

I'm a network guy, what's a code drop?

Honestly, infrastructure teams never really followed more traditional software development approaches - Continuous Delivery is a better fit, because of our key problems:

  • Change
  • Loops caused by changes

There's no true hand-off from development to operations, just the people who run the network, and those who don't. We are afflicted by an industry of either change fear or CAB purgatory where once something is built, it can no longer be improved. This builds up a lot of indebtedness that is rarely fixed by anything short of a forklift. Ideally, we can leverage CI tools in this way:

  • Clean Slate: Delete all workspace files
  • Write Feature Code
  • Build configurations
  • Apply configurations to test nodes
  • Validate (manually or automatically, or both) that the change did what it was supposed to, and that it worked
  • If it fails, go back to step #1
  • Stage Feature release, do paperwork, etc.
  • Release Feature to all applicable managed nodes
  • Work on the next Feature

I have attached a Jenkins Project that performs most of these tasks here. There are some caveats to this method that I'll cover below.

This should result in much higher quality work being released, and in the networking world, reliability is king. This is the key to becoming free of CAB Purgatory in large organizations.

A Day in the life of a Feature

Since the majority of the muscle work with Jenkins has already been programmed, we simply need to focus on the source code (device configuration), and work from there:

  • Create a new git branch. This can be achieved with git checkout or via your SCM GUI.
  • Write code for the git branch. Ideally, you'd create a new project for this step against that specific branch, but there is no "production environment" to speak of in my home lab.
  • Commit code. Again, small steps are still the best approach. The biggest change here is to periodically check in on your pipeline to see if anything breaks. This gives you the "fix or backpedal" opportunity at all times, and makes it easy to spot any breakage.
  • Submit a git pull request: This is an opportunity for the team to review your results, so be sure to include some form of linkage to your CI testing/execution data to better make your case.
  • Merge code. This will automatically roll to production at the next available window, and is your release lever.

Example 1: Fix an issue where BGP NLRIs are not being imported due to no policy

Pull Request #1

For this, we ran into a particularly odd behavior change - VyOS was somewhat recently rebased from Quagga to FRR, which picked up the following behavior:

Require policy on EBGP
[no] bgp ebgp-requires-policy
This command requires incoming and outgoing filters to be applied for eBGP sessions. Without the incoming filter, no routes will be accepted. Without the outgoing filter, no routes will be announced.

This is enabled by default.

When the incoming or outgoing filter is missing you will see “(Policy)” sign under show bgp summary:

exit1# show bgp summary

IPv4 Unicast Summary:
BGP router identifier, local AS number 65001 vrf-id 0
BGP table version 4
RIB entries 7, using 1344 bytes of memory
Peers 2, using 43 KiB of memory

Neighbor        V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt     4      65002         8        10        0    0    0 00:03:09            5 (Policy)
fe80:1::2222    4      65002         9        11        0    0    0 00:03:09     (Policy) (Policy)

This was preventing BGP route propagation, and was a result of an upstream change. In Software Development, this is called a "breaking change" because it implements major functional changes that will have potentially negative effects unless action is taken.

To mitigate this, we can develop a solution iteratively, using our lab environment, and test, re-test, and then test again until we get the desired result. 24 Commits later, I'm satisfied with the result. Once a solution is sound (passes automated testing) it is best practice to submit a solution for peer review. Git calls this action a pull request. Here's the one for this change:

Example 2: Roll out IPv6 Dynamic Routing

Pull Request #2

Like with the previous pull request, this particular implementation isn't huge.

By code volume, this was about 200 lines, but the real difference here is in the multiplier. From those 284 lines:

  • 100 lines DRY (Don't Repeat Yourself) highly repetitive code (template)
  • 57 are documentation
  • 38 lines DRY highly repetitive code (variables)

The history of this pull request is publicly available. I made a few mistakes, and then caught them with automated testing, as everyone can see.

About two-thirds of the way through this I realized I was rolling out IPv6 will a pull request. Neat.


This generates quite a bit of code, repeatably and reliably.

Value in Volume

All in all, we're generating 1,020 lines of configuration with 833 lines of code. The ratio becomes more favorable for a developer in terms of sheer work more homogenous your environment or custom configurations are. If you're only evaluating saved time:

  • 2 Devices may feel dubious
  • 3 Devices will show real value in saved time
  • 4+ the benefits become insane

Value in Consistency

The real value here is consistent configurations. Using traditional methods I'd normally have a ton of frustration trying to configure things consistently, un-doing and re-doing copy-paste errors, and re-testing. If you configure both sides with Jinja2, they'll match exactly and peer up, every time

Value in Documentation

This is the part where I truly value this approach. If an engineer or architect designs variable definitions well, the end result summarily defines the device. This can be attached in-line or as meta-data to a diagram, or easily verified against a diagram to ensure things are consistent. The few issues I had were quickly resolvable by comparing YAML to a diagram. I'm probably going to use this method to generate diagrams as well.


I trivialized the network driver aspect of this work. The one I chose, vyos.vyos.vyos_config, is not idempotent and was causing serious issues as a result (BGP neighbors dropping constantly as I re-applied the configuration). Off-the-shelf network drivers are perfectly well suited for prototyping, but substantial development is required to use them in production. This would take a full team to become reality, but a middle ground is readily achievable.

  • We call it Continuous Delivery for a reason.
  • Automate when it helps you.

We can use this guidance to come up with a plan, for example:

  • Milestone 1: Jinja2-fy your golden configurations, and stop manually generating them
  • Milestone 2: You have the config a device should have, gather_facts the current configuration, and generate a report to see if it's not compliant.
  • Milestone 3: Topical automation replaces manual remediation
  • Milestone 4: Fully mature NETCONF springs forth and saves the day!

People who are at #4 aren't better than people who have finished #1. Use what's useful.

Tuesday, December 29, 2020

Why Automate? Ansible Playbooks and Desired State for Network Operating Systems

Don't Reinvent the Wheel: Ansible Playbooks

Writing your own code isn't always the answer

Often, communities such as Python will contribute code of substantially higher quality than what you/I can create individually.

This is OK. In nearly every case, dyed-in-the-wool traditionalist programmers will consume "libraries" in their language of choice - it's only an outsider perspective that developers create everything they use.

In modern engineering, a true engineer or architect will often apply practices they studied in college to real-world situations instead of trying to create their own solutions. This doesn't discount creativity, nor does it discount those who are more pragmatically oriented. Without creativity, we have no way to improve engineering practice, and without pragmatism, we have seen some pretty serious loss of life:

...but you still have a lot of work to do

Adapting engineering practices, code from the internet, Googled Cisco example topologies as a matter of practice does take work. Do you trust all code from Stack Overflow? (not a real website)?

You shouldn't, and modern engineering practice doesn't either. In nearly every case, the ability to apply engineering practice to a problem comes with years of training, millennia of past examples (failures and successes) as history for individual practice, ideally with similar applications. A good example of this is the study of brittle fractures where manipulating (maximizing) material hardness is no longer an automatic victory, but more of a serious safety risk.

We live in a simpler world of abstraction and pure mathematics, and behaviors are a lot more reliable - but they're not perfectly so. We as designers and implementers of computer solutions (Network, Systems, don't care) can learn from our more disciplined cousins. I'll write more on this later, but for now, let's simply at least agree to review every action critically.

Playbook Automation

Let's use the lens of an engineer evaluating a technical control here. Ansible is going to be my example here, as it's probably the most straightforward.

Supporting Files

While it is possible to run a standalone, self-supporting playbook, it's not generally recommended at scale. The first step towards leveraging this automation is by defining an inventory. As always, this is typically in YAML, so most of the effort goes into structuring your data as opposed to actual work.

Some recommendations:

  • Don't let names collide between production, lab, etc. We don't want to have a Wargames scenario in anybody's production network.
  • Make sure it makes sense. It's pretty easy to over/under-organize; think about the smallest elemental unit you may work on.
  • Leverage Source Control! Save a copy, keep your revision history. Even better, get peer reviews.
  • Remember, this can be edited later! This should continually improve.

Example (loosely based from

I'm using the project (virtualized Clos Topologies) as a prefix, and then organizing device types from there. Spines don't need VLANs, and will be route reflectors - which is enough to justify separation in this case.


      ansible_host: ""
      ansible_host: ""
    ansible_network_os: vyos.vyos.vyos
    ansible_user: vyos
    ansible_connection: ansible.netcommon.network_cli
      ansible_host: ""
      ansible_host: ""
    ansible_network_os: vyos.vyos.vyos
    ansible_user: vyos
    ansible_connection: ansible.netcommon.network_cli

Let's explain what I've done here. There are a few deviations from the typical. I'll try to explain them here:

  • YAML Inventory: This is just me, I prefer it over the INI format as a Linux guy. It also helps a lot with structured hierarchies, which I like as a network guy.
  • Variable declarations:
    • Per Ansible's documentation on networking, we do know that there are a few things unique to network automation - namely the lack of on-board python. This means that the Ansible control node (the one EXECUTING the playbook) needs to know that it's doing all of the planning/thinking. For this to work, we need to make a few unique (but re-usable) declarations
      • ansible_network_os: More or less does exactly what it says. There's a built-in ansible interpreter for VyOS - but this is really only true for a handful of network distros. You can get more from Ansible Galaxy, but extensive testing should be applied.
      • ansible_connection: This is basically the "driver" for the CLI. You can use Paramiko or SSH as well. this is primarily governed by your Network OS.
      • ansible_user just instructs the control node on what username to attempt against the target host.

Outside of this, I have also set up SSH key authentication to all VyOS nodes. It's pretty easy: (

set system login user vyos authentication public-keys key1 key blahblahblah
set system login user vyos authentication public-keys key1 type ssh-rsa

The Playbooks


Before designing a playbook, we do need to cover some of Ansible's key design values:

  • Idempotency: Run once, get the same result every time. If a change already has been made and is invasive, don't repeat it unless the state doesn't match.
  • Thin Veil of Abstraction: You should be aware of what is being implemented from a technical perspective, but not have to control every last aspect of it.
  • Be Declarative: Try to design from the abstract concept you want to implement, and fill in the technical details as needed, not the other way around.

Day 0, get the system online

In this example, we want to have four devices have some level of usable configuration, and we don't want to do lots of manual, error-prone editing to get there. We're going to adapt my base configuration for this purpose by re-tooling it to support Jinja deployments. At a high level, Jinja playbooks:

  • Load Variables: This will be a separate file, effectively designing the what of your deployment
  • Load Template, then translate variables: This will be executed by the template module

We'll keep this example pretty short - it's available in the linked repository, but we also want to leverage idemopotency for future changes. It doesn't leverage inventory, because it's creating base configurations to be applied by some other method.

Fun fact - this is the first stage to any Infrastructure-as-Code implementation. The created end results (*-compiled.conf) can be directly applied, or by using a "Day 2 Method".


  hostname: ''
  domain: ''
  timezone: 'US/Alaska'

Execution (Playbook):

- hosts: localhost
    - name: Import Vars...
        file: vyos-base.yml
    - name: Combine vyos...
        src: templates/vyos-base.j2
        dest: vyos-compiled.conf

Day 2, apply routine changes

In this example, we've already started the deployment, and have it up and running. We have some form of routine change to make, but we want it to be consistently applied, and idempotently. This will mean that the configuration change playbook shouldn't contain anything about the specific change in an ideal world with this method.

- hosts:
    - name: Apply on L0!
        src: 'vyos-l0-compiled.conf'
        save: yes
- hosts:
    - name: Apply on L1!
        src: 'vyos-l1-compiled.conf'
        save: yes
- hosts:
    - name: Apply on S0!
        src: 'vyos-s0-compiled.conf'
        save: yes
- hosts:
    - name: Apply on S1!
        src: 'vyos-s1-compiled.conf'
        save: yes

This will re-apply any changes that are staged via the base configuration and Jinja merge repeatedly if re-executed.

Note: This particular network driver is not idempotent. In production networks something like NAPALM/Nornir may be more appropriate. You can verify if a method is idempotent by repeatedly running the playbook - an expected result is changed=0.

18:55:40 PLAY [] *****************************************************
18:55:40 TASK [Gathering Facts] *********************************************************
18:55:41 [WARNING]: Ignoring timeout(20) for vyos.vyos.vyos_facts
18:55:44 ok: []
18:55:44 TASK [Apply on L0!] ************************************************************
18:55:49 changed: []
18:55:49 PLAY [] *****************************************************
18:55:49 TASK [Gathering Facts] *********************************************************
18:55:49 [WARNING]: Ignoring timeout(20) for vyos.vyos.vyos_facts
18:55:53 ok: []
18:55:53 TASK [Apply on L1!] ************************************************************
18:55:57 changed: []
18:55:57 PLAY [] *****************************************************
18:55:57 TASK [Gathering Facts] *********************************************************
18:55:58 [WARNING]: Ignoring timeout(20) for vyos.vyos.vyos_facts
18:56:02 ok: []
18:56:02 TASK [Apply on S0!] ************************************************************
18:56:06 changed: []
18:56:06 PLAY [] *****************************************************
18:56:06 TASK [Gathering Facts] *********************************************************
18:56:06 [WARNING]: Ignoring timeout(20) for vyos.vyos.vyos_facts
18:56:10 ok: []
18:56:10 TASK [Apply on S1!] ************************************************************
18:56:14 changed: []
18:56:14 PLAY RECAP *********************************************************************
18:56:14 localhost                  : ok=12   changed=4    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
18:56:14        : ok=4    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
18:56:14        : ok=4    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
18:56:14        : ok=4    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
18:56:14        : ok=4    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

The next step is important - automatically updating a network based on configuration changes! As always, my source code for executing this is here. Note that this is a moving project and will get updates with future posts.

Monday, December 21, 2020

Advantages for BGP in Virtualized Topologies

BGP Advantages in Virtualized Networking

BGP is more like an Application

BGP, by design, is a lot more capable than most typical routing protocols. Here are a few ways MP-BGP/BGPv4 is fundamentally unique:

I'll keep this short to avoid beating a dead horse too much.

The following advantages are specific to NSX-T/V or virtualized routers:

  • Link-state adjacencies don't change if a virtual system is down. If a hypervisor hosting a VM stays up and a VM is down, link-state doesn't change, so you're going to wait for the entire dead interval as an outage.
  • I'll repeat this again, with virtualized network functions Link-state adjacencies failover at their maximum dead interval, nullifying the primary advantage to these routing protocols!
  • Interlinks from a physical network must be specifically engineered to prevent non-determinism. If you're multi-homing a virtual router via the same Layer 2 domain, LSAs will not only flow between the physical network endpoints and your desired Virtual Network Function (VNF) but *between physical network devices.
    • This can be designed around, but you lose the ability to scale multi-pathed machines easily and automatically.
    • You can get the dynamic adjacency capability with the BGP Neighbor Range feature on nearly all datacenter network equipment today.

TL;DR, So What if you don't run BGP currently?

This is where most people get hung up - if a network doesn't currently use BGP, it'll potentially introduce problems by adding a new thing for engineers to maintain, major forklifts to pick up hardware support, and so on.

These are all very valid concerns. I'd recommend that instead of shutting down the argument, try some of these solutions on for size instead:

We Inflate eBGP's complexity because we've been conditioned to

Most of the complicated stuff is iBGP loop prevention or pro-grade tuning. Cisco education mechanisms have failed the community somewhat here and with IS-IS (you can only test on SO MUCH in the CCNA/CCNP!). These advanced capabilities are rarely necessary for most typical enterprise deployments. The typical enterprise BGP deployment responsibilities will consist of:

This can be either really difficult and complex or really simple depending on needs. If it's not an enterprise-wide deployment of BGP (usually it won't be) just plan it out on paper before implementing - there will be learning experiences, accept that they'll happen, and maximize end results. You can't get this education without getting your hands dirty, so make sure it won't hurt the business / use a lab if you can

If you can't, contain the deployment: Set up a prefix for whatever workload is being used, and redistribute that instead of BGP until you've hit maturity. In many cases, it'll just stay there, and that's OK.

BGP to Security Appliances

This is probably where I'd start - it's got the highest value to effort ratio. Given your vendor choices, it's probably not that complicated and doesn't necessarily need to be redistributed to campus or other internet edge modules. For most enterprise deployments, this is totally cool. If you're me, you'll start getting annoyed by NLRIs not propagating across sites, which brings you to...

Run BGP on top of an IGP

This is actually how most Service Provider networks work! BGP isn't designed to synchronize - it doesn't modify any next-hop addresses for advertised prefixes and needs another routing protocol to do that. There are some applications where you can go all-BGP but they're usually reserved for hyper-scaler applications or shops that already are very familiar with BGP. Physical network routes can continue to propagate in this scenario just like they always did, and you're using BGP for the virtual ones. The only redistribution required would be a zeroes/default route from your point of origin to keep things nice and intuitive.


This is pretty complicated unless you contain the use cases. In the two scenarios above, you're mostly off the hook on this one - at most, you'll be installing a default route.

Workloads that benefit from BGP competency in the enterprise

  • VMware NSX-T
  • VMware Cloud on AWS
  • Avi Networks Load Balancer
  • Amazon AWS
  • Project Calico (Kubernetes!)
  • Vyatta Vyos
  • F5 LTM
  • Microsoft Azure
  • All SD-WAN
  • All firewalls

Needless to say, if you're a shop that consumes more than vCenter and ESXi, you probably should be dipping your toes in the water. How far is up to you, but it cannot be avoided.

Some things to remember

  • If it's providing value, you're doing well.
  • If you don't know something, that's OK. We're in an ever-changing industry.

Sunday, December 20, 2020

vCenter Upgrade Error: `Exception Occurred in precheck phase`

Error presented by VAMI Interface


VCSA 7.0 has moved the upgrade process logging to a new location - the log itself is now at /storage/log/vmware/applmgmt/update_microservice.log (actual) or /var/log/vmware/applmgmt/update_microservice.log (symlink)


This appears to be a rough order of operations with this new update process:

  • Pre-Checks: First, the upgrade tries to identify the system being upgraded:
update_microservice::          precheckEventHandler: 148 -     INFO - Precheck event happens
update_b2b::                      precheck: 709 -    DEBUG - Running update prechecks
update_b2b::               b2bRequirements: 479 -    DEBUG - Running B2B Requirements hook and processing the results
update_b2b::                _runScriptHook: 330 -    DEBUG - Running B2B script with hook CollectRequirementsHook
update_b2b::                _runScriptHook: 339 -    DEBUG - update script output to file /var/log/vmware/applmgmt/upgrade_hook_CollectRequirementsHook
extensions::                _findExtension:  83 -    DEBUG - Found script hook <module 'update_script' from '/storage/core/software-update/updates/'>:CollectRequirementsHook'
update_utils::                     isGateway:  83 -    DEBUG - Not running on a VMC Gateway appliance.
update_utils::                  isB2BUpgrade:  72 -    DEBUG - Bundle will execute upgrade: False
update_script::           collectRequirements: 492 -    DEBUG - Checking verisons
update_script::           collectRequirements: 496 -    DEBUG - Source VCSA version =
update_script::           collectRequirements: 500 -     INFO - Target VCSA version =
update_utils::               getRPMBlacklist: 185 -    DEBUG - vCSA deployment Type: embedded
update_b2b::               b2bRequirements: 493 -    DEBUG - Getting packages excluding the ones in blacklist

From there, it picks up the scope for the upgrade, and verifies against common upgrade issues:

update_b2b::               b2bRequirements: 528 -    DEBUG - Calculated packages list 
update_b2b::                     checkDisk: 423 -    DEBUG - Checking for disk utilization
update_b2b::                     checkDisk: 467 -    DEBUG - CheckDisk completed, returning with selected disk partition /storage/updatemgr
update_b2b::                      precheck: 740 -    DEBUG - Estimating time to install..
update_b2b::                 estimate_time: 679 -    DEBUG - Estimating time required for rpm-update, services start-stop and reboot time if its required
update_b2b::                 estimate_time: 682 -    DEBUG - Calculating RPM installation time
update_b2b::              rpm_install_time: 587 -    DEBUG - Reading all rpms present in rpm-manifest.json
update_b2b::              rpm_install_time: 588 -    DEBUG - Estimating installation time for installed rpms and new rpms
update_b2b::       get_installed_rpms_list: 564 -    DEBUG - Getting the list of installed RPMs along with the time of install
update_b2b::       get_installed_rpms_list: 578 -    DEBUG - Completed getting the list of rpms, returning with the list: <class 'list'>
update_b2b::              rpm_install_time: 610 -    DEBUG - Installation time estimated successfully, returning with time for installation 23
update_b2b::                 estimate_time: 684 -    DEBUG - Calculating time to start and stop services
update_b2b::        estimate_time_services: 620 -    DEBUG - Estimating time for services-start and services-stop
update_b2b::        estimate_time_services: 640 -    DEBUG - Completed estimating time for starting and stopping services, returning with the required time: 2
task_manager::                        update:  80 -    DEBUG - UpdateTask: status=SUCCEEDED, progress=100, message={'id': 'com.vmware.appliance.update.prechecks_task_ok', 'default_message': 'Prechecks completed', 'args': []}

In this case, everything looks good. I'm not really sure why it needs the SSO Administrator password, and there isn't much on-line about this. We're seeing three errors after we hit go time:

update_b2b::                   resumeStage:3431 -    DEBUG - 'download' phase is 100% completed. checkAllRpmsArePresent
rpmfunctions::        checkAllRpmsArePresent: 308 -    ERROR - Empty Stage location passed. This cannot be empty.
update_b2b::                   resumeStage:3497 -    ERROR - Exception in resume stage. Exception : {Package discrepency error, Cannot resume!}
task_manager::                        update:  80 -    DEBUG - UpdateTask: status=FAILED, progress=0, message={'id': 'com.vmware.appliance.plain_message', 'default_message': '%s', 'args': ['Package discrepency error, Cannot resume!']}
dbfunctions::                       execute:  81 -    DEBUG - Executing {SELECT CASE WHEN count(*) == 0 THEN 0 ELSE 1 END as status FROM progress WHERE _stagekey = 'patch-state' AND _message = 'Stage successful'}
functions::              get_resume_state: 340 -    DEBUG - Resume needed in Stage phase
update_b2b::           install_with_resume:2477 -    DEBUG - Installing version
update_functions::                  readJsonFile: 224 -    ERROR - Can't read JSON file /storage/core/software-update/stage/stageDir.json [Errno 2] No such file or directory: '/storage/core/software-update/stage/stageDir.json'
task_manager::                        update:  80 -    DEBUG - UpdateTask: status=FAILED, progress=0, message={'id': 'com.vmware.appliance.not_staged', 'default_message': 'The update is not staged', 'args': []}
update_b2b::              installPrechecks:2146 -    DEBUG - Exception occurred while checking for discrepancies Update not staged
task_manager::                        update:  80 -    DEBUG - UpdateTask: status=RESUMABLE, progress=0, message={'id': 'com.vmware.appliance.plain_message', 'default_message': '%s', 'args': ['Exception occurred in install precheck phase']}

This is pretty odd, because it's indicating a "resumable error" despite the fact that it cannot resume until a file lock is removed. Here are the errors I see:

  • Empty Stage Location: Unsure what this means, given the context. Odds are the upgrade script cannot find out where to stage RPMs (Red Hat Package Manager).
  • Package discrepancy error: It could be relating to the above, or it could be a failed checksum. No other logging is generated by the agent to indicate what's wrong.
  • Can't read JSON file /storage/core/software-update/stage/stageDir.json: This one's more actionable! It looks like there's no directory by this name.

Easter Egg: statsmoitor probably should be statsmonitor


Allow the update to resume

VAMI saves the installation state as a file in /etc/applmgmt/appliance/software_update_state.conf:

    "state": "INSTALL_FAILED",
    "version": "",
    "latest_query_time": "2020-12-21T00:19:32Z",
    "operation_id": "/storage/core/software-update/install_operation"

VAMI will be stuck in a loop until you remove this file as root:

rm -rf /etc/applmgmt/appliance/software_update_state.conf

This will not necessarily resolve the issue that caused the failure, however, more work still needs to be done.

Install via ISO

We're going to try a fallback method, attaching the upgrade ISO. The following snippet is from the vSphere UI, modifying vCenter's VM Hardware:

From there, simply click "Check CD-ROM" and it will immediately appear.

This time, we know what directories to search, so I'm going to watch the logs:

tail -f  /var/log/vmware/applmgmt/update_microservice.log | grep -i err

Attempt via Command-line with ISO

VMware documents the following method to update via the command line

Stage Packages

We're going to try and clear the (empty) workspace and try fresh, auto-accepting EULAs:

Command> software-packages unstage
Command> software-packages stage --iso --acceptEulas
 [2020-12-20T17:49:54.355] : ISO mounted successfully
 [2020-12-20T17:49:54.355] : UpdateInfo: Using product version and build 17004997
 [2020-12-20T17:49:55.355] : Target VCSA version =
 [2020-12-20 17:49:55,169] : Running requirements script.....
 [2020-12-20T17:50:12.355] : Evaluating packages to stage...
 [2020-12-20T17:50:12.355] : Verifying staging area
 [2020-12-20T17:50:12.355] : ISO unmounted successfully
 [2020-12-20T17:50:12.355] : Staging process completed successfully
 [2020-12-20T17:50:12.355] : Answers for following questions have to be provided to install phase:
                ID: vmdir.password
                Text: Single Sign-On administrator password
                Description: For the first instance of the identity domain, this is the password given to the Administrator account.  Otherwise, this is the password of the Administrator account of the replication partner.
                Allowed values:
                Default value:

 [2020-12-20T17:50:12.355] : Execute software-packages validate to validate your input

Let's take a look at the update:

Command> software-packages list --staged
[2020-12-20T17:52:00.355] :
    category: Bugfix
    leaf_services: ['vmware-pod', 'vsphere-ui', 'wcp']
    vendor: VMware, Inc.
    name: VC-7.0U1c
    tags: []
    version_supported: []
    size in MB: 5107
    releasedate: December 17, 2020
    updateversion: True
    allowedSourceVersions: [,]
    buildnumber: 17327517
    rebootrequired: False
    productname: VMware vCenter Server
    type: Update
    summary: {'id': 'patch.summary', 'translatable': 'In-place upgrade for vCenter appliances.', 'localized': 'In-place upgrade for vCenter appliances.'}
    severity: Critical
    TPP_ISO: False
    thirdPartyInstallation: False
    timeToInstall: 0
    requiredDiskSpace: {'/storage/core': 6.286324043273925, '/storage/seat': 228.3861328125}
    eulaAcceptTime: 2020-12-20 17:50:12 AKST

Let's run it!

Command> software-packages install --staged
 [2020-12-20T17:53:52.355] : For the first instance of the identity domain, this is the password given to the Administrator account.  Otherwise, this is the password of the Administrator account of the replication partner.
Enter Single Sign-On administrator password:

 [2020-12-20T17:54:02.355] : Validating software update payload
 [2020-12-20T17:54:02.355] : UpdateInfo: Using product version and build 17004997
 [2020-12-20 17:54:02,095] : Running validate script.....
 [2020-12-20T17:54:09.355] : Validation successful
 [2020-12-20 17:54:09,125] : Copying software packages  [2020-12-20T17:54:09.355] : ISO mounted successfully
 [2020-12-20T17:57:31.355] : ISO unmounted successfully
 [2020-12-20 17:57:31,238] : Running system-prepare script.....
 [2020-12-20 17:57:40,289] : Running test transaction ....
 [2020-12-20 17:57:54,344] : Running prepatch script.....
 [2020-12-20 18:01:22,731] : Upgrading software packages ....
 [2020-12-20T18:07:39.355] : Setting appliance version to build 17327517
 [2020-12-20 18:07:39,538] : Running patch script.
 [2020-12-20 18:28:42,743] : Starting all services ....
 [2020-12-20T18:28:46.355] : Services started.
 [2020-12-20T18:28:46.355] : Installation process completed successfully
 [2020-12-20T18:28:46.355] : The following warnings have been found:
['\tWarning: \n\t\tsummary: Failed to start all services, will retry operation.\n']
Command> shutdown reboot -r "patch reboot"

Looks like the manual install worked for me - 7.0 U1c


rm -rf /etc/applmgmt/application/software_update_state
grep -i error /var/log/vmware/applmgmt/update_microservice.log
software-packages unstage
software-packages stage --iso --acceptEulas
software-packages list --staged
software-packages install --staged
shutdown reboot -r "patch reboot"

PAN-OS IPv6 Error: bgp peer local address 0:0:0:0:0:0:0:0 does not belong to interface

  When encountering this error, please ensure that "Enable IPv6" is set under interfaces: Hope this helps! Happy IPv6ing!