Sunday, September 19, 2021

Get an A on ssllabs.com with VMware Avi / NSX ALB (and keep it that way with SemVer!)

Cryptographic security is an important aspect of hosting any business-critical service.

When hosting a public service secured by TLS, it is important to strike a balance between compatibility (The Availability aspect of CIA), and strong cryptography (the Integrity or Authentication and Confidentiality aspects of CIA). To illustrate, let's look at the CIA model:

In this case, we need to balance backward compatibility with using good quality cryptography -  here's a brief and probably soon-to-be-dated overview of what we ought to use and why.

Protocols

This block is fairly easy, as older protocols are worse, right? 

TLS 1.3

As a protocol, TLS 1.3 has quite a few great improvements and is fundamentally simpler to manage with fewer knobs and dials. There is a major concern with TLS 1.3 currently - security tooling in the large enterprise hasn't caught up with this protocol yet as new ciphers like ChaCha20 don't have hardware-assisted lanes for decryption. Here are some of the new capabilities you'll like::
  • Simplified Crypto sets: TLS 1.3 deprecates a ton of less-than-secure crypto - TLS 1.2 supports up to 356 cipher suites, 37 of which are new with TLS 1.2. This is a mess - TLS 1.3 supports five.
    • Note: The designers for TLS 1.3 achieved this by removing forward secrecy methods from the cipher suite, and they must be separately selected.
  • Simplified handshake: TLS 1.3 connections require fewer round-trips, and session resumption features allow a 0-RTT handshake.
  • AEAD Support: AEAD ciphers both support integrity and confidentiality. AES Galois Counter Mode (GCM) and Google's ChaCha20 serve this purpose.
  • Forward Secrecy: If a cipher suite doesn't have PFS (I disagree with perfect) support, it means that a user can capture your network traffic and replay it to decrypt if the private keys are acquired. PFS support is mandatory in TLS 1.2

Here are some of the things you can do to mitigate the risk if you're in a large enterprise that performs decryption:
  • Use a load balancer - since this is about a load balancer, you can protect your customer's traffic in transit by performing SSL/TLS bridging. Set the LB-to-Server (serverssl) profile to a high-efficiency cipher suite (TLS 1.2 + AES-CBC) to maintain confidentiality while still protecting privacy.

TLS 1.2

TLS 1.2 is like the Toyota Corolla of TLS, it's run for forever and not everyone maintains it properly.

It can still perform well if properly configured and maintained - we'll go into more detail on how in the next section. The practices outlined here are good for all editions of TLS.

Generally, TLS 1.0 and 1.1 should not be used. Two OS providers (Windows XP, Android 4, and below) were disturbingly slow to adopt TLS 1.2, so if this is part of your customer base, beware.

Ciphers

This information is much more likely to be dated. I'll try to keep this short:

Confidentiality

  • (AEAD) AES-GCM: This is usually my all-around cipher. It's decently fast and supports partial acceleration with hardware ADCs / CPUs. AES is generally pretty fast, so it's a good balance of performance and confidentiality. I don't personally think it's worth running anything but 256-bit on modern hardware.
  • (AEAD) ChaCha20: This was developed by Google, and is still "being proven". Generally trusted by the public, this novel cipher suite is fast despite a lack of hardware acceleration.
  • AES-CBC: This has been the "advanced" cipher for confidentiality before AES-GCM. Developed in 1993, this crypto is highly performant and motivated users to move from suites like DES and RC4 by being both more performant and stronger. Like with AES-GCM, I prefer not to use anything but 256-bit on modern hardware
  • Everything else: This is the "don't bother" bucket: RC4, DES, 3DES

Integrity

Generally, AEAD provides an advantage here - SHA3 isn't generally available yet but SHA2 variants should be the only thing used. The more bits the better!

Forward Secrecy

  • ECDHE (Elliptic Curve Diffie Hellman): This should be mandatory with TLS 1.2 unless you have customers with old Android phones and Windows XP.
  • TLS 1.3 lets you select multiple PFS algorithms that are EC-based.

Matters of Practice

Before we move into the Avi-specific configuration, I have a recommendation that is true for all platforms:
Cryptography practices change over time - and some of these changes break compatibility. Semantic versioning provides the capability to support three scales of change:
  • Major Changes: First number in a version. Since the specification is focused on APIs, I'll be more clear here. This is what you'd iterate if you are removing cipher suites or negotiation parameters that might break existing clients
  • Minor Changes: This category would be for tuning and adding support for something new that won't break compatibility. Examples here would be cipher order preference changes or adding new ciphers.
  • Patch Changes: This won't be used much in this case - here's where we'd document a change that matches the Minor Change's intent, like mistakes on cipher order preference.

Let's do it!

Let's move into an example leveraging NSX ALB (Avi Vantage). Here, I'll be creating a "first version," but the practices are the same. First, navigate to Templates -> Security -> SSL/TLS Profile:


Note: I really like this about Avi Vantage, even if I'm not using it here. The security scores here are accurate, albeit capped out - VMware is probably doing this to encourage use of AEAD ciphers:
...but, I'm somewhat old-school. I like using Apache-style cipher strings because they can apply to anything, and everything will run TLS eventually. Here are the cipher strings I'm using - the first is TLS 1.2, the second is TLS 1.3.
ECDHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-SHA384
TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256


One gripe I have here is that Avi won't add the "What If" analysis like F5's TM-OS does (14+ only).  Conversely, applying this profile is much easier. To do this, open the virtual service, and navigate to the bottom right:

That's it! Later on, we'll provide examples of coverage reporting for these profiles. In a production-like deployment, these services should be managed with release strategies given that versioning is applied.

Friday, September 17, 2021

Static IPv4/IPv6 Addresses - Debian 11

 Here's how to set both static IPv4 and IPv6 addressing on Debian 11. The new portions are outlined in italics.

First, edit /etc/network/interfaces

auto lo
auto ens192
iface lo inet loopback

# The primary network interface
allow-hotplug ens192
iface ens192 inet static
address {{ ipv4.address }}
gateway {{ ipv4.gateway }}
iface ens192 inet6 static
address {{ ipv6.address }}
gateway {{ ipv6.gateway }}
 

Then, restart your networking stack:
systemctl restart networking

Friday, September 10, 2021

VMware NSX ALB (Avi Networks) and NSX-T Integration, Installation

Note: I created a common baseline for pre-requisites in this previous post. We'll be following VMware's Avi + NSX-T Design guide.

This will be a complete re-install. Avi Vantage appears to develop some tight coupling issues with using the same vCenter for both Layer 2 and NSX-T deployments - which is not an issue that most people will typically have. Let's start with the OVA deployment:


Initial setup here will be very different compared to a typical vCenter standalone or read-only deployment. The setup wizard should be very minimally followed:

With a more "standard" deployment methodology, the Avi Service Engines will be running on their own Tier-1 router, and leveraging Source-NAT (misnomer, since it's a TCP proxy) for "one-arm load balancing":

To perform this, we'll need to add two segments to the ALB Tier-1. one for management, and one for vIPs. I have created the following NSX-T segments, with 10.7.80.0/24 running DHCP and 10.7.81.0/24 for vIPs:
Note: I used underscores in this segment name, in my own testing both ./ are illegal characters. Avi's NSX-T Cloud Connector will report "No Transport Nodes Found" if it cannot match the segment name due to these characters.
Note: If you configure an NSX-T cloud and discover this issue, you will need to delete and re-add the cloud after fixing the names!
Note: IPv6 is being used, but I will not share my globally routable prefixes.

First off, let's create NSX-T Manager and vCenter Credentials:
There is one thing that needs to be created on vCenter as well - a content library. Just create a blank one and label it accordingly, then proceed with the following steps:
Click Save, and get ready to wait. The Avi controller has automated quite a few steps, and it will take a while to run. If you want, the way to track any issue in NSX ALB is to navigate to Operations -> Events -> Show Internal:
Once the NSX Cloud is reporting as "Complete" under Infrastructure -> Dashboard, we need to specify some additional data to ensure that the service engines will deploy. To do this, we navigate to Infrastructure -> Cloud Resources -> Service Engine Groups, and select the Cloud:
Then let's build a Service Engine Group. This will be the compute resource attached to our vIPs. Here I configured a naming convention and a compute target - and it can automatically drop SEs into a specific folder.
The next step here is to configure the built-in IPAM. Let's add an IP range under Infrastructure -> Cloud Resources -> Networks by editing the appropriate network ID. Note that you will need to select the NSX-T cloud to see the correct network:
Those of you who have been LTM Admins will appreciate this. Avi SE also perform "Auto Last Hop," so you can reach a vIP without a default route, but monitors (health checks) will fail. The spot to configure the custom routes is under Infrastructure -> Cloud Resources -> Routing:


Finally, let's verify that the NSX-T Cloud is fully configured. An interesting thing I saw here is that Avi 21 shows an unconfigured or "In Progress" cloud as green now, so we'll have to mouse over the cloud status to check in on it. 
Now that everything is configured (at least in terms of infrastructure), Avi will not deploy Service Engines until there's something to do! So let's do that:
Let's define a pool (back-end server resources):

Let's set a HTTP-to-HTTPS redirect as well:

Finally, let's make sure that the correct SE group is selected:
And that's it! You're up and running with Avi Vantage 21! After a few minutes, you should see deployed service engines:
The service I configured is also now up - In this case, I'm using Hyperglass, and I can leverage the load-balanced vIP to check and see what the route advertisement from Avi looks like. As you can see, it's firing a multipath BGP host address:





Friday, September 3, 2021

vCenter - File system `/storage/log` is low on storage space

After a recent VCSA reboot, I was seeing the infamous `no healthy upstream` error from vCenter.

The first place to check for issues like this is VMware's Virtual Appliance Management Interface (VAMI), located by default via HTTPS on port 5480. An administrator can use the appliance root password for this particular interface.

When reviewing this issue with the VAMI, I saw the following error:


Now, VCSA by design automatically rotates most logs available on the appliance using the open-source tool logrotate, but nothing in this directory appears to be managed:

root@vcenter [ / ]# grep \/storage\/log /etc/logrotate.d/*

I'd say this particular log partition is going to need some manual cleanup every now and then. To open up the CLI, SSH into vCenter and execute the following command:
Command> shell
Shell access is granted to root

First, let's get an idea of how full the disks are:
Note: The -m switch converts units into Megabytes
root@vcenter [ ~ ]# df -m
Filesystem 1M-blocks Used Available Use% Mounted on
devtmpfs 5982 0 5982 0% /dev
tmpfs 5993 1 5992 1% /dev/shm
tmpfs 5993 2 5992 1% /run
tmpfs 5993 0 5993 0% /sys/fs/cgroup
/dev/sda3 46988 7199 37374 17% /
tmpfs 5993 5 5988 1% /tmp
/dev/mapper/dblog_vg-dblog 15047 185 14080 2% /storage/dblog
/dev/mapper/vtsdb_vg-vtsdb 10008 68 9412 1% /storage/vtsdb
/dev/mapper/vtsdblog_vg-vtsdblog 4968 36 4661 1% /storage/vtsdblog
/dev/sda2 120 30 82 27% /boot
/dev/mapper/log_vg-log 10008 9475 6 100% /storage/log
/dev/mapper/core_vg-core 25063 45 23723 1% /storage/core
/dev/mapper/db_vg-db 10008 507 8974 6% /storage/db
/dev/mapper/updatemgr_vg-updatemgr 100273 1953 93185 3% /storage/updatemgr
/dev/mapper/netdump_vg-netdump 985 3 915 1% /storage/netdump
/dev/mapper/lifecycle_vg-lifecycle 100273 3364 91775 4% /storage/lifecycle
/dev/mapper/autodeploy_vg-autodeploy 10008 37 9444 1% /storage/autodeploy
/dev/mapper/imagebuilder_vg-imagebuilder 10008 37 9444 1% /storage/imagebuilder
/dev/mapper/seat_vg-seat 10008 1185 8295 13% /storage/seat
/dev/mapper/archive_vg-archive 50133 16373 31185 35% /storage/archive

The log partition is definitely full. To take an inventory of disk usage, we'll use the du utility, with the s (summarize) and m (megabytes) switches enabled, and then pass the output to sort with the n (numerical) and r (reverse) switches enabled to focus on the most important first.
root@vcenter [ / ]# du -sm /storage/log/vmware/* | sort -n -r
2578 /storage/log/vmware/eam
2286 /storage/log/vmware/lookupsvc
785 /storage/log/vmware/sso
781 /storage/log/vmware/vsphere-ui
530 /storage/log/vmware/vmware-updatemgr

Examining these folders further, quite a few of these are old and never rotated. VMware provides the following guidance on what's safe or isn't. Generally, Linux has issues with files being deleted out from under it, so obviously rotated logs can be safely removed. If this is a production system, I'd recommend calling VMware GSS instead of taking it upon yourself. The above command (du -sm * | sort -nr) can be used in any working directory to see what is filling up the logs the most. Here are a few examples of what I deleted to make room:
rm -rf /storage/log/vmware/eam/web/localhost-2020-*
rm -rf /storage/log/vmware/eam/web/localhost_access.2020*
rm -rf /storage/log/vmware/eam/web/catalina-2020*

From here, I like to verify that space is cleared:
root@vcenter [ / ]# df -m | grep \/storage\/log
/dev/mapper/log_vg-log 10008 5793 3688 62%
/storage/log

Catalina and Tomcat names for the same thing. This software package proxies inbound HTTP requests to specific applications, allowing many developers to build code without having to construct a soup-to-nuts HTTP server. Other similar (but more recent) projects include Python's Flask.

With HTTP Proxies and servers, it is useful to keep comprehensive records indicating "who did what", both for security reasons ("whodunit") and for debugging reasons. As a result, Tomcat is a serious log-hog wherever it exists, and it almost never reviews old logs. This is why I evaluated the change as relatively safe.

If this was not an appliance, I would have added a logrotate spec to automatically delete old files from this directory, but it is not recommended to alter VCSA in this way.

Wednesday, August 25, 2021

VMworld 2021 is right around the corner! Here are my top 10 sessions!

VMworld 2021 is online this year

I'll really miss some of the sessions and exploration we've had in past years in person, but I think VMware made the right call this year. We can expect to see a fundamental shift with online conventions - and this will need some unique strategy compared to previous years.

The Basics

I attended my first VMworld in 2016, and to describe it as information overload would be an understatement. It's only been a few years, but here's what I have to say to new VMworld attendees:
  • Give yourself time between sessions: it's too easy to switch between video streams at home - but it's a trap. Your brain needs time to process new information, and normally stretching your legs and walking around would help with that. After a particularly heavy session, get away from your keyboard and give yourself time to think. It's like college, if you take too many classes you will perform less effectively than if you capped out your class time.
  • Talk to people: The Orbital Jigsaw Discord server can serve as a water cooler of sorts here - remember that you always can learn more with others than on your own.
  • Be kind to your mind: I'm mentioning it twice, and I don't care. trying to absorb everything will be stressful, the single most important thing you can do is take care of yourself. Don't skip meals, don't skip time with the kids, don't skip out on rest.
VMware has provided a lot more content in the breakout sessions this year, and it's because we can't do stuff like the fun run. Here are my sessions of interest:

Fundamentally Important Sessions

At its core - I'd like to break out sessions that would be of critical importance, aforementioned biases notwithstanding:
  • Enhance Data Center Network Design with NSX and VMware Cloud Foundation [NET1789]
    • Nimish Desai is an extremely colorful presenter. In my first VMworld, I was actually wandering around the halls and heard yelling from one of the auditoriums, and decided to wander in and take a look. It turns out he was asking some questions about OSPF and I answered one right and ended up with some trucker cap he'd glued a marketing-noncompliant NSX logo onto and didn't leave the auditorium for about 3 hours. This was on NSX-V Fundamentals - for a director he is an extremely capable teacher and presenter.
    • I consider this (other names before it, it's basically NSX fundamentals) session every year a foundation for just about everything VMware and SDN.
  • NSX-T Design, Performance and Sizing for Stateful Services [NET1212]
    • This one has to be good. My other favorite presenter on NSX has always been Samuel Kommu, he specializes in flaying whatever SDN platform crosses his desk within an inch of its life, and then squeezing a little bit more than that out of it. He was the first engineer to get NSX-V past 40 Gigabits/s. Nicolas Michel is a capable engineer in the newer NSX-T team, they appear to be based out of EMEA, and is a total Linux and Open Source guy too. NSX-T is based almost completely on open source software and his team is working to recreate the old NSX functionality with F/OSS.
    • In this case, we're visiting how to build out the stateful back-end (Tier-1) services, essentially the bits that make a network "smart". NSX-T has some highly unique next-gen scaling capabilities for these service types. Packet inspection devices are the bottleneck in nearly all modern enterprise networks, this will present a fresh perspective on solving this problem!
  • Extreme Performance Series: vSphere Advanced Performance Boot Camp [MCL2033]
    • This class every year is basically required for anyone interested in their VCAP (DCV) as it handles the most important subject for virtualization - getting the absolute most value out of your equipment. It is a Tech+ pass session but probably justifies it by itself. If you're having trouble putting together the in-book subjects while studying for VCAP/VCP, this is where you want to go.

Interesting Sessions

  • Apply SRE’s Golden Signals for Monitoring Toward Network Operations [NET1088]
    • The title more or less says it all, this would be step 4 after a round-trip of fundamentals. The first thing I try to do when encountering a new technology is to make it reliable, and this is a logical progression.
  • (Tech+)Future-Proof Your Network with IPv6, Platform Security and Compliance [EDG1024]
    • If you haven't guessed, IPv6 is coming and you can't avoid it. With that out of the way, VMware's Networking and Security Business Unit (NSBU) has covered significant ground getting the rest of the company IPv6-ready. This is a Tech+ session primarily focused on SD-WAN, so if you're interested in how an enterprise can become IPv6-ready, this is where to start.
  • (Tech+)NSX-T Reference Designs for vSphere with Tanzu [NET1426]
    • NSX-T's hidden superpower is actually container networking. It's designed from the ground up with two Container Plugins - Antrea and NCP - that support container networking without complex Flannel/IPTables configurations simply to get stuff to work.
  • Getting Started with NSX Infrastructure as Code [NET2272]
    • I'll be blunt here, I've made several series of blog posts on this already, but NSX-T is a complicated animal, and it's important to build it right. In my opinion, the best way to do this is to prototype your deployment repeatedly until it's as close to perfect as you can get it.
    • There are two major paths to automate NSX-T here:
      • The platform: Ansible/Terraform helps us here to maintain configured state. In a previous life I crushed concrete cylinders to see if they're strong enough, this is like that but digital (and safer!)
      • The services: vRealize Automation / vCloud Director provides services on top of the base networking we provide, it is important to understand how people consume networks we build.
  • NSX-T and Infrastructure as Code [CODE2741]
    • Yes, this will take more than one session to absorb. VMware understands that - Nicolas Michel is front-ending this one too, he's working on a YouTube channel called vPackets to capture some of this automation knowledge.

Telecom Sessions

I'm breaking this out because "there are dozens of us!" 

Apparently, VMware thinks there are more of us than that - and is diving head-first into the breach. VMware has developed a robust hosting and automation suite of services to help accelerate telecommunication delivery.

I'm hoping this will possibly transform smaller ISPs into more of an Edge model, where the telecom provides the pipe and "stuff" on top of it as an additional revenue source. It'd be pretty exciting - even if you don't have a 4-post rack and some cooling, you could loan some cycles from a colocation space as needed. Despite most complaints, telecommunications companies have a few strengths here, namely:
  • Drive. Telecom engineers do what they do to connect people to information - regardless of how one will often complain about how their internet sucks, these guys are out there working nonstop to help make things just that little bit better.
  • Connectivity. While this ought to be a given, do you as a customer want to deal with the stress of relocating your server farm while down-sizing offices due to COVID?
  • Connectivity (people) believe it or not, running cable in every major city will build up quite the Rolodex. If anyone can find a viable physical space to fit your equipment/services, it'd be the telecom company.
Before I go too far, there is a ton of sensationalism on "The Edge!(tm)" All this really means is what I've explained here - your telecommunications provider would be empowered to deploy distributed compute stacks regionally to fit your (low latency? more like cost-effective!) workload needs. This is especially important in Alaska, where reaching out to the data center the "next town over" is a microwave relay system reaching hundreds of miles.

There's also quite a bit of misinformation on 5G, which fits into my top priority session in this category:

  • A Tour of the Heart of the 5G Network with Nokia and VMware [EDG1935]
    • You probably haven't heard of this Nokia Networks. It doesn't matter, attend this session if you're interested in 5G - the architecture changes from 4G to 5G are myriad, the organization maintaining the standards (3GPP) made dramatic improvements in terms of technical design, and this will give you a bird's eye view.
    • Nokia Networks is a name to track in the future, VMware's NSX-T platform and Nokia's new SR-Linux platform are going to take the data center by storm. Nokia's recent interest in Open Source has culminated in a telecommunication grade workload based on Linux - and they seem to have thought of everything, model-based configuration, automated testing in a container pipeline, the sky is the limit!
  • Demystifying Performance: Meeting Stringent Latency Requirements for RAN [EDG2872]
    • I still groan every time someone states that it's "impossible to virtualize x because of latency!" We wouldn't have a connected Alaska today if we felt that wasn't a good enough reason to try. These guys succeeded.
I look forward to seeing you all there! I'll try my best to be reachable via Twitter @engyak907 and in the Orbital Jigsaw server when I can.

Sunday, August 22, 2021

Managing DNS Servers with Ansible and Jenkins (Unbound, BIND)

DNS is a vital component of all computer networks. Also known as the "Internet Yellow Pages," this service is consumed by every household.

DNS services are typically deployed in several patterns to support users and systems:

  • DNS Forwarder: This deployment method is the most common. Everybody needs name resolution - caching and forwarding DNS results can save you bandwidth and improve localized performance. Most appliances can do this out of the box, and if they don't, try it out! It's really easy and will help you learn how DNS works.
    • Use case: You don't have your own domain and use computers.
  • Managed Public DNS: This deployment method is a significant majority of public domains are managed this way. You pay a third-party provider to manage the authoritative registration of public DNS records
    • Use case: You have a business and own a domain, but don't have any internal resources that you need to resolve.
    • Use case: You have a business and own a domain, but don't want to manage publicly resolvable nameservers
  • Private/Internal Nameserver: This deployment method is typically enterprise-specific, but is also required for home labs and all manner of weird experiments. Since it's not on the internet, we can violate any and all manner of Internet conventions.
    • The first component here is a recursive nameserver because even if you run a second server for recursive lookups, you still need a second server for recursive lookups.
    • Authoritative zones: For any given domain, keep a zone file to resolve against. This will include name-to-record (forward) objects and record-to-name (reverse) objects in separate files.
    • A method to change everything above, this has a high benefit:effort ratio.
For this post, we'll build the structure to have an internal nameserver managed completely from source control. This is surprisingly easy to get started - performing this work with abstraction is a welcome convenience, but not initially necessary as zone files are typically very simple and the application (Bind 9 or Unbound) is only one service.

To perform this, we'll follow this procedure:

  • Install the service - in this case, we'll use CentOS for Bind9 (my old setup), and Debian 11 for Unbound (because Debian 11 is new).
  • Extract the configuration file, and then export it into source control.
  • Create zone files, and then export it into source control
  • Automate delivery from source control to what we'll now call the "DNS Worker Node"

Bind9

dnf install bind
find / -name 'named.conf'
cat /etc/named/named.conf
Example named configuration file (Credit where it's due, the vast majority of this configuration has been provided by CentOS and Bind9 - I set the forwarders, allow-query, listen-on, and zone directives:
options {
        listen-on { any; };
        listen-on-v6 { any; };
        directory       "/var/named";
        dump-file       "/var/named/data/cache_dump.db";
        statistics-file "/var/named/data/named_stats.txt";
        memstatistics-file "/var/named/data/named_mem_stats.txt";
        secroots-file   "/var/named/data/named.secroots";
        recursing-file  "/var/named/data/named.recursing";
        allow-query { 10.0.0.0/8; 127.0.0.1; 2000::/3; };
        forwarders { 1.1.1.1; 9.9.9.9; };
        /*
         - If you are building an AUTHORITATIVE DNS server, do NOT enable recursion.
         - If you are building a RECURSIVE (caching) DNS server, you need to enable
           recursion.
         - If your recursive DNS server has a public IP address, you MUST enable access
           control to limit queries to your legitimate users. Failing to do so will
           cause your server to become part of large scale DNS amplification
           attacks. Implementing BCP38 within your network would greatly
           reduce such attack surface
        */
        recursion yes;

        dnssec-enable yes;
        dnssec-validation yes;

        managed-keys-directory "/var/named/dynamic";

        pid-file "/run/named/named.pid";
        session-keyfile "/run/named/session.key";

        /* https://fedoraproject.org/wiki/Changes/CryptoPolicy */
        include "/etc/crypto-policies/back-ends/bind.config";
        
};

zone "engyak.net" in {
        allow-transfer { any; };
        file "/etc/named/engyak.net.zone";
        type master;
};
Then, let's build a zone file in source control. Please note that there are additional conventions that should be followed when creating new DNS zone records, this is just an example file that will run!
$TTL 2d
@               SOA             ns.engyak.net. hostmaster.engyak.net  (
                                1      ; serial
                                3600            ; refresh
                                600             ; retry
                                608400          ; expiry
                                3600 ) ;
;
;
engyak.net.     IN NS           ns.engyak.net.
ns              IN A            10.0.0.1
johnnyfive      IN A            10.1.1.1
duncanidaho     IN A            10.2.2.2
Copy the named.conf contents into a new source code repository or your existing one, preferably in an organized fashion. Ansible playbook execution is very straightforward. I'd recommend building this in source control as well - see above note about potential process improvements
---
- hosts: ns.engyak.net
  tasks:
    - name: "Update DNS Zones!"
      copy:
        src: zonefiles/engyak.net
        dest: /etc/named/engyak.net.zone
        mode: "0644"
    - name: "Update DNS Config!"
      copy:
        src: conf.d/ns.engyak.net/named.conf
        dest: /etc/named.conf
        mode: "0640"
    - name: "Restart Named!"
      service:
        name: "named"
        state: "restarted"

Any time you run this playbook it will download a fresh configuration and zone file, then restart Bind9.

As a cherry on top, let's make this process smart - if we want to automatically deploy changes to DNS from source control, we need a CI Tool like Jenkins. Start off by creating a new Freeform pipeline to "Watch SCM" - yes, this isn't a real repository.




That's it - add entries, live long, and prosper! Since the Ansible playbook and supporting files are fetched via source control, the only setup required on a DNS worker node is to establish a relationship between it and the CI tool, ex. SSH authentication.

Unbound

Unbound is a newer DNS server project and has quite a few interesting properties. I've been using BIND for well over a decade - and Unbound aims to change a few things, notably:
Oddly enough, there is no features list for this software package, but pretty much everything else is impressively documented. Let's start the installation:
apt install unbound
cat /usr/share/doc/unbound/examples/unbound.conf

Unbound can use the same zonefile format as BIND, so we only need to create a new config file to migrate things over. Note: This is not a production-ready configuration, it's just enough to get me started. 

As I learn more about Unbound, I'll be using source control to implement changes / implement a rollback - an important benefit when making lots of mistakes!


# The server clause sets the main parameters.
server:
        verbosity: 1
        num-threads: 2
        interface: 0.0.0.0
        interface: ::0
        port: 53
        prefer-ip4: no
        edns-buffer-size: 1232

        # Maximum UDP response size (not applied to TCP response).
        # Suggested values are 512 to 4096. Default is 4096. 65536 disables it.
        max-udp-size: 4096
        msg-buffer-size: 65552
        udp-connect: yes
        unknown-server-time-limit: 376

        do-ip4: yes
        do-ip6: yes
        do-udp: yes
        do-tcp: yes

        # control which clients are allowed to make (recursive) queries
        # to this server. Specify classless netblocks with /size and action.
        # By default everything is refused, except for localhost.
        access-control: 10.0.0.0/8 allow
        access-control: 127.0.0.0/8 allow

        private-domain: "engyak.net"
        caps-exempt: "engyak.net"
        domain-insecure: "engyak.net"

        private-address: 10.0.0.0/8

        # cipher setting for TLSv1.2
        tls-ciphers: "ECDHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES128-SHA256"
        # cipher setting for TLSv1.3
        tls-ciphersuites: "TLS_AES_128_GCM_SHA256:TLS_AES_128_CCM_8_SHA256:TLS_AES_128_CCM_SHA256:TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256"

# Python config section. To enable:
# o use --with-pythonmodule to configure before compiling.
# o list python in the module-config string (above) to enable.
#   It can be at the start, it gets validated results, or just before
#   the iterator and process before DNSSEC validation.
# o and give a python-script to run.
python:
        # Script file to load
        # python-script: "/etc/unbound/ubmodule-tst.py"

# Dynamic library config section. To enable:
# o use --with-dynlibmodule to configure before compiling.
# o list dynlib in the module-config string (above) to enable.
#   It can be placed anywhere, the dynlib module is only a very thin wrapper
#   to load modules dynamically.
# o and give a dynlib-file to run. If more than one dynlib entry is listed in
#   the module-config then you need one dynlib-file per instance.
dynlib:
        # Script file to load
        # dynlib-file: "/etc/unbound/dynlib.so"

# Remote control config section.
remote-control:
        # Enable remote control with unbound-control(8) here.
        # set up the keys and certificates with unbound-control-setup.
        control-enable: no

# Authority zones
# The data for these zones is kept locally, from a file or downloaded.
# The data can be served to downstream clients, or used instead of the
# upstream (which saves a lookup to the upstream).  The first example
# has a copy of the root for local usage.  The second serves example.org
# authoritatively.  zonefile: reads from file (and writes to it if you also
# download it), primary: fetches with AXFR and IXFR, or url to zonefile.
# With allow-notify: you can give additional (apart from primaries) sources of
# notifies.
forward-zone:
      name: "."
      forward-addr: 1.1.1.1
      forward-addr: 9.9.9.9
auth-zone:
      name: "engyak.net"
      for-downstream: yes
      for-upstream: yes
      zonefile: "engyak.net.zone"

To automate file delivery here, we'll use a (similar) playbook for Unbound. The Jenkins configuration will not need to be modified, because the playbook will automatically be re-executed.

---
- hosts: ns.engyak.net
  tasks:
    - name: "Update DNS Zones!"
      copy:
        src: zonefiles/engyak.net
        dest: /etc/unbound/engyak.net.zone
        mode: "0644"
    - name: "Update DNS Config!"
      copy:
        src: conf.d/ns.engyak.net/unbound.conf
        dest: /etc/unbound.conf
        mode: "0640"
    - name: "Restart Unbound!"
      service:
        name: "unbound"
        state: "restarted"

Some Thoughts

This method of building DNS records from a source of truth does replace the master-slave (sorry guys, BIND's terms are not my own!) relationship older name servers will typically use. Personally, I like this method of propagation.

The biggest upside here is that a DNS worker node being unavailable does not prevent an engineer from adding/modifying records as long as recursive name servers support multiple resolvers.

It is eventually consistent, as the orchestrator will update every worker node for you. This may be slower or faster, depending on TTL.

The Ansible playbook I used here will kill your DNS node if you push it into an invalid configuration, so this is probably not production-worthy without additional work.

If you would rather purchase a platform instead of building this capability with F/OSS components, this is basically how Infoblox Grid works.

It'd be really neat to abstract software-specific constructs, which can be done with Python and Jinja2 (or just Ansible and Jinja2!)

Monday, July 5, 2021

NSX Advanced Load Balancer - NSX-T Service Engine Creation Failures: `CC_SE_CREATION_FAILURE` and `Transport Node Not Found to create service engine`

TL;DR

If you see either of these errors, check grep 'ERROR' /opt/avi/log/cc_agent_go_{{ cloud }} for the potential cause. In my case, the / character was not correctly processed by Avi's Golang client (facing vCenter).

The Problem

When trying to configure NSX ALB + NSX-T on my home lab, I am presented nothing but the following error:

CC_SE_CREATION_FAILURE

The Process

Avi Vantage appears to be treating this as a retriable error, attempting to deploy a service engine five times, which can be re-executed with a controller restart:

Oddly enough, vCenter doesn't report any OVA deploy attempts. The next thing to check here would be the content library:
So far, so good. vCenter knows where to deploy the image from.

Now here's a problem - Avi doesn't provide any documentation on how to troubleshoot this yet - so I did a bit of digging and found that you can bump yourself to root by performing a:

sudo su

Useful note: Avi Vantage is running 
bullseye/sid
 with only 821 packages listed under dpkg -l | wc -l. They did do a pretty good job with pre-release cleanup, but there are still a few oddball packages in there. I'd give it a 9/10, I'd like to see X11 not be installed but am pleased to see only Python 3!

Avi's logs are located in:

/var/lib/avi/log
/opt/avi/log

Here's what I found in alert_notifications_debug.log:

summary: "Syslog for System Events occured"
event_pages: "EVENT_PAGE_VS"
event_pages: "EVENT_PAGE_CNTLR"
event_pages: "EVENT_PAGE_ALL"
obj_name: "avi_-Avi-se-rctbp"
tenant_uuid: "admin"
related uuids ['avi_-Avi-se-rctbp']
[2021-04-09 20:06:30,923] INFO [alert_engine.processAlertInstance:225] [uuid: ""
alert_config_uuid: "alertconfig-938cf267-e20d-4d8e-a50a-21f0f5a5b633"
timestamp: 1617998694.0 obj_uuid: "avi_-Avi-se-rctbp" threshold: 0 events { report_timestamp: 1617998694 obj_type: SEVM event_id: CC_SE_CREATION_FAILURE module: CLOUD_CONNECTOR internal: EVENT_EXTERNAL context: EVENT_CONTEXT_SYSTEM obj_uuid: "avi_-Avi-se-rctbp" obj_name: "avi_-Avi-se-rctbp" event_details { cc_se_vm_details { cc_id: "cloud-022c7b90-f987-4b15-91bb-1f1405715580" se_vm_uuid: "avi_-Avi-se-rctbp" error_string: "Transport node not found to create serviceengine avi_-Avi-se-rctbp" } } event_description: "Service Engine creation failure" event_pages: "EVENT_PAGE_VS" event_pages: "EVENT_PAGE_CNTLR" event_pages: "EVENT_PAGE_ALL" tenant_name: "" tenant: "admin" } reason: "threshold_exceeded" state: ALERT_STATE_ON related_uuids: "avi_-Avi-se-rctbp" level: ALERT_LOW name: "Syslog-System-Events-avi_-Avi-se-rctbp-1617998694.0-1617998694-45824571" summary: "Syslog for System Events occured" event_pages: "EVENT_PAGE_VS" event_pages: "EVENT_PAGE_CNTLR" event_pages: "EVENT_PAGE_ALL" obj_name: "avi_-Avi-se-rctbp" tenant_uuid: "admin"
From the looks of things - Avi is talking with NSX-T before vCenter to determine appropriate placement, which makes sense.

Update and Root Cause

With the Avi 20.1.6 release, VMware has made a lot of improvements to logging! We're now seeing this error in the GUI (Ensure that "Internal Events" is checked:






Let's take a look at the new logging. Avi's controller system leverages a series of Go modules called "cloud connectors" dedicated to that specific interface. Each one has its own log file in
/opt/avi/log/cc_
2021-07-04T20:20:42.801Z        ERROR   vcenterlib/vcenter_utils.go:606 [10.66.0.202][avi-mgt-vni-10.7.80.0/24] object references is empty
2021-07-04T20:20:42.819Z        ERROR   vcenterlib/vcenter_utils.go:578 [10.66.0.202][avi-mgt-vni-10.7.80.0/24] object references is empty
2021-07-04T20:20:42.822Z        ERROR   vcenterlib/vcenter_se_lifecycle.go:432  [10.66.0.202][QH] [10.66.0.202] Network 'avi-mgt-vni-10.7.80.0/24' matching not found in Vcenter
2021-07-04T20:20:42.822Z        ERROR   vcenterlib/vcenter_se_lifecycle.go:891  [10.66.0.202] [10.66.0.202] Network 'avi-mgt-vni-10.7.80.0/24' matching not found in Vcenter
Now, this vn-segment does exist in vCenter, so I tried the "non-escaped shell character" knowledge from years of Linux/Unix administration and reformatted it to avi-mgt-vni-10.7.80.0_24. 
Since we don't get a Redeploy (please VMware!) button, I restarted the controller and all SE deployments succeeded after that.

Sunday, June 20, 2021

World WiFi Day 2021!

World WiFi Day

We (human beings) have several weird superpowers, but the ability to communicate over vast distances has always fascinated me the most.

I've had the privilege of meeting some of the most truly capable pioneers in this field - but the reality here is that we're faced with a very unequal world.

Authors like William Gibson and Neal Stephenson have the right of things as well and while we're not quite living in that dystopian future, technology can become a great equalizer.

So yeah - as telecommunications operators we have the responsibility to bridge this gap!

Learn More

I'm always surprised by how much there still is to learn, even in fields I feel like I already know. Here are a few learning approaches that will help you build out a good foundation for learning Wi-Fi (and more!):

  • Amateur Radio: It's cheesy, sure. I'm amazed at how much my grown-up self can now do with amateur radio - I've been licensed since the '90s (KL0NS) and the community is doing so much more now than ever. When I was very young, this was a good opportunity to learn the principles of radio outdoors.
    • In Anchorage, Alaska we have it pretty good - KL7AA is a self-provided test provider, and they sell the study book I used way back when
    • Everywhere else in the US, the test costs $15 and probably takes about 2 days to study for. There's no reason not to try it out and participate. Most of the study material is free, and we even get practice tests
    • ARRL provides good ways to participate
    • If you join a radio club they'll find different ways to exercise your brain, and they're usually pretty fun. In addition, you'll be helping maintain emergency communication networks.
  • Get Certified. Also pretty cheesy, I know how people feel about IT certs and still would argue for in this case. For Wi-Fi, the CWNP organization tends to serve the same role as the Linux Professional Institute - employers don't know about them but it's really effective in terms of education.

Do More

Let's just cover some volunteer opportunities here - because there's no point in building skills if you don't use them:

  • World Wi-Fi Day
  • ITDRC These guys are really neat. The IT Disaster Resource Center leverages oldie-but-goodie enterprise telecom/IT equipment to provide disaster relief all over the continental US. Check out their deployment map!
  • Airheads Volunteer Corp. I know this is a vendor plug, but this approach is really cool if you can travel!
  • United Way

On Volunteering

One of the passive effects of these approaches - you'll get better as you go. Employers nearly always constrain your learning path to what they need at the moment, often to their own detriment. They may not know what they'll need you to do next year, COVID showed us that. Volunteering not only gives you an opportunity to help others but also passively improves your skills outside of the usual "corporate playbook".

Sunday, June 6, 2021

XML, JSON, YAML - Python data structures and visualization for infrastructure engineers

At some point, we can't "do it all" with one block of code. 

As developers, we need to store persistent data for a variety of reasons:

  • We want it for later execution (or to compare it to another result)
  • We're sick of storing variables in code. This matters a lot more in compiled languages than runtime ones
  • We want the results to end up in some form of a deliverable report

Let's cover a computer science concept being used here - semaphores). Edsger Dijkstra coined this term from Greek sema(sign) and phero(bearer) (you may remember him from OSPF) to solve Inter-Process Communications(IPC) issues.

To provide a reductionist example, process A and process B need to communicate somehow, but shouldn't access each other's memory or, in the '60s, it wasn't available. To solve this problem, the developer needs to develop a method of storing variables in a manner that is both efficient and can be consistently interpreted.

Dijkstra's example, in this case, was binary, and required a schema to interpret the same data - but was not specifically limited to single binary blocks. This specific need actually influenced one of the three data types we're comparing here - consequently the oldest.

But which one do I use? TL;DR?

Spoiler alert - anyone working with automation MUST learn all three to be truly effective. My general guidance would be:

  • This is a personal preference, but I would highly recommend YAML for human inputs. It's extremely easy to write, and while I generally prefer JSON it's much easier to first write a document into YAML and then convert it. If you take user input or just want to get a big JSON document started, I'd do it this way.
    • YAML User input drivers can also parse JSON, making this an extremely flexible approach.
  • JSON is good for storing machine inputs/outputs. Because all typing is pretty explicit with JSON, json.dumps(dict, indent=4) is pretty handy for previewing what your code thinks structured data looks like. Technically this is possible with YAML, but conventions on, say, a string literal can be squishy.
    • YAML with name: True could be interpreted as:
      • JSON of "name": true, indicating a Boolean value
      • JSON of "name": "True", indicating a String
    • Sure, this is oversimplified, and YAML can be explicitly typed, but generally, YAML is awesome for its speed low initial friction. If an engineer knows YAML really well (and writes their own classes for it) going all-YAML here is completely possible - but to me that's just too much work.
    • If you use it in the way I recommend, just learn to interpret JSON and use Python's JSON library natively, and remember json.dumps(dict, indent=4) for outputs. You'll pick it up in about half an hour and just passively get better over time.
  • Use XML if that's what the other code is using, Python's Element and ElementTree constructs are more nuanced than dictionaries, so a package like defusedXML is probably the best way to get started. There are a lot of binary invocations/security issues with XML, so using basic XML libraries by themselves is ill-advised. xmltodict is pretty handy if you just want to convert it into another format.

Note: JSON and XML both support Schema Validation, an important aspect of semaphores. YAML doesn't have a native function like this, but I have used Python's Cerberus modules to do the same thing here.

YAML

YAML was initially released in 2001 and has gained recent popularity with projects like Ansible. YAML 1.2 was released in 2009 and is publicly maintained by the community, so it won't have industry bias (but also won't change as quickly). YAML writes a lot like Python, consuming a ton of whitespace and being particular about tags. Users either love or hate it - I typically only use it for human inputs and objects that are frequently peer-reviewed.

NOTE: one big upside to YAML with people processes is comment support. YAML supports comments, but JSON does not.

YAML is pretty easy to start using in Python. I'm a big fan of the ruamel.YAML library, which adds on some interesting capabilities when parsing human inputs. I've found a nifty way to parse using try/except blocks - making a parser that is supremely agnostic, ingesting JSON or YAML, as a string or a file!

---
message:
  items:
    item:
      "@tag": Blue
      "#text": Hello, World!
#!/usr/bin/python3

import json

from ruamel.yaml import YAML
from ruamel.yaml import scanner

# Load Definition Classes
yaml_input = YAML(typ='safe')
yaml_dict = {}

# Input can take a file first, but will fall back to YAML processing of a string
try:
    yaml_dict = yaml_input.load(open('example.yml', 'r'))
except FileNotFoundError:
    print('Not found as file, trying as a string...')
    yaml_dict = yaml_input.load('example.yml')
finally:
    print(json.dumps(yaml_dict, indent=4))

JSON

JSON was first implemented in 2006 and is currently maintained by the IETF. Currently, Python 3 will visually represent dicts using JSON as well - making things pretty intuitive. In my experience, writing JSON is pretty annoying because it's picky.

{
    "message": {
        "items": {
            "item": {
                "@tag": "Blue",
                "#text": "Hello, World!"
            }
        }
    }
}
#!/usr/bin/python3

import json

with open('example.json', 'r') as file:
    print(json.dumps(json.loads(file.read())))

Typically, I'll just use json.dumps(dict, indent=4) on a live dict when I'm done with it - dumping it to a file. JSON is a well-defined standard and software support for it is excellent.

Due to its IETF bias, JSON's future seems to focus on streaming/logging required for infrastructure management. JSON-serialized Syslog is a neat application here, as you can write it to a file as a single line, but also explode for readability, infuriating grep users everywhere.

XML

XML is the oldest data language typically used for automation/data ingestion, and it really shows. XML was originally established by the W3C in 1998 and is used for many document types like Microsoft Office.

XML's document and W3C bias read very strongly. Older Java-oriented platforms like Jenkins CI heavily leverage XML for semaphores, document reporting, and configuration management. Strict validation (MUST be well-formed) required for compiled languages to synergize well with the capabilities provided. XML also heavily uses HTML-style escaping and tagging approaches, making it familiar to some web developers.

XML has plenty of downsides. Crashing on invalid input is generally considered excessive or "Steve-Ballmer"-esque, making the language favorable for mission-critical applications where misinterpretation of data MUST not be processed, and miserable everywhere else. For human inputs, it's pretty wordy which impacts readability quite a bit.

Schemas

XML has two tiers of schema - Document Type Definition (DTD) and XML Schema. DTD is very similar to HTML DTDs and provides a method of validating that the language is correctly used. XML Schema definitions (XSD) provide typing and structures for validation and is a more commonly used tool.

Python Example

XML Leverages the Element and ElementTree constructs in Python instead of dicts. This is due to XML being capable of so much more, but it's still pretty easy to use:

XML Document:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<message>
    <items>
        <item tag="Blue">Hello, World!</item>
    </items>
</message>
#!/usr/bin/python3

from defusedxml.ElementTree import parse
import xmltodict
import json

document = parse('example.xml').getroot()

print(document[0][0].text + " " + json.dumps(document[0][0].attrib))

file = open('example.xml', "r").read()
print(json.dumps(xmltodict.parse(file), indent=4))

After using both methods, I generally prefer using xmltodict for data processing - it lets me use a common language, Python lists and dicts to process all data regardless of source, allowing me to focus more on the payload. We're really fortunate to have this fantastic F/OSS community enabling that!

Saturday, May 22, 2021

Troubleshooting with VMware NSX ALB/Avi Vantage

NSX Advanced Load Balancer - Logging and Troubleshooting Cheat Sheet

Get into the OS Shell (all elements)

sudo su

Controller Log Locations

Note: Everything in /var/lib/avi/logs is managed by Elasticsearch. I wouldn't mess with it.

Events published to the GUI: /var/lib/avi/logs/ALL-EVENTS/

The primary log directory for Avi Vantage Controllers is /opt/avi/log. As this feeds into Elasticsearch, they have file outputs for every severity level. An easy way to get data on a specific object would be to build a grep statement like this:

grep {{ regex }} /opt/avi/log/{{ target }}
  • alert_notifications_*: Summarized problems log. Events are in a json format!

Troubleshooting Deployment Failures

  • avi-nsx.*: Presumably for NSX-T integration. further investigation required
  • cloudconnectorgo.*: Avi's cloud connector is pretty important given their architecture. This is where you can troubleshoot any issues getting a cloud turned up, or any initial provisioning issues.
  • vCenter*: vCenter write mode activity logs. Look here for SE deployment failures in a traditional vSphere cloud.

Service Engines

Troubleshooting

Checking the Routing Table

NSX ALB / Avi uses FRRouting (7.0 as of release 20.1) over network namespaces to achieve management/data plane separation and VRF-Lite. To access the data plane, you will need to change namespaces! Unlike NSX-T, this doesn't happen over docker namespaces. This means that the follow commands work in both as root:

  • Show all VRF+Namespaces ip netns show
  • Send a one-shot command to the namespace: ip netns exec {{ namespace }} {{ command }} Example: ip netns exec 'ip route show'
  • Start a shell in the desired namespace: ip netns exec {{ namespace }} {{ shell }} Example: ip netns exec avi_ns1 bash

After in the bash shell, all normal commands apply as if there was no namespace/VRF.

For more information on Linux Network Namespaces, here's a pretty good guide: https://www.opencloudblog.com/?p=42

Logging

All SE logging is contained in /var/lib/avi/log. Here are the significant log directories there:

  • IMPORTANT! bgp: This is where all the routing protocol namespace logging from FRRouting lands.
  • traffic: This one's pretty touch to parse and it's better to use Avi's Elasticsearch instead.

Conclusion

Avi Vantage has a pretty solid logging schema, but is very much a growing product. These logs will eventually be exposed more fully to the GUI/API, but for now it's handy to grep away. I'll be updating this list as I find more.

Saturday, May 15, 2021

VMware NSX Advanced Load Balancer - Installation

Pre-Requisites

Before beginning the Avi installer, I configured the following in my environment:
  • Management Segment (NSX-T Overlay). This is set up with DHCP for quick automatic provisioning - no ephemeral addresses required
  • Data Segments (NSX-T Overlay). Avi will build direct routes to IPs in this network for vIP processing. I built 3 - 
    • Layer 2 Cloud (attached to Tier-1)
    • NSX Integrated (attached to Tier-1)
    • Layer 3 Cloud (attached to Tier-0)

Avi also supports automatic SE deployment - which means that automatic IP configuration is important. Avi supports SLAAC (IPv6) and DHCP (IPv4) for this purpose.

NSX-T is unsurprisingly symbiotic here. I have built a dedicated Tier-1 for NSX ALB, and we're going to provide DHCP services via the Tier-1 router. If this was a production deployment or a VVD-compliant SDDC, this should be performed with a DHCP relay. I just haven't set aside time to deploy DHCP/IPAM tools for reasons that are beyond me.

The following changes are performed on the Tier-1 Logical Router. This step is not required for external DHCP servers!

The following changes are to be performed on the Logical Segment. 
If production, DHCP relay is selectable from the following screen:


Installation

 Avi Controller

VMware provides a prepackaged OVA for the Vantage controller - and it's a pretty large appliance. 24 GB of memory and 8 vCPUs is a lot of resourcing for a home lab. There are no sizing options here.

Installation is pretty easy - once the OVA is deployed, I used my CI/CD pipeline and GitHub to deploy DNS updates and logged right into the installation wizard.

AVI version 20.1.5 has changed the installer approach from the above to this. When "No cloud setup" is selected, it still insists on configuring a new cloud. This isn't too much of a problem:
Note: This passphrase is for backups - make sure to store it somewhere safe!
From here, we configure vCenter's integration:


Let's ensure that Avi is connected to vCenter and has no issues. Note: VMware recommends write-mode for vCenter clouds.


After install, it's useful to get a little oriented. Up in the top left of the Avi Vantage GUI. In the top left, you'll find the major configuration branches by expanding the triple ellipsis. Get familiar with this part of the GUI - you'll be using it a lot!




Patching

Before we build anything, I prefer to load any patches (if applicable) prior to building anything. This should help avoid any software issues on deployment, and patching is usually simpler/lower impact if you have no configuration yet. 

Avi Vantage really excels here - this upgrade process is pretty much fully automated, with extensive testing. As a result, it's probably going to be slower than "manual" upgrades, but is definitely more reliable. Our industry really needs to get over this - If you have a good way to keep an eye on things while keeping busy, you're ahead of the curve!

We'll hop on over to Administration -> Controller -> Software:


While this upgrade takes place - Avi's controller will serve up a "Sorry Page" indicating that it's not available yet - which is pretty neat.

When complete, you should see this:



Avi Clouds

Clouds are Avi's construct for deployment points - and we'll start with the more traditional one here - vCenter. Most of this has already been configured as part of the wizard above. Several things need to be finished for this to run well, however:

  • Service Engine Group - here we customize service engine settings
  • IPAM - Push IP address, get a cookie
SE Group Changes are executed under Infrastructure -> SE Groups. Here I want t to constrain the deployment to specific datastores and clusters.
IPAM is located in two places, Templates -> Profiles -> IPAM/DNS Profiles (bindable profile):

Ranges are configured under Networks. If you configure a write-access cloud, it'll scan all of the port groups and used IP ranges for you. IP ranges and Subnets will still need to be configured and/or confirmed:


Note: This IPAM profile does need to be added to the applicable cloud to leverage auto-allocate functions with vIPs.

Avi Service Engines

Now that the setup work is done, we can fire up the SE deployments by configuring a vIP. By default, Avi will conserve resources by deploying the minimum SEs required to get the job done - if there's no vIP, this means none. It takes some getting used to!
Once the vIP is applied, you should see some deployment jobs in vCenter:

Service engines take a while to deploy - don't get too obsessive if the deployment lags. There doesn't appear to be a whole lot of logging to indicate deployment stages, so the only option here is to wait it out. If a service engine doesn't deploy quite right, delete it. This is not the type of application we just hack until it works - I did notice that it occasionally will deploy with vNICs incorrectly configured.

From here, we can verify that all service engines are deployed. The health score will climb up over time if the deployment is successful.

Now we can build stuff! 


Get an A on ssllabs.com with VMware Avi / NSX ALB (and keep it that way with SemVer!)

Cryptographic security is an important aspect of hosting any business-critical service. When hosting a public service secured by TLS, it is ...