Saturday, May 22, 2021

Troubleshooting with VMware NSX ALB/Avi Vantage

NSX Advanced Load Balancer - Logging and Troubleshooting Cheat Sheet

Get into the OS Shell (all elements)

sudo su

Controller Log Locations

Note: Everything in /var/lib/avi/logs is managed by Elasticsearch. I wouldn't mess with it.

Events published to the GUI: /var/lib/avi/logs/ALL-EVENTS/

The primary log directory for Avi Vantage Controllers is /opt/avi/log. As this feeds into Elasticsearch, they have file outputs for every severity level. An easy way to get data on a specific object would be to build a grep statement like this:

grep {{ regex }} /opt/avi/log/{{ target }}
  • alert_notifications_*: Summarized problems log. Events are in a json format!

Troubleshooting Deployment Failures

  • avi-nsx.*: Presumably for NSX-T integration. further investigation required
  • cloudconnectorgo.*: Avi's cloud connector is pretty important given their architecture. This is where you can troubleshoot any issues getting a cloud turned up, or any initial provisioning issues.
  • vCenter*: vCenter write mode activity logs. Look here for SE deployment failures in a traditional vSphere cloud.

Service Engines

Troubleshooting

Checking the Routing Table

NSX ALB / Avi uses FRRouting (7.0 as of release 20.1) over network namespaces to achieve management/data plane separation and VRF-Lite. To access the data plane, you will need to change namespaces! Unlike NSX-T, this doesn't happen over docker namespaces. This means that the follow commands work in both as root:

  • Show all VRF+Namespaces ip netns show
  • Send a one-shot command to the namespace: ip netns exec {{ namespace }} {{ command }} Example: ip netns exec 'ip route show'
  • Start a shell in the desired namespace: ip netns exec {{ namespace }} {{ shell }} Example: ip netns exec avi_ns1 bash

After in the bash shell, all normal commands apply as if there was no namespace/VRF.

For more information on Linux Network Namespaces, here's a pretty good guide: https://www.opencloudblog.com/?p=42

Logging

All SE logging is contained in /var/lib/avi/log. Here are the significant log directories there:

  • IMPORTANT! bgp: This is where all the routing protocol namespace logging from FRRouting lands.
  • traffic: This one's pretty touch to parse and it's better to use Avi's Elasticsearch instead.

Conclusion

Avi Vantage has a pretty solid logging schema, but is very much a growing product. These logs will eventually be exposed more fully to the GUI/API, but for now it's handy to grep away. I'll be updating this list as I find more.

Saturday, May 15, 2021

VMware NSX Advanced Load Balancer - Installation

Pre-Requisites

Before beginning the Avi installer, I configured the following in my environment:
  • Management Segment (NSX-T Overlay). This is set up with DHCP for quick automatic provisioning - no ephemeral addresses required
  • Data Segments (NSX-T Overlay). Avi will build direct routes to IPs in this network for vIP processing. I built 3 - 
    • Layer 2 Cloud (attached to Tier-1)
    • NSX Integrated (attached to Tier-1)
    • Layer 3 Cloud (attached to Tier-0)

Avi also supports automatic SE deployment - which means that automatic IP configuration is important. Avi supports SLAAC (IPv6) and DHCP (IPv4) for this purpose.

NSX-T is unsurprisingly symbiotic here. I have built a dedicated Tier-1 for NSX ALB, and we're going to provide DHCP services via the Tier-1 router. If this was a production deployment or a VVD-compliant SDDC, this should be performed with a DHCP relay. I just haven't set aside time to deploy DHCP/IPAM tools for reasons that are beyond me.

The following changes are performed on the Tier-1 Logical Router. This step is not required for external DHCP servers!

The following changes are to be performed on the Logical Segment. 
If production, DHCP relay is selectable from the following screen:


Installation

 Avi Controller

VMware provides a prepackaged OVA for the Vantage controller - and it's a pretty large appliance. 24 GB of memory and 8 vCPUs is a lot of resourcing for a home lab. There are no sizing options here.

Installation is pretty easy - once the OVA is deployed, I used my CI/CD pipeline and GitHub to deploy DNS updates and logged right into the installation wizard.

AVI version 20.1.5 has changed the installer approach from the above to this. When "No cloud setup" is selected, it still insists on configuring a new cloud. This isn't too much of a problem:
Note: This passphrase is for backups - make sure to store it somewhere safe!
From here, we configure vCenter's integration:


Let's ensure that Avi is connected to vCenter and has no issues. Note: VMware recommends write-mode for vCenter clouds.


After install, it's useful to get a little oriented. Up in the top left of the Avi Vantage GUI. In the top left, you'll find the major configuration branches by expanding the triple ellipsis. Get familiar with this part of the GUI - you'll be using it a lot!




Patching

Before we build anything, I prefer to load any patches (if applicable) prior to building anything. This should help avoid any software issues on deployment, and patching is usually simpler/lower impact if you have no configuration yet. 

Avi Vantage really excels here - this upgrade process is pretty much fully automated, with extensive testing. As a result, it's probably going to be slower than "manual" upgrades, but is definitely more reliable. Our industry really needs to get over this - If you have a good way to keep an eye on things while keeping busy, you're ahead of the curve!

We'll hop on over to Administration -> Controller -> Software:


While this upgrade takes place - Avi's controller will serve up a "Sorry Page" indicating that it's not available yet - which is pretty neat.

When complete, you should see this:



Avi Clouds

Clouds are Avi's construct for deployment points - and we'll start with the more traditional one here - vCenter. Most of this has already been configured as part of the wizard above. Several things need to be finished for this to run well, however:

  • Service Engine Group - here we customize service engine settings
  • IPAM - Push IP address, get a cookie
SE Group Changes are executed under Infrastructure -> SE Groups. Here I want t to constrain the deployment to specific datastores and clusters.
IPAM is located in two places, Templates -> Profiles -> IPAM/DNS Profiles (bindable profile):

Ranges are configured under Networks. If you configure a write-access cloud, it'll scan all of the port groups and used IP ranges for you. IP ranges and Subnets will still need to be configured and/or confirmed:


Note: This IPAM profile does need to be added to the applicable cloud to leverage auto-allocate functions with vIPs.

Avi Service Engines

Now that the setup work is done, we can fire up the SE deployments by configuring a vIP. By default, Avi will conserve resources by deploying the minimum SEs required to get the job done - if there's no vIP, this means none. It takes some getting used to!
Once the vIP is applied, you should see some deployment jobs in vCenter:

Service engines take a while to deploy - don't get too obsessive if the deployment lags. There doesn't appear to be a whole lot of logging to indicate deployment stages, so the only option here is to wait it out. If a service engine doesn't deploy quite right, delete it. This is not the type of application we just hack until it works - I did notice that it occasionally will deploy with vNICs incorrectly configured.

From here, we can verify that all service engines are deployed. The health score will climb up over time if the deployment is successful.

Now we can build stuff! 


Sunday, May 9, 2021

Leveraging Hyperglass and NSX-T!

 For this example deployment, I'll be using my NSX-T Lab as the fabric, VyOS for the Overloaded Router role, and trying out hyperglass:



Installation (VyOS)

I already have a base image for VyOS with its management VRF set up - and updating the base image prior to deployment is a breeze due to the vSphere 7 VM Template Check Out Feature.

In this case, I'll deploy to an NSX-T External Port and peer up, with fully implemented ingress filtering:
Export Filters - Permit all prefixes:
Import Filters - don't trust any prefixes from this router:
Set in the correct directions:
Configure the BGP Neighbors:

From here, we build the VNF, by adding the following configuration:
protocols {
    bgp 64932 {
        address-family {
            ipv4-unicast {
                maximum-paths {
                    ebgp 4
                }
            }
            ipv6-unicast {
                maximum-paths {
                    ebgp 4
                }
            }
        }
        neighbor 10.7.2.1 {
            remote-as 64902
        }
        neighbor 10.7.2.2 {
            remote-as 64902
        }
        neighbor x:x:x:dea::1 {
            address-family {
                ipv6-unicast {
                }
            }
            remote-as 64902
        }
        neighbor x:x:x:dea::2 {
            address-family {
                ipv6-unicast {
                }
            }
            remote-as 64902
        }
        timers {
            holdtime 12
            keepalive 4
        }
    }
}

Then, let's verify that BGP is working:


vyos@vyos-lg-01:~$ show ip bgp summary

IPv4 Unicast Summary:
BGP router identifier 10.7.2.254, local AS number 64932 vrf-id 0
BGP table version 156
RIB entries 75, using 14 KiB of memory
Peers 4, using 85 KiB of memory

Neighbor             V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt
10.7.2.1             4      64902       278       272        0    0    0 00:11:31           40       42
10.7.2.2             4      64902        16        13        0    0    0 00:00:16           39       42
x:x:x:dea::1 		 4      64902       234       264        0    0    0 00:11:43 NoNeg
x:x:x:dea::2 		 4      64902       283       368        0    0    0 00:11:43 NoNeg

Total number of neighbors 4

The VNF is configured! Now, we'll follow the application maintainer's instructions for installation: https://hyperglass.io/docs/getting-started

The documentation for install is pretty good - but some customization is still required. I built the following configuration files out - hyperglass leverages YAML as a configuration file format, examples are here. I did make some changes:

  • Some combination of VyOS 1.4, MP-BGP, and/or VRF-lite changed the syntax for the BGP views around. Setting a commands file fixes this.
  • VyOS driver is appending a host mask (/32, /128) on routes with no prefix specified.
    • NB: I reached out to the maintainer (Matt Love) and he informed me that this was configurable per-VRF using the force-cidr option.
This particular tool has been extremely useful to me, as NSX-T still lacks comprehensive BGP visibility without CLI access - and even if it didn't, this will provide consumers an easy way to validate that prefixes have propagated, and where.

Sunday, May 2, 2021

PSA: PAN-OS Drops BGP peers with an invalid NLRI / Always filter inbound prefixes from Avi Vantage

If Avi Vantage IPAM cannot allocate an address for a new vIP, it will advertise an all-zeros host address - 0.0.0.0/32:


This will cause Palo Alto PAN-OS to restart a peer - even if it is not the immediate downstream prefix. Palo Alto uses routed as their dynamic routing engine - so this is probably default behavior inherited from there:

**** EXCEPTION   0x4103 - 57   (0000) **** I:008e7cd1 F:00000004
qbmlpar2.c 1352 :at 20:54:21, 2 May 2021 (1822572648 ms)
UPDATE message contains NLRI of 0.0.0.0.

**** PROBLEM     0x4102 - 46   (0000) **** I:008e7cd1 F:00000004
qbnmmsg.c 1074 :at 20:54:21, 2 May 2021 (1822572648 ms)
NM has received an UPDATE message that failed to parse.
Entity index               = 1
Local address              = 10.6.64.9
Local port                 = 0
Remote address             = 10.6.64.12
Remote port                = 0
Scope ID                   = 0

**** EXCEPTION   0x4102 - 71   (0000) **** I:008e7cd1 F:00000020
qbnmsnd2.c 167 :at 20:54:21, 2 May 2021 (1822572648 ms)
A NOTIFICATION message is being sent to a neighbor due to an unexpected
problem.
NM entity index       = 1
Local address         = 10.6.64.9
Local port            = 0
Remote address        = 10.6.64.12
Remote port           = 0
Scope ID              = 0
Remote AS number      = 64905
Remote BGP ID         = 0X0A06400C
Error code            = UPDATE Message Error (3)
Error subcode         = Invalid Network Field (10)

This could cause a network outage for all subtending networks on this peer. Consider this a friendly reminder to always leverage route filtering between autonomous systems!

Unfortunately, strict import filters on PAN-OS did not resolve this issue.

NSX-T Edge Transport Node Packet Captures

NSX-T Edge Transport Node Packet Captures

NSX-T Edge nodes have a rudimentary packet capture tool built in to the box. It is important to have a built-in tool here, as GENEVE encapsulation will wrap just about everything coming out of a transport node.

NSX-T's CLI guide indicates the method for packet captures - from here we can break it down to a few steps:

  • Find the VRF you want to capture from
  • Find the interface in that VRF you want to capture from
  • Capture from this interface!
get logical-routers
vrf {{ desired VRF }}
get interfaces
set capture session 0 interface {{ interface-id }} direction dual
set capture session 0 file example.pcap

The result will be placed in:

/var/vmware/nsx/file-store/

I do have some notes to be aware of here:

  • Be careful with packet captures! This is on an all-CPU router - so isolating the device before capturing packets is a wise choice. We can do that with NSX-T, we just need to remember to.
  • It's possible to use tcpdump-based packet filters instead of a wholesale capture - just replace the final line with a command similar to this:
set capture session 0 file example.pcap expression port 179

Get an A on ssllabs.com with VMware Avi / NSX ALB (and keep it that way with SemVer!)

Cryptographic security is an important aspect of hosting any business-critical service. When hosting a public service secured by TLS, it is ...