Monday, July 5, 2021

NSX Advanced Load Balancer - NSX-T Service Engine Creation Failures: `CC_SE_CREATION_FAILURE` and `Transport Node Not Found to create service engine`

TL;DR

If you see either of these errors, check grep 'ERROR' /opt/avi/log/cc_agent_go_{{ cloud }} for the potential cause. In my case, the / character was not correctly processed by Avi's Golang client (facing vCenter).

The Problem

When trying to configure NSX ALB + NSX-T on my home lab, I am presented nothing but the following error:

CC_SE_CREATION_FAILURE

The Process

Avi Vantage appears to be treating this as a retriable error, attempting to deploy a service engine five times, which can be re-executed with a controller restart:

Oddly enough, vCenter doesn't report any OVA deploy attempts. The next thing to check here would be the content library:
So far, so good. vCenter knows where to deploy the image from.

Now here's a problem - Avi doesn't provide any documentation on how to troubleshoot this yet - so I did a bit of digging and found that you can bump yourself to root by performing a:

sudo su

Useful note: Avi Vantage is running 
bullseye/sid
 with only 821 packages listed under dpkg -l | wc -l. They did do a pretty good job with pre-release cleanup, but there are still a few oddball packages in there. I'd give it a 9/10, I'd like to see X11 not be installed but am pleased to see only Python 3!

Avi's logs are located in:

/var/lib/avi/log
/opt/avi/log

Here's what I found in alert_notifications_debug.log:

summary: "Syslog for System Events occured"
event_pages: "EVENT_PAGE_VS"
event_pages: "EVENT_PAGE_CNTLR"
event_pages: "EVENT_PAGE_ALL"
obj_name: "avi_-Avi-se-rctbp"
tenant_uuid: "admin"
related uuids ['avi_-Avi-se-rctbp']
[2021-04-09 20:06:30,923] INFO [alert_engine.processAlertInstance:225] [uuid: ""
alert_config_uuid: "alertconfig-938cf267-e20d-4d8e-a50a-21f0f5a5b633"
timestamp: 1617998694.0 obj_uuid: "avi_-Avi-se-rctbp" threshold: 0 events { report_timestamp: 1617998694 obj_type: SEVM event_id: CC_SE_CREATION_FAILURE module: CLOUD_CONNECTOR internal: EVENT_EXTERNAL context: EVENT_CONTEXT_SYSTEM obj_uuid: "avi_-Avi-se-rctbp" obj_name: "avi_-Avi-se-rctbp" event_details { cc_se_vm_details { cc_id: "cloud-022c7b90-f987-4b15-91bb-1f1405715580" se_vm_uuid: "avi_-Avi-se-rctbp" error_string: "Transport node not found to create serviceengine avi_-Avi-se-rctbp" } } event_description: "Service Engine creation failure" event_pages: "EVENT_PAGE_VS" event_pages: "EVENT_PAGE_CNTLR" event_pages: "EVENT_PAGE_ALL" tenant_name: "" tenant: "admin" } reason: "threshold_exceeded" state: ALERT_STATE_ON related_uuids: "avi_-Avi-se-rctbp" level: ALERT_LOW name: "Syslog-System-Events-avi_-Avi-se-rctbp-1617998694.0-1617998694-45824571" summary: "Syslog for System Events occured" event_pages: "EVENT_PAGE_VS" event_pages: "EVENT_PAGE_CNTLR" event_pages: "EVENT_PAGE_ALL" obj_name: "avi_-Avi-se-rctbp" tenant_uuid: "admin"
From the looks of things - Avi is talking with NSX-T before vCenter to determine appropriate placement, which makes sense.

Update and Root Cause

With the Avi 20.1.6 release, VMware has made a lot of improvements to logging! We're now seeing this error in the GUI (Ensure that "Internal Events" is checked:






Let's take a look at the new logging. Avi's controller system leverages a series of Go modules called "cloud connectors" dedicated to that specific interface. Each one has its own log file in
/opt/avi/log/cc_
2021-07-04T20:20:42.801Z        ERROR   vcenterlib/vcenter_utils.go:606 [10.66.0.202][avi-mgt-vni-10.7.80.0/24] object references is empty
2021-07-04T20:20:42.819Z        ERROR   vcenterlib/vcenter_utils.go:578 [10.66.0.202][avi-mgt-vni-10.7.80.0/24] object references is empty
2021-07-04T20:20:42.822Z        ERROR   vcenterlib/vcenter_se_lifecycle.go:432  [10.66.0.202][QH] [10.66.0.202] Network 'avi-mgt-vni-10.7.80.0/24' matching not found in Vcenter
2021-07-04T20:20:42.822Z        ERROR   vcenterlib/vcenter_se_lifecycle.go:891  [10.66.0.202] [10.66.0.202] Network 'avi-mgt-vni-10.7.80.0/24' matching not found in Vcenter
Now, this vn-segment does exist in vCenter, so I tried the "non-escaped shell character" knowledge from years of Linux/Unix administration and reformatted it to avi-mgt-vni-10.7.80.0_24. 
Since we don't get a Redeploy (please VMware!) button, I restarted the controller and all SE deployments succeeded after that.

Sunday, June 20, 2021

World WiFi Day 2021!

World WiFi Day

We (human beings) have several weird superpowers, but the ability to communicate over vast distances has always fascinated me the most.

I've had the privilege of meeting some of the most truly capable pioneers in this field - but the reality here is that we're faced with a very unequal world.

Authors like William Gibson and Neal Stephenson have the right of things as well and while we're not quite living in that dystopian future, technology can become a great equalizer.

So yeah - as telecommunications operators we have the responsibility to bridge this gap!

Learn More

I'm always surprised by how much there still is to learn, even in fields I feel like I already know. Here are a few learning approaches that will help you build out a good foundation for learning Wi-Fi (and more!):

  • Amateur Radio: It's cheesy, sure. I'm amazed at how much my grown-up self can now do with amateur radio - I've been licensed since the '90s (KL0NS) and the community is doing so much more now than ever. When I was very young, this was a good opportunity to learn the principles of radio outdoors.
    • In Anchorage, Alaska we have it pretty good - KL7AA is a self-provided test provider, and they sell the study book I used way back when
    • Everywhere else in the US, the test costs $15 and probably takes about 2 days to study for. There's no reason not to try it out and participate. Most of the study material is free, and we even get practice tests
    • ARRL provides good ways to participate
    • If you join a radio club they'll find different ways to exercise your brain, and they're usually pretty fun. In addition, you'll be helping maintain emergency communication networks.
  • Get Certified. Also pretty cheesy, I know how people feel about IT certs and still would argue for in this case. For Wi-Fi, the CWNP organization tends to serve the same role as the Linux Professional Institute - employers don't know about them but it's really effective in terms of education.

Do More

Let's just cover some volunteer opportunities here - because there's no point in building skills if you don't use them:

  • World Wi-Fi Day
  • ITDRC These guys are really neat. The IT Disaster Resource Center leverages oldie-but-goodie enterprise telecom/IT equipment to provide disaster relief all over the continental US. Check out their deployment map!
  • Airheads Volunteer Corp. I know this is a vendor plug, but this approach is really cool if you can travel!
  • United Way

On Volunteering

One of the passive effects of these approaches - you'll get better as you go. Employers nearly always constrain your learning path to what they need at the moment, often to their own detriment. They may not know what they'll need you to do next year, COVID showed us that. Volunteering not only gives you an opportunity to help others but also passively improves your skills outside of the usual "corporate playbook".

Sunday, June 6, 2021

XML, JSON, YAML - Python data structures and visualization for infrastructure engineers

At some point, we can't "do it all" with one block of code. 

As developers, we need to store persistent data for a variety of reasons:

  • We want it for later execution (or to compare it to another result)
  • We're sick of storing variables in code. This matters a lot more in compiled languages than runtime ones
  • We want the results to end up in some form of a deliverable report

Let's cover a computer science concept being used here - semaphores). Edsger Dijkstra coined this term from Greek sema(sign) and phero(bearer) (you may remember him from OSPF) to solve Inter-Process Communications(IPC) issues.

To provide a reductionist example, process A and process B need to communicate somehow, but shouldn't access each other's memory or, in the '60s, it wasn't available. To solve this problem, the developer needs to develop a method of storing variables in a manner that is both efficient and can be consistently interpreted.

Dijkstra's example, in this case, was binary, and required a schema to interpret the same data - but was not specifically limited to single binary blocks. This specific need actually influenced one of the three data types we're comparing here - consequently the oldest.

But which one do I use? TL;DR?

Spoiler alert - anyone working with automation MUST learn all three to be truly effective. My general guidance would be:

  • This is a personal preference, but I would highly recommend YAML for human inputs. It's extremely easy to write, and while I generally prefer JSON it's much easier to first write a document into YAML and then convert it. If you take user input or just want to get a big JSON document started, I'd do it this way.
    • YAML User input drivers can also parse JSON, making this an extremely flexible approach.
  • JSON is good for storing machine inputs/outputs. Because all typing is pretty explicit with JSON, json.dumps(dict, indent=4) is pretty handy for previewing what your code thinks structured data looks like. Technically this is possible with YAML, but conventions on, say, a string literal can be squishy.
    • YAML with name: True could be interpreted as:
      • JSON of "name": true, indicating a Boolean value
      • JSON of "name": "True", indicating a String
    • Sure, this is oversimplified, and YAML can be explicitly typed, but generally, YAML is awesome for its speed low initial friction. If an engineer knows YAML really well (and writes their own classes for it) going all-YAML here is completely possible - but to me that's just too much work.
    • If you use it in the way I recommend, just learn to interpret JSON and use Python's JSON library natively, and remember json.dumps(dict, indent=4) for outputs. You'll pick it up in about half an hour and just passively get better over time.
  • Use XML if that's what the other code is using, Python's Element and ElementTree constructs are more nuanced than dictionaries, so a package like defusedXML is probably the best way to get started. There are a lot of binary invocations/security issues with XML, so using basic XML libraries by themselves is ill-advised. xmltodict is pretty handy if you just want to convert it into another format.

Note: JSON and XML both support Schema Validation, an important aspect of semaphores. YAML doesn't have a native function like this, but I have used Python's Cerberus modules to do the same thing here.

YAML

YAML was initially released in 2001 and has gained recent popularity with projects like Ansible. YAML 1.2 was released in 2009 and is publicly maintained by the community, so it won't have industry bias (but also won't change as quickly). YAML writes a lot like Python, consuming a ton of whitespace and being particular about tags. Users either love or hate it - I typically only use it for human inputs and objects that are frequently peer-reviewed.

NOTE: one big upside to YAML with people processes is comment support. YAML supports comments, but JSON does not.

YAML is pretty easy to start using in Python. I'm a big fan of the ruamel.YAML library, which adds on some interesting capabilities when parsing human inputs. I've found a nifty way to parse using try/except blocks - making a parser that is supremely agnostic, ingesting JSON or YAML, as a string or a file!

---
message:
  items:
    item:
      "@tag": Blue
      "#text": Hello, World!
#!/usr/bin/python3

import json

from ruamel.yaml import YAML
from ruamel.yaml import scanner

# Load Definition Classes
yaml_input = YAML(typ='safe')
yaml_dict = {}

# Input can take a file first, but will fall back to YAML processing of a string
try:
    yaml_dict = yaml_input.load(open('example.yml', 'r'))
except FileNotFoundError:
    print('Not found as file, trying as a string...')
    yaml_dict = yaml_input.load('example.yml')
finally:
    print(json.dumps(yaml_dict, indent=4))

JSON

JSON was first implemented in 2006 and is currently maintained by the IETF. Currently, Python 3 will visually represent dicts using JSON as well - making things pretty intuitive. In my experience, writing JSON is pretty annoying because it's picky.

{
    "message": {
        "items": {
            "item": {
                "@tag": "Blue",
                "#text": "Hello, World!"
            }
        }
    }
}
#!/usr/bin/python3

import json

with open('example.json', 'r') as file:
    print(json.dumps(json.loads(file.read())))

Typically, I'll just use json.dumps(dict, indent=4) on a live dict when I'm done with it - dumping it to a file. JSON is a well-defined standard and software support for it is excellent.

Due to its IETF bias, JSON's future seems to focus on streaming/logging required for infrastructure management. JSON-serialized Syslog is a neat application here, as you can write it to a file as a single line, but also explode for readability, infuriating grep users everywhere.

XML

XML is the oldest data language typically used for automation/data ingestion, and it really shows. XML was originally established by the W3C in 1998 and is used for many document types like Microsoft Office.

XML's document and W3C bias read very strongly. Older Java-oriented platforms like Jenkins CI heavily leverage XML for semaphores, document reporting, and configuration management. Strict validation (MUST be well-formed) required for compiled languages to synergize well with the capabilities provided. XML also heavily uses HTML-style escaping and tagging approaches, making it familiar to some web developers.

XML has plenty of downsides. Crashing on invalid input is generally considered excessive or "Steve-Ballmer"-esque, making the language favorable for mission-critical applications where misinterpretation of data MUST not be processed, and miserable everywhere else. For human inputs, it's pretty wordy which impacts readability quite a bit.

Schemas

XML has two tiers of schema - Document Type Definition (DTD) and XML Schema. DTD is very similar to HTML DTDs and provides a method of validating that the language is correctly used. XML Schema definitions (XSD) provide typing and structures for validation and is a more commonly used tool.

Python Example

XML Leverages the Element and ElementTree constructs in Python instead of dicts. This is due to XML being capable of so much more, but it's still pretty easy to use:

XML Document:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<message>
    <items>
        <item tag="Blue">Hello, World!</item>
    </items>
</message>
#!/usr/bin/python3

from defusedxml.ElementTree import parse
import xmltodict
import json

document = parse('example.xml').getroot()

print(document[0][0].text + " " + json.dumps(document[0][0].attrib))

file = open('example.xml', "r").read()
print(json.dumps(xmltodict.parse(file), indent=4))

After using both methods, I generally prefer using xmltodict for data processing - it lets me use a common language, Python lists and dicts to process all data regardless of source, allowing me to focus more on the payload. We're really fortunate to have this fantastic F/OSS community enabling that!

Saturday, May 22, 2021

Troubleshooting with VMware NSX ALB/Avi Vantage

NSX Advanced Load Balancer - Logging and Troubleshooting Cheat Sheet

Get into the OS Shell (all elements)

sudo su

Controller Log Locations

Note: Everything in /var/lib/avi/logs is managed by Elasticsearch. I wouldn't mess with it.

Events published to the GUI: /var/lib/avi/logs/ALL-EVENTS/

The primary log directory for Avi Vantage Controllers is /opt/avi/log. As this feeds into Elasticsearch, they have file outputs for every severity level. An easy way to get data on a specific object would be to build a grep statement like this:

grep {{ regex }} /opt/avi/log/{{ target }}
  • alert_notifications_*: Summarized problems log. Events are in a json format!

Troubleshooting Deployment Failures

  • avi-nsx.*: Presumably for NSX-T integration. further investigation required
  • cloudconnectorgo.*: Avi's cloud connector is pretty important given their architecture. This is where you can troubleshoot any issues getting a cloud turned up, or any initial provisioning issues.
  • vCenter*: vCenter write mode activity logs. Look here for SE deployment failures in a traditional vSphere cloud.

Service Engines

Troubleshooting

Checking the Routing Table

NSX ALB / Avi uses FRRouting (7.0 as of release 20.1) over network namespaces to achieve management/data plane separation and VRF-Lite. To access the data plane, you will need to change namespaces! Unlike NSX-T, this doesn't happen over docker namespaces. This means that the follow commands work in both as root:

  • Show all VRF+Namespaces ip netns show
  • Send a one-shot command to the namespace: ip netns exec {{ namespace }} {{ command }} Example: ip netns exec 'ip route show'
  • Start a shell in the desired namespace: ip netns exec {{ namespace }} {{ shell }} Example: ip netns exec avi_ns1 bash

After in the bash shell, all normal commands apply as if there was no namespace/VRF.

For more information on Linux Network Namespaces, here's a pretty good guide: https://www.opencloudblog.com/?p=42

Logging

All SE logging is contained in /var/lib/avi/log. Here are the significant log directories there:

  • IMPORTANT! bgp: This is where all the routing protocol namespace logging from FRRouting lands.
  • traffic: This one's pretty touch to parse and it's better to use Avi's Elasticsearch instead.

Conclusion

Avi Vantage has a pretty solid logging schema, but is very much a growing product. These logs will eventually be exposed more fully to the GUI/API, but for now it's handy to grep away. I'll be updating this list as I find more.

Saturday, May 15, 2021

VMware NSX Advanced Load Balancer - Installation

Pre-Requisites

Before beginning the Avi installer, I configured the following in my environment:
  • Management Segment (NSX-T Overlay). This is set up with DHCP for quick automatic provisioning - no ephemeral addresses required
  • Data Segments (NSX-T Overlay). Avi will build direct routes to IPs in this network for vIP processing. I built 3 - 
    • Layer 2 Cloud (attached to Tier-1)
    • NSX Integrated (attached to Tier-1)
    • Layer 3 Cloud (attached to Tier-0)

Avi also supports automatic SE deployment - which means that automatic IP configuration is important. Avi supports SLAAC (IPv6) and DHCP (IPv4) for this purpose.

NSX-T is unsurprisingly symbiotic here. I have built a dedicated Tier-1 for NSX ALB, and we're going to provide DHCP services via the Tier-1 router. If this was a production deployment or a VVD-compliant SDDC, this should be performed with a DHCP relay. I just haven't set aside time to deploy DHCP/IPAM tools for reasons that are beyond me.

The following changes are performed on the Tier-1 Logical Router. This step is not required for external DHCP servers!

The following changes are to be performed on the Logical Segment. 
If production, DHCP relay is selectable from the following screen:


Installation

 Avi Controller

VMware provides a prepackaged OVA for the Vantage controller - and it's a pretty large appliance. 24 GB of memory and 8 vCPUs is a lot of resourcing for a home lab. There are no sizing options here.

Installation is pretty easy - once the OVA is deployed, I used my CI/CD pipeline and GitHub to deploy DNS updates and logged right into the installation wizard.

AVI version 20.1.5 has changed the installer approach from the above to this. When "No cloud setup" is selected, it still insists on configuring a new cloud. This isn't too much of a problem:
Note: This passphrase is for backups - make sure to store it somewhere safe!
From here, we configure vCenter's integration:


Let's ensure that Avi is connected to vCenter and has no issues. Note: VMware recommends write-mode for vCenter clouds.


After install, it's useful to get a little oriented. Up in the top left of the Avi Vantage GUI. In the top left, you'll find the major configuration branches by expanding the triple ellipsis. Get familiar with this part of the GUI - you'll be using it a lot!




Patching

Before we build anything, I prefer to load any patches (if applicable) prior to building anything. This should help avoid any software issues on deployment, and patching is usually simpler/lower impact if you have no configuration yet. 

Avi Vantage really excels here - this upgrade process is pretty much fully automated, with extensive testing. As a result, it's probably going to be slower than "manual" upgrades, but is definitely more reliable. Our industry really needs to get over this - If you have a good way to keep an eye on things while keeping busy, you're ahead of the curve!

We'll hop on over to Administration -> Controller -> Software:


While this upgrade takes place - Avi's controller will serve up a "Sorry Page" indicating that it's not available yet - which is pretty neat.

When complete, you should see this:



Avi Clouds

Clouds are Avi's construct for deployment points - and we'll start with the more traditional one here - vCenter. Most of this has already been configured as part of the wizard above. Several things need to be finished for this to run well, however:

  • Service Engine Group - here we customize service engine settings
  • IPAM - Push IP address, get a cookie
SE Group Changes are executed under Infrastructure -> SE Groups. Here I want t to constrain the deployment to specific datastores and clusters.
IPAM is located in two places, Templates -> Profiles -> IPAM/DNS Profiles (bindable profile):

Ranges are configured under Networks. If you configure a write-access cloud, it'll scan all of the port groups and used IP ranges for you. IP ranges and Subnets will still need to be configured and/or confirmed:


Note: This IPAM profile does need to be added to the applicable cloud to leverage auto-allocate functions with vIPs.

Avi Service Engines

Now that the setup work is done, we can fire up the SE deployments by configuring a vIP. By default, Avi will conserve resources by deploying the minimum SEs required to get the job done - if there's no vIP, this means none. It takes some getting used to!
Once the vIP is applied, you should see some deployment jobs in vCenter:

Service engines take a while to deploy - don't get too obsessive if the deployment lags. There doesn't appear to be a whole lot of logging to indicate deployment stages, so the only option here is to wait it out. If a service engine doesn't deploy quite right, delete it. This is not the type of application we just hack until it works - I did notice that it occasionally will deploy with vNICs incorrectly configured.

From here, we can verify that all service engines are deployed. The health score will climb up over time if the deployment is successful.

Now we can build stuff! 


Sunday, May 9, 2021

Leveraging Hyperglass and NSX-T!

 For this example deployment, I'll be using my NSX-T Lab as the fabric, VyOS for the Overloaded Router role, and trying out hyperglass:



Installation (VyOS)

I already have a base image for VyOS with its management VRF set up - and updating the base image prior to deployment is a breeze due to the vSphere 7 VM Template Check Out Feature.

In this case, I'll deploy to an NSX-T External Port and peer up, with fully implemented ingress filtering:
Export Filters - Permit all prefixes:
Import Filters - don't trust any prefixes from this router:
Set in the correct directions:
Configure the BGP Neighbors:

From here, we build the VNF, by adding the following configuration:
protocols {
    bgp 64932 {
        address-family {
            ipv4-unicast {
                maximum-paths {
                    ebgp 4
                }
            }
            ipv6-unicast {
                maximum-paths {
                    ebgp 4
                }
            }
        }
        neighbor 10.7.2.1 {
            remote-as 64902
        }
        neighbor 10.7.2.2 {
            remote-as 64902
        }
        neighbor x:x:x:dea::1 {
            address-family {
                ipv6-unicast {
                }
            }
            remote-as 64902
        }
        neighbor x:x:x:dea::2 {
            address-family {
                ipv6-unicast {
                }
            }
            remote-as 64902
        }
        timers {
            holdtime 12
            keepalive 4
        }
    }
}

Then, let's verify that BGP is working:


vyos@vyos-lg-01:~$ show ip bgp summary

IPv4 Unicast Summary:
BGP router identifier 10.7.2.254, local AS number 64932 vrf-id 0
BGP table version 156
RIB entries 75, using 14 KiB of memory
Peers 4, using 85 KiB of memory

Neighbor             V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt
10.7.2.1             4      64902       278       272        0    0    0 00:11:31           40       42
10.7.2.2             4      64902        16        13        0    0    0 00:00:16           39       42
x:x:x:dea::1 		 4      64902       234       264        0    0    0 00:11:43 NoNeg
x:x:x:dea::2 		 4      64902       283       368        0    0    0 00:11:43 NoNeg

Total number of neighbors 4

The VNF is configured! Now, we'll follow the application maintainer's instructions for installation: https://hyperglass.io/docs/getting-started

The documentation for install is pretty good - but some customization is still required. I built the following configuration files out - hyperglass leverages YAML as a configuration file format, examples are here. I did make some changes:

  • Some combination of VyOS 1.4, MP-BGP, and/or VRF-lite changed the syntax for the BGP views around. Setting a commands file fixes this.
  • VyOS driver is appending a host mask (/32, /128) on routes with no prefix specified.
    • NB: I reached out to the maintainer (Matt Love) and he informed me that this was configurable per-VRF using the force-cidr option.
This particular tool has been extremely useful to me, as NSX-T still lacks comprehensive BGP visibility without CLI access - and even if it didn't, this will provide consumers an easy way to validate that prefixes have propagated, and where.

Sunday, May 2, 2021

PSA: PAN-OS Drops BGP peers with an invalid NLRI / Always filter inbound prefixes from Avi Vantage

If Avi Vantage IPAM cannot allocate an address for a new vIP, it will advertise an all-zeros host address - 0.0.0.0/32:


This will cause Palo Alto PAN-OS to restart a peer - even if it is not the immediate downstream prefix. Palo Alto uses routed as their dynamic routing engine - so this is probably default behavior inherited from there:

**** EXCEPTION   0x4103 - 57   (0000) **** I:008e7cd1 F:00000004
qbmlpar2.c 1352 :at 20:54:21, 2 May 2021 (1822572648 ms)
UPDATE message contains NLRI of 0.0.0.0.

**** PROBLEM     0x4102 - 46   (0000) **** I:008e7cd1 F:00000004
qbnmmsg.c 1074 :at 20:54:21, 2 May 2021 (1822572648 ms)
NM has received an UPDATE message that failed to parse.
Entity index               = 1
Local address              = 10.6.64.9
Local port                 = 0
Remote address             = 10.6.64.12
Remote port                = 0
Scope ID                   = 0

**** EXCEPTION   0x4102 - 71   (0000) **** I:008e7cd1 F:00000020
qbnmsnd2.c 167 :at 20:54:21, 2 May 2021 (1822572648 ms)
A NOTIFICATION message is being sent to a neighbor due to an unexpected
problem.
NM entity index       = 1
Local address         = 10.6.64.9
Local port            = 0
Remote address        = 10.6.64.12
Remote port           = 0
Scope ID              = 0
Remote AS number      = 64905
Remote BGP ID         = 0X0A06400C
Error code            = UPDATE Message Error (3)
Error subcode         = Invalid Network Field (10)

This could cause a network outage for all subtending networks on this peer. Consider this a friendly reminder to always leverage route filtering between autonomous systems!

Unfortunately, strict import filters on PAN-OS did not resolve this issue.

NSX-T Edge Transport Node Packet Captures

NSX-T Edge Transport Node Packet Captures

NSX-T Edge nodes have a rudimentary packet capture tool built in to the box. It is important to have a built-in tool here, as GENEVE encapsulation will wrap just about everything coming out of a transport node.

NSX-T's CLI guide indicates the method for packet captures - from here we can break it down to a few steps:

  • Find the VRF you want to capture from
  • Find the interface in that VRF you want to capture from
  • Capture from this interface!
get logical-routers
vrf {{ desired VRF }}
get interfaces
set capture session 0 interface {{ interface-id }} direction dual
set capture session 0 file example.pcap

The result will be placed in:

/var/vmware/nsx/file-store/

I do have some notes to be aware of here:

  • Be careful with packet captures! This is on an all-CPU router - so isolating the device before capturing packets is a wise choice. We can do that with NSX-T, we just need to remember to.
  • It's possible to use tcpdump-based packet filters instead of a wholesale capture - just replace the final line with a command similar to this:
set capture session 0 file example.pcap expression port 179

Sunday, April 11, 2021

Saturday, April 3, 2021

VMware NSX Advanced Load Balancer - Overview

Load Balancing is Important

Load balancing is an important aspect of network mobility.

How is a network useful if you can't move around within it?

  • Cellular networks lose their appeal if you drop connectivity every time you roam between towers
  • Wi-Fi networks are designed to facilitate smaller-scale movements. Imagine if you had to sit still for your Wi-Fi to work
Network Movements also facilitate migrations between services - as a consumer of a network service, frequent cutovers occur without your knowledge:
  • Infrastructure upgrades: Firewalls, routers, switches constantly need to be bumped up to higher speeds, and feeds
  • Preventing outages: Network "Maintenance Mode"

As computer networks get more complex - SDN is important for the orchestration of these changes or "movements". A distributed, off-box, dedicated management and control plane is essential to tracking "customers" in a scalable fashion - but load balancing is special here.

Most of our consumed services today leverage load balancers to "symmetrify" network traffic to accommodate nodes that do not support them. This can solve a lot of problems large enterprises have:

  • Need to scale firewalls past 2?
  • Need to scale firewalls in any public cloud?
  • Imperfect link balancing with ECMP hashing?
  • Want to prefer an ISP over another, but use both?
These problems are all solvable by the right load balancer platform - but are infrastructure specific. Load balancers often solve application-specific problems, including:
  • HTTP Transforms
  • TLS Quality Enforcement / Consolidated Stack
  • "Diet" Acceleration, e.g. HTTP Compression

Stateless apps work perfectly without some form of load balancer/ingress controller but still benefit greatly from a discrete point to ingest data as well.

NSX Advanced Load Balancer Differentiating Points

N.B. I will probably revise this in a later post as I get more familiar with Avi Vantage

Avi Networks was founded in 2012 with the goal of providing a software load balancer designed from the ground up to leverage Software-Defined Networking (SDN) capabilities. Every aspect of the platform's design appears to eschew this - the company clearly wanted to perform a totally new platform without any need for maintaining legacy platforms. In 2019, VMware acquired Avi Networks and is rebranding the platform to "NSX Advanced Load Balancer".

Here are some clear differentiating points I have found with the platform so far:
  • Enterprise (Web) Oriented - Some load balancing platforms, like Kemp Technologies and Loadbalancer.org focus on clear, common enterprise needs and executing as effectively as possible; instead of "boiling the ocean" with a more feature-complete platform. If this is you as a customer, you can expect significant cost and quality improvements due to this more narrow focus - but Service Providers and specialty customers may be turned off by this.
  • This product is designed for self-service, with robust management plane multi-tenancy
  • This is a VMware product, so Avi is diving head-first into providing high-quality Kubernetes support
  • Offloaded Control Plane: So far, this is a big one for me personally. I'm continually amazed as to how much rich data can be extracted simply by offloading telemetry processing to a controller. Logging and Analytics do not impact data plane performance and have minimal impact on sizing/costs due to per Service Engine licensing
  • Software-only Kitchen Sink: Few load balancing platforms can support all clouds, KVM, K8s, Cisco ACI, Mesosphere, Acropolis, and OpenStack with direct support. Usually, the best we can hope for with a KVM install is an ISO and a prayer. This is refreshing.
  • Support for dynamic routing: The vast majority of load balancers on the market don't natively support this, and specific implementations like anycast or multi-site load balancing stand to benefit from this particular feature.
  • Global Server Load Balancing (GSLB) allows an engineer to control which site traffic may route to with anycast DNS. This provides them the ability to perform application-level capacity management with multiple sites in one solution.

Design Elements

Controller

This is Avi's brain and the primary reason for using a platform like Vantage - the control and management planes are, by default, managed by an offboard controller. The following functions are available here, with no performance penalty to the data plane:
  • Central Configuration Management, all locations, all the time.
    • Configure BGP once
    • Configure routes once
    • Configure vIPs once
    • Configure hardening (logging, TLS settings, passwords) once
  • Monitoring of vIPs, if a service is down relocate it
  • Software Lifecycle Management
  • IP Address Management
  • Periodic monitoring for common issues
  • Per Virtual Service extensive Analytics (Avi Enterprise only). They are running ElasticSearch on-box to achieve this, it's pretty neat.
NB: Avi Release 20.1.4 has <900 Debian packages (based on bullseye/sid), so they are running a little lean but could do more cleanup. 20.1.5 is down to 820 - so they are working on this.

Service Engine

Generally, these components do work. Structurally, these appliances are running Debian bullseye/sid with load balancer processes as Docker images. They're running the same edition of FRRouting as NSX-T - with the same approximate OS edition.

Service Engines do:
  • Report in to the AVI controller
  • Perform actual load balancing functions
NB: Avi Release 20.1.5 is much leaner than prior releases, and SEs typically have a much more compressed install base. 515 Debian packages here - almost in line with NSX-T 3.1.2!

IPv6

  • AVI Controller UI and vCenter/NSX-T Interaction have hard-coded IPv4 Constructs, 20.1.5 introduces preliminary support for IPv6, but VMware's NSBU is usually ahead of everyone else here. I'll be testing vCenter + IPv6 in a later post.
  • AVI Controllers appear to pick up an IPv6 address via SLAAC
  • This platform appears to have full data-plane support.

Deployment Methodology

Management/Control Plane

No orchestrator pre-sets will be used here - per the Avi NSX-T Integration Guide. The primary reason for my doing this is as a more thorough test of this platform - I'll be deploying 3 "Clouds":
  • Layer 2 Cloud (Typical A/P Load Balancer Deployment)
  • Layer 3 Cloud (MP-BGP Load Balancer Deployment)
  • NSX-T Cloud (NSX-T Integrated Deployment)
Avi Vantage designates any grouping of infrastructure presets as a "Cloud", which can have its own tenancy and routing table. This construct allows us to allocate multiple infrastructures to each administrative tenant or customer. This access is decoupled from "Tenant", which is the parent for RBAC.

Data Plane Topologies

The Avi Vantage VCF Design Guide 4.1 indicates that service engines should be attached to a tier-1 router as an overlay segment. The primary reason for this has to do with NSX-T and Avi's integration - in short, the Avi controller invokes the NSX-T API to add and advertise static routes to each service engine to handle dynamic advertisement.






Monday, March 22, 2021

Design Pattern: Looking Glasses

It's probably safe to say that service provider networking is pretty unique.

One particular design pattern - Looking Glasses - is extremely useful for complex dynamically routed networks

I'd really like to shift the gatekeeping needle here - networks that are complex enough to benefit from a looking glass should move to:
  • >100 Routing table entries globally
  • Some vague preference towards reliability
  • Dynamic Routing (BGP is preferred)
In any small to medium enterprise, I'd posit that the only thing truly preventing benefits, in this case, is the lack of dynamic routing adoption, primarily because pre-packaged offerings in this range don't have an "easy button" for implementing them. This lack of accessibility causes a real problem with SMB networking, as reliability features stay out of their reach.

Design Pattern: Looking Glass

A Network "Looking Glass" is a type of web server that responds to user requests, providing externalized (without userspace access to network equipment) to an authenticated or unauthenticated client. This allows clients to view BGP meta-data, routing tables to ensure outbound advertisements between Service Providers have propagated. 

Here's my starting point for this design pattern.

History (non-inclusive)

Note: I don't have everything here. It seems most Looking Glasses were stood up silently by telecommunications companies. They're searchable, but I can't find any citable data on when they started out.

Form

  • Least (Zero) Privilege Access to a network services routing table, searchable via API and/or GUI

Forces

Of these forces, #1 is probably the biggest. Since we cannot force all of the networking industry titans (yet) to provide a permission set that will facilitate this use - I'd propose the following approach:
In this solution, I'm proposing some additional safeguards/scale-guards to make sure that the approach will not be harmful to a "host" network. In addition to implementing the looking glass, I'd propose the deployment of a series of Virtual Network Functions (VNFs) scaled out with monitored routing tables. This is where the collectors would interact - if the physical network doesn't allow any inbound prefixes from the VNF, it's easy enough to build a solution to safely collect from it. There are tons of VNF options here - as we only need BGP capability and a collection method.

NSX Advanced Load Balancer - NSX-T Service Engine Creation Failures: `CC_SE_CREATION_FAILURE` and `Transport Node Not Found to create service engine`

TL;DR If you see either of these errors, check  grep 'ERROR' /opt/avi/log/cc_agent_go_{{ cloud }}  for the potential cause. In my ca...