NSX-T Edge Multi-TEP Flow-Cache Bug

Steven Schramm

23. September 2020

Reading time: 4 min

NSX-T supports Multi-TEP just like NSX-V. But this feature was limited to ESXi hosts and it was not supported for Edge Nodes. Since NSX-T version 2.4 Multi-TEP is supported for Edge Nodes as well, which optimizes the North-South network performance.
Starting with NSX-T 2.4 the Multi-TEP was supported but not recommend from reference design guide perspective. The recommendation was added after the reference design guide was updated for NSX-T 2.5. Since then a deployment with a single NVDS for VLAN and Overlay transport zones combined with Multi-TEP for ESXi and Edge Nodes is recommended.

For all currently released NSX-T versions there is a known bug affecting exactly this deployment model and under some special circumstances this bug will affect the availability of your environment negatively. We as evoila GmbH already did many NSX-T deployments and got affected by this bug just once. In the upcoming blog article we will deliver some more Information regarding this bug.

Single NVDS and Multi-TEP

To be able to understand the described deployment model it is very important to get to know how Edge nodes will be connected to the network. Each Edge Node will have four network adapters and these are assigned as follows.

Network adapter 1: Management (eth0)
Network adapter 2: Overlay/ VLAN Uplink (fp-eth0)
Network adapter 3: Overlay/ VLAN Uplink (fp-eth1)
Network adapter 4: Optional! Necessary, if Overlay and VLAN Uplinks are separated and more than one VLAN uplink is required. An Example for this need might be BGP. (fp-eth2)

For the Single NVDS and Multi-TEP deployment model there is one network adapter which will be used for the management network of the Edges, the second (fp-eth1) and third (fp-eth2) network adapter will be shared for the overlay and VLAN uplinks. The network adapters fp-eth1 and fp-eth2 can be used in active/active or active/standby configuration. For active/active the teaming policy “LoadBalance Source” have to be configured, if you prefer the active/standby configuration the right teaming policy would be “Failover Order”.
For the multi TEP deployment you are able to configure different teaming policies for the overlay traffic and the VLAN uplinks. All overlay segments will use the default teaming policy which might be configured for “LoadBalance Source” per default. Additionally, you are able to create named teaming policies which can be used to define which VLAN uplink should use which network adapter as preferred. Each of the named teaming policies can be either configured for “LoadBalance Source” or “Failover Order”. The named teaming policies will be used, if BGP will be used for dynamic routing.

Each overlay segment/logical switch is pinned to a specific tunnel end point IP, TEP IP1 or TEP IP2. Each TEP uses a different uplink, for instance, TEP IP1 uses Uplink1 (network adapter 2) that’s mapped to pNIC P1 and TEP IP2 uses Uplink2 (network adapter 3) that’s mapped to pNIC P2.

Problem description bug

Because of a bug in the flow-cache module inside the Edge nodes it is not ensured that TEP IP1 uses Uplink1 and TEP IP2 uses Uplink2. Instead, all TEP IPs will be used randomly over all the available uplinks. For example TEP IP 1 uses MAC-Address of Uplink1 and leaves the Edge over Uplink2 and therefore pNic P2 instead of pNic P1. Because of this behavior some network components in the underlay will drop these packets since they are take advantag of some security features like “Mac Spoofing Detection”. I already faced this issue in combination with NSX-T and Cisco ACI.

Workaround

A workaround is to disable the flow cache module at the edge node. But before disabling the flow-cache module, keep in mind that this will have an effect regarding the network performance for each packet that will go through the Edge nodes. The flow-chache module is responsible to keep track of every forwarding decision and will save all these decisions inside a table, if the flow-cache module is disabled every forwarding decision will consume CPU instead of taking advantage of the preprocessed forwarding decisions.
After all the consequences are evaluated you can disable the flow-cache module as follows.