Achieving Awesome Single-Stream Performance Over Bonded Ethernet

Ethernet Link Aggregation (aka PortChannel, Etherchannel, ethernet bonding, NIC teaming, trunking, link bundling, Smartgroup, Ethertrunk, etc) is a way to combine multiple Ethernet links to a single logical link. This improves redundancy and increases aggregate performance. It is good stuff, especially if you are into High Availability.

Here is a typical setup:

A good start. If you can spare the ports and cables to make this work, you should do it. You get the benefit of very fast fail-over between links if one dies. And you get the aggregated bandwidth. However, you probably will not get the combined bandwidth over a **single **tcp stream. The reason for this is that the switch probably does not round-robin the ethernet frames to the source server. Usually switches have a configurable algorithm for switching frames over bonded links. Here are some options on my switch:

10g-nexus(config)# port-channel load-balance ethernet ?
destination-ip Destination IP address
destination-mac Destination MAC address
destination-port Destination TCP/UDP port
source-dest-ip Source & Destination IP address
source-dest-mac Source & Destination MAC address
source-dest-port Source & Destination TCP/UDP port
source-ip Source IP address
source-mac Source MAC address
source-port Source TCP/UDP port

Yea, no round robin.  As referenced in http://www.kernel.org/doc/Documentation/networking/bonding.txt :

Many switches do not support any modes that stripe traffic (instead choosing a port based upon IP or MAC level addresses); for those devices, traffic for a particular connection flowing through the switch to a balance-rr bond will not utilize greater than one interface’s worth of bandwidth.

If you do know of a switch that supports striped frames, please contact me. But! There is a way to work around this limitation and achieve Awesome performance! It costs an extra switch, but the benefits are double performance and the ability to withstand a switch failure. Here is a crappy diagram:

This is what is suggested in Section 12 in http://www.kernel.org/doc/Documentation/networking/bonding.txt

With this setup, each switch makes its own¬†decision¬†about where to send a frame, but it only has one choice, sort of. In a sense, no matter what¬†aggregation¬†algorithm you have on the switch, it doesn’t matter. What matters is the algorithm on the sending server. (round robin in my case)

The end result give you key benefits:

  • Resiliency¬†for individual links (failure of cables / optics / nics )
  • Resiliency¬†for switches (failure of a switch, powersupply, etc?)
  • Better than single-link performance on single streams
  • iperf bragging rights?

Proof is in the pudding:

[email protected]:~# iperf -c server2
------------------------------------------------------------
Client connecting to server2, TCP port 5001
TCP window size: 27.9 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.0.2 port 38576 connected with 192.168.0.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 21.1 GBytes 18.1 Gbits/sec

Exact Components in this setup (if you want to reproduce my work)

In the end it breaks down to about $2000 per 20g bond. (taking into account switch ports, nics, cables, etc) If your speed and availability can justify that cost, then it is a cool setup.