Figures 3 and 4 show the average CPU utilizations on the sender and receiver respectively for the bandwidth tests reported in Section 3.1. Some of the results vary noticeably due to several factors. On the receiver, this may be affected by incomplete support for page coloring in the Alpha FreeBSD port; the runs show a bimodal distribution on the Alpha receiver configurations when copying is used. In the zero-copy sender results, irregularities result when netperf reuses a send buffer page before the driver determines that the previous transmit on that buffer is complete. The Trapeze driver detects this case and conservatively copies the page, although netperf does not actually store to the buffer in this experiment; if the process did store to the page then a copy-on-write would result. The sender-side zero-copy optimizations trigger with varying probabilities on different configurations, and are affected by CPU speed, the process send buffer size, and the Trapeze ring size, given that Trapeze suppresses transmit-complete notices until a send ring entry is reused. Some step behavior results from the TCP implementation selecting packet sizes that are integral multiples of the page size for odd MTUs; these effects are less pronounced on the Intel platforms, which use 4KB rather than 8KB pages. The numbers presented here are averages of 20 runs.
On the receiving side, all graphs show a trend of declining CPU utilization with large MTUs, with much lower CPU utilizations when data-movement costs such as copying and checksumming are eliminated. The downward trend with larger MTUs is most pronounced on the faster platforms, since the bandwidth of older platforms such as 440LX is initially limited by the CPU; reduced overheads result in higher bandwidth rather than lower CPU utilization. Similarly, CPU utilizations initially increase with larger MTUs on the sending side. This is because the larger MTUs allow higher bandwidth at the receiver, driving the sender to transmit faster. Once peak bandwidth is attained, the CPU utilizations begin to drop with increasing MTU. The graphs reflect the higher CPU costs on the receiving side, primarily due to lower interrupt overheads at the sender.
These graphs show that copying and checksumming optimizations are extremely important even on the platforms that are capable of achieving peak bandwidth without them. Any reduction in overhead translates directly into lower CPU utilization, leaving more cycles available for application processing at a given bandwidth. Note also that disabling checksumming yields little benefit on the Monet receiver because of checksum offloading: the small incremental CPU cost is due to checksumming the headers in the driver.
The receiver utilizations in Figure 3 again reinforce the importance of the Jumbo Frames standard promulgated by Alteon and Microsoft, which would increase the Gigabit Ethernet MTU to 9000 bytes. The Intel receiver CPUs are saturated at 1500-byte MTUs,showing that the bandwidth limitation near 300 Mb/s for standard Ethernet MTUs is due to receiver CPU saturation caused by the overhead of handling the larger number of packets. Slightly higher 1500-byte bandwidths are achieved on the Alphas due to the faster host CPUs: on Miata the 1500-byte bandwidth is limited at 313 Mb/s by saturation of the CPU on the LANai-4 NIC, while the faster LANai-5 NIC delivers bandwidths closer to 410 Mb/s before saturating. However, at this speed packet handling overheads push the Monet's Alpha 21264 host CPU to 90% utilization, even while it is driving less than half of the link speed. The Monet's receiver utilization drops below 30% with 8KB packet sizes when zero-copy and checksum offloading are enabled, even as the delivered bandwidth more than doubles to 956 Mb/s.
Looking further, the results show that the 9000-byte MTU of the Jumbo Frames standard is sufficient to achieve near-peak bandwidth on all platforms. However, Figure 3 shows that if other host overheads are present, per-packet overheads can constrain peak bandwidth even if Jumbo Frames are used. The 440BX platform does not attain its peak bandwidth until 16KB MTUs, and does not achieve its minimal CPU utilization until the MTUs reach 57KB. Receiver CPU utilization on this platform drops from 88% to 48% as the MTU grows from 8KB to 16KB.