next up previous
Next: Performance Up: Network I/O with Trapeze Previous: Trapeze Messaging and NetRPC

Balancing Latency and Bandwidth


From a network perspective, storage access presents challenges that are different from other driving applications of high-speed networks, such as parallel computing or streaming media. While small-message latency is important, server throughput and client I/O stall times are determined primarily by the latency and bandwidth of messages carrying file blocks or memory pages in the 4 KB to 16 KB range. The relative importance of latency and bandwidth varies with workload. A client issuing unpredicted fetch requests requires low latency; other clients may be bandwidth-limited due to multithreading, prefetching, or write-behind.

Reconciling these conflicting demands requires careful attention to data movement through the messaging system and network interface. One way to achieve high bandwidth is to use large transfers, reducing per-transfer overheads. On the other hand, a key technique for achieving low latency for large packets is to fragment each message and pipeline the fragments through the network, overlapping transfers on the network links and I/O buses [16,14]. Since it is not possible to do both at once, systems must select which strategy to use. Table 1 shows the effect of this choice on Trapeze latency and bandwidth for 8KB payloads, which are typical of block I/O traffic. The first two columns show measured one-way latency and bandwidth using fixed-size 1408-byte DMA transfers and 8KB store-and-forward transfers. These experiments use raw Trapeze messaging over LANai-4 Myrinet NICs with firmware configured for each DMA policy. Fixed pipelining reduces latency by up to 45% relative to store-and-forward DMA through the NIC, but the resulting per-transfer overheads on the NIC and I/O bus reduce delivered bandwidth by up to 30%.

To balance latency and bandwidth, Trapeze uses an adaptive strategy that pipelines individual messages automatically for lowest latency, while dynamically adjusting the degree of pipelining to traffic patterns and congestion. The third column in Table 1 shows that this yields both low latency and high bandwidth. Adaptive message pipelining in Trapeze is implemented in the NIC firmware, eliminating host overheads for message fragmentation and reassembly.

Figure 2 outlines the message pipelining policy and the resulting overlapped transfers of a single 8KB packet across the sender's I/O bus, network link, and receiver's I/O bus. The basic function of the firmware running in each NIC is to move packet data from a source to a sink, in both sending and receiving directions. Data flows into the NIC from the source and accumulates in NIC buffers; the firmware ultimately moves the data to the sink by scheduling a transfer on the NIC DMA engine for the sink. When sending, the NIC's source is the host I/O bus (hostTx) and the sink is the network link (netTx). When receiving, the source is the network link (netRcv) and the sink is the I/O bus (hostRcv). The Trapeze firmware issues large transfers from each source as soon as data is available and there is sufficient buffer space to accept it. Each NIC makes independent choices about when to move data from its local buffers to its sinks.

Figure 3: Adaptive message pipelining reverts to larger transfer sizes for a stream of 8K payloads. The fixed-size transfer at the start of each packet on NetTx and HostRcv is the control message data, which is always handled as a separate transfer. Control messages do not appear on HostTx because they are sent using programmed I/O rather than DMA on this platform (300 MHz Pentium-II/440LX).

\psfig {file = figs/PII_stream.eps, height = 2in, width = 6.5in}\end{figure*}

The policy behind the Trapeze pipelining strategy is simple: if a sink is idle, initiate a transfer of all buffered data to the sink if and only if the amount of data exceeds a configurable threshhold (minpulse). This policy produces near-optimal pipeline schedules automatically because it naturally adapts to speed variations between the source and the sink. For example, if a fast source feeds a slow sink, data builds up in the NIC buffers behind the sink, triggering larger transfers through the bottleneck to reduce the total per-transfer overhead. Similarly, if a slow source feeds a fast sink, the policy produces a sequence of small transfers that use the idle sink bandwidth to reduce latency.

The adaptive message pipelining strategy falls back to larger transfers during bursts or network congestion, because buffer queues on the NICs allow the adaptive behavior to carry over to multiple packets headed for the same sink. Even if the speeds and overheads at each pipeline stage are evenly matched, the higher overhead of initial small transfers on the downstream links quickly causes data to build up in the buffers of the sending and receiving NICs, triggering larger transfers.

Figure 3 illustrates the adaptive pipelining behavior for a one-way burst of packets with 8KB payloads. This packet flow graph was generated from logs of DMA activity taken by an instrumented version of the Trapeze firmware on the sending and receiving NICs. The transfers for successive packets are shown in alternating shadings; all consecutive stripes with the same shading are from the same packet. The width of each stripe shows the duration of the transfer, measured by a cycle counter on the NIC. This duration is proportional to the transfer size in the absence of contention. Contention effects can be seen in the long first transfer on the sender's I/O bus, which results from the host CPU contending for the bus as it initiates send requests for the remaining packets.

Figure 3 shows that both the sending and receiving NICs automatically drop out of pipelining and fall back to full 8KB transfers about one millisecond into the packet burst. While the pipelining yields low latency for individual packets at low utilization, the adaptive behavior yields peak bandwidth for streams of packets. The policy is automatic and self-tuning, and requires no direction from the host software. Experiments have shown that the policy is robust, and responds well to a range of congestion conditions [15].

next up previous
Next: Performance Up: Network I/O with Trapeze Previous: Trapeze Messaging and NetRPC
Jeff Chase