The Trapeze messaging system consists of two components: a messaging library that is linked into the kernel or user programs, and a firmware program that runs on the Myrinet network interface card (NIC). The firmware defines the interface between the host CPU and the network device; it interprets commands issued by the host and masters DMA transactions to move data between host memory and the network link. The host accesses the network using the Trapeze library, which defines the lowest-level API for network communication. Since Myrinet firmware is customer-loadable, any Myrinet site can use Trapeze.
Trapeze was designed primarily to support fast kernel-to-kernel messaging alongside conventional TCP/IP networking. Figure 1 depicts the structure of our current prototype client based on FreeBSD 4.0. The Trapeze library is linked into the kernel along with a network device driver that interfaces to the TCP/IP protocol stack. Network storage access bypasses the TCP/IP stack, instead using NetRPC, a lightweight communication layer that supports an extended Remote Procedure Call (RPC) model optimized for block I/O traffic. Since copying overhead can consume a large share of CPU cycles at gigabit-per-second bandwidths, Trapeze is designed to allow copy-free data movement, which is supported by page-oriented buffering strategies in the socket layer, network device driver, and NetRPC.
We are experimenting with Slice, a new scalable network I/O service based on Trapeze. The current Slice prototype is implemented as a set of loadable kernel modules for FreeBSD. The client side consists of 3000 lines of code interposed as a stackable file system layer above the Network File System (NFS) protocol stack. This module intercepts read and write operations on file vnodes and redirects them to an array of block I/O servers using NetRPC. It incorporates a simple striping layer and cacheable block maps that track the location of blocks in the storage system. Name space operations are handled by a file manager using the NFS protocol, decoupling name space management (and access control) from block management. This structure is similar to other systems that use independent file managers, including Swift , Zebra , and Cheops/NASD . To scale the file manager service, Slice uses a hashing scheme that partitions the name space across an array of file managers, implemented in a packet filter that redirects NFS requests to the appropriate server.