Scheduled Transfers OS Bypass API 11/17/98 Eric Salo (salo@sgi.com) Jim Pinkerton (jimp@sgi.com) ------------------------------------------------------------------------------- Intro ------------------------------------------------------------------------------- The ST OS Bypass is implemented in several modules which span both kernel and user space. To make the problem tractable, the approach we've chosen is to create multiple API layers. This paper focuses specifically on the user space interfaces needed to provide an OS bypass API. It is assumed that a normal (BSD) socket-based API is used to create and close connections. Before and after a connection is established, the ST OS Bypass queries the kernel ST protocol stack to gain enough state to allow it to perform data transfer without going through the kernel. So after connection setup, no OS Bypass ST messages will be transferred through the kernel until the connection is torn down. To ensure that multiple OS Bypass jobs can coexist safely, several functions are performed exclusively by the kernel. These include setup/teardown of connections, setup/teardown of pinned buffers for data transfer, and setup/teardown of descriptor queues. The OS Bypass process is also not allowed to manipulate physical addresses or source MAC addresses. ------------------------------------------------------------------------------- ST OS Bypass Layered APIs ------------------------------------------------------------------------------- The Bypass API currently consists of a 'sockopt' layer and three other layers numbered 0-2. It is our intention that the numbered layers of the API be implementable on top of each other opaquely. In other words, it should be possible to implement the complete level 1 API on top of the level 0 API with no knowledge of the level 0 internals. It should be similarly possible to implement level 2 on top of level 1, and so on. Note that we do not *require* implementations to be done this way, and in fact for performance reasons it is highly likely that we will ultimately implement the higher-numbered layers directly. The layers are defined as follows: THE SOCKOPT LEVEL ----------------- This is the layer which initializes the bypass via calls to setsockopt() and getsockopt(). Many of the routines in this layer are intended solely for use by our own development team and exist only to support the higher layers. But some will also be part of the supported interface; this is discussed below. LEVEL 0 ------- The lowest layer API is meant to be as thin as possible, and allows full access to all ST functionality by providing the ability to transmit and receive ST control and/or data packets directly. It introduces as few new end-to-end concepts as possible and conforms to the ST connection-oriented protocol. As a result, setting up transmit and receive operations is often rather complex. LEVEL 1 ------- This layer implements a more restricted data transfer model. Its purpose is to abstract away most of the complexity involved with buffer and descriptor management in order to provide a greatly simplified interface. LEVEL 2 ------- Note that both of the above layers provide unreliable data transfer. A reliable data transfer layer is definitely in our plans, and this is a placeholder for that functionality. At present we still do not have enough experience with the unreliable layers to know what a reliable layer should look like. All of the numbered layers will be implemented in a single dynamic library named libst.so. ------------------------------------------------------------------------------- ST OS Bypass Modules ------------------------------------------------------------------------------- In general, the numbered API layers must interact (directly or indirectly) with several additional modules. These modules include the User Device Driver, the Kernel ST Protocol Stack, the Kernel Device Driver, and the Network Adaptor. Of these only the first two interact directly with the API. Visually, the relationship is as follows: +-------------------------------------------+ | ST Middleware User Library, Level 2 | +-------------------------------------------+ | ST Middleware User Library, Level 1 | +-------------------------------------------+ | ST Middleware User Library, Level 0 | | +---------------------+ | | Kernel ST Stack | +---------------------+---------------------+ | User Device Driver | Kernel Device Driver| +---------------------+---------------------+ | Network Adapter | +-------------------------------------------+ This document defines the interfaces between all of the above modules. The User Device Driver is similar to the Kernel Device Driver - it provides a device-independent abstraction to the Level 0 API. So, the gory details of the transmit descriptor format, the receive descriptor format, and how to flow control the transmit descriptor queue are opaque within the User Device Driver. The User Device Driver makes several calls into the kernel to configure itself for data transfer, and talks directly to the adapter in order to move data. ------------------------------------------------------------------------------- Establishing a Bypass Connection ------------------------------------------------------------------------------- The calling sequence when establishing a connection will normally be: 1. Create a socket. 2. Enable the bypass on the socket. 3. Configure the bypass by getting and setting various socket options. 4. Establish a socket connection to a peer process on another host. 5. Initialize API state for the connection. Once the bypass connection setup is complete, all ST messages across the socket completely bypass the operating system; no further system calls are needed to move data across the network until the connection is closed. ------------------------------------------------------------------------------- STEP 1 - Create a Socket ------------------------------------------------------------------------------- int s; s = socket(PF_INET, SOCK_SEQPACKET, IPPROTO_STP); Note: SOCK_SEQPACKET is currently the only supported protocol for IPPROTO_STP. ------------------------------------------------------------------------------- STEP 2 - Enable the Bypass ------------------------------------------------------------------------------- flag = 1; size = sizeof(int); setsockopt(s, IPPROTO_STP, ST_BYPASS, &flag, size); Note: This step must occur before the socket is bound to a controller and before a connection is established. ------------------------------------------------------------------------------- STEP 3 - Configure the Bypass ------------------------------------------------------------------------------- This step involves making system calls to get/set various parameters for the bypass connection. These options are enumerated and described in the SOCKOPT section of this document. ------------------------------------------------------------------------------- STEP 4 - Establish a Socket Connection ------------------------------------------------------------------------------- This is exactly the same listen() / accept() / connect() dance that we've all come to know and love. Describing the full BSD sockets API is beyond the scope of this document. Note: If any of the configuration parameters set during step 3 were invalid, it is expected that the connect/accept will fail and this will in fact be the only way (in general) to detect such invalid settings. This is because we don't mandate binding to a device before configuring the bypass. ------------------------------------------------------------------------------- STEP 5 - Initialize API State ------------------------------------------------------------------------------- The bypass API is initialized for a connection by calling st_attach() on the open socket. This routine is described more fully below. ------------------------------------------------------------------------------- Summary of API Routines (INCOMPLETE) ------------------------------------------------------------------------------- Sockopt(s): ST_BYPASS General: st_attach() st_detach() st_flush() st_map() st_unmap() Level 0: st_tx() st_rx() Level 1: st_push() st_pull() ------------------------------------------------------------------------------- SOCKOPT: Enable the Bypass ------------------------------------------------------------------------------- int s, flag; size = sizeof(int); setsockopt(s, IPPROTO_STP, ST_BYPASS, &flag, size); getsockopt(s, IPPROTO_STP, ST_BYPASS, &flag, &size); The 'set' option enables the bypass for the socket if flag=1 and disables the bypass if flag=0. It is erroneous to enable or disable the bypass after the socket has already been bound to a controller. ------------------------------------------------------------------------------- GENERAL: Attach to a Peer ------------------------------------------------------------------------------- int st_attach(int fd, int channel, int flags, st_peer_t **handle); Takes as input a file descriptor which corresponds to a ST connection (as described above) and a desired channel number. Returns a pointer to a st_peer_t, which encapsulates all of the hidden bypass-level state for that connection. By having this call, we avoid the extra indirection into a table that we would otherwise have if the API routines took fds as arguments. We also completely sidestep the problem of mirroring the kernel's fd table in user space. NOTE: No flags are currently defined. ------------------------------------------------------------------------------- GENERAL: Detach from a Peer ------------------------------------------------------------------------------- int st_detach(st_peer_t **handle); Cleans up all API state associated with a connection and frees the handle. Note that this call does not also close the original file descriptor. ------------------------------------------------------------------------------- GENERAL: Flush the Output Queue ------------------------------------------------------------------------------- int st_fence(st_peer_t *handle); Blocks until all transmit descriptors that have previously been sent to the appropriate SDQ have been completely processed and all of the corresponding bits have been completely DMA'ed out of host memory. Discussion: We may want to add a 'count' argument to this call, indicating the number of transactions that we should wait for. A value of -1 would mean to wait for all of them. ------------------------------------------------------------------------------- GENERAL: Prepare a Buffer ------------------------------------------------------------------------------- int st_map(st_peer_t *handle, void **ptr, int len); Allows a range of memory to be used by the data-moving routines. Note that the second argument ('ptr') is the address of a pointer; 'ptr' cannot be NULL, but it may point to a NULL pointer. If 'ptr' points to a non-NULL pointer then the library will attempt to mark the supplied buffer as bypass-ready. If 'ptr' points to a NULL pointer then the library will attempt to allocate such a buffer and return its address back to the caller. Only one range is permitted at a time for any given 'peer'. Discussion: What should the semantics be for multiple calls? ------------------------------------------------------------------------------- GENERAL: Release a Buffer ------------------------------------------------------------------------------- int st_unmap(st_peer_t *handle, void **ptr, int len); TBD; more analysis is needed here regarding the desired semantics. ------------------------------------------------------------------------------- LEVEL 0: Transmit a ST Header ------------------------------------------------------------------------------- int st_tx(st_peer_t *handle, st_header_t *hdr, void *opt_pay, int len, int bufx, int off); Blocks until a transmit slot appears for the appropriate channel. Writes the descriptor and returns. If 'opt_pay' is non-NULL, a receive descriptor will be sent to the receiver. If 'len' is non-zero, the buffer starting at will be delivered into the receiver's memory. NOTE: st_header_t is a complete ST header, as defined in st.h. ------------------------------------------------------------------------------- LEVEL 0: Receive a ST Header ------------------------------------------------------------------------------- int st_rx(st_peer_t *handle, st_header_t *hdr, void *opt_pay, int *len); Blocks until a descriptor appears on the appropriate channel. Copies the ST header and length into the user-supplied buffer and returns. ------------------------------------------------------------------------------- LEVEL 1: Data Transfer ------------------------------------------------------------------------------- Level 1 abstracts away both the ST header and bufx table. It supports both 'get' and 'put' operations, with optional synchronization for each. All data transfer is performed by the following two routines: int st1_push(void *peer, void *hdr, void *loc, void *rem, int len); int st1_pull(void *peer, void *hdr, void *loc, void *rem, int len); 'hdr' points to a fixed-size buffer which serves as a sync point. 'loc' and 'rem' are local and remote addresses, respectively. 'len' is the number of bytes to transfer between the local and remote buffers. There are three possible ways to call either st1_push() or st1_pull(): hdr loc rem --- --- --- non-NULL NULL NULL NULL non-NULL non-NULL non-NULL non-NULL non-NULL That is, either a header may be specified, or local/remote pointers may be specified, or both. The semantics of these calls can be summarized completely with three rules: 1) The 'hdr' arguments to the push and pull calls act as traditional send/recv operations. That is, the hdr data passed in to a push will appear in the hdr buffer of the next pull called by the peer process. 2) If 'loc' and 'rem' arguments are supplied to st1_push(), a PUT operation will be performed before the hdr (if any) is delivered remotely. 3) If 'loc' and 'rem' arguments are supplied to st1_pull(), a GET operation will be performed after the hdr (if any) is delivered locally. ------------------------------------------------------------------------------- LEVEL 1: Examples ------------------------------------------------------------------------------- There are exactly six different combinations of the push/pull calls which make semantic sense: 1) SEND/RECV of a header only Process A calls st1_push(peer, hdr, NULL, NULL, 0) Process B calls st1_pull(peer, hdr, NULL, NULL, 0) In this example, a small message is copied from Process A's 'hdr' buffer into Process B's 'hdr' buffer. This is exactly the same behavior as we might see with calls to send() and recv() or write() and read(). 2) PUT Process A calls st1_push(peer, NULL, loc, rem, len) Process B does nothing In this example, data from Process A's address space is silently copied into Process B's address space. There is no direct way for Process B to detect that this has happened, other than to periodically inspect its own buffer(s). 3) GET Process A does nothing Process B calls st1_pull(peer, NULL, loc, rem, len) In this example, data from Process A's address space is silently copied into Process B's address space. There is no direct way for Process A to detect that this has happened. 4) PUT with sync Process A calls st1_push(peer, hdr, loc, rem, len) Process B calls st1_pull(peer, hdr, NULL, NULL, 0) In this example, data from Process A's address space is copied into Process B's address space. Process B will block in st1_pull() until this has happened, at which point st1_pull() will return and Process B's 'hdr' buffer will contain the header sent from Process A. In this way, Process B can determine that the PUT operation has completed and that it is now safe for it to inspect its local buffers. 5) GET with sync Process A calls st1_push(peer, hdr, NULL, NULL, 0) Process B calls st1_pull(peer, hdr, loc, rem, len) In this example, data from Process A's address space is copied into Process B's address space, but only after Process B receives the 'hdr' from Process A. In this way, Process A can postpone the GET operation until it has safely updated its local buffers with new data for use by Process B. 6) Combined PUT and GET with sync Process A calls st1_push(peer, hdr, loc1, rem1, len1) Process B calls st1_pull(peer, hdr, loc2, rem2, len2) In this example, data from Process A's address space (from loc1) is PUT into Process B's address space, and then a message is sent to Process B. After this message arrives, a GET operation is performed into Process B's address space. -- Eric Salo Silicon Graphics salo@sgi.com