libst API (STP OS Bypass API Level 0) 12/10/99 draft Revised 2/15/00 v04 by Eric Salo, Jim Pinkerton, John Gregor and Monika ten Bruggencate Revision History 01 - initial version 02 - fixed tabs, justification 03 - changed return codes to be negative, added st_rx and st_urx return value of 1 if prepend payload is present. 04 - decouple mx alloc/free from mapping of a buffer. added ST_OPT_TX_CREDITS 05 - minor mods to introduction. ========================================================== Table of Contents ========================================================== 1. Scope 2. Normative references 3. Definitions and conventions 4. System Overview 5. Constructors and Destructors 6. Configuration 7. Connection-oriented Model 8. Connectionless Mode Initialization 9. Memory Management 10. Data Transfer 11. Miscellaneous Appendix A. Unresolved Issues ========================================================== 1. Scope ========================================================== The primary motivator for an OS Bypass is to avoid the overhead of sending messages through the Operating System (OS). This overhead adds latency and unnecessarily consumes CPU cycles due to context switches required to switch in to and out of the kernel. Additionally, an OS Bypass exposes the ability to DMA directly into and out of a user buffer, allowing a zero copy transfer. This becomes critical as networks exceed the data copy rate of a single CPU. Libst, an OS Bypass library, provides an OS Bypass using the ANSI draft standard Scheduled Transfers Protocol (STP). The library is designed to provide MAC layer independence while exposing the full STP tool suite. Because STP takes a toolkit approach to message transfer, many transfer models can be layered on top of STP. This includes stream, datagram, and message oriented push, pull and fetchop semantics. This document defines the libst Applications Programmer Interface (API). The libst API goals are - provide the simplest software layer possible to optimize performance - maintain MAC layer independence - provide hardware implementation independence - expose the full STP tool suite. It is envisioned that multiple libraries can be targeted to this API, potentially exposing new, higher level APIs. As such, the API defined in this document is referred to as level zero. The libst API provides an STP block oriented API. It does not provide an API for transmission of individual STP STUs. This is because the choice of whether STU tiling is done in hardware or software should be hidden to allow different implementations to optimize for trade-offs between adapter functionality verses cost. Libst supports two data transfer models: - a connection-oriented model that is similar to the send() sockets API semantics of a connection-oriented API, and - a connectionless model similar to the sendto() sockets API semantics, which allows end-points to be defined without connections. A fully connected cluster of computers using a connection-oriented model requires n-squared connections, where n is the number of processes in the cluster. The connectionless model does not require connections, and thus avoids connection-oriented state scaling problems. Instead, n end-points are defined which can send to and receive from any process in the cluster. If a connection-oriented model is used, libst requires that parameters exchanged during STP connection setup be set before a connection is created. For the connectionless model libst requires an endpoint to be setup. Once a connection or endpoint is created, libst queries the OS to acquire several STP parameters and enable the adapter to send packets directly to and transmit packets directly from the user process. System integrity dictates that privileged operations needed by libst be performed exclusively by the kernel. These include: - setup and teardown of connections and endpoints - setup and teardown of pinned buffers - mapping of a user address to a physical address that can be used by the adapter - mapping of transmit and receive descriptor queues between the user process and the adapter - setting of the source MAC address in outgoing packets - setting of the source STP port in outgoing packets - interface selection (aka routing) ========================================================== 2. Normative references ========================================================== ========================================================== 3. Definitions and conventions ========================================================== 3.1 Definitions -------------------------- ptr = pointer input parameter output parameter endpoint connection 3.2 Datatypes -------------------------- The following datatypes are defined: struct st_ehandle { }; // opaque structure for endpoint struct st_chandle { }; // opaque structure for connection typedef struct st_ehandle *st_ehandle_t; // opaque ptr to an endpoint typedef struct st_chandle *st_chandle_t; // opaque ptr to a connection typedef void *st_mhandle_t; // opaque ptr to a region of memory typedef unsigned int uint; // unsigned integer typedef long long int64_t; // 64-bit signed integer typedef __uint64_t st_macaddr_t; // MAC address 3.2 Editorial Conventions -------------------------- 3.3 -------------------------- ========================================================== 4. System Overview ========================================================== The following diagram outlines the functional components of an OS Bypass library. This document specifies the libst API boundary. The design is optimized for, but does not require, a libst implementation to have an internal hardware independent interface to multiple libst device drivers. +--------------------------------------+ | | | Upper Layer Protocols | Multiple application libraries | | +--------------------------------------+ libst API boundary | libst level 0 | | +------------------+ | |+-----------------| Kernel boundary | || STP Protocol | | || Stack | +-------------------||-----------------+ | Libst Device || Kernel Device | | Driver || Driver | +-------------------++-----------------+ | | | Adapter | +--------------------------------------+ A libst Level Zero library implementation should consist of: - a device independent layer - one or more device dependent drivers - kernel support to allow configuration of an STP endpoint or connection - kernel support to allow mapping the adapter directly into the user's address space - kernel support to allow mapping of a user's buffers to the adapter The Libst Device Driver functions are similar to the Kernel Device Driver - they provide a device independent abstraction for the device API. Thus details that are generally implementation dependent such as transmit descriptor format, receive descriptor format, and flow control mechanisms for the transmit descriptor queue are hidden. The adapter must be designed to enforce the various checks on incoming packets defined in the STP specification and not allow the user to specify the source port in transmitted packets. Note that the API does not expose the source MAC address. 4.1. List of All Routines --------------------------------------------- Constructors and Destructors: st_create() st_endpoint() st_delete() Configuration: st_getopt() st_setopt() Connection Setup: st_listen() st_accept() st_connect() st_close() End-Point Setup: st_macaddr() Memory Management: st_map() st_unmap() st_mx_alloc() st_mx_free() Data Transfer: st_tx() st_rx() st_flush() st_utx() st_urx() Miscellaneous: st_time() st_version() ========================================================== 5. Constructors and Destructors ========================================================== 5.1 st_create() ----------------------------------------- To create an endpoint which is initialized for a connection-oriented STP transfer model, use int st_create(const char *str, st_ehandle_t *ehandle). This routine allocates (and initializes) storage in libst for all state corresponding to a point-to-point STP connection. Input parameters: str: Is a string that takes one of two forms. "hostname:port" - hostname is either a valid internet hostname (e.g. foobar.sgi.com), or an IP address. Port is a valid IANA port. NULL - used by the client to allow automatic port and interface selection during st_connect(). ehandle: address of a pointer to the opaque structure st_ehandle. Output parameters: ehandle: address of a pointer to the memory allocated for the opaque data structure st_ehandle. Return errors: Return value of zero signifies success. Negative value means an error (negate value and see sys/errno.h for decode). 5.2 st_endpoint() ----------------------------------------- To create an endpoint for the connectionless model, use int st_endpoint(const char *str, st_ehandle_t *ehandle). This routine creates an end-point that can be used to send and receive messages from/to arbitrary hosts and ports. See the STP specification for details of checks performed by the adapter. Input parameters: same as st_create() Output parameters: same as st_create() Return error codes: same as st_create() 5.3 st_delete() ----------------------------------------- To deallocate the storage and state setup with either st_endpoint() or st_create(), use int st_delete(st_ehandle_t ehandle). Input parameters: an st_ehandle_t that was setup previously with a call to st_create() or st_endpoint(). Output parameters: None Return error codes: Return value of zero signifies success. Negative value means an error (negate value and see sys/errno.h for decode). If passed a handle to a still-open connection, st_delete() will silently close that connection before deleting any state, just as if the application had issued a call to st_close(). In either connection mode or connectionless mode, st_delete may involve waiting for any pending transmissions to flush. ========================================================== 6. Configuration ========================================================== The configuration parameters for an STP connection maybe written or read using st_setopt() or st_getopt(). To read the current value of a STP parameter, use: int st_getopt(st_ehandle_t ehandle, int opt, void *val). Similarly, to set the value of a STP parameter use: int st_setopt(st_ehandle_t ehandle, int opt, void *val). The following values are legal options for the 'opt' argument. The val pointer is dependent on the option used. For st_setopt() val is an input parameter. For st_getopt() val is an output parameter. The return value is zero for success and negative for an error. To decode the error, negate the value and see sys/errno.h. ST_OPT_BUFSIZE_LOCAL The bufsize used by the local controller. val = pointer to uint. Defined as a power of two. Default is 14 (i.e. the bufsize is 16 KB). ST_OPT_BUFSIZE_REMOTE The bufsize used by the remote controller. This value is exchanged as part of STP connection setup and thus can not be set. It is not a valid command when the connectionless model is used. val = pointer to uint. Return value is a power of two (e.g. 14 = 16 KB). ST_OPT_MAXSTU_LOCAL The maximum STU size supported by the local controller. val = pointer to uint. Defined as a power of two. Default is 14 (i.e. the maxstu size is 16 KB). ST_OPT_MAXSTU_REMOTE The maximum STU size supported by the remote controller. This value is exchanged as part of STP connection setup and thus can not be set. It is not a valid command when the connectionless model is used. val = pointer to uint. Return value is a power of two (e.g. 14 = 16 KB). ST_OPT_KEY_LOCAL The key value used by the local controller when processing incoming messages. Can only be set by root for security reasons. val = pointer to uint. ST_OPT_KEY_REMOTE The key value used by the remote controller when processing incoming messages. This value is exchanged as part of STP connection setup and thus can not be set. It is not a valid command when the connectionless model is used. val = pointer to uint. ST_OPT_PORT_LOCAL The local STP port for this connection or endpoint. val = pointer to uint. ST_OPT_PORT_REMOTE The remote STP port for the connection. This value is exchanged as part of STP connection setup and thus can not be set. It is not a valid command when the connectionless model is used. val = pointer to uint. ST_OPT_RX_BUFX_BASE The base bufx used for receive buffers. The combination of base bufx and bufx count defines the valid set of bufx values for receive buffers. val = pointer to uint. ST_OPT_RX_BUFX_COUNT The number of receive bufx's that have been allocated to this endpoint or connection. The combination of base bufx and bufx count defines the valid set of bufx values for receive buffers. val = pointer to uint. ST_OPT_RX_SLOTS The total number of STP Schedule Header Queue slots available locally. val = pointer to uint. ST_OPT_TX_BUFX_BASE The base bufx used for transmit buffers. The combination of base bufx and bufx count defines the valid set of bufx values for transmit buffers. val = pointer to uint. ST_OPT_TX_BUFX_COUNT The number of transmit bufx's that have been allocated to this endpoint or connection. The combination of base bufx and bufx count defines the valid set of bufx values for transmit buffers. val = pointer to uint. ST_OPT_TX_CREDITS The number of transmit credits in the transmit queue. This is the number of outstanding sends before transmit flow control starts and the pipeline will block before the queue is flushed. ========================================================== 7. Connection-oriented Model ========================================================== Libst provides the following routines to manage connection setup and teardown when the connection-oriented model is used. 7.1 st_listen() ----------------------------------------- To prepare a server to accept incoming connection requests from a client. int st_listen(st_ehandle_t ehandle) Input parameters: ehandle: created with st_create() Output parameters: None. Return values: Zero means success. Negative value means an error (negate value and see sys/errno.h for decode). 7.2 st_accept() ----------------------------------------- To accept an incoming connection: int st_accept(st_ehandle_t ehandle, st_chandle_t *chandle) Input parameters: ehandle: created with st_create() chandle: a pointer to memory of size st_chandle_t Output parameters: chandle: an opaque pointer to the connection handle. Return values: zero means success. Negative value means an error (negate value and see sys/errno.h for decode). st_accept() blocks the server until an incoming connection from a client has been accepted. 7.3 st_connect() ----------------------------------------- To establish a connection with the remote server: int st_connect(st_ehandle_t ehandle, const char *peer, st_chandle_t *chandle) Input parameters: ehandle: created with st_create() chandle: a pointer to memory of size st_chandle_t str: a string of the form: "hostname:port" - hostname is either a valid internet hostname (e.g. foobar.sgi.com), or an IP address. Port is a valid IANA port. Output parameters: chandle: an opaque pointer to the connection handle. Return values: zero means success. Negative value means an error (negate value and see sys/errno.h for decode). 7.4 st_close() ----------------------------------------- To close the connection: int st_close(st_chandle_t chandle) Input parameters: chandle: created with either st_connect() or st_accept() Output parameters: None. Return values: zero means success. Negative value means an error (negate value and see sys/errno.h for decode). If there are still pending transmit operations on this connection, the call will block until they have been cleared, exactly as if the application had called st_flush(-1, NULL). 7.5 Example C Code ----------------------------------------- Libst connection setup is similar in functionality to their BSD namesakes. Programmers already familiar with the sockets API should have little difficulty in learning how to use them. Pseudo-code for a simple client-server example follows: st_ehandle_t ehandle; st_chandle_t chandle; char *hostname, *service; int s, t, key, flag; char peer[128]; hostname = "server_host"; // could also be a IP addr service = "3000"; // could also be a name // Create string containing server hostname and port number strcpy(peer, hostname); strcat(peer, ":"); strcat(peer, service); // Create an empty handle if (I_am_client) { rc = st_create(NULL, &ehandle); } else { rc = st_create(peer, &ehandle); } // Set various options for the connection st_setopt(ehandle, ST_OPT_FOO, ...); st_setopt(ehandle, ST_OPT_BAR, ...); st_setopt(ehandle, ST_OPT_BAZ, ...); // Establish the connection if (i_am_client) { st_connect(ehandle, peer, &chandle); } else { st_listen(ehandle); st_accept(ehandle, &chandle); } // Get various fields needed for the STP header st_getopt(chandle, ST_OPT_PORT_LOCAL, &key); ... // Send a simple STP header to the remote host st_tx(chandle, &st_header, ...); ========================================================== 8. Connectionless Mode Initialization ========================================================== As mentioned previously, the connection-oriented model breaks down if the application is an extremely large cluster of fully interconnected processes. The number of connections scales as n-squared, where n is the number of processes. The connectionless model provides endpoints that can transfer and receive messages to and from multiple peer endpoints. Through a mechanism outside the scope of this document an application can exchange the necessary STP endpoint information to allow communication to occur. Thus the data transfer model uses an st_utx() call which includes the remote MAC (Media Access Control) address as well as the STP header. 8.1 st_macaddr() ----------------------------------------- To get the MAC address of the local or remote interface use: int st_macaddr(st_ehandle_t ehandle, const char *hostname, st_macaddr_t *macaddr). Input parameters: ehandle: created with st_create() hostname: the IP address or hostname to be resolved to a MAC address. if NULL then resolve the local interface. macaddr: pointer to memory of size st_macaddr_t. Output parameters: macaddr: The MAC address for the hostname. This is typically resolved by using ARP. Return values: Zero means success. Negative value means an error (negate value and see sys/errno.h for decode). The returned MAC address should be viewed as an opaque cookie to be passed to the st_urx() and st_utx() routines. 8.2 Example C Code ----------------------------------------- Pseudo-code for a simple peer-to-peer example follows: st_ehandle_t ehandle; char *hostname, *service; int s, t, key, flag; char my_name[128]; char your_name[128]; hostname = "my_name"; // could also be a IP addr service = "3000"; // could also be a name peer = "your_name"; // could also be a IP addr // Create string containing server hostname and port number strcpy(my_name, hostname); strcat(my_name, ":"); strcat(my_name, service); // Create an empty handle rc = st_endpoint(peer, &ehandle); // Set various options for the connection st_setopt(ehandle, ST_OPT_FOO, ...); st_setopt(ehandle, ST_OPT_BAR, ...); st_setopt(ehandle, ST_OPT_BAZ, ...); // Resolve the IP address to a MAC address st_macaddr(peer, NULL, &macaddr); // Get various fields needed for the STP header st_getopt(chandle, ST_OPT_KEY_LOCAL, &key); ... // Send a simple STP header to the remote host st_utx(chandle, &st_header, ...); ========================================================== 9. Memory Management ========================================================== Before an STP DATA message can be sent, memory regions on both sides of the connection must be prepared for use by the API. This will typically (but not always) involve pinning physical pages of memory so that they are safe for DMA, as well as updating the bufx table(s) on the network controller to point to the appropriate buffers. In the most general case, there are at least ten separate actions that could be required: 1) Allocate a bufx range for a connection 2) Pin a buffer 3) Associate a pinned buffer with a bufx sub-range 4) Allocate a Mx 5) Associate a Mx with a bufx sub-range 6) Disassociate a Mx from a bufx sub-range 7) Deallocate a Mx 8) Disassociate a pinned buffer from a bufx sub-range 9) Unpin a buffer 10) Deallocate a bufx range for a connection Note that potentially every one of the above could involve making a call into the OS. As we are interested in minimizing overhead, this would clearly not be an optimal design. We therefore define only two routines - st_map() performs actions 2-5 above and st_unmap which performs actions 6-9 above. Note that the st_setopt ST_OPT_RX_BUFX_COUNT and ST_OPT_TX_BUFX_COUNT are used allocation and deallocation of a bufx range (i.e. 1) and 10) above). 9.1 st_map() ----------------------------------------- int st_map(st_ehandle_t ehandle, int flags, void *ptr, int len, int base, int count, st_mhandle_t *mhandle, int *mx) st_map() takes a user-provided buffer and prepares it for use. Input parameters: ehandle endpoint handle created with st_create() flags Valid options are ST_MAP_TX or ST_MAP_RX to mark the buffer as send-only or receive-only. A buffer cannot be simultaneously used for sending and receiving. ptr Pointer to the beginning of the user-supplied buffer. The buffer's address must be lbuf_size aligned. len Length of the buffer in bytes. The length must be lbuf_size aligned. base Value of the bufx entry which will correspond to the beginning of the buffer. count The number of entries that the buffer will occupy in the bufx table. The maximum values is 64k. mhandle Pointer to memory of size st_mhandle_t. mx NULL if ST_MAP_TX is used. Pointer to a uint if ST_MAP_RX is used. Output parameters: mhandle Opaque pointer which describes the memory region to the library. mx valid only if ST_MAP_RX is used. It is the STP Mx value associated with buffer. Return values: Zero means success. Negative value means an error (negate value and see sys/errno.h for decode). The and fields are redundant. Given one, the other can be derived as = * . It is therefore legal to set one of them to a value of -1, indicating that the other value should be used to determine the physical extent of the memory region. If neither is set to -1, then they must exactly agree with each other. It is legal for overlapping (fully or partially) memory ranges to be passed in to separate calls to st_map(). It is also legal for a region to be mapped for transmit by one call and for receive by another call. The st_map() call is intended to provide TLB-like protection to a range of bufx entries, so it is not possible to specify granularity less than a full bufsize entry. 9.2 st_unmap() ----------------------------------------- To unmap the bufxs, unpin the memory, and free the Mx entry, use: int st_unmap(st_mhandle_t mhandle). Input parameters: mhandle: initialized with st_map() Return values: Zero means success. Negative value means an error (negate value and see sys/errno.h for decode). 9.3 st_mx_alloc() ----------------------------------------- To allocate and map a buffer to an mx, use: int st_mx_alloc(st_chandle_t chandle, st_mhandle_t mhandle, int *mx); Input parameters: chandle: initialized with st_connect() or st_accept() mhandle: initialized with st_map() mx: pointer to memory to return the allocated mx Return values: Zero means success. Negative value means an error (negate value and see sys/errno.h for decode). 9.4 st_mx_free() ----------------------------------------- To free a previously allocated mx, use: int st_mx_free(st_chandle_t chandle, int mx); Input parameters: chandle: initialized with st_connect() or st_accept() mx: Mx to be freed. Previously allocated with st_mx_alloc() Return values: Zero means success. Negative value means an error (negate value and see sys/errno.h for decode). ========================================================== 10. Data Transfer ========================================================== Data transfer can be done using a connection-oriented model with st_tx() and st_rx(), or with a connectionless model with st_utx() and st_urx(). In either case, re-use of a buffer cannot be assumed to be safe unless either an end-to-end acknowledgement is used or st_flush() is used to flush the messages through the transmitters network interface. 10.1 Connection-oriented Data Transfer ----------------------------------------- 10.1.1 st_tx() ----------------------------------------- To send STP control or data messages, excluding connection messages, use the routine int st_tx(void *chandle, st_header_t *hdr, void *opt_pay, int len, int bufx, int off). Input parameters: chandle: setup with st_connect() or st_accept() hdr: an STP header opt_pay: a pointer that is either: NULL - no optional payload is specified non-NULL, STP control message - points to a 32 byte STP optional payload non-NULL, STP data message - points to a 64 byte append data (defined below) len: length of the payload to be sent from the bufx mapped buffer. Only valid for Data messages. This does not include the length of the append data. bufx: the STP base bufx for the start of the transfer. off: the offset within the base bufx to start the transfer. Output parameters: None. Return values: Zero means success. Negative value means an error (negate value and see sys/errno.h for decode). If contains a data message and is non-zero, the buffer starting at will be delivered into the receiver's memory. If contains a control message, the control message will be delivered into the receiver's Schedule Header Queue. st_tx() blocks until a transmit request is successfully delivered to another asynchronous agent. This agent might be implemented as another thread in libst, as a kernel-level daemon, or on the network controller itself; the specifics are unimportant as long as progress is guaranteed to be made in the background after this routine returns. When st_tx() returns this does NOT mean that the user's data has necessarily been copied out of the transmit buffer - that is only guaranteed through the use of st_flush(), described below. For security reasons only the kernel is allowed to send and/or receive STP connection messages. Source bufx:offset:length combinations can cross source bufx boundaries. As required by the STP specification, if a destination bufx boundary is crossed, tiling rules apply and the Block will be subdivided into multiple STUs. A Note on Tiling: Libst is an STP block oriented API. It is assumed that libst will handle any tiling that may be required if the and arguments result in a buffer that crosses a bufx boundary, requiring multiple STUs. If applications need to assume that all messages are sent as single STU's, then they must slice st_tx() calls according to the tiling algorithm to ensure this, or use large pages which will allow a large bufsize and consequently effectively turn off STU tiling. 10.1.2 st_rx() ----------------------------------------- To receive STP control or data messages, excluding connection management messages, use: int st_rx(void *chandle, st_header_t *hdr, void *opt, struct timeval *timeout). Input parameters: chandle: setup with st_connect() or st_accept() hdr: pointer to memory of size st_header_t opt_pay: a pointer to memory of size 64 bytes. This allows option payload for control messages to be received. timeout: If NULL, block until a receive descriptor arrives. If non-NULL, it is a timeval structure specifying a timeout interval. Output parameters: hdr: the STP header received on the Schedule Header Queue opt_pay: if the STP header is a control message, zero or 32 bytes of option payload are returned. Return values: Zero means success, no option payload present. One means success, option payload is present. Negative value means an error (negate value and see sys/errno.h for decode). Implementations are not required to implement timeouts. At a minimum, if the timeout argument is non-null and the call would block, it should return immediately with an -EWOULDBLOCK. st_rx() attempts to receive an incoming descriptor on the appropriate connection, then copies the STP header and optional payload (if any) into the user-supplied buffers and returns. The semantics of the argument are exactly the same as in the BSD select() call; a NULL pointer will cause st_rx() to block indefinitely while waiting for a descriptor, otherwise points to a struct which defines the amount of time that the call should block. If the timeout expires before a descriptor appears, -EWOULDBLOCK is returned. 10.1 Connectionless Data Transfer ----------------------------------------- 10.2.1 st_utx() ----------------------------------------- To send STP control or data messages, excluding connection messages, use the routine int st_utx(void *ehandle, st_macaddr_t macaddr, st_header_t *hdr, void *opt_pay, int len, int bufx, int off) Input parameters: ehandle: endpoint handle created with st_create() hdr: see st_tx() opt_pay: see st_tx() len: see st_tx() bufx: see st_tx() off: see st_tx() macaddr: the destination MAC address. Found through a call to st_macaddr(). Output parameters: None. Return values: Zero means success. Negative value means an error (negate value and see sys/errno.h for decode). 10.2.1 st_urx() ----------------------------------------- To receive STP control or data messages, excluding connection management messages, use: int st_urx(void *ehandle, st_macaddr_t *macaddr, st_header_t *hdr, void *opt_pay, struct timeval *timeout) Input parameters: ehandle: endpoint handle created with st_endpoint() hdr: see st_rx () opt_pay: see st_rx() macaddr: the destination MAC address. Found through a call to st_macaddr(). If NULL, receive from anyone. timeout: If NULL, block until a receive descriptor arrives. If non-NULL, it is a timeval structure specifying a timeout interval. Output parameters: None. Return values: Zero means success, no option payload present. One means success, option payload is present. Negative value means an error (negate value and see sys/errno.h for decode). Implementations are not required to implement timeouts. At a minimum, if the timeout argument is non-null and the call would block, it should return immediately with an -EWOULDBLOCK. Implementations are not required to support non-NULL macaddr's. st_urx() attempts to receive an incoming descriptor, then copies the STP header and optional payload (if any) into the user-supplied buffers and returns. The semantics of the argument are exactly the same as in the BSD select() call; a NULL pointer will cause st_rx() to block indefinitely while waiting for a descriptor, otherwise points to a struct which defines the amount of time that the call should block. If the timeout expires before a descriptor appears, -EWOULDBLOCK is returned. 10.3 Waiting for Transmissions to Complete ----------------------------------------- The st_tx() and st_utx() routines return when the transmit request has been submitted. It does not provide the application with sufficient information to determine when it is safe to reuse its transmit buffer. 10.3.1 st_flush() ----------------------------------------- To ensure that all data has been transmitted (but not necessarily received), use: st_flush(void *ehandle) Input parameters: ehandle endpoint handle created with st_create() Output parameters: none Return values: Zero means success. Negative value means an error (negate value and see sys/errno.h for decode). st_flush() blocks until all pending transmits have completed. ========================================================== 11. Miscellaneous ========================================================== 11.1 st_time() ----------------------------------------- We define the following routine to provide wall-clock measurements: double st_time(void); Each call to st_time() returns the number of seconds that have elapsed since an undefined start time. Thus the correct way to use this routine is to call it twice and take the difference. 11.2 st_version () ----------------------------------------- const char *st_version(void); This routine returns an implementation-dependent, human-readable string that describes which version of the library is in use. It is not intended to be read directly by applications; it is more of a debugging aid for users. ========================================================== Appendix A. Unresolved Issues ========================================================== These are arranged roughly in priority order. - Need a way to detect whether an optional payload has been sent to st_urx(). - Need to add text for I-bit semantics. Briefly, we currently believe that the best idea is to require producer/consumer counters between libst and the OS. That way, once libst blocks on a receive, the OS will be able to look at an incoming descriptor with the I-bit set and reliably determine whether to wake the user process. There are some nasty issues related to flow-control on the Rx queue, but these may be tractable if we define the semantics of the I-bit with sufficient precision. - Chicken-and-egg problem: how to set the number of desired bufx entries before establishing the connection when the connection determines which controller will be used (and therefore how much memory is associated with each bufx entry)? - Should there be an option to support disjoint vaddrs in st_map()? - Different implementations may want/need to restrict the user from providing certain fields in the STP header, such as the Key, or the Mx in a CTS/MRA. Should such restrictions be allowed/required/forbidden, and if so, how? - We have not yet addressed issues specific to Fetch&Op messages (such as data alignment). - It would probably be nice to define some default error handlers and hooks for providing user-level wrappers to the above routines; experience with MPI has shown these to be useful features of an API. - We need more code examples. - Currently, the user creates all message buffers externally to the API and just passes them to st_map(). We may want to define some sort of st_malloc() routine which (for example) takes the desired bufsize as an input parameter, in order to assist portability. - Related to the above, what if anything should we do about NUMA issues? - Need to add text on how to handle endianness - should libst magically convert all fields, or should we define hton* and ntoh* macros so the user can do it all explicitly? - Need to define error codes for common situations. - Need to clean up the Configuration section by noting which options can only be changed before a connection, after a connection, or either. - It is not currently possible to pass optional payloads as part of the accept/connect sequence. How best to fix this? Is it sufficient to send them "blindly", or do we want a server to be able to process an optional payload before decided what to send when accepting the connection? We could handle this either with some new options or by defining a new kind of st_accept() call. - Do we need a way to tell a potential receiver that the sender has disconnected? If so, what should the interface for this look like? Perhaps we should just define a new error code for st_urx()? Is this something that we can cleanly implement in general?