------ ST Bypass API Level 0 5/10/99 draft by Eric Salo (salo@sgi.com) I. List of All Routines II. Datatypes III. Constructors and Destructors IV. Configuration V. Connection Setup VI. Memory Management VII. Data Transfer VIII. Miscellaneous IX. Unresolved Issues I. List of All Routines ======================= Constructors and Destructors: st_create() st_delete() Configuration: st_getopt() st_setopt() Connection Setup: st_listen() st_accept() st_connect() st_close() Memory Management: st_map() st_unmap() Data Transfer: st_tx() st_rx() st_flush() Miscellaneous: st_time() st_version() II. Datatypes ============= We define the following datatypes for our API: typedef void *st_chandle_t; // Handle to a ST connection typedef void *st_mhandle_t; // Handle to a region of memory typedef unsigned int uint; // unsigned integer typedef long long int64_t; // 64-bit signed integer *** NEED TO ADD text for other "standard" integer sizes here III. Constructors and Destructors ================================= To create an endpoint for a ST connection, we use the following: typedef void *st_chandle_t; // handle to a connection int st_create(st_chandle_t *chandle); This routine allocates (and initializes) storage in libst for all state corresponding to a ST connection. To deallocate this storage, we use: int st_delete(st_chandle_t chandle); If passed a handle to a still-open connection, st_delete() will silently close that connection before deleting any state, just as if the application had issed a call to st_close(). Note that this may involve waiting for any pending transmissions to flush. IV. Configuration ================= There are a great many parameters for a ST bypass connection, which in general may be read or written either before or after the connection itself has been established. To read the current value of a ST paramater, we define the following: int st_getopt(st_chandle_t chandle, int optname, void *optval); Similarly, to set the value of a ST parameter we define the following: int st_setopt(st_chandle_t chandle, int optname, void *optval); We also define the following values as legal options for the ’optname’ argument: ST_OPT_BUFSIZE_LOCAL The bufsize used by the local controller. ST_OPT_BUFSIZE_REMOTE The bufsize used by the remote controller. ST_OPT_BUFX_BASE The smallest bufx in the continuous range reserved for this connection. ST_OPT_BUFX_COUNT The number of contiguous bufx entries reserved for this connection. ST_OPT_CHANNELS Which of the four channels defined by the ST protocol are supported for this connection. (channel 0 is denoted by bit 0, channel 1 by bit 1, etc.) ST_OPT_IANA_LOCAL The local IANA port for the connection. ST_OPT_IANA_REMOTE The remote IANA port for the connection. ST_OPT_KEY_LOCAL The key value used by the local controller when processing incoming messages. ST_OPT_KEY_REMOTE The key value used by the remote controller when processing incoming messages. ST_OPT_MAXSTU_LOCAL The maximum STU size supported by the local controller. ST_OPT_MAXSTU_REMOTE The maximum STU size supported by the remote controller. ST_OPT_PORT_LOCAL The local ST port for the connection. ST_OPT_PORT_REMOTE The remote ST port for the connection. ST_OPT_PORTLEN_LOCAL The "vector" size for the local ST port. ST_OPT_PORTLEN_REMOTE The "vector" size for the remote ST port. ST_OPT_RX_SLOTS The total number of Rx slots available locally. ST_OPT_RX_WINDOW_LOCAL The number of local Rx slots that we wish to advertise to the remote host during connection setup. ST_OPT_RX_WINDOW_REMOTE The number of remote Rx slots advertised to us by the remote host during connection setup. ST_OPT_THREAD_SAFETY The degree of thread-safety that is either desired from or supported by the local implementation per connection. V. Connection Setup =================== The bypass library provides the following routines which manage all connection setup and teardown: int st_listen(st_chandle_t chandle, const char *hostname, const char *service); Prepares a server to accept incoming connection requests from a client. int st_accept(st_chandle_t chandle); Blocks until an incoming connection has arrived and been completed. int st_connect(st_chandle_t chandle, const char *hostname, const char *service); Establishes a connection with a remote server. int st_close(st_chandle_t chandle); Closes a connection. If there are still pending transmit operations on this connection, the call will block until they have been cleared, exactly as if the application had called st_flush(-1, NULL). As these wrappers are similar in functionality to their BSD namesakes, programmers already familiar with the sockets API should have little difficultly in learning how to use them. Pseudo-code for a simple client-server example follows: st_chandle_t *chandle; char *hostname, *service; int s, t, key, flag; // Create an empty handle st_create(&chandle); // Set various options for the connection st_setopt(chandle, ST_OPT_FOO, ...); st_setopt(chandle, ST_OPT_BAR, ...); st_setopt(chandle, ST_OPT_BAZ, ...); // Establish the connection if (i_am_client) { hostname = "server_host"; // could also be a IP addr service = "3000"; // could also be a name sleep(3); // ugly hack - give the server time to // call st_listen() st_connect(chandle, hostname, service); } else { hostname = NULL; // use default local address service = "3000"; // use any IANA port st_listen(chandle, hostname, service); st_accept(chandle); } // Get various fields needed for the ST header st_getopt(chandle, ST_OPT_KEY_REMOTE, &key); ... // Send a simple ST header to the remote host st_tx(chandle, &st_header, ...); VI. Memory Management ===================== Before a ST DATA message can be sent, memory regions on both sides of the connection must be prepared for use by the API. This will typically (but not always) involve pinning physical pages of memory so that they are safe for DMA, as well as updating the bufx table(s) on the network controller to point to the appropriate buffers. In the most general case, there are at least ten seperate actions that could be required: 1) Allocate a bufx range for a connection 2) Pin a buffer 3) Associate a pinned buffer with a bufx sub-range 4) Allocate a Mx 5) Associate a Mx with a bufx sub-range 6) Disassociate a Mx from a bufx sub-range 7) Deallocate a Mx 8) Disassociate a pinned buffer from a bufx sub-range 9) Unpin a buffer 10) Deallocate a bufx range for a connection Note that potentially every one of the above could involve making a call into the OS. As we are interested in minimizing overhead, this would clearly not be an optimal design. We therefore propose defining only two new routines - one which sets up a buffer and another which tears it down again - which between them cover actions 2-9 above: st_map(st_chandle_t *chandle, int flags, void *ptr, size_t len, int bufx_base, int bufx_count, st_mhandle_t *mhandle, int *Mx); st_unmap(st_mhandle_t *mhandle); st_map() takes a user-provided buffer and prepares it for use. The arguments it takes are as follows: chandle Opaque pointer which describes a ST bypass connection to the library. flags Any combination of { ST_PIN_TX, ST_PIN_RX } to mark the buffer as send-only, receive-only, or send-receive. ptr Pointer to the beginning of the user-supplied buffer. len Length of the buffer in bytes. bufx_base Value of the bufx entry which will correspond to the beginning of the buffer. bufx_count The number of entries that the buffer will occupy in the bufx table. mhandle Opaque pointer which describes the memory region to the library. (Returned to caller.) Mx Mx value associated with buffer if ST_PIN_RX is set. (Returned to caller.) The and fields are slightly redundant; given one, the other can be derived. It is therefore legal to set one of them to a value of -1, indicating that the other value should be used to determine the physical extent of the memory region. If neither is set to -1, then they must exactly agree with each other. The bufx range for the connection (which are allocated and deallocated in actions 1 and 10 above, respectively) will be handled by the st_setopt() call as described in the Configuration section, so no explicit routines are needed for that functionality. It is legal for overlapping (fully or partially) memory ranges to be passed in to seperate calls to st_map(). It is also legal for a region to be mapped for transmit by one call and for receive by another call. Note that the st_map() call is intended to provide TLB-like protection to a range of bufx entries, so it is not possible to specify granularity less than a full bufx entry. VII. Data Transfer ================== The three routines which control message traffic are the following: st_tx() st_rx() st_flush() Send a ST Header ---------------- To send any non-connection ST headers, we use the following routine: int st_tx(st_chandle_t *chandle, st_header_t *hdr, void *opt_pay, uint len, uint bufx, uint off); Blocks until a transmit request is successfully delivered to another asynchronous agent. This agent might be implemented as another thread in libst, or as a kernel-level daemon, or on the network controller itself; the specifics are unimportant as long as progress is guaranteed to be made in the background after this routine returns. Note that this does NOT mean that the user’s data has necessarily been copied out of the transmit buffer - that information is only available thru the st_flush() call, described below. If contains a control message and is non-NULL, a 32-byte optional payload is delivered to the receiver along with the ST header. If contains a data message and is non-zero, the buffer starting at will be delivered into the receiver’s memory. Note: st_header_t is a complete ST header, as defined in st.h. A Note on Tiling: It is assumed that libst will handle any tiling that may be required if the and arguments result in a buffer that crosses a bufx boundary, requiring multiple STUs. (** This has implications for Mx values - need to add more text here.) We also need a way to disable this feature, which can be done either via a new ’literal-mode’ option or by setting the remote stu size to be something enormous. Receive a ST Header ------------------- To receive any non-connection ST headers, we use the following routine: int st_rx(st_chandle_t *chandle, st_header_t *hdr, void *opt_pay, struct timeval *timeout); Attempts to receive an incoming descriptor on the appropriate connection. Copies the ST header and optional payload (if any) into the user-supplied buffers and returns. The semantics of the argument are exactly the same as in the BSD select() call; a NULL pointer will cause st_rx() to block indefinitely while waiting for a descriptor, otherwise points to a struct which defines the amount of time that the call should block. If the timeout expires before a descriptor appears, EWOULDBLOCK is returned. Wait for Transmissions to Complete ---------------------------------- The st_tx() routine returns only when some asynchronous agent has received the request to transmit; it does not provide the application with sufficient information to determine when it is safe to reuse its transmit buffer. Thus, the following routine: st_flush(st_chandle_t *chandle, int64_t threshold, int64_t *count); This routine blocks until the total number of ST headers sent on the given connection meets or exceeds . If equals -1, then it blocks until all pending transmits have completed. The total number of ST headers sent at the moment this routine returns is copied into , which may also be passed as a NULL pointer to indicate "don’t care". Note that by setting to zero, this routine can be used to obtain an instantaneous (read: non-blocking) snapshot of the current count. VIII. Miscellaneous =================== Timer ----- We define the following routine to provide wall-clock measurements: double st_time(void); Each call to st_time() returns the number of seconds that have elapsed since some (undefined) epoch. The correct way to use this routine, then, is to call it twice and take the difference. Library Version --------------- const char *st_version(void); This routine returns an implementation-dependent, human-readable string that describes which version of the library is in use. It is not really intended to be read directly by applications; it is more of a debugging aid for users. IX. Unresolved Issues ===================== - We have not yet addressed issues specific to Fetch&Op messages (such as data alignment). - It would probably be nice to define some default error handlers and hooks for providing user-level wrappers to the above routines; experience with MPI has shown these to be useful features of an API. - We need more code examples, especially for the "vectorized" connections. - Currently, the user creates all message buffers externally to the API and just passes them to st_map(). We may want to define some sort of st_malloc() routine which (for example) takes the desired bufsize as an input parameter, in order to assist portability. - Related to the above, what if anything should we do about NUMA issues? - Different implementations may want/need to restrict the user from providing certain fields in the ST header, such as the Key, or the Mx in a CTS/MRA. Should such restrictions be allowed/required/forbidden, and if so, how? - Should there be an option to support disjoint vaddrs in st_map()? - Need to add text on how to handle endianness - should libst magically convert all fields, or should we define hton* and ntoh* macros so the user can do it all explicitly? - Need to define error codes for common situations. - Need a way to detect whether an optional payload has been sent to st_rx(). - Need to clean up the Configuration section by noting which options can only be changed before a connection, after a connection, or either. - Chicken-and-egg problem: how to set the number of desired bufx entries before establishing the connection when the connection determines which controller will be used (and therefore how much memory is associated with each bufx entry)? - It is not currently possible to pass optinal payloads as part of the accept/connect sequence. How best to fix this? Is it sufficient to send them "blindly", or do we want a server to be able to process an optional payload before decided what to send when accepting the connection? We could handle this either with some new options or by defining a new kind of st_accept() call. - Do we need a way to tell a potential receiver that the sender has disconnected? If so, what should the interface for this look like? Perhaps we should just define a new error code for st_rx()? Is this somethat that we can cleanly implement in general? - Is the addressing scheme sufficient for all potential ST endpoints (such as SCSI)? Would it be cleaner to combine these two arguments into a single string instead? - Need to add text for I-bit semantics. Briefly, we currently believe that the best idea is to require producer/consumer counters between libst and the OS. That way, once libst blocks on a receive, the OS will be able to look at an incoming descriptor with the I-bit set and reliably determine whether to wake the user process. There are some nasty issues related to flow-control on the Rx queue, but these may be tractable if we define the semantics of the I-bit with sufficient precision. -- Eric Salo Silicon Graphics salo@sgi.com