*** iSCSI Extensions for RDMA (iSER) in tgt *** This is an detailed description of the iSER tgtd target. It covers issues from the design to how to manually set it up. NOTE: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! To run this iSER target you must have installed the libiverbs and librdma rpms on your system. They will not get brought in automatically when installing this rpm. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! See man tgt-admin and the example /etc/tgt/targets.conf file for how to setup a persistent configuration that is started when the tgtd service is started (when "service tgtd start" is run). Copyright (C) 2007 Pete Wyckoff <pw@osc.edu> Copyright (C) 2011 Alexander Nezhinsky <alexandern@voltaire.com> 1. Background 1.1. Standards (iSCSI, iSER) The IETF standards track RFC 5046 extends the iSCSI protocol to work on RDMA-capable networks as well as on traditional TCP/IP: Internet Small Computer System Interface (iSCSI) Extensions for Remote Direct Memory Access (RDMA), Mike Ko, October 2007. It is available online: http://tools.ietf.org/html/rfc5046 RDMA stands for Remote Direct Memory Access, a way of accessing memory of a remote node directly through the network without involving the processor of that remote node. Many network devices implement some form of RDMA. Two of the more popular network devices are InfiniBand (IB) and iWARP. IB uses its own physical and network layer, while iWARP sits on top of TCP/IP (or SCTP). Using these devices requires a new application programming interface (API). The Linux kernel has many components of the OpenFabrics software stack, including APIs for access from user space and drivers for some popular RDMA-capable NICs, including IB cards with the chipset from Mellanox and QLogic, and iWARP cards from NetEffect, Chelsio, and Ammasso. Most Linux distributions ship the user space libraries for device access and RDMA connection management. There is an ongoing activity, which is still in progress, intended to improve upon RFC 5046 and address some existing issues. The text of the latest proposal is available online (note, though, that it may become outdated quickly): http://tools.ietf.org/html/draft-ietf-storm-iser-01 1.2. iSER in tgtd tgtd is a user space target that supports multiple transports, including iSCSI/TCP and iSER on RDMA devices. The original iSER code was written in early 2007 by researchers at the Ohio Supercomputer Center: Dennis Dalessandro <dennis@osc.edu> Ananth Devulapalli <ananth@osc.edu> Pete Wyckoff <pw@osc.edu> The authors wanted to use a faster transport to test the capabilities of an object-based storage device (OSD) emulator. A report describing this implementation and some performance results appears in IEEE conference proceedings as: Dennis Dalessandro, Ananth Devulapalli and Pete Wyckoff, "iSER Storage Target for Object-based Storage Devices", Proceedings of MSST'07, SNAPI Workshop, San Diego, CA, September 2007. and is available at: http://www.osc.edu/~pw/papers/iser-snapi07.pdf Slides of the talk with more results and analysis are available at: http://www.osc.edu/~pw/papers/wyckoff-iser-snapi07-talk.pdf The original code lived in iscsi/iscsi_rdma.c, with a few places in iscsi/iscsid.c. RDMA transport was added and some more functions where TCP and RDMA behaved differently were virtualized. There was a bug that resulted in occasional data corruption. The new implementation was written by Alexander Nezhinsky <alexandern@voltaire.com>. It defines iSER as a separate transport (and not as a sub-transport of iSCSI/TCP). One of the main differences between iSCSI/TCP and iSER is that the former enjoys the stream semantics of TCP and may work in a synchronous manner, while the latter's flow is intrinsically asynchronous and message based. Implementing a synchronous flow within an asynchronous framework is relatively natural, while fitting an asynchronous flow within a synchronous framework is usually met with a few obstacles resulting in a sub-optimal design. The main reason to define iser as a separate transport (which is an example of such obstacle) was to decouple rx/tx flow from using EPOLLIN/EPOLLOUT events originally used to poll TCP sockets. See "Event Management" section below for details. Although one day we may return to a common tcp/rdma transport, for now a separate transport LLD (named "iser") is defined. Other changes include avoiding memory copies, using a memory pool shared between connections with "patient" memory allocation mechanism, etc. Source-wise, a new header "iser.h" is created, "iscsi_rdma.c" is replaced by "iser.c". File iser_text.c contains the iscsi-text processing code replicated from iscsid.c. This is done because the functions there are not general enough, and rely on specifics of iscsi/tcp structs. This file will hopefully be removed in the future. 2. Design 2.1. General Notes In general, a SCSI system includes two components, an initiator and a target. The initiator submits commands and awaits responses. The target services commands from initiators and returns responses. Data may flow from the initiator, from the client, or both (bidirectional). The iSER specification requires all data transfers to be started by the target, regardless of direction. In a read operation, the target uses RDMA Write to move data to the initiator, while a write operation uses RDMA Read to fetch data from the initiator. 2.2. Memory registration One of the most severe stumbling blocks in moving any application to take advantage of RDMA features is memory registration. Before using RDMA, both the sending and receiving buffers must be registered with the operating system. This operation ensures that the underlying hardware pages will not be modified during the transfer, and provides the physical addresses of the buffers to the network card. However, the process itself is time consuming, and CPU intensive. Previous investigations have shown that for InfiniBand, the throughput drops by up to 40% when memory registration and deregistration are included in the critical path. This iSER implementation uses pre-registered buffers for RDMA operations. In general such a scheme is difficult to justify due to the large per-connection resource requirements. However, in this application it may be appropriate. Since the target always initiates RDMA operations and never advertises RDMA buffers, it can securely use one pool of buffers for multiple clients and can manage its memory resources explicitly. Also, the architecture of the code is such that the iSCSI layer dictates incoming and outgoing buffer locations to the storage device layer, so supplying a registered buffer is relatively easy. 2.3. Event management As mentioned above, there is a mismatch between what the iscsid framework assumes and what the RDMA notification interface provides. The existing TCP-based iSCSI target code has one file descriptor per connection and it is driven by readability or writeability of the socket. A single poll system call returns which sockets can be serviced, driving the TCP code to read or write as appropriate. The RDMA interface is also represented by a single file descriptor created by the driver responsible for the hardware. This file descriptor readability may be used by requesting interrupts from the network card on work request completions, after a sufficiently long period of quiescence. Furter completions can be polled and retrieved without re-arming the interrupts. Beside this first difference, the RDMA device file descriptor can not and should not be polled for writability, as any messages or RDMA transfer requests may be issued assynhcronously. Moreover, the existing sockets-based code goes beyond this and changes the bitmask of requested events to control its code flow. For instance, after it finishes sending a response, it will modify the bitmask to only look for readability. Even if the socket is writeable, there is no data to write, hence polling for that status is not useful. The code also disables new message arrival during command execution as a sort of exclusion facility, again by modifying the bitmask. As it can not be done with the RDMA interface, the original code had to maintain an active list of tasks having data to write and to drive a progress engine to service them. The progress was tracked by a counter, and the tgtd event loop checked this counter and called into the iSER-specific while the counter is still non-zero. This scheme was quite unnatural and error-prone. The new implementation issues all SEND requests asynchronously. Besides, it relies heavily upon the scheduled events that are injected into the event loop with no dependence on file descriptors. It schedules such events to poll for new RDMA completion events, in hope that new ones are ready. If no event arrives after a certain number of polls then interrupts are requested and further progress will be driven through the file-based event mechanism. Note that only the first event is signal in this manner and while new completions are constantly arriving, they will be retrieved by polling only. Other internal events of the same kind (like tasks requesting a send, commands that are ready for submition etc.) are grouped on appropriate lists and special events are scheduled for them. This allows to process few tasks in a batched manner in order to optimise RDMA and other operations, if possible. 2.4. RDMA-only mode The code implies RDMA-only mode of work. This means the "first burst" including immediate data should be disabled, so that the entire data transfer is performed using RDMA. This mode is perhaps the most suitable one for iser in the majority of work scenarios. The only concern is about relatively small WRITE I/Os, which may enjoy theoretically lower latencies using IB SEND instead of RDMA-RD. Implementing this mode is meanwhile precluded because it would lead to multiple buffers per iSER task (e.g. ImmediateData buffer received with the command PDU, and the rest of the data retrieved using RDMA-RD), which is not supported by the existing tgt backing stores. The RDMA-only mode is achieved by setting: target->session_param[ISCSI_PARAM_INITIAL_R2T_EN].val = 1; target->session_param[ISCSI_PARAM_IMM_DATA_EN].val = 0; which is hardcoded in iser_target_create(). 2.5. Padding The iSCSI specification clearly states that all segments in the protocol data unit (PDU) must be individually padded to four-byte boundaries. However, the iSER specification remains mute on the subject of padding. It is clear from an implementation perspective that padding data segments is both unnecessary and would add considerable overhead to implement. (Possibly a memory copy or extra SG entry on the initiator when sending directly from user memory.) RDMA is used to move all data, with byte granularity provided by the network. The need for padding in the TCP case was motivated by the optional marker support to work around the limitations of the streaming mode of TCP. IB and iWARP are message-based networks and would never need markers. And finally, the Linux initiator does not add padding either. 3. Using iSER 3.1. Running tgtd Start the daemon (as root): ./tgtd It will send messages to syslog. You can add "-d 1" to turn on debug messages. Debug messages can be also turned on and off during run time using the following commnds: ./tgtadm --mode system --op update --name debug --value on ./tgtadm --mode system --op update --name debug --value off The target will listen on all TCP interfaces (as usual), as well as all RDMA devices. Both use the same default iSCSI port, 3260. Clients on TCP or RDMA will connect to the same tgtd instance. 3.2. Configuring tgtd Configure the running target with one or more devices, using the tgtadm program you just built (also as root). Full information is available in doc/README.iscsi. The difference is only in the name of LLD which should be "iser". Here is a quick-start example: ./tgtadm --lld iser --mode target \ --op new --tid 1 --targetname "iqn.$(hostname).t1" ./tgtadm --lld iser --mode target \ --op bind --tid 1 --initiator-address ALL ./tgtadm --lld iser --mode logicalunit \ --op new --tid 1 --lun 1 \ --backing-store /dev/sde --bstype rdwr 3.3. Initiator side To make your initiator use RDMA, make sure the "ib_iser" module is loaded in your kernel. Then do discovery as usual, over TCP: iscsiadm -m discovery -t sendtargets -p $targetip where $targetip is the ethernet address of your IPoIB device. Discovery traffic will use IPoIB, but login and full feature phase will use RDMA natively. Then do something like the following to change the transport type: iscsiadm -m node -p $targetip -T $targetname --op update \ -n node.transport_name -v iser Next, login as usual: iscsiadm -m node -p $targetip -T $targetname --login And access the new block device, e.g. /dev/sdb. Note that separate iscsi and iser transports mean that you should know which targets are configured as iser and which as iscsi/tcp. If you try to login to a target configured as iser over tcp, this will fail. And vice versa, trying to login to a target configured as iscsi/tcp over iser will not succeed as well. Because an iscsi target has no means for reporting its RDMA capabilities you have to try to login over iser to every target reported by SendTargets. If it fails and you still want to access the target over tcp, then change the transport name back to "tcp" and try to login again. Some distributions include a script named "iscsi_discovery", which accomplishes just this. If you wish to login either over iser or over tcp: iscsi_discovery $targetip -t iser This will login changing transports if necessary. Then, if succesful, it will logout leaving the target's record with the appropriate transport setting. If you are interested only in iser targets, then add "-f", forcing the transport to be iser. Note also that the port can be specified explicitely: iscsi_discovery $targetip -p $targetport -t iser -f This will cancel login retry over tcp in case of the initial failure. 4. Errata 4.1. Pre-2.6.21 mthca driver bug There is a major bug in the mthca driver in linux kernels before 2.6.21. This includes the popular rhel5 kernels, such as 2.6.18-8.1.6.el5 and possibly later. The critical commit is: 608d8268be392444f825b4fc8fc7c8b509627129 IB/mthca: Fix data corruption after FMR unmap on Sinai If you use single-port memfree cards, SCSI read operations will frequently result in randomly corrupted memory, leading to bad application data or unexplainable kernel crashes. Older kernels are also missing some nice iSCSI changes that avoids crashes in some situations where the target goes away. Stock kernel.org linux after 2.6.21 have been tested and are known to work. 4.2. Bidirectional commands The Linux kernel iSER initiator is currently lacking support for bidirectional transfers, and for extended command descriptors (CDBs). Progress toward adding this is being made, with patches frequently appearing on the relevant mailing lists. 4.3. ZBVA The Linux kernel iSER initiator uses a different header structure on its packets than is in the iSER specification. This is described in an InfiniBand document and is required for that network, which only supports for Zero-Based Virtual Addressing (ZBVA). If you are using a non-IB initiator that doesn't need this header extension, it won't work with tgtd. There may be some way to negotiate the header format. Using iWARP hardware devices with the Linux kernel iSER initiator also will not work due to its reliance on fast memory registration (FMR), an InfiniBand-only feature. 4.4. MaxOutstandingUnexpectedPDUs The current code sizes its per-connection resource consumption based on negotiatied parameters. However, the Linux iSER initiator does not support negotiation of MaxOutstandingUnexpectedPDUs, so that value is hard-coded in the target. 4.5. TargetRecvDataSegmentLength Also, open-iscsi is hard-coded with a very small value of TargetRecvDataSegmentLength, so even though the target would be willing to accept a larger size, it cannot. This may limit performance of small transfers on high-speed networks: transfers bigger than 8 kB, but not large enough to amortize a round-trip for RDMA setup. 4.6. Multiple devices The iser code has been successfully tested with multiple Infiniband devices. 4.7. SCSI command size A single buffer per SCSI command limitation has another implication (except the "RDMA-only mode", see above). The RDMA buffers pool is currently created with buffers of 512KB each (see "Memory registration" section above). This is suitable to work with most of the linux initiators, which split all transfers into SCSI commands of up to 128KB, 256KB or 512KB (depending on the system). Initiators that issue explicit SCSI commands with the size greater than 512KB will be unable to work with the current iser implementation. Once multiple buffers are supported by the backing stores this limitation can be eliminated in a relatively simple manner.