I/O Barriers ============ Tejun Heo <htejun@gmail.com>, July 22 2005 I/O barrier requests are used to guarantee ordering around the barrier requests. Unless you're crazy enough to use disk drives for implementing synchronization constructs (wow, sounds interesting...), the ordering is meaningful only for write requests for things like journal checkpoints. All requests queued before a barrier request must be finished (made it to the physical medium) before the barrier request is started, and all requests queued after the barrier request must be started only after the barrier request is finished (again, made it to the physical medium). In other words, I/O barrier requests have the following two properties. 1. Request ordering Requests cannot pass the barrier request. Preceding requests are processed before the barrier and following requests after. Depending on what features a drive supports, this can be done in one of the following three ways. i. For devices which have queue depth greater than 1 (TCQ devices) and support ordered tags, block layer can just issue the barrier as an ordered request and the lower level driver, controller and drive itself are responsible for making sure that the ordering constraint is met. Most modern SCSI controllers/drives should support this. NOTE: SCSI ordered tag isn't currently used due to limitation in the SCSI midlayer, see the following random notes section. ii. For devices which have queue depth greater than 1 but don't support ordered tags, block layer ensures that the requests preceding a barrier request finishes before issuing the barrier request. Also, it defers requests following the barrier until the barrier request is finished. Older SCSI controllers/drives and SATA drives fall in this category. iii. Devices which have queue depth of 1. This is a degenerate case of ii. Just keeping issue order suffices. Ancient SCSI controllers/drives and IDE drives are in this category. 2. Forced flushing to physical medium Again, if you're not gonna do synchronization with disk drives (dang, it sounds even more appealing now!), the reason you use I/O barriers is mainly to protect filesystem integrity when power failure or some other events abruptly stop the drive from operating and possibly make the drive lose data in its cache. So, I/O barriers need to guarantee that requests actually get written to non-volatile medium in order. There are four cases, i. No write-back cache. Keeping requests ordered is enough. ii. Write-back cache but no flush operation. There's no way to guarantee physical-medium commit order. This kind of devices can't to I/O barriers. iii. Write-back cache and flush operation but no FUA (forced unit access). We need two cache flushes - before and after the barrier request. iv. Write-back cache, flush operation and FUA. We still need one flush to make sure requests preceding a barrier are written to medium, but post-barrier flush can be avoided by using FUA write on the barrier itself. How to support barrier requests in drivers ------------------------------------------ All barrier handling is done inside block layer proper. All low level drivers have to are implementing its prepare_flush_fn and using one the following two functions to indicate what barrier type it supports and how to prepare flush requests. Note that the term 'ordered' is used to indicate the whole sequence of performing barrier requests including draining and flushing. typedef void (prepare_flush_fn)(request_queue_t *q, struct request *rq); int blk_queue_ordered(request_queue_t *q, unsigned ordered, prepare_flush_fn *prepare_flush_fn); @q : the queue in question @ordered : the ordered mode the driver/device supports @prepare_flush_fn : this function should prepare @rq such that it flushes cache to physical medium when executed For example, SCSI disk driver's prepare_flush_fn looks like the following. static void sd_prepare_flush(request_queue_t *q, struct request *rq) { memset(rq->cmd, 0, sizeof(rq->cmd)); rq->cmd_type = REQ_TYPE_BLOCK_PC; rq->timeout = SD_TIMEOUT; rq->cmd[0] = SYNCHRONIZE_CACHE; rq->cmd_len = 10; } The following seven ordered modes are supported. The following table shows which mode should be used depending on what features a device/driver supports. In the leftmost column of table, QUEUE_ORDERED_ prefix is omitted from the mode names to save space. The table is followed by description of each mode. Note that in the descriptions of QUEUE_ORDERED_DRAIN*, '=>' is used whereas '->' is used for QUEUE_ORDERED_TAG* descriptions. '=>' indicates that the preceding step must be complete before proceeding to the next step. '->' indicates that the next step can start as soon as the previous step is issued. write-back cache ordered tag flush FUA ----------------------------------------------------------------------- NONE yes/no N/A no N/A DRAIN no no N/A N/A DRAIN_FLUSH yes no yes no DRAIN_FUA yes no yes yes TAG no yes N/A N/A TAG_FLUSH yes yes yes no TAG_FUA yes yes yes yes QUEUE_ORDERED_NONE I/O barriers are not needed and/or supported. Sequence: N/A QUEUE_ORDERED_DRAIN Requests are ordered by draining the request queue and cache flushing isn't needed. Sequence: drain => barrier QUEUE_ORDERED_DRAIN_FLUSH Requests are ordered by draining the request queue and both pre-barrier and post-barrier cache flushings are needed. Sequence: drain => preflush => barrier => postflush QUEUE_ORDERED_DRAIN_FUA Requests are ordered by draining the request queue and pre-barrier cache flushing is needed. By using FUA on barrier request, post-barrier flushing can be skipped. Sequence: drain => preflush => barrier QUEUE_ORDERED_TAG Requests are ordered by ordered tag and cache flushing isn't needed. Sequence: barrier QUEUE_ORDERED_TAG_FLUSH Requests are ordered by ordered tag and both pre-barrier and post-barrier cache flushings are needed. Sequence: preflush -> barrier -> postflush QUEUE_ORDERED_TAG_FUA Requests are ordered by ordered tag and pre-barrier cache flushing is needed. By using FUA on barrier request, post-barrier flushing can be skipped. Sequence: preflush -> barrier Random notes/caveats -------------------- * SCSI layer currently can't use TAG ordering even if the drive, controller and driver support it. The problem is that SCSI midlayer request dispatch function is not atomic. It releases queue lock and switch to SCSI host lock during issue and it's possible and likely to happen in time that requests change their relative positions. Once this problem is solved, TAG ordering can be enabled. * Currently, no matter which ordered mode is used, there can be only one barrier request in progress. All I/O barriers are held off by block layer until the previous I/O barrier is complete. This doesn't make any difference for DRAIN ordered devices, but, for TAG ordered devices with very high command latency, passing multiple I/O barriers to low level *might* be helpful if they are very frequent. Well, this certainly is a non-issue. I'm writing this just to make clear that no two I/O barrier is ever passed to low-level driver. * Completion order. Requests in ordered sequence are issued in order but not required to finish in order. Barrier implementation can handle out-of-order completion of ordered sequence. IOW, the requests MUST be processed in order but the hardware/software completion paths are allowed to reorder completion notifications - eg. current SCSI midlayer doesn't preserve completion order during error handling. * Requeueing order. Low-level drivers are free to requeue any request after they removed it from the request queue with blkdev_dequeue_request(). As barrier sequence should be kept in order when requeued, generic elevator code takes care of putting requests in order around barrier. See blk_ordered_req_seq() and ELEVATOR_INSERT_REQUEUE handling in __elv_add_request() for details. Note that block drivers must not requeue preceding requests while completing latter requests in an ordered sequence. Currently, no error checking is done against this. * Error handling. Currently, block layer will report error to upper layer if any of requests in an ordered sequence fails. Unfortunately, this doesn't seem to be enough. Look at the following request flow. QUEUE_ORDERED_TAG_FLUSH is in use. [0] [1] [2] [3] [pre] [barrier] [post] < [4] [5] [6] ... > still in elevator Let's say request [2], [3] are write requests to update file system metadata (journal or whatever) and [barrier] is used to mark that those updates are valid. Consider the following sequence. i. Requests [0] ~ [post] leaves the request queue and enters low-level driver. ii. After a while, unfortunately, something goes wrong and the drive fails [2]. Note that any of [0], [1] and [3] could have completed by this time, but [pre] couldn't have been finished as the drive must process it in order and it failed before processing that command. iii. Error handling kicks in and determines that the error is unrecoverable and fails [2], and resumes operation. iv. [pre] [barrier] [post] gets processed. v. *BOOM* power fails The problem here is that the barrier request is *supposed* to indicate that filesystem update requests [2] and [3] made it safely to the physical medium and, if the machine crashes after the barrier is written, filesystem recovery code can depend on that. Sadly, that isn't true in this case anymore. IOW, the success of a I/O barrier should also be dependent on success of some of the preceding requests, where only upper layer (filesystem) knows what 'some' is. This can be solved by implementing a way to tell the block layer which requests affect the success of the following barrier request and making lower lever drivers to resume operation on error only after block layer tells it to do so. As the probability of this happening is very low and the drive should be faulty, implementing the fix is probably an overkill. But, still, it's there. * In previous drafts of barrier implementation, there was fallback mechanism such that, if FUA or ordered TAG fails, less fancy ordered mode can be selected and the failed barrier request is retried automatically. The rationale for this feature was that as FUA is pretty new in ATA world and ordered tag was never used widely, there could be devices which report to support those features but choke when actually given such requests. This was removed for two reasons 1. it's an overkill 2. it's impossible to implement properly when TAG ordering is used as low level drivers resume after an error automatically. If it's ever needed adding it back and modifying low level drivers accordingly shouldn't be difficult.