aboutsummaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2009-06-22dm: rename suspended_bdev to bdevMikulas Patocka
Rename suspended_bdev to bdev. This patch doesn't change any functionality, just renames the variable. In the next patch, the variable will be used even for non-suspended device. (Pre-requisite for the per-target barrier support patches.) Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22dm exception store: fix exstore lookup to be case insensitiveJonathan Brassow
When snapshots are created using 'p' instead of 'P' as the exception store type, the device-mapper table loading fails. This patch makes the code case insensitive as intended and fixes some regressions reported with device-mapper snapshots. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Cc: stable@kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22dm: use i_size_readMikulas Patocka
Use i_size_read() instead of reading i_size. If someone changes the size of the device simultaneously, i_size_read is guaranteed to return a valid value (either the old one or the new one). i_size can return some intermediate invalid value (on 32-bit computers with 64-bit i_size, the reads to both halves of i_size can be interleaved with updates to i_size, resulting in garbage being returned). Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22dm: avoid unsupported spanning of md stripe boundariesMikulas Patocka
A bio that has two or more vector entries, size less than or equal to page size, that crosses a stripe boundary of an underlying md device is accepted by device mapper (it conforms to all its limits) but not by the underlying device. The fix is: If device mapper selects the one-page maximum request size, it also needs to set its own q->merge_bvec_fn to reject any bios with multiple vector entries that span more pages. The problem was discovered in the following scenario: * MD - RAID-0 * LV on the top of it (raid1, snapshot or striped with chunk size/stripe larger than RAID-0 stripe) * one of the logical volumes is exported to xen domU * inside xen domU it is partitioned, the key point is that the partition must be unaligned on page boundary (fdisk normally aligns the partition to 63 sectors which will trigger it) * install the system on the partitioned disk in domU This causes I/O failures in dom0. Reference: https://bugzilla.redhat.com/show_bug.cgi?id=223947 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22dm mpath: flush keventd queue in destructorMikulas Patocka
The commit fe9cf30eb8186ef267d1868dc9f12f2d0f40835a moves dm table event submission from kmultipath queue to kernel kevent queue to avoid a deadlock. There is a possibility of race condition because kevent queue is not flushed in the multipath destructor. The scenario is: - some event happens and is queued to keventd - keventd thread is delayed due to scheuling latency or some other work - multipath device is destroyed - keventd now attempts to process work_struct that is residing in already released memory. The patch flushes the keventd queue in multipath constructor. I've already fixed similar bug in dm-raid1. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: stable@kernel.org
2009-06-22dm raid1: keep retrying alloc if mempool_alloc failedMikulas Patocka
If the code can't handle allocation failures, use __GFP_NOFAIL so that in case of memory pressure the allocator will retry indefinitely and won't return NULL which would cause a crash in the function. This is still not a correct fix, it may cause a classic deadlock when memory manager waits for I/O being done and I/O waits for some free memory. I/O code shouldn't allocate any memory. But in this case it probably doesn't matter much in practice, people usually do not swap on RAID. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22dm mpath: call activate fn for each path in pg_initChandra Seetharaman
Fixed a problem affecting reinstatement of passive paths. Before we moved the hardware handler from dm to SCSI, it performed a pg_init for a path group and didn't maintain any state about each path in hardware handler code. But in SCSI dh, such state is now maintained, as we want to fail I/O early on a path if it is not the active path. All the hardware handlers have a state now and set to active or some form of inactive. They have prep_fn() which uses this state to fail the I/O without it ever being sent to the device. So in effect when dm-multipath calls scsi_dh_activate(), activate is sent to only one path and the "state" of that path is changed appropriately to "active" while other paths in the same path group are never changed as they never got an "activate". In order make sure all the paths in a path group gets their state set properly when a pg_init happens, we need to call scsi_dh_activate() on all paths in a path group. Doing this at the hardware handler layer is not a good option as we want the multipath layer to define the relationship between path and path groups and not the hardware handler. Attached patch sends an "activate" on each path in a path group when a path group is switched. It also sends an activate when a path is reinstated. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22dm mpath: change attached scsi_dhHannes Reinecke
When specifying a different hardware handler via multipath features we should be able to override the built-in defaults. The problem here is the hardware table from scsi_dh is compiled in and cannot be changed from userland. The multipath.conf OTOH is purely user-defined and, what's more, the user might have a valid reason for modifying it. (EG EMC Clariion can well be run in PNR mode even though ALUA is active, or the user might want to try ALUA on any as-of-yet unknown devices) So _not_ allowing multipath to override the device handler setting will just add to the confusion and makes error tracking even more difficult. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22dm: sysfs skip output when device is being destroyedMilan Broz
Do not process sysfs attributes when device is being destroyed. Otherwise code can cause BUG_ON(test_bit(DMF_FREEING, &md->flags)); in dm_put() call. Cc: stable@kernel.org Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22dm mpath: validate hw_handler argument countMikulas Patocka
Fix arg count parsing error in hw handlers. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22dm mpath: validate table argument countMikulas Patocka
The parser reads the argument count as a number but doesn't check that sufficient arguments are supplied. This command triggers the bug: dmsetup create mpath --table "0 `blockdev --getsize /dev/mapper/cr0` multipath 0 0 2 1 round-robin 1000 0 1 1 /dev/mapper/cr0 round-robin 0 1 1 /dev/mapper/cr1 1000" kernel BUG at drivers/md/dm-mpath.c:530! Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-19Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: Fix kernel-doc parameter name typo in blk-settings.c: block: rename CONFIG_LBD to CONFIG_LBDAF block: Fix bounce_pfn setting hd: stop defining MAJOR_NR
2009-06-19block: rename CONFIG_LBD to CONFIG_LBDAFBartlomiej Zolnierkiewicz
Follow-up to "block: enable by default support for large devices and files on 32-bit archs". Rename CONFIG_LBD to CONFIG_LBDAF to: - allow update of existing [def]configs for "default y" change - reflect that it is used also for large files support nowadays Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-18Merge branch 'for-linus' of git://neil.brown.name/mdLinus Torvalds
* 'for-linus' of git://neil.brown.name/md: (39 commits) md/raid5: correctly update sync_completed when we reach max_resync md/raid5: add missing call to schedule() after prepare_to_wait() md/linear: use call_rcu to free obsolete 'conf' structures. md linear: Protecting mddev with rcu locks to avoid races md: Move check for bitmap presence to personality code. md: remove chunksize rounding from common code. md: raid0/linear: ensure device sizes are rounded to chunk size. md: move assignment of ->utime so that it never gets skipped. md: Push down reconstruction log message to personality code. md: merge reconfig and check_reshape methods. md: remove unnecessary arguments from ->reconfig method. md: raid5: check stripe cache is large enough in start_reshape md: raid0: chunk_sectors cleanups. md: fix some comments. md/raid5: Use is_power_of_2() in raid5_reconfig()/raid6_reconfig(). md: convert conf->chunk_size and conf->prev_chunk to sectors. md: Convert mddev->new_chunk to sectors. md: Make mddev->chunk_size sector-based. md: raid0 :Enables chunk size other than powers of 2. md: prepare for non-power-of-two chunk sizes ...
2009-06-18md/raid5: correctly update sync_completed when we reach max_resyncNeilBrown
At the end of reshape_request we update cyrr_resync_completed if we are about to pause due to reaching resync_max. However we update it to the wrong value. We need to add the "reshape_sectors" that have just been reshaped. Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18md/raid5: add missing call to schedule() after prepare_to_wait()Dan Williams
In the unlikely event that reshape progresses past the current request while it is waiting for a stripe we need to schedule() before retrying for 2 reasons: 1/ Prevent list corruption from duplicated list_add() calls without intervening list_del(). 2/ Give the reshape code a chance to make some progress to resolve the conflict. Cc: <stable@kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18md/linear: use call_rcu to free obsolete 'conf' structures.NeilBrown
Current, when we update the 'conf' structure, when adding a drive to a linear array, we keep the old version around until the array is finally stopped, as it is not safe to free it immediately. Now that we have rcu protection on all accesses to 'conf', we can use call_rcu to free it more promptly. Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18md linear: Protecting mddev with rcu locks to avoid racesSandeepKsinha
Due to the lack of memory ordering guarantees, we may have races around mddev->conf. In particular, the correct contents of the structure we get from dereferencing ->private might not be visible to this CPU yet, and they might not be correct w.r.t mddev->raid_disks. This patch addresses the problem using rcu protection to avoid such race conditions. Signed-off-by: SandeepKsinha <sandeepksinha@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18md: Move check for bitmap presence to personality code.Andre Noll
If the superblock of a component device indicates the presence of a bitmap but the corresponding raid personality does not support bitmaps (raid0, linear, multipath, faulty), then something is seriously wrong and we'd better refuse to run such an array. Currently, this check is performed while the superblocks are examined, i.e. before entering personality code. Therefore the generic md layer must know which raid levels support bitmaps and which do not. This patch avoids this layer violation without adding identical code to various personalities. This is accomplished by introducing a new public function to md.c, md_check_no_bitmap(), which replaces the hard-coded checks in the superblock loading functions. A call to md_check_no_bitmap() is added to the ->run method of each personality which does not support bitmaps and assembly is aborted if at least one component device contains a bitmap. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18md: remove chunksize rounding from common code.NeilBrown
It is easiest to round sizes to multiples of chunk size in the personality code for those personalities which care. Those personalities now do the rounding, so we can remove that function from common code. Also remove the upper bound on the size of a chunk, and the lower bound on the size of a device (1 chunk), neither of which really buy us anything. Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18md: raid0/linear: ensure device sizes are rounded to chunk size.NeilBrown
This is currently ensured by common code, but it is more reliable to ensure it where it is needed in personality code. All the other personalities that care already round the size to the chunk_size. raid0 and linear are the only hold-outs. Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18md: move assignment of ->utime so that it never gets skipped.NeilBrown
Currently the assignment to utime gets skipped for 'external' metadata. So move it to the top of the function so that it always gets effected. This is of largely cosmetic interest. Nothing actually depends on ->utime being right for external arrays. "mdadm --monitor" does use it for 0.90 and 1.x arrays, but with mdadm-3.0, this is not important for external metadata. Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18md: Push down reconstruction log message to personality code.Andre Noll
Currently, the md layer checks in analyze_sbs() if the raid level supports reconstruction (mddev->level >= 1) and if reconstruction is in progress (mddev->recovery_cp != MaxSector). Move that printk into the personality code of those raid levels that care (levels 1, 4, 5, 6, 10). Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18md: merge reconfig and check_reshape methods.NeilBrown
The difference between these two methods is artificial. Both check that a pending reshape is valid, and perform any aspect of it that can be done immediately. 'reconfig' handles chunk size and layout. 'check_reshape' handles raid_disks. So make them just one method. Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18md: remove unnecessary arguments from ->reconfig method.NeilBrown
Passing the new layout and chunksize as args is not necessary as the mddev has fields for new_check and new_layout. This is preparation for combining the check_reshape and reconfig methods Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18md: raid5: check stripe cache is large enough in start_reshapeNeilBrown
In reshape cases that do not change the number of devices, start_reshape is called without first calling check_reshape. Currently, the check that the stripe_cache is large enough is only done in check_reshape. It should be in start_reshape too. Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18md: raid0: chunk_sectors cleanups.NeilBrown
following the conversion to chunk_sectors, there is room for cleaning up a little. Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18md: fix some comments.Andre Noll
1/ Raid5 has learned to take over also raid4 and raid6 arrays. 2/ new_chunk in mdp_superblock_1 is in sectors, not bytes. Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18md/raid5: Use is_power_of_2() in raid5_reconfig()/raid6_reconfig().Andre Noll
Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18md: convert conf->chunk_size and conf->prev_chunk to sectors.Andre Noll
This kills some more shifts. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18md: Convert mddev->new_chunk to sectors.Andre Noll
A straight-forward conversion which gets rid of some multiplications/divisions/shifts. The patch also introduces a couple of new ones, most of which are due to conf->chunk_size still being represented in bytes. This will be cleaned up in subsequent patches. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18md: Make mddev->chunk_size sector-based.Andre Noll
This patch renames the chunk_size field to chunk_sectors with the implied change of semantics. Since is_power_of_2(chunk_size) = is_power_of_2(chunk_sectors << 9) = is_power_of_2(chunk_sectors) these bits don't need an adjustment for the shift. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (64 commits) debugfs: use specified mode to possibly mark files read/write only debugfs: Fix terminology inconsistency of dir name to mount debugfs filesystem. xen: remove driver_data direct access of struct device from more drivers usb: gadget: at91_udc: remove driver_data direct access of struct device uml: remove driver_data direct access of struct device block/ps3: remove driver_data direct access of struct device s390: remove driver_data direct access of struct device parport: remove driver_data direct access of struct device parisc: remove driver_data direct access of struct device of_serial: remove driver_data direct access of struct device mips: remove driver_data direct access of struct device ipmi: remove driver_data direct access of struct device infiniband: ehca: remove driver_data direct access of struct device ibmvscsi: gadget: at91_udc: remove driver_data direct access of struct device hvcs: remove driver_data direct access of struct device xen block: remove driver_data direct access of struct device thermal: remove driver_data direct access of struct device scsi: remove driver_data direct access of struct device pcmcia: remove driver_data direct access of struct device PCIE: remove driver_data direct access of struct device ... Manually fix up trivial conflicts due to different direct driver_data direct access fixups in drivers/block/{ps3disk.c,ps3vram.c}
2009-06-16block: remove some includings of blktrace_api.hLi Zefan
When porting blktrace to tracepoints, we changed to trace/block.h for trace prober declarations. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-16md: raid0 :Enables chunk size other than powers of 2.raz ben yehuda
Maintain two flows, one for pow2 chunk sizes (which uses masks and shift), and a flow for the general case (which uses sector_div). This is for the sake of performance. - introduce map_sector and is_io_in_chunk_boundary to encapsulate those two flows better for raid0_make_request - fix blk_mergeable to support the two flows. Signed-off-by: raziebe@gmail.com Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16md: prepare for non-power-of-two chunk sizesraz ben yehuda
Remove chunk size check from md as this is now performed in the run function in each personality. Replace chunk size power 2 code calculations by a regular division. Signed-off-by: raziebe@gmail.com Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16md: raid5: chunk size check in setup_confraz ben yehuda
have raid5 check chunk size in run/reshape method instead of in md Signed-off-by: raziebe@gmail.com Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16md: raid10: chunk size check in runraz ben yehuda
have raid10 check chunk size in run method instead of in md Signed-off-by: raziebe@gmail.com Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16md: raid0: chunk size check in raid0_runraz ben yehuda
have raid0 check chunk size in run method instead of in md. This is part of a series moving the checks from common code to the personalities where they belong. hardsect is short and chunksize is an int, so it is safe to use %. Signed-off-by: raziebe@gmail.com Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16md: have raid0 report its formationraz ben yehuda
Report to the user what are the raid zones Signed-off-by: raziebe@gmail.com Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16md: have raid0 compile with MD_DEBUG onraz ben yehuda
Because of the removal of the device list from the strips raid0 did not compile with MD_DEBUG flag on Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16md: Binary search in linear raidSandeep K Sinha
Replace the linear search with binary search in which_dev. Signed-off-by: Sandeep K Sinha <sandeepksinha@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16md: Removing num_sector and replacing start_sector with end_sectorSandeep K Sinha
Remove num_sectors from dev_info and replace start_sector with end_sector. This makes a lot of comparisons much simpler. Signed-off-by: Sandeep K Sinha <sandeepksinha@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16md: Removal of hash table in linear raidSandeep K Sinha
Get rid of sector_div and hash table for linear raid and replace with a linear search in which_dev. The hash table adds a lot of complexity for little if any gain. Ultimately a binary search will be used which will have smaller cache foot print, a similar number of memory access, and no divisions. Signed-off-by: Sandeep K Sinha <sandeepksinha@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16md: remove mddev_to_conf "helper" macroNeilBrown
Having a macro just to cast a void* isn't really helpful. I would must rather see that we are simply de-referencing ->private, than have to know what the macro does. So open code the macro everywhere and remove the pointless cast. Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16md: raid0: remove setting of segment boundary.NeilBrown
This setting doesn't seem to make sense (half the chunk size??) and shouldn't be needed. The segment boundary exported by raid0 should simply be the minimum of the segment boundary of all component devices. And we already get that right. Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16md: raid0: remove ->dev pointer from strip_zone structureNeilBrown
If we treat conf->devlist more like a 2 dimensional array, we can get the devlist for a particular zone simply by indexing that array, so we don't need to store the pointers to subarrays in strip_zone. This makes strip_zone smaller and so (hopefully) searches faster. Signed-of-by: NeilBrown <neilb@suse.de>
2009-06-16md: raid0: remove ->sectors from the strip_zone structure.NeilBrown
storing ->sectors is redundant as is can be computed from the difference z->zone_end - (z-1)->zone_end The one place where it is used, it is just as efficient to use a zone_end value instead. And removing it makes strip_zone smaller, so they array of these that is searched on every request has a better chance to say in cache. So discard the field and get the value from elsewhere. Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16md: raid0: Fix a memory leak when stopping a raid0 array.Andre Noll
raid0_stop() removes all references to the raid0 configuration but misses to free the ->devlist buffer. This patch closes this leak, removes a pointless initialization and fixes a coding style issue in raid0_stop(). Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16md: raid0: Allocate all buffers for the raid0 configuration in one function.Andre Noll
Currently the raid0 configuration is allocated in raid0_run() while the buffers for the strip_zone and the dev_list arrays are allocated in create_strip_zones(). On errors, all three buffers are freed in raid0_run(). It's easier and more readable to do the allocation and cleanup within a single function. So move that code into create_strip_zones(). Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>