Age | Commit message (Collapse) | Author |
|
Change #include "dm.h" to #include <linux/device-mapper.h> in all targets.
Targets should not need direct access to internal DM structures.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Move array_too_big to include/linux/device-mapper.h because it is
used by targets.
Remove the test from dm-raid1 as the number of mirror legs is limited
such that it can never fail. (Even for stripes it seems rather
unlikely.)
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
We must zero the next chunk on disk *before* writing out the current chunk, not
after. Otherwise if the machine crashes at the wrong time, the "end of metadata"
marker may be missing.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: stable@kernel.org
|
|
Use a separate buffer for writing zeroes to the on-disk snapshot
exception store, make the updating of ps->current_area explicit and
refactor the code in preparation for the fix in the next patch.
No functional change.
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@kernel.org
|
|
The last_percent field is unused - remove it.
(It dates from when events were triggered as each X% filled up.)
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Fix a race condition with primary_pe ref_count handling.
put_pending_exception runs under dm_snapshot->lock, it does atomic_dec_and_test
on primary_pe->ref_count, and later does atomic_read primary_pe->ref_count.
__origin_write does atomic_dec_and_test on primary_pe->ref_count without holding
dm_snapshot->lock.
This opens the following race condition:
Assume two CPUs, CPU1 is executing put_pending_exception (and holding
dm_snapshot->lock). CPU2 is executing __origin_write in parallel.
primary_pe->ref_count == 2.
CPU1:
if (primary_pe && atomic_dec_and_test(&primary_pe->ref_count))
origin_bios = bio_list_get(&primary_pe->origin_bios);
... decrements primary_pe->ref_count to 1. Doesn't load origin_bios
CPU2:
if (first && atomic_dec_and_test(&primary_pe->ref_count)) {
flush_bios(bio_list_get(&primary_pe->origin_bios));
free_pending_exception(primary_pe);
/* If we got here, pe_queue is necessarily empty. */
return r;
}
... decrements primary_pe->ref_count to 0, submits pending bios, frees
primary_pe.
CPU1:
if (!primary_pe || primary_pe != pe)
free_pending_exception(pe);
... this has no effect.
if (primary_pe && !atomic_read(&primary_pe->ref_count))
free_pending_exception(primary_pe);
... sees ref_count == 0 (written by CPU 2), does double free !!
This bug can happen only if someone is simultaneously writing to both the
origin and the snapshot.
If someone is writing only to the origin, __origin_write will submit kcopyd
request after it decrements primary_pe->ref_count (so it can't happen that the
finished copy races with primary_pe->ref_count decrementation).
If someone is writing only to the snapshot, __origin_write isn't invoked at all
and the race can't happen.
The race happens when someone writes to the snapshot --- this creates
pending_exception with primary_pe == NULL and starts copying. Then, someone
writes to the same chunk in the snapshot, and __origin_write races with
termination of already submitted request in pending_complete (that calls
put_pending_exception).
This race may be reason for bugs:
http://bugzilla.kernel.org/show_bug.cgi?id=11636
https://bugzilla.redhat.com/show_bug.cgi?id=465825
The patch fixes the code to make sure that:
1. If atomic_dec_and_test(&primary_pe->ref_count) returns false, the process
must no longer dereference primary_pe (because someone else may free it under
us).
2. If atomic_dec_and_test(&primary_pe->ref_count) returns true, the process
is responsible for freeing primary_pe.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: stable@kernel.org
|
|
Write throughput to LVM snapshot origin volume is an order
of magnitude slower than those to LV without snapshots or
snapshot target volumes, especially in the case of sequential
writes with O_SYNC on.
The following patch originally written by Kevin Jamieson and
Jan Blunck and slightly modified for the current RCs by myself
tries to improve the performance by modifying the behaviour
of kcopyd, so that it pushes back an I/O job to the head of
the job queue instead of the tail as process_jobs() currently
does when it has to wait for free pages. This way, write
requests aren't shuffled to cause extra seeks.
I tested the patch against 2.6.27-rc5 and got the following results.
The test is a dd command writing to snapshot origin followed by fsync
to the file just created/updated. A couple of filesystem benchmarks
gave me similar results in case of sequential writes, while random
writes didn't suffer much.
dd if=/dev/zero of=<somewhere on snapshot origin> bs=4096 count=...
[conv=notrunc when updating]
1) linux 2.6.27-rc5 without the patch, write to snapshot origin,
average throughput (MB/s)
10M 100M 1000M
create,dd 511.46 610.72 11.81
create,dd+fsync 7.10 6.77 8.13
update,dd 431.63 917.41 12.75
update,dd+fsync 7.79 7.43 8.12
compared with write throughput to LV without any snapshots,
all dd+fsync and 1000 MiB writes perform very poorly.
10M 100M 1000M
create,dd 555.03 608.98 123.29
create,dd+fsync 114.27 72.78 76.65
update,dd 152.34 1267.27 124.04
update,dd+fsync 130.56 77.81 77.84
2) linux 2.6.27-rc5 with the patch, write to snapshot origin,
average throughput (MB/s)
10M 100M 1000M
create,dd 537.06 589.44 46.21
create,dd+fsync 31.63 29.19 29.23
update,dd 487.59 897.65 37.76
update,dd+fsync 34.12 30.07 26.85
Although still not on par with plain LV performance -
cannot be avoided because it's copy on write anyway -
this simple patch successfully improves throughtput
of dd+fsync while not affecting the rest.
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Kazuo Ito <ito.kazuo@oss.ntt.co.jp>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: stable@kernel.org
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
ioctl() doesn't need BKL here
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
To keep the size of changesets sane we split the switch by drivers;
to keep the damn thing bisectable we do the following:
1) rename the affected methods, add ones with correct
prototypes, make (few) callers handle both. That's this changeset.
2) for each driver convert to new methods. *ALL* drivers
are converted in this series.
3) kill the old (renamed) methods.
Note that it _is_ a flagday; all in-tree drivers are converted and by the
end of this series no trace of old methods remain. The only reason why
we do that this way is to keep the damn thing bisectable and allow per-driver
debugging if anything goes wrong.
New methods:
open(bdev, mode)
release(disk, mode)
ioctl(bdev, mode, cmd, arg) /* Called without BKL */
compat_ioctl(bdev, mode, cmd, arg)
locked_ioctl(bdev, mode, cmd, arg) /* Called with BKL, legacy */
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Analog of blkdev_driver_ioctl() with sane arguments. For
now uses fake struct file, by the end of the series it won't
and blkdev_driver_ioctl() will become a wrapper around it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
The new extended partition support provides a much nicer was
to have partitions on md devices that the 'mdp' alternate major.
We cannot really get rid of 'mdp' at this time, but we can
enable extended partitions as that will probably make life
easier for sysadmins.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
The 'state' file for a device reports, for example, when the device
has failed. Changes should be reported to userspace ASAP without
the possibility of blocking on low-memory. sysfs_notify does
have that possibility (as it takes a mutex which can be held
across a kmalloc) so use sysfs_notify_dirent instead.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
Now that we have sysfs_notify_dirent, use it to notify changes
to md/array_state.
As sysfs_notify_dirent can be called in atomic context, we can
remove the delayed notify and the MD_NOTIFY_ARRAY_STATE flag.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (39 commits)
[SCSI] sd: fix compile failure with CONFIG_BLK_DEV_INTEGRITY=n
libiscsi: fix locking in iscsi_eh_device_reset
libiscsi: check reason why we are stopping iscsi session to determine error value
[SCSI] iscsi_tcp: return a descriptive error value during connection errors
[SCSI] libiscsi: rename host reset to target reset
[SCSI] iscsi class: fix endpoint id handling
[SCSI] libiscsi: Support drivers initiating session removal
[SCSI] libiscsi: fix data corruption when target has to resend data-in packets
[SCSI] sd: Switch kernel printing level for DIF messages
[SCSI] sd: Correctly handle all combinations of DIF and DIX
[SCSI] sd: Always print actual protection_type
[SCSI] sd: Issue correct protection operation
[SCSI] scsi_error: fix target reset handling
[SCSI] lpfc 8.2.8 v2 : Add statistical reporting control and additional fc vendor events
[SCSI] lpfc 8.2.8 v2 : Add sysfs control of target queue depth handling
[SCSI] lpfc 8.2.8 v2 : Revert target busy in favor of transport disrupted
[SCSI] scsi_dh_alua: remove REQ_NOMERGE
[SCSI] lpfc 8.2.8 : update driver version to 8.2.8
[SCSI] lpfc 8.2.8 : Add MSI-X support
[SCSI] lpfc 8.2.8 : Update driver to use new Host byte error code DID_TRANSPORT_DISRUPTED
...
|
|
* 'for-linus' of git://neil.brown.name/md:
md: fix input truncation in safe_delay_store()
md: check for memory allocation failure in faulty personality
md: build failure due to missing delay.h
md: Relax minimum size restrictions on chunk_size.
md: remove space after function name in declaration and call.
md: Remove unnecessary #includes, #defines, and function declarations.
md: Convert remaining 1k representations in linear.c to sectors.
md: linear.c: Make two local variables sector-based.
md: linear: Represent dev_info->size and dev_info->offset in sectors.
md: linear.c: Remove broken debug code.
md: linear.c: Remove pointless initialization of curr_offset.
md: linear.c: Fix typo in comment.
md: Don't try to set an array to 'read-auto' if it is already in that state.
md: Allow metadata_version to be updated for externally managed metadata.
md: Fix rdev_size_store with size == 0
|
|
safe_delay_store() currently truncates the last character of input since
it tells strlcpy that the buffer can only hold 'len' characters, off by
one. sysfs already null terminates the buffer, so just increase the
last argument to strlcpy.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
It's a fault injection module, but I don't think we should oops here.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Neil Brown <neilb@suse.de>
|
|
Today's linux-next build (powerpc ppc64_defconfig) failed like this:
drivers/md/raid1.c: In function 'sync_request':
drivers/md/raid1.c:1759: error: implicit declaration of function 'msleep_interruptible'
make[3]: *** [drivers/md/raid1.o] Error 1
make[3]: *** Waiting for unfinished jobs....
drivers/md/raid10.c: In function 'sync_request':
drivers/md/raid10.c:1749: error: implicit declaration of function 'msleep_interruptible'
make[3]: *** [drivers/md/raid10.o] Error 1
drivers/md/md.c: In function 'md_do_sync':
drivers/md/md.c:5915: error: implicit declaration of function 'msleep'
Caused by commit 6caa3b0bbdb474647f6bdd8a958ffc46f78d8d58 ("md: Remove
unnecessary #includes, #defines, and function declarations"). I added
the following patch.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
Multipath is best at handling transport errors. If it gets a device
error then there is not much the multipath layer can do. It will just
access the same device but from a different path.
This patch breaks up failfast into device, transport and driver errors.
The multipath layers (md and dm mutlipath) only ask the lower levels to
fast fail transport errors. The user of failfast, read ahead, will ask
to fast fail on all errors.
Note that blk_noretry_request will return true if any failfast bit
is set. This allows drivers that do not support the multipath failfast
bits to continue to fail on any failfast error like before. Drivers
like scsi that are able to fail fast specific errors can check
for the specific fail fast type. In the next patch I will convert
scsi.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
|
|
Currently, the 'chunk_size' of an array must be at-least PAGE_SIZE.
This makes moving an array to a machine with a larger PAGE_SIZE, or
changing the kernel to use a larger PAGE_SIZE, can stop an array from
working.
For RAID10 and RAID4/5/6, this is non-trivial to fix as the resync
process works on whole pages at a time, and assumes them to be wholly
within a stripe. For other raid personalities, this restriction is
not needed at all and can be dropped.
So remove the test on chunk_size from common can, and add it in just
the places where it is needed: raid10 and raid4/5/6.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
Having
function (args)
instead of
function(args)
make is harder to search for calls of particular functions.
So remove all those spaces.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
A lot of cruft has gathered over the years. Time to remove it.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
This patch renames hash_spacing and preshift to spacing and
sector_shift respectively with the following change of semantics:
Case 1: (sizeof(sector_t) <= sizeof(u32)).
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In this case, we have sector_shift = preshift = 0 and spacing =
2 * hash_spacing.
Hence, the index for the hash table which is computed by the new code
in which_dev() as sector / spacing equals the old value which was
(sector/2) / hash_spacing.
Note also that the value of nb_zone stays the same because both sz
and base double.
Case 2: (sizeof(sector_t) > sizeof(u32)).
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(aka the shifting dance case). Here we have sector_shift = preshift +
1 and
spacing = 2 * hash_spacing
during the computation of nb_zone and curr_sector, but
spacing = hash_spacing
in which_dev() because in the last hunk of the patch for linear.c we
shift down conf->spacing (= 2 * hash_spacing) by one more bit than
in the old code.
Hence in the computation of nb_zone, sz and base have the same value
as before, so nb_zone is not affected. Also curr_sector in the next
hunk stays the same.
In which_dev() the hash table index is computed as
(sector >> sector_shift) / spacing
In view of sector_shift = preshift + 1 and spacing = hash_spacing,
this equals
((sector/2) >> preshift) / hash_spacing
which is the value computed by the old code.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
This is a preparation for representing also the remaining fields of struct
linear_private_data as sectors.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
Rename them to num_sectors and start_sector which is more descriptive.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
conf->smallest_size is undefined since day one of the git repo..
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
'read-auto' is a variant of 'readonly' which will switch to writable
on the first write attempt.
Calling do_md_stop to set the array readonly when it is already readonly
returns an error. So make sure not to do that.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
For externally managed metadata, the 'metadata_version' sysfs
attribute is really just a channel for user-space programs to
communicate about how the array is being managed.
It can be useful for this to be changed while the array is active.
Normally changes to metadata_version are not permitted while the array
is active. Change that so that if the metadata is externally managed,
the metadata_version can be changed to a different flavour of external
management.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
Fix rdev_size_store with size == 0.
size == 0 means to use the largest size allowed by the
underlying device and is used when modifying an active array.
This fixes a regression introduced by
commit d7027458d68b2f1752a28016dcf2ffd0a7e8f567
Cc: <stable@kernel.org>
Signed-off-by: Chris Webb <chris@arachsys.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
Signed-off-by: Alan Jenkins <alan-jenkins@tuffmail.co.uk>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
RAID autodetect has the side effect of requiring synchronisation
of all device drivers, which can make the boot several seconds longer
(I've measured 7 on one of my laptops).... even for systems that don't
have RAID setup for the root filesystem (the only FS where this matters).
This patch makes the default for autodetect a config option; either way
the user can always override via the kernel command line.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: NeilBrown <neilb@suse.de>
|
|
* git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm:
dm: detect lost queue
dm: publish dm_vcalloc
dm: publish dm_table_unplug_all
dm: publish dm_get_mapinfo
dm: export struct dm_dev
dm crypt: avoid unnecessary wait when splitting bio
dm crypt: tidy ctx pending
dm crypt: fix async inc_pending
dm crypt: move dec_pending on error into write_io_submit
dm crypt: remove inc_pending from write_io_submit
dm crypt: tidy write loop pending
dm crypt: tidy crypt alloc
dm crypt: tidy inc pending
dm exception store: use chunk_t for_areas
dm exception store: introduce area_location function
dm raid1: kcopyd should stop on error if errors handled
dm mpath: remove is_active from struct dm_path
dm mpath: use more error codes
Fixed up trivial conflict in drivers/md/dm-mpath.c manually.
|
|
Detect and report buggy drivers that destroy their request_queue.
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: Stefan Raspl <raspl@linux.vnet.ibm.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
|
|
Publish dm_vcalloc in include/linux/device-mapper.h because this function is
used by targets.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Publish dm_table_unplug_all in include/linux/device-mapper.h because this
function is used by targets.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Publish dm_get_mapinfo in include/linux/device-mapper.h because this function
is used by targets.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Split struct dm_dev in two and publish the part that other targets need in
include/linux/device-mapper.h.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Don't wait between submitting crypt requests for a bio unless
we are short of memory.
There are two situations when we must split an encrypted bio:
1) there are no free pages;
2) the new bio would violate underlying device restrictions
(e.g. max hw segments).
In case (2) we do not need to wait.
Add output variable to crypt_alloc_buffer() to distinguish between
these cases.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Move the initialisation of ctx->pending into one place, at the
start of crypt_convert().
Introduce crypt_finished to indicate whether or not the encryption
is finished, for use in a later patch.
No functional change.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
The pending reference count must be incremented *before* the async work is
queued to another thread, not after. Otherwise there's a race if the
work completes and decrements the reference count before it gets incremented.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Make kcryptd_crypt_write_io_submit() responsible for decrementing
the pending count after an error.
Also fixes a bug in the async path that forgot to decrement it.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Make the caller reponsible for incrementing the pending count before calling
kcryptd_crypt_write_io_submit() in the non-async case to bring it into line
with the async case.
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Move kcryptd_crypt_write_convert_loop inside kcryptd_crypt_write_convert.
This change is needed for a later patch.
No functional change.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Factor out crypt io allocation code.
Later patches will call it from another place.
No functional change.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|