aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2010-01-02reiserfs: Fix reiserfs lock <-> i_xattr_sem dependency inversionFrederic Weisbecker
i_xattr_sem depends on the reiserfs lock. But after we grab i_xattr_sem, we may relax/relock the reiserfs lock while waiting on a freezed filesystem, creating a dependency inversion between the two locks. In order to avoid the i_xattr_sem -> reiserfs lock dependency, let's create a reiserfs_down_read_safe() that acts like reiserfs_mutex_lock_safe(): relax the reiserfs lock while grabbing another lock to avoid undesired dependencies induced by the heivyweight reiserfs lock. This fixes the following warning: [ 990.005931] ======================================================= [ 990.012373] [ INFO: possible circular locking dependency detected ] [ 990.013233] 2.6.33-rc1 #1 [ 990.013233] ------------------------------------------------------- [ 990.013233] dbench/1891 is trying to acquire lock: [ 990.013233] (&REISERFS_SB(s)->lock){+.+.+.}, at: [<ffffffff81159505>] reiserfs_write_lock+0x35/0x50 [ 990.013233] [ 990.013233] but task is already holding lock: [ 990.013233] (&REISERFS_I(inode)->i_xattr_sem){+.+.+.}, at: [<ffffffff8115899a>] reiserfs_xattr_set_handle+0x8a/0x470 [ 990.013233] [ 990.013233] which lock already depends on the new lock. [ 990.013233] [ 990.013233] [ 990.013233] the existing dependency chain (in reverse order) is: [ 990.013233] [ 990.013233] -> #1 (&REISERFS_I(inode)->i_xattr_sem){+.+.+.}: [ 990.013233] [<ffffffff81063afc>] __lock_acquire+0xf9c/0x1560 [ 990.013233] [<ffffffff8106414f>] lock_acquire+0x8f/0xb0 [ 990.013233] [<ffffffff814ac194>] down_write+0x44/0x80 [ 990.013233] [<ffffffff8115899a>] reiserfs_xattr_set_handle+0x8a/0x470 [ 990.013233] [<ffffffff81158e30>] reiserfs_xattr_set+0xb0/0x150 [ 990.013233] [<ffffffff8115a6aa>] user_set+0x8a/0x90 [ 990.013233] [<ffffffff8115901a>] reiserfs_setxattr+0xaa/0xb0 [ 990.013233] [<ffffffff810e2596>] __vfs_setxattr_noperm+0x36/0xa0 [ 990.013233] [<ffffffff810e26bc>] vfs_setxattr+0xbc/0xc0 [ 990.013233] [<ffffffff810e2780>] setxattr+0xc0/0x150 [ 990.013233] [<ffffffff810e289d>] sys_fsetxattr+0x8d/0xa0 [ 990.013233] [<ffffffff81002dab>] system_call_fastpath+0x16/0x1b [ 990.013233] [ 990.013233] -> #0 (&REISERFS_SB(s)->lock){+.+.+.}: [ 990.013233] [<ffffffff81063e30>] __lock_acquire+0x12d0/0x1560 [ 990.013233] [<ffffffff8106414f>] lock_acquire+0x8f/0xb0 [ 990.013233] [<ffffffff814aba77>] __mutex_lock_common+0x47/0x3b0 [ 990.013233] [<ffffffff814abebe>] mutex_lock_nested+0x3e/0x50 [ 990.013233] [<ffffffff81159505>] reiserfs_write_lock+0x35/0x50 [ 990.013233] [<ffffffff811340e5>] reiserfs_prepare_write+0x45/0x180 [ 990.013233] [<ffffffff81158bb6>] reiserfs_xattr_set_handle+0x2a6/0x470 [ 990.013233] [<ffffffff81158e30>] reiserfs_xattr_set+0xb0/0x150 [ 990.013233] [<ffffffff8115a6aa>] user_set+0x8a/0x90 [ 990.013233] [<ffffffff8115901a>] reiserfs_setxattr+0xaa/0xb0 [ 990.013233] [<ffffffff810e2596>] __vfs_setxattr_noperm+0x36/0xa0 [ 990.013233] [<ffffffff810e26bc>] vfs_setxattr+0xbc/0xc0 [ 990.013233] [<ffffffff810e2780>] setxattr+0xc0/0x150 [ 990.013233] [<ffffffff810e289d>] sys_fsetxattr+0x8d/0xa0 [ 990.013233] [<ffffffff81002dab>] system_call_fastpath+0x16/0x1b [ 990.013233] [ 990.013233] other info that might help us debug this: [ 990.013233] [ 990.013233] 2 locks held by dbench/1891: [ 990.013233] #0: (&sb->s_type->i_mutex_key#12){+.+.+.}, at: [<ffffffff810e2678>] vfs_setxattr+0x78/0xc0 [ 990.013233] #1: (&REISERFS_I(inode)->i_xattr_sem){+.+.+.}, at: [<ffffffff8115899a>] reiserfs_xattr_set_handle+0x8a/0x470 [ 990.013233] [ 990.013233] stack backtrace: [ 990.013233] Pid: 1891, comm: dbench Not tainted 2.6.33-rc1 #1 [ 990.013233] Call Trace: [ 990.013233] [<ffffffff81061639>] print_circular_bug+0xe9/0xf0 [ 990.013233] [<ffffffff81063e30>] __lock_acquire+0x12d0/0x1560 [ 990.013233] [<ffffffff8115899a>] ? reiserfs_xattr_set_handle+0x8a/0x470 [ 990.013233] [<ffffffff8106414f>] lock_acquire+0x8f/0xb0 [ 990.013233] [<ffffffff81159505>] ? reiserfs_write_lock+0x35/0x50 [ 990.013233] [<ffffffff8115899a>] ? reiserfs_xattr_set_handle+0x8a/0x470 [ 990.013233] [<ffffffff814aba77>] __mutex_lock_common+0x47/0x3b0 [ 990.013233] [<ffffffff81159505>] ? reiserfs_write_lock+0x35/0x50 [ 990.013233] [<ffffffff81159505>] ? reiserfs_write_lock+0x35/0x50 [ 990.013233] [<ffffffff81062592>] ? mark_held_locks+0x72/0xa0 [ 990.013233] [<ffffffff814ab81d>] ? __mutex_unlock_slowpath+0xbd/0x140 [ 990.013233] [<ffffffff810628ad>] ? trace_hardirqs_on_caller+0x14d/0x1a0 [ 990.013233] [<ffffffff814abebe>] mutex_lock_nested+0x3e/0x50 [ 990.013233] [<ffffffff81159505>] reiserfs_write_lock+0x35/0x50 [ 990.013233] [<ffffffff811340e5>] reiserfs_prepare_write+0x45/0x180 [ 990.013233] [<ffffffff81158bb6>] reiserfs_xattr_set_handle+0x2a6/0x470 [ 990.013233] [<ffffffff81158e30>] reiserfs_xattr_set+0xb0/0x150 [ 990.013233] [<ffffffff814abcb4>] ? __mutex_lock_common+0x284/0x3b0 [ 990.013233] [<ffffffff8115a6aa>] user_set+0x8a/0x90 [ 990.013233] [<ffffffff8115901a>] reiserfs_setxattr+0xaa/0xb0 [ 990.013233] [<ffffffff810e2596>] __vfs_setxattr_noperm+0x36/0xa0 [ 990.013233] [<ffffffff810e26bc>] vfs_setxattr+0xbc/0xc0 [ 990.013233] [<ffffffff810e2780>] setxattr+0xc0/0x150 [ 990.013233] [<ffffffff81056018>] ? sched_clock_cpu+0xb8/0x100 [ 990.013233] [<ffffffff8105eded>] ? trace_hardirqs_off+0xd/0x10 [ 990.013233] [<ffffffff810560a3>] ? cpu_clock+0x43/0x50 [ 990.013233] [<ffffffff810c6820>] ? fget+0xb0/0x110 [ 990.013233] [<ffffffff810c6770>] ? fget+0x0/0x110 [ 990.013233] [<ffffffff81002ddc>] ? sysret_check+0x27/0x62 [ 990.013233] [<ffffffff810e289d>] sys_fsetxattr+0x8d/0xa0 [ 990.013233] [<ffffffff81002dab>] system_call_fastpath+0x16/0x1b Reported-and-tested-by: Christian Kujau <lists@nerdbynature.de> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Alexander Beregalov <a.beregalov@gmail.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Ingo Molnar <mingo@elte.hu>
2010-01-01ext4: Calculate metadata requirements more accuratelyTheodore Ts'o
In the past, ext4_calc_metadata_amount(), and its sub-functions ext4_ext_calc_metadata_amount() and ext4_indirect_calc_metadata_amount() badly over-estimated the number of metadata blocks that might be required for delayed allocation blocks. This didn't matter as much when functions which managed the reserved metadata blocks were more aggressive about dropping reserved metadata blocks as delayed allocation blocks were written, but unfortunately they were too aggressive. This was fixed in commit 0637c6f, but as a result the over-estimation by ext4_calc_metadata_amount() would lead to reserving 2-3 times the number of pending delayed allocation blocks as potentially required metadata blocks. So if there are 1 megabytes of blocks which have been not yet been allocation, up to 3 megabytes of space would get reserved out of the user's quota and from the file system free space pool until all of the inode's data blocks have been allocated. This commit addresses this problem by much more accurately estimating the number of metadata blocks that will be required. It will still somewhat over-estimate the number of blocks needed, since it must make a worst case estimate not knowing which physical blocks will be needed, but it is much more accurate than before. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-01-01ext4: Fix accounting of reserved metadata blocksTheodore Ts'o
Commit 0637c6f had a typo which caused the reserved metadata blocks to not be released correctly. Fix this. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: [CIFS] Enable mmap on forcedirectio mounts cifs: NULL out tcon, pSesInfo, and srvTcp pointers when chasing DFS referrals
2009-12-30ocfs2: Handle O_DIRECT when writing to a refcounted cluster.Tao Ma
In case of writing to a refcounted cluster with O_DIRECT, we need to fall back to buffer write. And when it is finished, we need to flush the page and the journal as we did for other O_DIRECT writes. This patch fix oss bug 1191. http://oss.oracle.com/bugzilla/show_bug.cgi?id=1191 Signed-off-by: Tao Ma <tao.ma@oracle.com> Tested-by: Tristan Ye <tristan.ye@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-12-30Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: Patch up how we claim metadata blocks for quota purposes ext4: Ensure zeroout blocks have no dirty metadata ext4: return correct wbc.nr_to_write in ext4_da_writepages ext4: Update documentation to correct the inode_readahead_blks option name jbd2: don't use __GFP_NOFAIL in journal_init_common() ext4: flush delalloc blocks when space is low fs-writeback: Add helper function to start writeback if idle ext4: Eliminate potential double free on error path ext4: fix unsigned long long printk warning in super.c ext4, jbd2: Add barriers for file systems with exernal journals ext4: replace BUG() with return -EIO in ext4_ext_get_blocks ext4: add module aliases for ext2 and ext3 ext4: Don't ask about supporting ext2/3 in ext4 if ext4 is not configured ext4: remove unused #include <linux/version.h>
2009-12-30generic_permission: MAY_OPEN is not write accessSerge E. Hallyn
generic_permission was refusing CAP_DAC_READ_SEARCH-enabled processes from opening DAC-protected files read-only, because do_filp_open adds MAY_OPEN to the open mask. Ignore MAY_OPEN. After this patch, CAP_DAC_READ_SEARCH is again sufficient to open(fname, O_RDONLY) on a file to which DAC otherwise refuses us read permission. Reported-by: Mike Kazantsev <mk.fraggod@gmail.com> Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Mike Kazantsev <mk.fraggod@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-30ext4: Patch up how we claim metadata blocks for quota purposesTheodore Ts'o
As reported in Kernel Bugzilla #14936, commit d21cd8f triggered a BUG in the function ext4_da_update_reserve_space() found in fs/ext4/inode.c. The root cause of this BUG() was caused by the fact that ext4_calc_metadata_amount() can severely over-estimate how many metadata blocks will be needed, especially when using direct block-mapped files. In addition, it can also badly *under* estimate how much space is needed, since ext4_calc_metadata_amount() assumes that the blocks are contiguous, and this is not always true. If the application is writing blocks to a sparse file, the number of metadata blocks necessary can be severly underestimated by the functions ext4_da_reserve_space(), ext4_da_update_reserve_space() and ext4_da_release_space(). This was the cause of the dq_claim_space reports found on kerneloops.org. Unfortunately, doing this right means that we need to massively over-estimate the amount of free space needed. So in some cases we may need to force the inode to be written to disk asynchronously in to avoid spurious quota failures. http://bugzilla.kernel.org/show_bug.cgi?id=14936 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-29ext4: Ensure zeroout blocks have no dirty metadataAneesh Kumar K.V
This fixes a bug (found by Curt Wohlgemuth) in which new blocks returned from an extent created with ext4_ext_zeroout() can have dirty metadata still associated with them. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-29reiserfs: Fix remaining in-reclaim-fs <-> reclaim-fs-on locking inversionFrederic Weisbecker
Commit 500f5a0bf5f0624dae34307010e240ec090e4cde (reiserfs: Fix possible recursive lock) fixed a vmalloc under reiserfs lock that triggered a lockdep warning because of a IN-FS-RECLAIM <-> RECLAIM-FS-ON locking dependency inversion. But this patch has ommitted another vmalloc call in the same path that allocates the journal. Relax the lock for this one too. Reported-by: Alexander Beregalov <a.beregalov@gmail.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Ingo Molnar <mingo@elte.hu>
2009-12-25ext4: return correct wbc.nr_to_write in ext4_da_writepagesRichard Kennedy
When ext4_da_writepages increases the nr_to_write in writeback_control then it must always re-base the return value. Originally there was a (misguided) attempt prevent wbc.nr_to_write from going negative. In fact, it's necessary to allow nr_to_write to be negative so that wb_writeback() can correctly calculate how many pages were actually written. Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-25nilfs2: Storage class should be before const qualifierTobias Klauser
The C99 specification states in section 6.11.5: The placement of a storage-class specifier other than at the beginning of the declaration specifiers in a declaration is an obsolescent feature. Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-12-25nilfs2: trivial coding style fixJiro SEKIBA
This is a trivial style fix patch to mend errors/warnings reported by "checkpatch.pl --file". Signed-off-by: Jiro SEKIBA <jir@unicus.jp> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-12-24Merge branch 'upstream-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2 * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: ocfs2/trivial: Use le16_to_cpu for a disk value in xattr.c ocfs2/trivial: Use proper mask for 2 places in hearbeat.c Ocfs2: Let ocfs2 support fiemap for symlink and fast symlink. Ocfs2: Should ocfs2 support fiemap for S_IFDIR inode? ocfs2: Use FIEMAP_EXTENT_SHARED fiemap: Add new extent flag FIEMAP_EXTENT_SHARED ocfs2: replace u8 by __u8 in ocfs2_fs.h ocfs2: explicit declare uninitialized var in user_cluster_connect() ocfs2-devel: remove redundant OCFS2_MOUNT_POSIX_ACL check in ocfs2_get_acl_nolock() ocfs2: return -EAGAIN instead of EAGAIN in dlm ocfs2/cluster: Make fence method configurable - v2 ocfs2: Set MS_POSIXACL on remount ocfs2: Make acl use the default ocfs2: Always include ACL support
2009-12-23ocfs2/trivial: Use le16_to_cpu for a disk value in xattr.cTao Ma
In ocfs2_value_metas_in_xattr_header, we should Use le16_to_cpu for ocfs2_extent_list.l_next_free_rec. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-12-23ocfs2/trivial: Use proper mask for 2 places in hearbeat.cTao Ma
I just noticed today that there are 2 places of "mlog(0,...)" in fs/ocfs2/cluster/heartbeat.c, but actually have no default mask prefix in that file. So change them to mlog(ML_HEARTBEAT,...). Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-12-23Ocfs2: Let ocfs2 support fiemap for symlink and fast symlink.Tristan Ye
For fast symlink, it can be treated the same as inlined files since the data extent we want to return of both case all were stored in metadata block. For symlink, it can be simply treated the same as we did for regular files. Signed-off-by: Tristan Ye <tristan.ye@oracle.com> Acked-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-12-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: devtmpfs: unlock mutex in case of string allocation error Driver core: export platform_device_register_data as a GPL symbol driver core: Prevent reference to freed memory on error path Driver-core: Fix bogus 0 error return in device_add() Driver core: driver_attribute parameters can often be const* Driver core: bin_attribute parameters can often be const* Driver core: device_attribute parameters can often be const* Doc/stable rules: add new cherry-pick logic vfs: get_sb_single() - do not pass options twice devtmpfs: Convert dirlock to a mutex
2009-12-23Merge branch 'upstream-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2 * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: ocfs2: Set i_nlink properly during reflink. ocfs2: Add reflinked file's inode to inode hash eariler. ocfs2: refcounttree.c cleanup. ocfs2: Find proper end cpos for a leaf refcount block.
2009-12-23Driver core: bin_attribute parameters can often be const*Phil Carmody
Many struct bin_attribute descriptors are purely read-only structures, and there's no need to change them. Therefore make the promise not to, which will let those descriptors be put in a ro section. Signed-off-by: Phil Carmody <ext-phil.2.carmody@nokia.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-23vfs: get_sb_single() - do not pass options twiceKay Sievers
Filesystem code usually destroys the option buffer while parsing it. This leads to errors when the same buffer is passed twice. In case we fill a new superblock do not call remount. This is needed to quite a warning that the debugfs code causes every boot. Cc: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-23jbd2: don't use __GFP_NOFAIL in journal_init_common()Andrew Morton
It triggers the warning in get_page_from_freelist(), and it isn't appropriate to use __GFP_NOFAIL here anyway. Addresses http://bugzilla.kernel.org/show_bug.cgi?id=14843 Reported-by: Christian Casteyde <casteyde.christian@free.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-23ext4: flush delalloc blocks when space is lowEric Sandeen
Creating many small files in rapid succession on a small filesystem can lead to spurious ENOSPC; on a 104MB filesystem: for i in `seq 1 22500`; do echo -n > $SCRATCH_MNT/$i echo XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX > $SCRATCH_MNT/$i done leads to ENOSPC even though after a sync, 40% of the fs is free again. This is because we reserve worst-case metadata for delalloc writes, and when data is allocated that worst-case reservation is not usually needed. When freespace is low, kicking off an async writeback will start converting that worst-case space usage into something more realistic, almost always freeing up space to continue. This resolves the testcase for me, and survives all 4 generic ENOSPC tests in xfstests. We'll still need a hard synchronous sync to squeeze out the last bit, but this fixes things up to a large degree. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-23fs-writeback: Add helper function to start writeback if idleEric Sandeen
ext4, at least, would like to start pushing on writeback if it starts to get close to ENOSPC when reserving worst-case blocks for delalloc writes. Writing out delalloc data will convert those worst-case predictions into usually smaller actual usage, freeing up space before we hit ENOSPC based on this speculation. Thanks to Jens for the suggestion for the helper function, & the naming help. I've made the helper return status on whether writeback was started even though I don't plan to use it in the ext4 patch; it seems like it would be potentially useful to test this in some cases. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Acked-by: Jan Kara <jack@suse.cz>
2009-12-23ext4: Eliminate potential double free on error pathJulia Lawall
b_entry_name and buffer are initially NULL, are initialized within a loop to the result of calling kmalloc, and are freed at the bottom of this loop. The loop contains gotos to cleanup, which also frees b_entry_name and buffer. Some of these gotos are before the reinitializations of b_entry_name and buffer. To maintain the invariant that b_entry_name and buffer are NULL at the top of the loop, and thus acceptable arguments to kfree, these variables are now set to NULL after the kfrees. This seems to be the simplest solution. A more complicated solution would be to introduce more labels in the error handling code at the end of the function. A simplified version of the semantic match that finds this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @r@ identifier E; expression E1; iterator I; statement S; @@ *kfree(E); ... when != E = E1 when != I(E,...) S when != &E *kfree(E); // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-23ext4: fix unsigned long long printk warning in super.cAndrew Morton
sparc64 allmodconfig: fs/ext4/super.c: In function `lifetime_write_kbytes_show': fs/ext4/super.c:2174: warning: long long unsigned int format, long unsigned int arg (arg 4) fs/ext4/super.c:2174: warning: long long unsigned int format, long unsigned int arg (arg 4) Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-23quota: Improve checking of quota file headerJan Kara
When we are asked for vfsv0 quota format and the file is in vfsv1 format (or vice versa), refuse to use the quota file. Also return with error when we don't like the header of quota file. Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-23jbd: jbd-debug and jbd2-debug should be writableYin Kangkai
jbd-debug and jbd2-debug is currently read-only (S_IRUGO), which is not correct. Make it writable so that we can start debuging. Signed-off-by: Yin Kangkai <kangkai.yin@intel.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-23ext4: fix sleep inside spinlock issue with quota and dealloc (#14739)Dmitry Monakhov
Unlock i_block_reservation_lock before vfs_dq_reserve_block(). This patch fixes http://bugzilla.kernel.org/show_bug.cgi?id=14739 CC: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-23ext4: Fix potential quota deadlockDmitry Monakhov
We have to delay vfs_dq_claim_space() until allocation context destruction. Currently we have following call-trace: ext4_mb_new_blocks() /* task is already holding ac->alloc_semp */ ->ext4_mb_mark_diskspace_used ->vfs_dq_claim_space() /* acquire dqptr_sem here. Possible deadlock */ ->ext4_mb_release_context() /* drop ac->alloc_semp here */ Let's move quota claiming to ext4_da_update_reserve_space() ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.32-rc7 #18 ------------------------------------------------------- write-truncate-/3465 is trying to acquire lock: (&s->s_dquot.dqptr_sem){++++..}, at: [<c025e73b>] dquot_claim_space+0x3b/0x1b0 but task is already holding lock: (&meta_group_info[i]->alloc_sem){++++..}, at: [<c02ce962>] ext4_mb_load_buddy+0xb2/0x370 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&meta_group_info[i]->alloc_sem){++++..}: [<c017d04b>] __lock_acquire+0xd7b/0x1260 [<c017d5ea>] lock_acquire+0xba/0xd0 [<c0527191>] down_read+0x51/0x90 [<c02ce962>] ext4_mb_load_buddy+0xb2/0x370 [<c02d0c1c>] ext4_mb_free_blocks+0x46c/0x870 [<c029c9d3>] ext4_free_blocks+0x73/0x130 [<c02c8cfc>] ext4_ext_truncate+0x76c/0x8d0 [<c02a8087>] ext4_truncate+0x187/0x5e0 [<c01e0f7b>] vmtruncate+0x6b/0x70 [<c022ec02>] inode_setattr+0x62/0x190 [<c02a2d7a>] ext4_setattr+0x25a/0x370 [<c022ee81>] notify_change+0x151/0x340 [<c021349d>] do_truncate+0x6d/0xa0 [<c0221034>] may_open+0x1d4/0x200 [<c022412b>] do_filp_open+0x1eb/0x910 [<c021244d>] do_sys_open+0x6d/0x140 [<c021258e>] sys_open+0x2e/0x40 [<c0103100>] sysenter_do_call+0x12/0x32 -> #2 (&ei->i_data_sem){++++..}: [<c017d04b>] __lock_acquire+0xd7b/0x1260 [<c017d5ea>] lock_acquire+0xba/0xd0 [<c0527191>] down_read+0x51/0x90 [<c02a5787>] ext4_get_blocks+0x47/0x450 [<c02a74c1>] ext4_getblk+0x61/0x1d0 [<c02a7a7f>] ext4_bread+0x1f/0xa0 [<c02bcddc>] ext4_quota_write+0x12c/0x310 [<c0262d23>] qtree_write_dquot+0x93/0x120 [<c0261708>] v2_write_dquot+0x28/0x30 [<c025d3fb>] dquot_commit+0xab/0xf0 [<c02be977>] ext4_write_dquot+0x77/0x90 [<c02be9bf>] ext4_mark_dquot_dirty+0x2f/0x50 [<c025e321>] dquot_alloc_inode+0x101/0x180 [<c029fec2>] ext4_new_inode+0x602/0xf00 [<c02ad789>] ext4_create+0x89/0x150 [<c0221ff2>] vfs_create+0xa2/0xc0 [<c02246e7>] do_filp_open+0x7a7/0x910 [<c021244d>] do_sys_open+0x6d/0x140 [<c021258e>] sys_open+0x2e/0x40 [<c0103100>] sysenter_do_call+0x12/0x32 -> #1 (&sb->s_type->i_mutex_key#7/4){+.+...}: [<c017d04b>] __lock_acquire+0xd7b/0x1260 [<c017d5ea>] lock_acquire+0xba/0xd0 [<c0526505>] mutex_lock_nested+0x65/0x2d0 [<c0260c9d>] vfs_load_quota_inode+0x4bd/0x5a0 [<c02610af>] vfs_quota_on_path+0x5f/0x70 [<c02bc812>] ext4_quota_on+0x112/0x190 [<c026345a>] sys_quotactl+0x44a/0x8a0 [<c0103100>] sysenter_do_call+0x12/0x32 -> #0 (&s->s_dquot.dqptr_sem){++++..}: [<c017d361>] __lock_acquire+0x1091/0x1260 [<c017d5ea>] lock_acquire+0xba/0xd0 [<c0527191>] down_read+0x51/0x90 [<c025e73b>] dquot_claim_space+0x3b/0x1b0 [<c02cb95f>] ext4_mb_mark_diskspace_used+0x36f/0x380 [<c02d210a>] ext4_mb_new_blocks+0x34a/0x530 [<c02c83fb>] ext4_ext_get_blocks+0x122b/0x13c0 [<c02a5966>] ext4_get_blocks+0x226/0x450 [<c02a5ff3>] mpage_da_map_blocks+0xc3/0xaa0 [<c02a6ed6>] ext4_da_writepages+0x506/0x790 [<c01de272>] do_writepages+0x22/0x50 [<c01d766d>] __filemap_fdatawrite_range+0x6d/0x80 [<c01d7b9b>] filemap_flush+0x2b/0x30 [<c02a40ac>] ext4_alloc_da_blocks+0x5c/0x60 [<c029e595>] ext4_release_file+0x75/0xb0 [<c0216b59>] __fput+0xf9/0x210 [<c0216c97>] fput+0x27/0x30 [<c02122dc>] filp_close+0x4c/0x80 [<c014510e>] put_files_struct+0x6e/0xd0 [<c01451b7>] exit_files+0x47/0x60 [<c0146a24>] do_exit+0x144/0x710 [<c0147028>] do_group_exit+0x38/0xa0 [<c0159abc>] get_signal_to_deliver+0x2ac/0x410 [<c0102849>] do_notify_resume+0xb9/0x890 [<c01032d2>] work_notifysig+0x13/0x21 other info that might help us debug this: 3 locks held by write-truncate-/3465: #0: (jbd2_handle){+.+...}, at: [<c02e1f8f>] start_this_handle+0x38f/0x5c0 #1: (&ei->i_data_sem){++++..}, at: [<c02a57f6>] ext4_get_blocks+0xb6/0x450 #2: (&meta_group_info[i]->alloc_sem){++++..}, at: [<c02ce962>] ext4_mb_load_buddy+0xb2/0x370 stack backtrace: Pid: 3465, comm: write-truncate- Not tainted 2.6.32-rc7 #18 Call Trace: [<c0524cb3>] ? printk+0x1d/0x22 [<c017ac9a>] print_circular_bug+0xca/0xd0 [<c017d361>] __lock_acquire+0x1091/0x1260 [<c016bca2>] ? sched_clock_local+0xd2/0x170 [<c0178fd0>] ? trace_hardirqs_off_caller+0x20/0xd0 [<c017d5ea>] lock_acquire+0xba/0xd0 [<c025e73b>] ? dquot_claim_space+0x3b/0x1b0 [<c0527191>] down_read+0x51/0x90 [<c025e73b>] ? dquot_claim_space+0x3b/0x1b0 [<c025e73b>] dquot_claim_space+0x3b/0x1b0 [<c02cb95f>] ext4_mb_mark_diskspace_used+0x36f/0x380 [<c02d210a>] ext4_mb_new_blocks+0x34a/0x530 [<c02c601d>] ? ext4_ext_find_extent+0x25d/0x280 [<c02c83fb>] ext4_ext_get_blocks+0x122b/0x13c0 [<c016bca2>] ? sched_clock_local+0xd2/0x170 [<c016be60>] ? sched_clock_cpu+0x120/0x160 [<c016beef>] ? cpu_clock+0x4f/0x60 [<c0178fd0>] ? trace_hardirqs_off_caller+0x20/0xd0 [<c052712c>] ? down_write+0x8c/0xa0 [<c02a5966>] ext4_get_blocks+0x226/0x450 [<c016be60>] ? sched_clock_cpu+0x120/0x160 [<c016beef>] ? cpu_clock+0x4f/0x60 [<c017908b>] ? trace_hardirqs_off+0xb/0x10 [<c02a5ff3>] mpage_da_map_blocks+0xc3/0xaa0 [<c01d69cc>] ? find_get_pages_tag+0x16c/0x180 [<c01d6860>] ? find_get_pages_tag+0x0/0x180 [<c02a73bd>] ? __mpage_da_writepage+0x16d/0x1a0 [<c01dfc4e>] ? pagevec_lookup_tag+0x2e/0x40 [<c01ddf1b>] ? write_cache_pages+0xdb/0x3d0 [<c02a7250>] ? __mpage_da_writepage+0x0/0x1a0 [<c02a6ed6>] ext4_da_writepages+0x506/0x790 [<c016beef>] ? cpu_clock+0x4f/0x60 [<c016bca2>] ? sched_clock_local+0xd2/0x170 [<c016be60>] ? sched_clock_cpu+0x120/0x160 [<c016be60>] ? sched_clock_cpu+0x120/0x160 [<c02a69d0>] ? ext4_da_writepages+0x0/0x790 [<c01de272>] do_writepages+0x22/0x50 [<c01d766d>] __filemap_fdatawrite_range+0x6d/0x80 [<c01d7b9b>] filemap_flush+0x2b/0x30 [<c02a40ac>] ext4_alloc_da_blocks+0x5c/0x60 [<c029e595>] ext4_release_file+0x75/0xb0 [<c0216b59>] __fput+0xf9/0x210 [<c0216c97>] fput+0x27/0x30 [<c02122dc>] filp_close+0x4c/0x80 [<c014510e>] put_files_struct+0x6e/0xd0 [<c01451b7>] exit_files+0x47/0x60 [<c0146a24>] do_exit+0x144/0x710 [<c017b163>] ? lock_release_holdtime+0x33/0x210 [<c0528137>] ? _spin_unlock_irq+0x27/0x30 [<c0147028>] do_group_exit+0x38/0xa0 [<c017babb>] ? trace_hardirqs_on+0xb/0x10 [<c0159abc>] get_signal_to_deliver+0x2ac/0x410 [<c0102849>] do_notify_resume+0xb9/0x890 [<c0178fd0>] ? trace_hardirqs_off_caller+0x20/0xd0 [<c017b163>] ? lock_release_holdtime+0x33/0x210 [<c0165b50>] ? autoremove_wake_function+0x0/0x50 [<c017ba54>] ? trace_hardirqs_on_caller+0x134/0x190 [<c017babb>] ? trace_hardirqs_on+0xb/0x10 [<c0300ba4>] ? security_file_permission+0x14/0x20 [<c0215761>] ? vfs_write+0x131/0x190 [<c0214f50>] ? do_sync_write+0x0/0x120 [<c0103115>] ? sysenter_do_call+0x27/0x32 [<c01032d2>] work_notifysig+0x13/0x21 CC: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-23quota: Fix 64-bit limits setting on 32-bit archsJan Kara
Fix warnings: fs/quota/quota_v2.c: In function ‘v2_read_file_info’: fs/quota/quota_v2.c:123: warning: integer constant is too large for ‘long’ type fs/quota/quota_v2.c:124: warning: integer constant is too large for ‘long’ type Reported-by: Jerry Leo <jerryleo860202@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-23ext3: Replace lock/unlock_super() with an explicit lock for resizingEric Sandeen
Use a separate lock to protect s_groups_count and the other block group descriptors which get changed via an on-line resize operation, so we can stop overloading the use of lock_super(). Port of ext4 commit 32ed5058ce90024efcd811254b4b1de0468099df by Theodore Ts'o <tytso@mit.edu>. CC: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-23ext3: Replace lock/unlock_super() with an explicit lock for the orphan listEric Sandeen
Use a separate lock to protect the orphan list, so we can stop overloading the use of lock_super(). Port of ext4 commit 3b9d4ed26680771295d904a6b83e88e620780893 by Theodore Ts'o <tytso@mit.edu>. CC: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-23ext3: ext3_mark_recovery_complete() doesn't need to use lock_superEric Sandeen
The function ext3_mark_recovery_complete() is called from two call paths: either (a) while mounting the filesystem, in which case there's no danger of any other CPU calling write_super() until the mount is completed, and (b) while remounting the filesystem read-write, in which case the fs core has already locked the superblock. This also allows us to take out a very vile unlock_super()/lock_super() pair in ext3_remount(). Port of ext4 commit a63c9eb2ce6f5028da90f282798232c4f398ceb8 by Theodore Ts'o <tytso@mit.edu>. CC: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-23ext3: Remove outdated comment about lock_super()Eric Sandeen
ext3_fill_super() is no longer called by read_super(), and it is no longer called with the superblock locked. The unlock_super()/lock_super() is no longer present, so this comment is entirely superfluous. Port of ext4 commit 32ed5058ce90024efcd811254b4b1de0468099df by Theodore Ts'o <tytso@mit.edu>. CC: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-23quota: Move duplicated code to separate functionsDmitry Monakhov
- for(..) { mark_dquot_dirty(); } -> mark_all_dquot_dirty() - for(..) { dput(); } -> dqput_all() Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-23ext4: Convert to generic reserved quota's space management.Dmitry Monakhov
This patch also fixes write vs chown race condition. Acked-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-23quota: decouple fs reserved space from quota reservationDmitry Monakhov
Currently inode_reservation is managed by fs itself and this reservation is transfered on dquot_transfer(). This means what inode_reservation must always be in sync with dquot->dq_dqb.dqb_rsvspace. Otherwise dquot_transfer() will result in incorrect quota(WARN_ON in dquot_claim_reserved_space() will be triggered) This is not easy because of complex locking order issues for example http://bugzilla.kernel.org/show_bug.cgi?id=14739 The patch introduce quota reservation field for each fs-inode (fs specific inode is used in order to prevent bloating generic vfs inode). This reservation is managed by quota code internally similar to i_blocks/i_bytes and may not be always in sync with internal fs reservation. Also perform some code rearrangement: - Unify dquot_reserve_space() and dquot_reserve_space() - Unify dquot_release_reserved_space() and dquot_free_space() - Also this patch add missing warning update to release_rsv() dquot_release_reserved_space() must call flush_warnings() as dquot_free_space() does. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-23Add unlocked version of inode_add_bytes() functionDmitry Monakhov
Quota code requires unlocked version of this function. Off course we can just copy-paste the code, but copy-pasting is always an evil. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-23ext3: quota macros cleanup [V2]Dmitry Monakhov
Currently all quota block reservation macros contains hardcoded "2" aka MAXQUOTAS value. This is no good because in some places it is not obvious to understand what does this digit represent. Let's introduce new macro with self descriptive name. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-23ext4, jbd2: Add barriers for file systems with exernal journalsTheodore Ts'o
This is a bit complicated because we are trying to optimize when we send barriers to the fs data disk. We could just throw in an extra barrier to the data disk whenever we send a barrier to the journal disk, but that's not always strictly necessary. We only need to send a barrier during a commit when there are data blocks which are must be written out due to an inode written in ordered mode, or if fsync() depends on the commit to force data blocks to disk. Finally, before we drop transactions from the beginning of the journal during a checkpoint operation, we need to guarantee that any blocks that were flushed out to the data disk are firmly on the rust platter before we drop the transaction from the journal. Thanks to Oleg Drokin for pointing out this flaw in ext3/ext4. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-22jfs: Fix 32bit build warningAlan Cox
loff_t is a type that isn't entirely dependant upon 32 v 64bit choice Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-12-22Sanitize f_flags helpersAl Viro
* pull ACC_MODE to fs.h; we have several copies all over the place * nightmarish expression calculating f_mode by f_flags deserves a helper too (OPEN_FMODE(flags)) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-12-22Fix f_flags/f_mode in case of lookup_instantiate_filp() from open(pathname, 3)Al Viro
Just set f_flags when shoving struct file into nameidata; don't postpone that until __dentry_open(). do_filp_open() has correct value; lookup_instantiate_filp() doesn't - we lose the difference between O_RDWR and 3 by that point. We still set .intent.open.flags, so no fs code needs to be changed. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-12-22anonfd: Allow making anon files read-onlyRoland Dreier
It seems a couple places such as arch/ia64/kernel/perfmon.c and drivers/infiniband/core/uverbs_main.c could use anon_inode_getfile() instead of a private pseudo-fs + alloc_file(), if only there were a way to get a read-only file. So provide this by having anon_inode_getfile() create a read-only file if we pass O_RDONLY in flags. Signed-off-by: Roland Dreier <rolandd@cisco.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-12-22fs/compat_ioctl.c: fix build error when !BLOCKArnd Bergmann
No driver uses SG_SET_TRANSFORM any more in Linux, since the ide-scsi driver was removed in 2.6.29. The compat-ioctl cleanup series moved the handling for this around, which broke building without CONFIG_BLOCK. Just remove the code handling it for compat mode. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-12-22alloc_file(): simplify handling of mnt_clone_write() errorsRoland Dreier
When alloc_file() and init_file() were combined, the error handling of mnt_clone_write() was taken into alloc_file() in a somewhat obfuscated way. Since we don't use the error code for anything except warning, we might as well warn directly without an extra variable. Signed-off-by: Roland Dreier <rolandd@cisco.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-12-20nfsd: fix "insecure" export optionJ. Bruce Fields
A typo in 12045a6ee9908b "nfsd: let "insecure" flag vary by pseudoflavor" reversed the sense of the "insecure" flag. Reported-by: Michael Guntsche <mike@it-loops.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-20nfsd: fix "insecure" export optionJ. Bruce Fields
A typo in 12045a6ee9908b "nfsd: let "insecure" flag vary by pseudoflavor" reversed the sense of the "insecure" flag. Reported-by: Michael Guntsche <mike@it-loops.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-12-19Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (25 commits) sched: Fix broken assertion sched: Assert task state bits at build time sched: Update task_state_arraypwith new states sched: Add missing state chars to TASK_STATE_TO_CHAR_STR sched: Move TASK_STATE_TO_CHAR_STR near the TASK_state bits sched: Teach might_sleep() about preemptible RCU sched: Make warning less noisy sched: Simplify set_task_cpu() sched: Remove the cfs_rq dependency from set_task_cpu() sched: Add pre and post wakeup hooks sched: Move kthread_bind() back to kthread.c sched: Fix select_task_rq() vs hotplug issues sched: Fix sched_exec() balancing sched: Ensure set_task_cpu() is never called on blocked tasks sched: Use TASK_WAKING for fork wakups sched: Select_task_rq_fair() must honour SD_LOAD_BALANCE sched: Fix task_hot() test order sched: Fix set_cpu_active() in cpu_down() sched: Mark boot-cpu active before smp_init() sched: Fix cpu_clock() in NMIs, on !CONFIG_HAVE_UNSTABLE_SCHED_CLOCK ...