Kernel - My Linux kernel repository

Age	Commit message (Collapse)	Author
2009-07-21	inotify: fix error paths in inotify_update_watch	Eric Paris
	inotify_update_watch could leave things in a horrid state on a number of error paths. We could try to remove idr entries that didn't exist, we could send an IN_IGNORED to userspace for watches that don't exist, and a bit of other stupidity. Clean these up by doing the idr addition before we put the mark on the inode since we can clean that up on error and getting off the inode's mark list is hard. Signed-off-by: Eric Paris <eparis@redhat.com>
2009-07-21	inotify: do not leak inode marks in inotify_add_watch	Eric Paris
	inotify_add_watch had a couple of problems. The biggest being that if inotify_add_watch was called on the same inode twice (to update or change the event mask) a refence was taken on the original inode mark by fsnotify_find_mark_entry but was not being dropped at the end of the inotify_add_watch call. Thus if inotify_rm_watch was called although the mark was removed from the inode, the refcnt wouldn't hit zero and we would leak memory. Reported-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2009-07-21	inotify: drop user watch count when a watch is removed	Eric Paris
	The inotify rewrite forgot to drop the inotify watch use cound when a watch was removed. This means that a single inotify fd can only ever register a maximum of /proc/sys/fs/max_user_watches even if some of those had been freed. Signed-off-by: Eric Paris <eparis@redhat.com>
2009-07-02	fs/notify/inotify: decrement user inotify count on close	Keith Packard
	The per-user inotify_devs value is incremented each time a new file is allocated, but never decremented. This led to inotify_init failing after a limited number of calls. Signed-off-by: Keith Packard <keithp@keithp.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2009-06-19	inotify: inotify_destroy_mark_entry could get called twice	Eric Paris
	inotify_destroy_mark_entry could get called twice for the same mark since it is called directly in inotify_rm_watch and when the mark is being destroyed for another reason. As an example assume that the file being watched was just deleted so inotify_destroy_mark_entry would get called from the path fsnotify_inoderemove() -> fsnotify_destroy_marks_by_inode() -> fsnotify_destroy_mark_entry() -> inotify_destroy_mark_entry(). If this happened at the same time as userspace tried to remove a watch via inotify_rm_watch we could attempt to remove the mark from the idr twice and could thus double dec the ref cnt and potentially could be in a use after free/double free situation. The fix is to have inotify_rm_watch use the generic recursive safe fsnotify_destroy_mark_by_entry() so we are sure the inotify_destroy_mark_entry() function can only be called one. This patch also renames the function to inotify_ingored_remove_idr() so it is clear what is actually going on in the function. Hopefully this fixes: [ 20.342058] idr_remove called for id=20 which is not allocated. [ 20.348000] Pid: 1860, comm: udevd Not tainted 2.6.30-tip #1077 [ 20.353933] Call Trace: [ 20.356410] [<ffffffff811a82b7>] idr_remove+0x115/0x18f [ 20.361737] [<ffffffff8134259d>] ? _spin_lock+0x6d/0x75 [ 20.367061] [<ffffffff8111640a>] ? inotify_destroy_mark_entry+0xa3/0xcf [ 20.373771] [<ffffffff8111641e>] inotify_destroy_mark_entry+0xb7/0xcf [ 20.380306] [<ffffffff81115913>] inotify_freeing_mark+0xe/0x10 [ 20.386238] [<ffffffff8111410d>] fsnotify_destroy_mark_by_entry+0x143/0x170 [ 20.393293] [<ffffffff811163a3>] inotify_destroy_mark_entry+0x3c/0xcf [ 20.399829] [<ffffffff811164d1>] sys_inotify_rm_watch+0x9b/0xc6 [ 20.405850] [<ffffffff8100bcdb>] system_call_fastpath+0x16/0x1b Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Eric Paris <eparis@redhat.com> Tested-by: Peter Ziljlstra <peterz@infradead.org>
2009-06-11	inotify/dnotify: should_send_event shouldn't match on FS_EVENT_ON_CHILD	Eric Paris
	inotify and dnotify will both indicate that they want any event which came from a child inode. The fix is to mask off FS_EVENT_ON_CHILD when deciding if inotify or dnotify is interested in a given event. Signed-off-by: Eric Paris <eparis@redhat.com>
2009-06-11	inotify: reimplement inotify using fsnotify	Eric Paris
	Reimplement inotify_user using fsnotify. This should be feature for feature exactly the same as the original inotify_user. This does not make any changes to the in kernel inotify feature used by audit. Those patches (and the eventual removal of in kernel inotify) will come after the new inotify_user proves to be working correctly. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de>
2009-06-11	fsnotify: unified filesystem notification backend	Eric Paris
	fsnotify is a backend for filesystem notification. fsnotify does not provide any userspace interface but does provide the basis needed for other notification schemes such as dnotify. fsnotify can be extended to be the backend for inotify or the upcoming fanotify. fsnotify provides a mechanism for "groups" to register for some set of filesystem events and to then deliver those events to those groups for processing. fsnotify has a number of benefits, the first being actually shrinking the size of an inode. Before fsnotify to support both dnotify and inotify an inode had unsigned long i_dnotify_mask; /* Directory notify events / struct dnotify_struct i_dnotify; /* for directory notifications / struct list_head inotify_watches; / watches on this inode / struct mutex inotify_mutex; / protects the watches list But with fsnotify this same functionallity (and more) is done with just __u32 i_fsnotify_mask; /* all events for this inode / struct hlist_head i_fsnotify_mark_entries; / marks on this inode */ That's right, inotify, dnotify, and fanotify all in 64 bits. We used that much space just in inotify_watches alone, before this patch set. fsnotify object lifetime and locking is MUCH better than what we have today. inotify locking is incredibly complex. See 8f7b0ba1c8539 as an example of what's been busted since inception. inotify needs to know internal semantics of superblock destruction and unmounting to function. The inode pinning and vfs contortions are horrible. no fsnotify implementers do allocation under locks. This means things like f04b30de3 which (due to an overabundance of caution) changes GFP_KERNEL to GFP_NOFS can be reverted. There are no longer any allocation rules when using or implementing your own fsnotify listener. fsnotify paves the way for fanotify. In brief fanotify is a notification mechanism that delivers the lisener both an 'event' and an open file descriptor to the object in question. This means that fanotify is pathname agnostic. Some on lkml may not care for the original companies or users that pushed for TALPA, but fanotify was designed with flexibility and input for other users in mind. The readahead group expressed interest in fanotify as it could be used to profile disk access on boot without breaking the audit system. The desktop search groups have also expressed interest in fanotify as it solves a number of the race conditions and problems present with managing inotify when more than a limited number of specific files are of interest. fanotify can provide for a userspace access control system which makes it a clean interface for AV vendors to hook without trying to do binary patching on the syscall table, LSM, and everywhere else they do their things today. With this patch series fanotify can be implemented in less than 1200 lines of easy to review code. Almost all of which is the socket based user interface. This patch series builds fsnotify to the point that it can implement dnotify and inotify_user. Patches exist and will be sent soon after acceptance to finish the in kernel inotify conversion (audit) and implement fanotify. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de>
2009-05-06	inotify: use GFP_NOFS in kernel_event() to work around a lockdep false-positive	Wu Fengguang
	There is what we believe to be a false positive reported by lockdep. inotify_inode_queue_event() => take inotify_mutex => kernel_event() => kmalloc() => SLOB => alloc_pages_node() => page reclaim => slab reclaim => dcache reclaim => inotify_inode_is_dead => take inotify_mutex => deadlock The plan is to fix this via lockdep annotation, but that is proving to be quite involved. The patch flips the allocation over to GFP_NFS to shut the warning up, for the 2.6.30 release. Hopefully we will fix this for real in 2.6.31. I'll queue a patch in -mm to switch it back to GFP_KERNEL so we don't forget. ================================= [ INFO: inconsistent lock state ] 2.6.30-rc2-next-20090417 #203 --------------------------------- inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. kswapd0/380 [HC0[0]:SC0[0]:HE1:SE1] takes: (&inode->inotify_mutex){+.+.?.}, at: [<ffffffff8112f1b5>] inotify_inode_is_dead+0x35/0xb0 {RECLAIM_FS-ON-W} state was registered at: [<ffffffff81079188>] mark_held_locks+0x68/0x90 [<ffffffff810792a5>] lockdep_trace_alloc+0xf5/0x100 [<ffffffff810f5261>] __kmalloc_node+0x31/0x1e0 [<ffffffff81130652>] kernel_event+0xe2/0x190 [<ffffffff81130826>] inotify_dev_queue_event+0x126/0x230 [<ffffffff8112f096>] inotify_inode_queue_event+0xc6/0x110 [<ffffffff8110444d>] vfs_create+0xcd/0x140 [<ffffffff8110825d>] do_filp_open+0x88d/0xa20 [<ffffffff810f6b68>] do_sys_open+0x98/0x140 [<ffffffff810f6c50>] sys_open+0x20/0x30 [<ffffffff8100c272>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff irq event stamp: 690455 hardirqs last enabled at (690455): [<ffffffff81564fe4>] _spin_unlock_irqrestore+0x44/0x80 hardirqs last disabled at (690454): [<ffffffff81565372>] _spin_lock_irqsave+0x32/0xa0 softirqs last enabled at (690178): [<ffffffff81052282>] __do_softirq+0x202/0x220 softirqs last disabled at (690157): [<ffffffff8100d50c>] call_softirq+0x1c/0x50 other info that might help us debug this: 2 locks held by kswapd0/380: #0: (shrinker_rwsem){++++..}, at: [<ffffffff810d0bd7>] shrink_slab+0x37/0x180 #1: (&type->s_umount_key#17){++++..}, at: [<ffffffff8110cfbf>] shrink_dcache_memory+0x11f/0x1e0 stack backtrace: Pid: 380, comm: kswapd0 Not tainted 2.6.30-rc2-next-20090417 #203 Call Trace: [<ffffffff810789ef>] print_usage_bug+0x19f/0x200 [<ffffffff81018bff>] ? save_stack_trace+0x2f/0x50 [<ffffffff81078f0b>] mark_lock+0x4bb/0x6d0 [<ffffffff810799e0>] ? check_usage_forwards+0x0/0xc0 [<ffffffff8107b142>] __lock_acquire+0xc62/0x1ae0 [<ffffffff810f478c>] ? slob_free+0x10c/0x370 [<ffffffff8107c0a1>] lock_acquire+0xe1/0x120 [<ffffffff8112f1b5>] ? inotify_inode_is_dead+0x35/0xb0 [<ffffffff81562d43>] mutex_lock_nested+0x63/0x420 [<ffffffff8112f1b5>] ? inotify_inode_is_dead+0x35/0xb0 [<ffffffff8112f1b5>] ? inotify_inode_is_dead+0x35/0xb0 [<ffffffff81012fe9>] ? sched_clock+0x9/0x10 [<ffffffff81077165>] ? lock_release_holdtime+0x35/0x1c0 [<ffffffff8112f1b5>] inotify_inode_is_dead+0x35/0xb0 [<ffffffff8110c9dc>] dentry_iput+0xbc/0xe0 [<ffffffff8110cb23>] d_kill+0x33/0x60 [<ffffffff8110ce23>] __shrink_dcache_sb+0x2d3/0x350 [<ffffffff8110cffa>] shrink_dcache_memory+0x15a/0x1e0 [<ffffffff810d0cc5>] shrink_slab+0x125/0x180 [<ffffffff810d1540>] kswapd+0x560/0x7a0 [<ffffffff810ce160>] ? isolate_pages_global+0x0/0x2c0 [<ffffffff81065a30>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8107953d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff810d0fe0>] ? kswapd+0x0/0x7a0 [<ffffffff8106555b>] kthread+0x5b/0xa0 [<ffffffff8100d40a>] child_rip+0xa/0x20 [<ffffffff8100cdd0>] ? restore_args+0x0/0x30 [<ffffffff81065500>] ? kthread+0x0/0xa0 [<ffffffff8100d400>] ? child_rip+0x0/0x20 [eparis@redhat.com: fix audit too] Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Matt Mackall <mpm@selenic.com> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Eric Paris <eparis@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-27	fs: avoid I_NEW inodes	Nick Piggin
	To be on the safe side, it should be less fragile to exclude I_NEW inodes from inode list scans by default (unless there is an important reason to have them). Normally they will get excluded (eg. by zero refcount or writecount etc), however it is a bit fragile for list walkers to know exactly what parts of the inode state is set up and valid to test when in I_NEW. So along these lines, move I_NEW checks upward as well (sometimes taking I_FREEING etc checks with them too -- this shouldn't be a problem should it?) Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-02-18	inotify: fix GFP_KERNEL related deadlock	Ingo Molnar
	Enhanced lockdep coverage of __GFP_NOFS turned up this new lockdep assert: [ 1093.677775] [ 1093.677781] ================================= [ 1093.680031] [ INFO: inconsistent lock state ] [ 1093.680031] 2.6.29-rc5-tip-01504-gb49eca1-dirty #1 [ 1093.680031] --------------------------------- [ 1093.680031] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. [ 1093.680031] kswapd0/308 [HC0[0]:SC0[0]:HE1:SE1] takes: [ 1093.680031] (&inode->inotify_mutex){+.+.?.}, at: [<c0205942>] inotify_inode_is_dead+0x20/0x80 [ 1093.680031] {RECLAIM_FS-ON-W} state was registered at: [ 1093.680031] [<c01696b9>] mark_held_locks+0x43/0x5b [ 1093.680031] [<c016baa4>] lockdep_trace_alloc+0x6c/0x6e [ 1093.680031] [<c01cf8b0>] kmem_cache_alloc+0x20/0x150 [ 1093.680031] [<c040d0ec>] idr_pre_get+0x27/0x6c [ 1093.680031] [<c02056e3>] inotify_handle_get_wd+0x25/0xad [ 1093.680031] [<c0205f43>] inotify_add_watch+0x7a/0x129 [ 1093.680031] [<c020679e>] sys_inotify_add_watch+0x20f/0x250 [ 1093.680031] [<c010389e>] sysenter_do_call+0x12/0x35 [ 1093.680031] [<ffffffff>] 0xffffffff [ 1093.680031] irq event stamp: 60417 [ 1093.680031] hardirqs last enabled at (60417): [<c018d5f5>] call_rcu+0x53/0x59 [ 1093.680031] hardirqs last disabled at (60416): [<c018d5b9>] call_rcu+0x17/0x59 [ 1093.680031] softirqs last enabled at (59656): [<c0146229>] __do_softirq+0x157/0x16b [ 1093.680031] softirqs last disabled at (59651): [<c0106293>] do_softirq+0x74/0x15d [ 1093.680031] [ 1093.680031] other info that might help us debug this: [ 1093.680031] 2 locks held by kswapd0/308: [ 1093.680031] #0: (shrinker_rwsem){++++..}, at: [<c01b0502>] shrink_slab+0x36/0x189 [ 1093.680031] #1: (&type->s_umount_key#4){+++++.}, at: [<c01e6d77>] shrink_dcache_memory+0x110/0x1fb [ 1093.680031] [ 1093.680031] stack backtrace: [ 1093.680031] Pid: 308, comm: kswapd0 Not tainted 2.6.29-rc5-tip-01504-gb49eca1-dirty #1 [ 1093.680031] Call Trace: [ 1093.680031] [<c016947a>] valid_state+0x12a/0x13d [ 1093.680031] [<c016954e>] mark_lock+0xc1/0x1e9 [ 1093.680031] [<c016a5b4>] ? check_usage_forwards+0x0/0x3f [ 1093.680031] [<c016ab74>] __lock_acquire+0x2c6/0xac8 [ 1093.680031] [<c01688d9>] ? register_lock_class+0x17/0x228 [ 1093.680031] [<c016b3d3>] lock_acquire+0x5d/0x7a [ 1093.680031] [<c0205942>] ? inotify_inode_is_dead+0x20/0x80 [ 1093.680031] [<c08824c4>] __mutex_lock_common+0x3a/0x4cb [ 1093.680031] [<c0205942>] ? inotify_inode_is_dead+0x20/0x80 [ 1093.680031] [<c08829ed>] mutex_lock_nested+0x2e/0x36 [ 1093.680031] [<c0205942>] ? inotify_inode_is_dead+0x20/0x80 [ 1093.680031] [<c0205942>] inotify_inode_is_dead+0x20/0x80 [ 1093.680031] [<c01e6672>] dentry_iput+0x90/0xc2 [ 1093.680031] [<c01e67a3>] d_kill+0x21/0x45 [ 1093.680031] [<c01e6a46>] __shrink_dcache_sb+0x27f/0x355 [ 1093.680031] [<c01e6dc5>] shrink_dcache_memory+0x15e/0x1fb [ 1093.680031] [<c01b05ed>] shrink_slab+0x121/0x189 [ 1093.680031] [<c01b0d12>] kswapd+0x39f/0x561 [ 1093.680031] [<c01ae499>] ? isolate_pages_global+0x0/0x233 [ 1093.680031] [<c0157eae>] ? autoremove_wake_function+0x0/0x43 [ 1093.680031] [<c01b0973>] ? kswapd+0x0/0x561 [ 1093.680031] [<c0157daf>] kthread+0x41/0x82 [ 1093.680031] [<c0157d6e>] ? kthread+0x0/0x82 [ 1093.680031] [<c01043ab>] kernel_thread_helper+0x7/0x10 inotify_handle_get_wd() does idr_pre_get() which does a kmem_cache_alloc() without __GFP_FS - and is hence deadlockable under extreme MM pressure. Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: MinChan Kim <minchan.kim@gmail.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-26	inotify: clean up inotify_read and fix locking problems	Vegard Nossum
	If userspace supplies an invalid pointer to a read() of an inotify instance, the inotify device's event list mutex is unlocked twice. This causes an unbalance which effectively leaves the data structure unprotected, and we can trigger oopses by accessing the inotify instance from different tasks concurrently. The best fix (contributed largely by Linus) is a total rewrite of the function in question: On Thu, Jan 22, 2009 at 7:05 AM, Linus Torvalds wrote: > The thing to notice is that: > > - locking is done in just one place, and there is no question about it > not having an unlock. > > - that whole double-while(1)-loop thing is gone. > > - use multiple functions to make nesting and error handling sane > > - do error testing after doing the things you always need to do, ie do > this: > > mutex_lock(..) > ret = function_call(); > mutex_unlock(..) > > .. test ret here .. > > instead of doing conditional exits with unlocking or freeing. > > So if the code is written in this way, it may still be buggy, but at least > it's not buggy because of subtle "forgot to unlock" or "forgot to free" > issues. > > This _always_ unlocks if it locked, and it always frees if it got a > non-error kevent. Cc: John McCutchan <ttb@tentacle.dhs.org> Cc: Robert Love <rlove@google.com> Cc: <stable@kernel.org> Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-14	[CVE-2009-0029] System call wrappers part 29	Heiko Carstens
	Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14	[CVE-2009-0029] System call wrappers part 28	Heiko Carstens
	Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-05	inotify: fix type errors in interfaces	Michael Kerrisk
	The problems lie in the types used for some inotify interfaces, both at the kernel level and at the glibc level. This mail addresses the kernel problem. I will follow up with some suggestions for glibc changes. For the sys_inotify_rm_watch() interface, the type of the 'wd' argument is currently 'u32', it should be '__s32' . That is Robert's suggestion, and is consistent with the other declarations of watch descriptors in the kernel source, in particular, the inotify_event structure in include/linux/inotify.h: struct inotify_event { __s32 wd; /* watch descriptor / __u32 mask; / watch mask / __u32 cookie; / cookie to synchronize two events / __u32 len; / length (including nulls) of name / char name[0]; / stub for possible name */ }; The patch makes the changes needed for inotify_rm_watch(). Signed-off-by: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: Robert Love <rlove@google.com> Cc: Vegard Nossum <vegard.nossum@gmail.com> Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-12-31	filesystem notification: create fs/notify to contain all fs notification	Eric Paris
	Creating a generic filesystem notification interface, fsnotify, which will be used by inotify, dnotify, and eventually fanotify is really starting to clutter the fs directory. This patch simply moves inotify and dnotify into fs/notify/inotify and fs/notify/dnotify respectively to make both current fs/ and future notification tidier. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>