aboutsummaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2010-03-06mm: change anon_vma linking to fix multi-process server scalability issueRik van Riel
The old anon_vma code can lead to scalability issues with heavily forking workloads. Specifically, each anon_vma will be shared between the parent process and all its child processes. In a workload with 1000 child processes and a VMA with 1000 anonymous pages per process that get COWed, this leads to a system with a million anonymous pages in the same anon_vma, each of which is mapped in just one of the 1000 processes. However, the current rmap code needs to walk them all, leading to O(N) scanning complexity for each page. This can result in systems where one CPU is walking the page tables of 1000 processes in page_referenced_one, while all other CPUs are stuck on the anon_vma lock. This leads to catastrophic failure for a benchmark like AIM7, where the total number of processes can reach in the tens of thousands. Real workloads are still a factor 10 less process intensive than AIM7, but they are catching up. This patch changes the way anon_vmas and VMAs are linked, which allows us to associate multiple anon_vmas with a VMA. At fork time, each child process gets its own anon_vmas, in which its COWed pages will be instantiated. The parents' anon_vma is also linked to the VMA, because non-COWed pages could be present in any of the children. This reduces rmap scanning complexity to O(1) for the pages of the 1000 child processes, with O(N) complexity for at most 1/N pages in the system. This reduces the average scanning cost in heavily forking workloads from O(N) to 2. The only real complexity in this patch stems from the fact that linking a VMA to anon_vmas now involves memory allocations. This means vma_adjust can fail, if it needs to attach a VMA to anon_vma structures. This in turn means error handling needs to be added to the calling functions. A second source of complexity is that, because there can be multiple anon_vmas, the anon_vma linking in vma_adjust can no longer be done under "the" anon_vma lock. To prevent the rmap code from walking up an incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h to make sure it is impossible to compile a kernel that needs both symbolic values for the same bitflag. Some test results: Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test box with 16GB RAM and not quite enough IO), the system ends up running >99% in system time, with every CPU on the same anon_vma lock in the pageout code. With these changes, AIM7 hits the cross-over point around 29.7k users. This happens with ~99% IO wait time, there never seems to be any spike in system time. The anon_vma lock contention appears to be resolved. [akpm@linux-foundation.org: cleanups] Signed-off-by: Rik van Riel <riel@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06include/linux/fs.h: convert FMODE_* constants to hexAndrew Morton
It was tolerable until Eric went and added 8388608. Cc: Eric Paris <eparis@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06readahead: introduce FMODE_RANDOM for POSIX_FADV_RANDOMWu Fengguang
This fixes inefficient page-by-page reads on POSIX_FADV_RANDOM. POSIX_FADV_RANDOM used to set ra_pages=0, which leads to poor performance: a 16K read will be carried out in 4 _sync_ 1-page reads. In other places, ra_pages==0 means - it's ramfs/tmpfs/hugetlbfs/sysfs/configfs - some IO error happened where multi-page read IO won't help or should be avoided. POSIX_FADV_RANDOM actually want a different semantics: to disable the *heuristic* readahead algorithm, and to use a dumb one which faithfully submit read IO for whatever application requests. So introduce a flag FMODE_RANDOM for POSIX_FADV_RANDOM. Note that the random hint is not likely to help random reads performance noticeably. And it may be too permissive on huge request size (its IO size is not limited by read_ahead_kb). In Quentin's report (http://lkml.org/lkml/2009/12/24/145), the overall (NFS read) performance of the application increased by 313%! Tested-by: Quentin Barnes <qbarnes+nfs@yahoo-inc.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Andi Kleen <andi@firstfloor.org> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: <stable@kernel.org> [2.6.33.x] Cc: <qbarnes+nfs@yahoo-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06memory-hotplug: create /sys/firmware/memmap entry for new memoryakpm@linux-foundation.org
A memmap is a directory in sysfs which includes 3 text files: start, end and type. For example: start: 0x100000 end: 0x7e7b1cff type: System RAM Interface firmware_map_add was not called explicitly. Remove it and add function firmware_map_add_hotplug as hotplug interface of memmap. Each memory entry has a memmap in sysfs, When we hot-add new memory, sysfs does not export memmap entry for it. We add a call in function add_memory to function firmware_map_add_hotplug. Add a new function add_sysfs_fw_map_entry() to create memmap entry, it will be called when initialize memmap and hot-add memory. [akpm@linux-foundation.org: un-kernedoc a no longer kerneldoc comment] Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com> Acked-by: Andi Kleen <ak@linux.intel.com> Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06mm: restore zone->all_unreclaimable to independence wordKOSAKI Motohiro
commit e815af95 ("change all_unreclaimable zone member to flags") changed all_unreclaimable member to bit flag. But it had an undesireble side effect. free_one_page() is one of most hot path in linux kernel and increasing atomic ops in it can reduce kernel performance a bit. Thus, this patch revert such commit partially. at least all_unreclaimable shouldn't share memory word with other zone flags. [akpm@linux-foundation.org: fix patch interaction] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Huang Shijie <shijie8@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06mm: remove free_hot_page()Li Hong
free_hot_page() is just a wrapper around free_hot_cold_page() with parameter 'cold = 0'. After adding a clear comment for free_hot_cold_page(), it is reasonable to remove a level of call. [akpm@linux-foundation.org: fix build] Signed-off-by: Li Hong <lihong.hi@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Li Ming Chun <macli@brc.ubc.ca> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Americo Wang <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06mm: count swap usageKAMEZAWA Hiroyuki
A frequent questions from users about memory management is what numbers of swap ents are user for processes. And this information will give some hints to oom-killer. Besides we can count the number of swapents per a process by scanning /proc/<pid>/smaps, this is very slow and not good for usual process information handler which works like 'ps' or 'top'. (ps or top is now enough slow..) This patch adds a counter of swapents to mm_counter and update is at each swap events. Information is exported via /proc/<pid>/status file as [kamezawa@bluextal memory]$ cat /proc/self/status Name: cat State: R (running) Tgid: 2910 Pid: 2910 PPid: 2823 TracerPid: 0 Uid: 500 500 500 500 Gid: 500 500 500 500 FDSize: 256 Groups: 500 VmPeak: 82696 kB VmSize: 82696 kB VmLck: 0 kB VmHWM: 432 kB VmRSS: 432 kB VmData: 172 kB VmStk: 84 kB VmExe: 48 kB VmLib: 1568 kB VmPTE: 40 kB VmSwap: 0 kB <=============== this. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06mm: avoid false sharing of mm_counterKAMEZAWA Hiroyuki
Considering the nature of per mm stats, it's the shared object among threads and can be a cache-miss point in the page fault path. This patch adds per-thread cache for mm_counter. RSS value will be counted into a struct in task_struct and synchronized with mm's one at events. Now, in this patch, the event is the number of calls to handle_mm_fault. Per-thread value is added to mm at each 64 calls. rough estimation with small benchmark on parallel thread (2threads) shows [before] 4.5 cache-miss/faults [after] 4.0 cache-miss/faults Anyway, the most contended object is mmap_sem if the number of threads grows. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06mm: clean up mm_counterKAMEZAWA Hiroyuki
Presently, per-mm statistics counter is defined by macro in sched.h This patch modifies it to - defined in mm.h as inlinf functions - use array instead of macro's name creation. This patch is for reducing patch size in future patch to modify implementation of per-mm counter. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06bitops: rename for_each_bit() to for_each_set_bit()Akinobu Mita
Rename for_each_bit to for_each_set_bit in the kernel source tree. To permit for_each_clear_bit(), should that ever be added. The patch includes a macro to map the old for_each_bit() onto the new for_each_set_bit(). This is a (very) temporary thing to ease the migration. [akpm@linux-foundation.org: add temporary for_each_bit()] Suggested-by: Alexey Dobriyan <adobriyan@gmail.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Russell King <rmk@arm.linux.org.uk> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Artem Bityutskiy <dedekind@infradead.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-05Merge branch 'slab-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 * 'slab-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6: SLUB: Fix per-cpu merge conflict failslab: add ability to filter slab caches slab: fix regression in touched logic dma kmalloc handling fixes slub: remove impossible condition slab: initialize unused alien cache entry as NULL at alloc_alien_cache(). SLUB: Make slub statistics use this_cpu_inc SLUB: this_cpu: Remove slub kmem_cache fields SLUB: Get rid of dynamic DMA kmalloc cache allocation SLUB: Use this_cpu operations in slub
2010-03-05Merge branch 'nfs-for-2.6.34' of ↵Linus Torvalds
git://git.linux-nfs.org/projects/trondmy/nfs-2.6 * 'nfs-for-2.6.34' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (44 commits) NFS: Remove requirement for inode->i_mutex from nfs_invalidate_mapping NFS: Clean up nfs_sync_mapping NFS: Simplify nfs_wb_page() NFS: Replace __nfs_write_mapping with sync_inode() NFS: Simplify nfs_wb_page_cancel() NFS: Ensure inode is always marked I_DIRTY_DATASYNC, if it has unstable pages NFS: Run COMMIT as an asynchronous RPC call when wbc->for_background is set NFS: Reduce the number of unnecessary COMMIT calls NFS: Add a count of the number of unstable writes carried by an inode NFS: Cleanup - move nfs_write_inode() into fs/nfs/write.c nfs41 fix NFS4ERR_CLID_INUSE for exchange id NFS: Fix an allocation-under-spinlock bug SUNRPC: Handle EINVAL error returns from the TCP connect operation NFSv4.1: Various fixes to the sequence flag error handling nfs4: renewd renew operations should take/put a client reference nfs41: renewd sequence operations should take/put client reference nfs: prevent backlogging of renewd requests nfs: kill renewd before clearing client minor version NFS: Make close(2) asynchronous when closing NFS O_DIRECT files NFS: Improve NFS iostat byte count accuracy for writes ...
2010-03-05Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs: fs/9p: Add hardlink support to .u extension 9P2010.L handshake: .L protocol negotiation 9P2010.L handshake: Remove "dotu" variable 9P2010.L handshake: Add mount option 9P2010.L handshake: Add VFS flags net/9p: Handle mount errors correctly. net/9p: Remove MAX_9P_CHAN limit net/9p: Add multi channel support.
2010-03-05Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (33 commits) quota: stop using QUOTA_OK / NO_QUOTA dquot: cleanup dquot initialize routine dquot: move dquot initialization responsibility into the filesystem dquot: cleanup dquot drop routine dquot: move dquot drop responsibility into the filesystem dquot: cleanup dquot transfer routine dquot: move dquot transfer responsibility into the filesystem dquot: cleanup inode allocation / freeing routines dquot: cleanup space allocation / freeing routines ext3: add writepage sanity checks ext3: Truncate allocated blocks if direct IO write fails to update i_size quota: Properly invalidate caches even for filesystems with blocksize < pagesize quota: generalize quota transfer interface quota: sb_quota state flags cleanup jbd: Delay discarding buffers in journal_unmap_buffer ext3: quota_write cross block boundary behaviour quota: drop permission checks from xfs_fs_set_xstate/xfs_fs_set_xquota quota: split out compat_sys_quotactl support from quota.c quota: split out netlink notification support from quota.c quota: remove invalid optimization from quota_sync_all ... Fixed trivial conflicts in fs/namei.c and fs/ufs/inode.c
2010-03-05Merge branch 'kvm-updates/2.6.34' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
* 'kvm-updates/2.6.34' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (145 commits) KVM: x86: Add KVM_CAP_X86_ROBUST_SINGLESTEP KVM: VMX: Update instruction length on intercepted BP KVM: Fix emulate_sys[call, enter, exit]()'s fault handling KVM: Fix segment descriptor loading KVM: Fix load_guest_segment_descriptor() to inject page fault KVM: x86 emulator: Forbid modifying CS segment register by mov instruction KVM: Convert kvm->requests_lock to raw_spinlock_t KVM: Convert i8254/i8259 locks to raw_spinlocks KVM: x86 emulator: disallow opcode 82 in 64-bit mode KVM: x86 emulator: code style cleanup KVM: Plan obsolescence of kernel allocated slots, paravirt mmu KVM: x86 emulator: Add LOCK prefix validity checking KVM: x86 emulator: Check CPL level during privilege instruction emulation KVM: x86 emulator: Fix popf emulation KVM: x86 emulator: Check IOPL level during io instruction emulation KVM: x86 emulator: fix memory access during x86 emulation KVM: x86 emulator: Add Virtual-8086 mode of emulation KVM: x86 emulator: Add group9 instruction decoding KVM: x86 emulator: Add group8 instruction decoding KVM: do not store wqh in irqfd ... Trivial conflicts in Documentation/feature-removal-schedule.txt
2010-03-05net/9p: Remove MAX_9P_CHAN limitAneesh Kumar K.V
Use a list to track the channel instead of statically allocated array Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-03-05net/9p: Add multi channel support.Aneesh Kumar K.V
This is needed for supporting multiple mount points. We can find out the device names to be used with mount by checking /sys/devices/virtio-pci/virtio*/device file if the device file have value 9 then the specific virtio device can be used for mounting. ex: #cat /sys/devices/virtio-pci/virtio1/device 9 now we can mount using # mount -t 9p -o trans=virtio virtio1 /mnt/ Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-03-05Merge branch 'writeback-for-2.6.34' into nfs-for-2.6.34Trond Myklebust
2010-03-05NFS: Remove requirement for inode->i_mutex from nfs_invalidate_mappingTrond Myklebust
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-03-05NFS: Simplify nfs_wb_page()Trond Myklebust
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-03-05NFS: Replace __nfs_write_mapping with sync_inode()Trond Myklebust
Now that we have correct COMMIT semantics in writeback_single_inode, we can reduce and simplify nfs_wb_all(). Also replace nfs_wb_nocommit() with a call to filemap_write_and_wait(), which doesn't need to hold the inode->i_mutex. With that done, we can eliminate nfs_write_mapping() altogether. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-03-05NFS: Simplify nfs_wb_page_cancel()Trond Myklebust
In all cases we should be able to just remove the request and call cancel_dirty_page(). Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-03-05NFS: Add a count of the number of unstable writes carried by an inodeTrond Myklebust
In order to know when we should do opportunistic commits of the unstable writes, when the VM is doing a background flush, we add a field to count the number of unstable writes. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-03-05NFS: Cleanup - move nfs_write_inode() into fs/nfs/write.cTrond Myklebust
The sole purpose of nfs_write_inode is to commit unstable writes, so move it into fs/nfs/write.c, and make nfs_commit_inode static. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-03-05Merge branch 'write_inode2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'write_inode2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: pass writeback_control to ->write_inode make sure data is on disk before calling ->write_inode
2010-03-05Merge branch 'perf-probes-for-linus-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-probes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: Issue at least one memory barrier in stop_machine_text_poke() perf probe: Correct probe syntax on command line help perf probe: Add lazy line matching support perf probe: Show more lines after last line perf probe: Check function address range strictly in line finder perf probe: Use libdw callback routines perf probe: Use elfutils-libdw for analyzing debuginfo perf probe: Rename probe finder functions perf probe: Fix bugs in line range finder perf probe: Update perf probe document perf probe: Do not show --line option without dwarf support kprobes: Add documents of jump optimization kprobes/x86: Support kprobes jump optimization on x86 x86: Add text_poke_smp for SMP cross modifying code kprobes/x86: Cleanup save/restore registers kprobes/x86: Boost probes when reentering kprobes: Jump optimization sysctl interface kprobes: Introduce kprobes jump optimization kprobes: Introduce generic insn_slot framework kprobes/x86: Cleanup RELATIVEJUMP_INSTRUCTION to RELATIVEJUMP_OPCODE
2010-03-05Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (36 commits) ext4: fix up rb_root initializations to use RB_ROOT ext4: Code cleanup for EXT4_IOC_MOVE_EXT ioctl ext4: Fix the NULL reference in double_down_write_data_sem() ext4: Fix insertion point of extent in mext_insert_across_blocks() ext4: consolidate in_range() definitions ext4: cleanup to use ext4_grp_offs_to_block() ext4: cleanup to use ext4_group_first_block_no() ext4: Release page references acquired in ext4_da_block_invalidatepages ext4: Fix ext4_quota_write cross block boundary behaviour ext4: Convert BUG_ON checks to use ext4_error() instead ext4: Use direct_IO_no_locking in ext4 dio read ext4: use ext4_get_block_write in buffer write ext4: mechanical rename some of the direct I/O get_block's identifiers ext4: make "offset" consistent in ext4_check_dir_entry() ext4: Handle non empty on-disk orphan link ext4: explicitly remove inode from orphan list after failed direct io ext4: fix error handling in migrate ext4: deprecate obsoleted mount options ext4: Fix fencepost error in chosing choosing group vs file preallocation. jbd2: clean up an assertion in jbd2_journal_commit_transaction() ...
2010-03-05pass writeback_control to ->write_inodeChristoph Hellwig
This gives the filesystem more information about the writeback that is happening. Trond requested this for the NFS unstable write handling, and other filesystems might benefit from this too by beeing able to distinguish between the different callers in more detail. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-05quota: stop using QUOTA_OK / NO_QUOTAChristoph Hellwig
Just use 0 / -EDQUOT directly - that's what it translates to anyway. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05dquot: cleanup dquot initialize routineChristoph Hellwig
Get rid of the initialize dquot operation - it is now always called from the filesystem and if a filesystem really needs it's own (which none currently does) it can just call into it's own routine directly. Rename the now static low-level dquot_initialize helper to __dquot_initialize and vfs_dq_init to dquot_initialize to have a consistent namespace. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05dquot: move dquot initialization responsibility into the filesystemChristoph Hellwig
Currently various places in the VFS call vfs_dq_init directly. This means we tie the quota code into the VFS. Get rid of that and make the filesystem responsible for the initialization. For most metadata operations this is a straight forward move into the methods, but for truncate and open it's a bit more complicated. For truncate we currently only call vfs_dq_init for the sys_truncate case because open already takes care of it for ftruncate and open(O_TRUNC) - the new code causes an additional vfs_dq_init for those which is harmless. For open the initialization is moved from do_filp_open into the open method, which means it happens slightly earlier now, and only for regular files. The latter is fine because we don't need to initialize it for operations on special files, and we already do it as part of the namespace operations for directories. Add a dquot_file_open helper that filesystems that support generic quotas can use to fill in ->open. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05dquot: cleanup dquot drop routineChristoph Hellwig
Get rid of the drop dquot operation - it is now always called from the filesystem and if a filesystem really needs it's own (which none currently does) it can just call into it's own routine directly. Rename the now static low-level dquot_drop helper to __dquot_drop and vfs_dq_drop to dquot_drop to have a consistent namespace. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05dquot: cleanup dquot transfer routineChristoph Hellwig
Get rid of the transfer dquot operation - it is now always called from the filesystem and if a filesystem really needs it's own (which none currently does) it can just call into it's own routine directly. Rename the now static low-level dquot_transfer helper to __dquot_transfer and vfs_dq_transfer to dquot_transfer to have a consistent namespace, and make the new dquot_transfer return a normal negative errno value which all callers expect. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05dquot: cleanup inode allocation / freeing routinesChristoph Hellwig
Get rid of the alloc_inode and free_inode dquot operations - they are always called from the filesystem and if a filesystem really needs their own (which none currently does) it can just call into it's own routine directly. Also get rid of the vfs_dq_alloc/vfs_dq_free wrappers and always call the lowlevel dquot_alloc_inode / dqout_free_inode routines directly, which now lose the number argument which is always 1. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05dquot: cleanup space allocation / freeing routinesChristoph Hellwig
Get rid of the alloc_space, free_space, reserve_space, claim_space and release_rsv dquot operations - they are always called from the filesystem and if a filesystem really needs their own (which none currently does) it can just call into it's own routine directly. Move shared logic into the common __dquot_alloc_space, dquot_claim_space_nodirty and __dquot_free_space low-level methods, and rationalize the wrappers around it to move as much as possible code into the common block for CONFIG_QUOTA vs not. Also rename all these helpers to be named dquot_* instead of vfs_dq_*. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05quota: generalize quota transfer interfaceDmitry Monakhov
Current quota transfer interface support only uid/gid. This patch extend interface in order to support various quotas types The goal is accomplished without changes in most frequently used vfs_dq_transfer() func. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05quota: sb_quota state flags cleanupDmitry Monakhov
- remove hardcoded USRQUOTA/GRPQUOTA flags - convert int to bool for appropriate functions Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05quota: move code from sync_quota_sb into vfs_quota_syncChristoph Hellwig
Currenly sync_quota_sb does a lot of sync and truncate action that only applies to "VFS" style quotas and is actively harmful for the sync performance in XFS. Move it into vfs_quota_sync and add a wait parameter to ->quota_sync to tell if we need it or not. My audit of the GFS2 code says it's also not needed given the way GFS2 implements quotas, but I'd be happy if this can get a detailed review. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05quota: clean up Q_XQUOTASYNCChristoph Hellwig
Currently Q_XQUOTASYNC calls into the quota_sync method, but XFS does something entirely different in it than the rest of the filesystems. xfs_quota which calls Q_XQUOTASYNC expects an asynchronous data writeout to flush delayed allocations, while the "VFS" quota support wants to flush changes to the quota file. So make Q_XQUOTASYNC call into the writeback code directly and make the quota_sync method optional as XFS doesn't need in the sense expected by the rest of the quota code. GFS2 was using limited XFS-style quota and has a quota_sync method fitting neither the style used by vfs_quota_sync nor xfs_fs_quota_sync. I left it in for now as per discussion with Steve it expects to be called from the sync path this way. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05quota: manage reserved space when quota is not active [v2]Dmitry Monakhov
Since we implemented generic reserved space management interface, then it is possible to account reserved space even when quota is not active (similar to i_blocks/i_bytes). Without this patch following testcase result in massive comlain from WARN_ON in dquot_claim_space() TEST_CASE: mount /dev/sdb /mnt -oquota dd if=/dev/zero of=/mnt/test bs=1M count=1 quotaon /mnt # fs_reserved_spave == 1Mb # quota_reserved_space == 0, because quota was disabled dd if=/dev/zero of=/mnt/test seek=1 bs=1M count=1 # fs_reserved_spave == 2Mb # quota_reserved_space == 1Mb sync # ->dquot_claim_space() -> WARN_ON Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05jbd[2]: remove references to BUFFER_DEBUGChristoph Egger
CONFIG_BUFFER_DEBUG seems to have been removed from the documentation somewhere around 2.4.15 and seemingly hasn't been available even longer. It is, however, still referenced at one place from the jbd code (one is a copy of the other header). Time to clean it up Signed-off-by: Christoph Egger <siccegge@stud.informatik.uni-erlangen.de> Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05ext3: Use bitops to read/modify EXT3_I(inode)->i_stateJan Kara
At several places we modify EXT3_I(inode)->i_state without holding i_mutex (ext3_release_file, ext3_bmap, ext3_journalled_writepage, ext3_do_update_inode, ...). These modifications are racy and we can lose updates to i_state. So convert handling of i_state to use bitops which are atomic. Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide-next-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide-next-2.6: (49 commits) drivers/ide: Fix continuation line formats ide: fixed section mismatch warning in cmd640.c ide: ide_timing_compute() fixup ide: make ide_get_best_pio_mode() static via82cxxx: use ->pio_mode value to determine pair device speed tx493xide: use ->pio_mode value to determine pair device speed siimage: use ->pio_mode value to determine pair device speed palm_bk3710: use ->pio_mode value to determine pair device speed it821x: use ->pio_mode value to determine pair device speed cs5536: use ->pio_mode value to determine pair device speed cs5535: use ->pio_mode value to determine pair device speed cmd64x: fix handling of address setup timings amd74xx: use ->pio_mode value to determine pair device speed alim15x3: fix handling of UDMA enable bit alim15x3: fix handling of DMA timings alim15x3: fix handling of command timings alim15x3: fix handling of address setup timings ide-timings: use ->pio_mode value to determine fastest PIO speed ide: change ->set_dma_mode method parameters ide: change ->set_pio_mode method parameters ...
2010-03-04Merge branch 'next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx: (28 commits) ioat: cleanup ->timer_fn() and ->cleanup_fn() prototypes ioat3: interrupt coalescing ioat: close potential BUG_ON race in the descriptor cleanup path ioat2: kill pending flag ioat3: use ioat2_quiesce() ioat3: cleanup, don't enable DCA completion writes DMAENGINE: COH 901 318 lli sg offset fix DMAENGINE: COH 901 318 configure channel direction DMAENGINE: COH 901 318 remove irq counting DMAENGINE: COH 901 318 descriptor pool refactoring DMAENGINE: COH 901 318 cleanups dma: Add MPC512x DMA driver Debugging options for the DMA engine subsystem iop-adma: redundant/wrong tests in iop_*_count()? dmatest: fix handling of an even number of xor_sources dmatest: correct raid6 PQ test fsldma: Fix cookie issues fsldma: Fix cookie issues dma: cases IPU_PIX_FMT_BGRA32, BGR32 and ABGR32 are the same in ipu_ch_param_set_size() dma: make Open Firmware device id constant ...
2010-03-04Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits) init: Open /dev/console from rootfs mqueue: fix typo "failues" -> "failures" mqueue: only set error codes if they are really necessary mqueue: simplify do_open() error handling mqueue: apply mathematics distributivity on mq_bytes calculation mqueue: remove unneeded info->messages initialization mqueue: fix mq_open() file descriptor leak on user-space processes fix race in d_splice_alias() set S_DEAD on unlink() and non-directory rename() victims vfs: add NOFOLLOW flag to umount(2) get rid of ->mnt_parent in tomoyo/realpath hppfs can use existing proc_mnt, no need for do_kern_mount() in there Mirror MS_KERNMOUNT in ->mnt_flags get rid of useless vfsmount_lock use in put_mnt_ns() Take vfsmount_lock to fs/internal.h get rid of insanity with namespace roots in tomoyo take check for new events in namespace (guts of mounts_poll()) to namespace.c Don't mess with generic_permission() under ->d_lock in hpfs sanitize const/signedness for udf nilfs: sanitize const/signedness in dealing with ->d_name.name ... Fix up fairly trivial (famous last words...) conflicts in drivers/infiniband/core/uverbs_main.c and security/tomoyo/realpath.c
2010-03-04Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/lrg/voltage-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/lrg/voltage-2.6: (27 commits) Regulators: wm8400 - cleanup platform driver data handling Regulators: wm8994 - clean up driver data after removal Regulators: wm831x-xxx - clean up driver data after removal Regulators: pcap-regulator - clean up driver data after removal Regulators: max8660 - annotate probe and remove methods Regulators: max1586 - annotate probe and remove methods Regulators: lp3971 - fail if platform data was not supplied Regulators: tps6507x-regulator - mark probe method as __devinit Regulators: tps65023-regulator - mark probe method as __devinit Regulators: twl-regulator - mark probe function as __devinit Regulators: fixed - annotate probe and remove methods Regulators: ab3100 - fix probe and remove annotations Regulators: virtual - use sysfs attribute groups twl6030: regulator: Configure STATE register instead of REMAP regulator: Provide optional dummy regulator for consumers regulator: Assume regulators are enabled if they don't report anything regulator: Convert fixed voltage regulator to use enable_time() regulator: Add WM8994 regulator support regulator: enable max8649 regulator driver regulator: trivial: fix typos in user-visible Kconfig text ...
2010-03-04Merge branch 'drm-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6 * 'drm-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6: (151 commits) vga_switcheroo: disable default y by new rules. drm/nouveau: fix *staging* driver build with switcheroo off. drm/radeon: fix typo in Makefile vga_switcheroo: fix build on platforms with no ACPI drm/radeon: Fix printf type warning in 64bit system. drm/radeon/kms: bump the KMS version number for square tiling support. vga_switcheroo: initial implementation (v15) drm/radeon/kms: do not disable audio engine twice Revert "drm/radeon/kms: disable HDMI audio for now on rv710/rv730" drm/radeon/kms: do not preset audio stuff and start timer when not using audio drm/radeon: r100/r200 ums: block ability for userspace app to trash 0 page and beyond drm/ttm: fix function prototype to match implementation drm/radeon: use ALIGN instead of open coding it drm/radeon/kms: initialize set_surface_reg reg for rs600 asic drm/i915: Use a dmi quirk to skip a broken SDVO TV output. drm/i915: enable/disable LVDS port at DPMS time drm/i915: check for multiple write domains in pin_and_relocate drm/i915: clean-up i915_gem_flush_gpu_write_domain drm/i915: reuse i915_gpu_idle helper drm/i915: ensure lru ordering of fence_list ... Fixed trivial conflicts in drivers/gpu/vga/Kconfig
2010-03-04Merge branches 'slab/cleanups', 'slab/failslab', 'slab/fixes' and ↵Pekka Enberg
'slub/percpu' into slab-for-linus
2010-03-03Merge branch 'for-fsnotify' into for-linusAl Viro
2010-03-03vfs: add NOFOLLOW flag to umount(2)Miklos Szeredi
Add a new UMOUNT_NOFOLLOW flag to umount(2). This is needed to prevent symlink attacks in unprivileged unmounts (fuse, samba, ncpfs). Additionally, return -EINVAL if an unknown flag is used (and specify an explicitly unused flag: UMOUNT_UNUSED). This makes it possible for the caller to determine if a flag is supported or not. CC: Eugene Teo <eugene@redhat.com> CC: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>