aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2008-07-16Merge branch 'for_linus' of git://git.infradead.org/~dedekind/ubifs-2.6Linus Torvalds
* 'for_linus' of git://git.infradead.org/~dedekind/ubifs-2.6: UBIFS: include to compilation UBIFS: add new flash file system UBIFS: add brief documentation MAINTAINERS: add UBIFS section do_mounts: allow UBI root device name VFS: export sync_sb_inodes VFS: move inode_lock into sync_sb_inodes
2008-07-16Merge git://git.linux-nfs.org/projects/trondmy/nfs-2.6Linus Torvalds
* git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (82 commits) NFSv4: Remove BKL from the nfsv4 state recovery SUNRPC: Remove the BKL from the callback functions NFS: Remove BKL from the readdir code NFS: Remove BKL from the symlink code NFS: Remove BKL from the sillydelete operations NFS: Remove the BKL from the rename, rmdir and unlink operations NFS: Remove BKL from NFS lookup code NFS: Remove the BKL from nfs_link() NFS: Remove the BKL from the inode creation operations NFS: Remove BKL usage from open() NFS: Remove BKL usage from the write path NFS: Remove the BKL from the permission checking code NFS: Remove attribute update related BKL references NFS: Remove BKL requirement from attribute updates NFS: Protect inode->i_nlink updates using inode->i_lock nfs: set correct fl_len in nlmclnt_test() SUNRPC: Support registering IPv6 interfaces with local rpcbind daemon SUNRPC: Refactor rpcb_register to make rpcbindv4 support easier SUNRPC: None of rpcb_create's callers wants a privileged source port SUNRPC: Introduce a specific rpcb_create for contacting localhost ...
2008-07-16Fix compile issues in fs/compat_ioctl.c when CONFIG_BLOCK is disabledRandy Dunlap
Fix fs/compat_ioctl.c to handle CONFIG_BLOCK=n, CONFIG_SCSI=n to avoid build errors: In file included from include/scsi/scsi.h:12, from fs/compat_ioctl.c:71: include/scsi/scsi_cmnd.h:27:25: warning: "BLK_MAX_CDB" is not defined include/scsi/scsi_cmnd.h:28:3: error: #error MAX_COMMAND_SIZE can not be bigger than BLK_MAX_CDB In file included from include/scsi/scsi.h:12, from fs/compat_ioctl.c:71: include/scsi/scsi_cmnd.h: In function 'scsi_bidi_cmnd': include/scsi/scsi_cmnd.h:182: error: implicit declaration of function 'blk_bidi_rq' include/scsi/scsi_cmnd.h:183: error: dereferencing pointer to incomplete type include/scsi/scsi_cmnd.h: In function 'scsi_in': include/scsi/scsi_cmnd.h:189: error: dereferencing pointer to incomplete type Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-15Merge branch 'bkl-removal' into nextTrond Myklebust
2008-07-15Merge branch 'devel' into nextTrond Myklebust
Conflicts: fs/nfs/file.c Fix up the conflict with Jon Corbet's bkl-removal tree
2008-07-15NFSv4: Remove BKL from the nfsv4 state recoveryTrond Myklebust
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15SUNRPC: Remove the BKL from the callback functionsTrond Myklebust
Push it into those callback functions that actually need it. Note that all the NFS operations use their own locking, so don't need the BKL. Ditto for the rpcbind client. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15NFS: Remove BKL from the readdir codeTrond Myklebust
Page accesses are serialised using the page locks, whereas all attribute updates are serialised using the inode->i_lock. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15NFS: Remove BKL from the symlink codeTrond Myklebust
Page cache accesses are serialised using page locks, whereas attribute updates are serialised using inode->i_lock. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15NFS: Remove BKL from the sillydelete operationsTrond Myklebust
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15NFS: Remove the BKL from the rename, rmdir and unlink operationsTrond Myklebust
Attribute updates are safe, and dentry operations are protected using VFS level locks. Defer removing the BKL from sillyrename until a separate patch. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15NFS: Remove BKL from NFS lookup codeTrond Myklebust
All dentry-related operations are already BKL-safe, since they are protected by the VFS locking. No extra locks should be needed in the NFS code. In the case of nfs_revalidate_inode(), we're only doing an attribute update (protected by the inode->i_lock). In the case of nfs_lookup(), we're instantiating a new dentry, so there should be no contention possible until after we call d_materialise_unique. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15NFS: Remove the BKL from nfs_link()Trond Myklebust
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15NFS: Remove the BKL from the inode creation operationsTrond Myklebust
nfs_instantiate() does not require the BKL, neither do the attribute updates or the RPC code. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15NFS: Remove BKL usage from open()Trond Myklebust
All the NFSv4 stateful operations are already protected by other locks (in particular by the rpc_sequence locks. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15NFS: Remove BKL usage from the write pathTrond Myklebust
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15NFS: Remove the BKL from the permission checking codeTrond Myklebust
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15NFS: Remove attribute update related BKL referencesTrond Myklebust
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15NFS: Remove BKL requirement from attribute updatesTrond Myklebust
The main problem is dealing with inode->i_size: we need to set the inode->i_lock on all attribute updates, and so vmtruncate won't cut it. Make an NFS-private version of vmtruncate that has the necessary locking semantics. The result should be that the following inode attribute updates are protected by inode->i_lock nfsi->cache_validity nfsi->read_cache_jiffies nfsi->attrtimeo nfsi->attrtimeo_timestamp nfsi->change_attr nfsi->last_updated nfsi->cache_change_attribute nfsi->access_cache nfsi->access_cache_entry_lru nfsi->access_cache_inode_lru nfsi->acl_access nfsi->acl_default nfsi->nfs_page_tree nfsi->ncommit nfsi->npages nfsi->open_files nfsi->silly_list nfsi->acl nfsi->open_states inode->i_size inode->i_atime inode->i_mtime inode->i_ctime inode->i_nlink inode->i_uid inode->i_gid The following is protected by dir->i_mutex nfsi->cookieverf Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15NFS: Protect inode->i_nlink updates using inode->i_lockTrond Myklebust
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15nfs: set correct fl_len in nlmclnt_test()Felix Blyakher
fcntl(F_GETLK) on an nfs client incorrectly returns the values for the conflicting lock. fl_len value is always 1. If the conflicting lock is (0, 4095) the F_GETLK request for (1024, 10) returns (0, 1), which doesn't even cover the requested range, and is quite confusing. The fix is trivial, set fl_end from the fl_end value recieved from the nfs server. Signed-off-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15Merge branch 'generic-ipi' into generic-ipi-for-linusIngo Molnar
Conflicts: arch/powerpc/Kconfig arch/s390/kernel/time.c arch/x86/kernel/apic_32.c arch/x86/kernel/cpu/perfctr-watchdog.c arch/x86/kernel/i8259_64.c arch/x86/kernel/ldt.c arch/x86/kernel/nmi_64.c arch/x86/kernel/smpboot.c arch/x86/xen/smp.c include/asm-x86/hw_irq_32.h include/asm-x86/hw_irq_64.h include/asm-x86/mach-default/irq_vectors.h include/asm-x86/mach-voyager/irq_vectors.h include/asm-x86/smp.h kernel/Makefile Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-15Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6: jfs: remove DIRENTSIZ JFS: diAlloc() should return -EIO rather than EIO JFS: skip bad iput() call in error path JFS: switch to seq_files JFS: 0 is not valid errno value so return NULL from jfs_lookup
2008-07-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmwLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: [GFS2] Fix GFS2's use of do_div() in its quota calculations [GFS2] Remove unused declaration [GFS2] Remove support for unused and pointless flag [GFS2] Replace rgrp "recent list" with mru list [GFS2] Allow local DF locks when holding a cached EX glock [GFS2] Fix delayed demote race [GFS2] don't call permission() [GFS2] Fix module building [GFS2] Glock documentation [GFS2] Remove all_list from lock_dlm [GFS2] Remove obsolete conversion deadlock avoidance code [GFS2] Remove remote lock dropping code [GFS2] kernel panic mounting volume [GFS2] Revise readpage locking [GFS2] Fix ordering of args for list_add [GFS2] trivial sparse lock annotations [GFS2] No lock_nolock [GFS2] Fix ordering bug in lock_dlm [GFS2] Clean up the glock core
2008-07-15Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (61 commits) ext4: Documention update for new ordered mode and delayed allocation ext4: do not set extents feature from the kernel ext4: Don't allow nonextenst mount option for large filesystem ext4: Enable delalloc by default. ext4: delayed allocation i_blocks fix for stat ext4: fix delalloc i_disksize early update issue ext4: Handle page without buffers in ext4_*_writepage() ext4: Add ordered mode support for delalloc ext4: Invert lock ordering of page_lock and transaction start in delalloc mm: Add range_cont mode for writeback ext4: delayed allocation ENOSPC handling percpu_counter: new function percpu_counter_sum_and_set ext4: Add delayed allocation support in data=writeback mode vfs: add hooks for ext4's delayed allocation support jbd2: Remove data=ordered mode support using jbd buffer heads ext4: Use new framework for data=ordered mode in JBD2 jbd2: Implement data=ordered mode handling via inodes vfs: export filemap_fdatawrite_range() ext4: Fix lock inversion in ext4_ext_truncate() ext4: Invert the locking order of page_lock and transaction start ...
2008-07-15UBIFS: include to compilationArtem Bityutskiy
Add UBIFS to Makefile and Kbuild. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
2008-07-15UBIFS: add new flash file systemArtem Bityutskiy
This is a new flash file system. See http://www.linux-mtd.infradead.org/doc/ubifs.html Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
2008-07-14Merge branch 'bkl-removal' of git://git.lwn.net/linux-2.6Linus Torvalds
* 'bkl-removal' of git://git.lwn.net/linux-2.6: (146 commits) IB/umad: BKL is not needed for ib_umad_open() IB/uverbs: BKL is not needed for ib_uverbs_open() bf561-coreb: BKL unneeded for open() Call fasync() functions without the BKL snd/PCM: fasync BKL pushdown ipmi: fasync BKL pushdown ecryptfs: fasync BKL pushdown Bluetooth VHCI: fasync BKL pushdown tty_io: fasync BKL pushdown tun: fasync BKL pushdown i2o: fasync BKL pushdown mpt: fasync BKL pushdown Remove BKL from remote_llseek v2 Make FAT users happier by not deadlocking x86-mce: BKL pushdown vmwatchdog: BKL pushdown vmcp: BKL pushdown via-pmu: BKL pushdown uml-random: BKL pushdown uml-mmapper: BKL pushdown ...
2008-07-14Merge commit 'v2.6.26' into bkl-removalJonathan Corbet
2008-07-14Merge branch 'x86/for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (821 commits) x86: make 64bit hpet_set_mapping to use ioremap too, v2 x86: get x86_phys_bits early x86: max_low_pfn_mapped fix #4 x86: change _node_to_cpumask_ptr to return const ptr x86: I/O APIC: remove an IRQ2-mask hack x86: fix numaq_tsc_disable calling x86, e820: remove end_user_pfn x86: max_low_pfn_mapped fix, #3 x86: max_low_pfn_mapped fix, #2 x86: max_low_pfn_mapped fix, #1 x86_64: fix delayed signals x86: remove conflicting nx6325 and nx6125 quirks x86: Recover timer_ack lost in the merge of the NMI watchdog x86: I/O APIC: Never configure IRQ2 x86: L-APIC: Always fully configure IRQ0 x86: L-APIC: Set IRQ0 as edge-triggered x86: merge dwarf2 headers x86: use AS_CFI instead of UNWIND_INFO x86: use ignore macro instead of hash comment x86: use matching CFI_ENDPROC ...
2008-07-14Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (25 commits) security: remove register_security hook security: remove dummy module fix security: remove dummy module security: remove unused sb_get_mnt_opts hook LSM/SELinux: show LSM mount options in /proc/mounts SELinux: allow fstype unknown to policy to use xattrs if present security: fix return of void-valued expressions SELinux: use do_each_thread as a proper do/while block SELinux: remove unused and shadowed addrlen variable SELinux: more user friendly unknown handling printk selinux: change handling of invalid classes (Was: Re: 2.6.26-rc5-mm1 selinux whine) SELinux: drop load_mutex in security_load_policy SELinux: fix off by 1 reference of class_to_string in context_struct_compute_av SELinux: open code sidtab lock SELinux: open code load_mutex SELinux: open code policy_rwlock selinux: fix endianness bug in network node address handling selinux: simplify ioctl checking SELinux: enable processes with mac_admin to get the raw inode contexts Security: split proc ptrace checking into read vs. attach ...
2008-07-14Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: (37 commits) splice: fix generic_file_splice_read() race with page invalidation ramfs: enable splice write drivers/block/pktcdvd.c: avoid useless memset cdrom: revert commit 22a9189 (cdrom: use kmalloced buffers instead of buffers on stack) scsi: sr avoids useless buffer allocation block: blk_rq_map_kern uses the bounce buffers for stack buffers block: add blk_queue_update_dma_pad DAC960: push down BKL pktcdvd: push BKL down into driver paride: push ioctl down into driver block: use get_unaligned_* helpers block: extend queue_flag bitops block: request_module(): use format string Add bvec_merge_data to handle stacked devices and ->merge_bvec() block: integrity flags can't use bit ops on unsigned short cmdfilter: extend default read filter sg: fix odd style (extra parenthesis) introduced by cmd filter patch block: add bounce support to blk_rq_map_user_iov cfq-iosched: get rid of enable_idle being unused warning allow userspace to modify scsi command filter on per device basis ...
2008-07-14VFS: export sync_sb_inodesArtem Bityutskiy
This patch exports the 'sync_sb_inodes()' which is needed for UBIFS because it has to force write-back from time to time. Namely, the UBIFS budgeting subsystem forces write-back when its pessimistic callculations show that there is no free space on the media. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2008-07-14VFS: move inode_lock into sync_sb_inodesHans Reiser
This patch makes 'sync_sb_inodes()' lock 'inode_lock', rather than expect that the caller will do this. This change was previously done by Hans Reiser <reiser@namesys.com> and sat in the -mm tree. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2008-07-14Merge commit 'v2.6.26' into x86/coreIngo Molnar
2008-07-14LSM/SELinux: show LSM mount options in /proc/mountsEric Paris
This patch causes SELinux mount options to show up in /proc/mounts. As with other code in the area seq_put errors are ignored. Other LSM's will not have their mount options displayed until they fill in their own security_sb_show_options() function. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: James Morris <jmorris@namei.org>
2008-07-14Security: split proc ptrace checking into read vs. attachStephen Smalley
Enable security modules to distinguish reading of process state via proc from full ptrace access by renaming ptrace_may_attach to ptrace_may_access and adding a mode argument indicating whether only read access or full attach access is requested. This allows security modules to permit access to reading process state without granting full ptrace access. The base DAC/capability checking remains unchanged. Read access to /proc/pid/mem continues to apply a full ptrace attach check since check_mem_permission() already requires the current task to already be ptracing the target. The other ptrace checks within proc for elements like environ, maps, and fds are changed to pass the read mode instead of attach. In the SELinux case, we model such reading of process state as a reading of a proc file labeled with the target process' label. This enables SELinux policy to permit such reading of process state without permitting control or manipulation of the target process, as there are a number of cases where programs probe for such information via proc but do not need to be able to control the target (e.g. procps, lsof, PolicyKit, ConsoleKit). At present we have to choose between allowing full ptrace in policy (more permissive than required/desired) or breaking functionality (or in some cases just silencing the denials via dontaudit rules but this can hide genuine attacks). This version of the patch incorporates comments from Casey Schaufler (change/replace existing ptrace_may_attach interface, pass access mode), and Chris Wright (provide greater consistency in the checking). Note that like their predecessors __ptrace_may_attach and ptrace_may_attach, the __ptrace_may_access and ptrace_may_access interfaces use different return value conventions from each other (0 or -errno vs. 1 or 0). I retained this difference to avoid any changes to the caller logic but made the difference clearer by changing the latter interface to return a bool rather than an int and by adding a comment about it to ptrace.h for any future callers. Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: James Morris <jmorris@namei.org>
2008-07-12cifs: fix wksidarr declaration to be big-endian friendlyJeff Layton
The current definition of wksidarr works fine on little endian arches (since cpu_to_le32 is a no-op there), but on big-endian arches, it fails to compile with this error: error: braced-group within expression allowed only inside a function The problem is that this static declaration has cpu_to_le32 embedded within it, and that expands into a function macro. We need to use __constant_cpu_to_le32() instead. Signed-off-by: Jeff Layton <jlayton@redhat.com> Cc: Steven French <sfrench@us.ibm.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-12cifs: fix inode leak in cifs_get_inode_info_unixJeff Layton
Try this: mount a share with unix extensions create a file on it umount the share You'll get the following message in the ring buffer: VFS: Busy inodes after unmount of cifs. Self-destruct in 5 seconds. Have a nice day... ...the problem is that cifs_get_inode_info_unix is creating and hashing a new inode even when it's going to return error anyway. The first lookup when creating a file returns an error so we end up leaking this inode before we do the actual create. This appears to be a regression caused by commit 0e4bbde94fdc33f5b3d793166b21bf768ca3e098. The following patch seems to fix it for me, and fixes a minor formatting nit as well. Signed-off-by: Jeff Layton <jlayton@redhat.com> Acked-by: Steven French <sfrench@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-12Merge branch 'linus' into x86/coreIngo Molnar
Conflicts: arch/x86/mm/ioremap.c Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-11ext4: do not set extents feature from the kernelEric Sandeen
We've talked for a while about getting rid of any feature- setting from the kernel; this gets rid of the code which would set the INCOMPAT_EXTENTS flag on the first file write when mounted as ext4[dev]. With this patch, if the extents feature is not already set on disk, then mounting as ext4 will fall back to noextents with a warning, and if -o extents is explicitly requested, the mount will fail, also with warning. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11ext4: Don't allow nonextenst mount option for large filesystemAneesh Kumar K.V
The block mapped inode format can address only blocks within 2**32. This causes a number of issues, the biggest of which is that the block allocator needs to be taught that certain inodes can not utilize block numbers > 2**32. So until this is fixed, it is simplest to fail mounting of file systems with more than 2**32 blocks if the -o noextents option is given. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11ext4: Enable delalloc by default.Aneesh Kumar K.V
Enable delalloc by default to ensure it gets sufficient testing and because it makes the filesystem much more efficient. Add a nodealalloc option to disable delayed allocation, and update ext4_show_options to show delayed allocation off if it is disabled. If the data=journal mount option is used, disable delayed allocation since the delalloc code doesn't support data=journal yet. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2008-07-11ext4: delayed allocation i_blocks fix for statMingming Cao
Right now i_blocks is not getting updated until the blocks are actually allocaed on disk. This means with delayed allocation, right after files are copied, "ls -sF" shoes the file as taking 0 blocks on disk. "du" also shows the files taking zero space, which is highly confusing to the user. Since delayed allocation already keeps track of per-inode total number of blocks that are subject to delayed allocation, this patch fix this by using that to adjust the value returned by stat(2). When real block allocation is done, the i_blocks will get updated. Since the reserved blocks for delayed allocation will be decreased, this will be keep value returned by stat(2) consistent. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11ext4: fix delalloc i_disksize early update issueMingming Cao
Ext4_da_write_end() used walk_page_buffers() with a callback function of ext4_bh_unmapped_or_delay() to check if it extended the file size without allocating any blocks (since in this case i_disksize needs to be updated). However, this is didn't work proprely because the buffer head has not been marked dirty yet --- this is done later in block_commit_write() --- which caused ext4_bh_unmapped_or_delay() to always return false. In addition, walk_page_buffers() checks all of the buffer heads covering the page, and the only buffer_head that should be checked is the one covering the end of the write. Otherwise, given a 1k blocksize filesystem and a 4k page size, the buffer head covering the first 1k stripe of the file could be unmapped (because it was a sparse file), and the second or third buffer_head covering that page could be mapped, and using walk_page_buffers() would fail in this case since it would stop at the first unmapped buffer_head and return true. The core problem is that walk_page_buffers() was intended to do work in a callback function, and a non-zero return value indicated a failure, which termined the walk of the buffer heads covering the page. It was not intended to be used with a boolean function, such as ext4_bh_unmapped_or_delay(). Add addtional fix from Aneesh to protect i_disksize update rave with truncate. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11ext4: Handle page without buffers in ext4_*_writepage()Aneesh Kumar K.V
It can happen that buffers are removed from the page before it gets marked dirty and then is passed to writepage(). In writepage() we just initialize the buffers and check whether they are mapped and non delay. If they are mapped and non delay we write the page. Otherwise we mark them dirty. With this change we don't do block allocation at all in ext4_*_write_page. writepage() can get called under many condition and with a locking order of journal_start -> lock_page, we should not try to allocate blocks in writepage() which get called after taking page lock. writepage() can get called via shrink_page_list even with a journal handle which was created for doing inode update. For example when doing ext4_da_write_begin we create a journal handle with credit 1 expecting a i_disksize update for the inode. But ext4_da_write_begin can cause shrink_page_list via _grab_page_cache. So having a valid handle via ext4_journal_current_handle is not a guarantee that we can use the handle for block allocation in writepage, since we shouldn't be using credits that had been reserved for other updates. That it could result in we running out of credits when we update inodes. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11ext4: Add ordered mode support for delallocAneesh Kumar K.V
This provides a new ordered mode implementation which gets rid of using buffer heads to enforce the ordering between metadata change with the related data chage. Instead, in the new ordering mode, it keeps track of all of the inodes touched by each transaction on a list, and when that transaction is committed, it flushes all of the dirty pages for those inodes. In addition, the new ordered mode reverses the lock ordering of the page lock and transaction lock, which provides easier support for delayed allocation. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11ext4: Invert lock ordering of page_lock and transaction start in delallocMingming Cao
With the reverse locking, we need to start a transation before taking the page lock, so in ext4_da_writepages() we need to break the write-out into chunks, and restart the journal for each chunck to ensure the write-out fits in a single transaction. Updated patch from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> which fixes delalloc sync hang with journal lock inversion, and address the performance regression issue. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-14ext4: delayed allocation ENOSPC handlingMingming Cao
This patch does block reservation for delayed allocation, to avoid ENOSPC later at page flush time. Blocks(data and metadata) are reserved at da_write_begin() time, the freeblocks counter is updated by then, and the number of reserved blocks is store in per inode counter. At the writepage time, the unused reserved meta blocks are returned back. At unlink/truncate time, reserved blocks are properly released. Updated fix from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> to fix the oldallocator block reservation accounting with delalloc, added lock to guard the counters and also fix the reservation for meta blocks. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-07-11percpu_counter: new function percpu_counter_sum_and_setMingming Cao
Delayed allocation need to check free blocks at every write time. percpu_counter_read_positive() is not quit accurate. delayed allocation need a more accurate accounting, but using percpu_counter_sum_positive() is frequently is quite expensive. This patch added a new function to update center counter when sum per-cpu counter, to increase the accurate rate for next percpu_counter_read() and require less calling expensive percpu_counter_sum(). Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>