aboutsummaryrefslogtreecommitdiff
path: root/fs/xfs/linux-2.6
AgeCommit message (Collapse)Author
2007-07-19Merge branch 'for-linus' of git://oss.sgi.com:8090/xfs/xfs-2.6Linus Torvalds
* 'for-linus' of git://oss.sgi.com:8090/xfs/xfs-2.6: [XFS] Fix inode size update before data write in xfs_setattr [XFS] Allow punching holes to free space when at ENOSPC [XFS] Implement ->page_mkwrite in XFS. [FS] Implement block_page_mkwrite. Manually fix up conflict with Nick's VM fault handling patches in fs/xfs/linux-2.6/xfs_file.c Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19mm: fault feedback #1Nick Piggin
Change ->fault prototype. We now return an int, which contains VM_FAULT_xxx code in the low byte, and FAULT_RET_xxx code in the next byte. FAULT_RET_ code tells the VM whether a page was found, whether it has been locked, and potentially other things. This is not quite the way he wanted it yet, but that's changed in the next patch (which requires changes to arch code). This means we no longer set VM_CAN_INVALIDATE in the vma in order to say that a page is locked which requires filemap_nopage to go away (because we can no longer remain backward compatible without that flag), but we were going to do that anyway. struct fault_data is renamed to struct vm_fault as Linus asked. address is now a void __user * that we should firmly encourage drivers not to use without really good reason. The page is now returned via a page pointer in the vm_fault struct. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19mm: merge populate and nopage into fault (fixes nonlinear)Nick Piggin
Nonlinear mappings are (AFAIKS) simply a virtual memory concept that encodes the virtual address -> file offset differently from linear mappings. ->populate is a layering violation because the filesystem/pagecache code should need to know anything about the virtual memory mapping. The hitch here is that the ->nopage handler didn't pass down enough information (ie. pgoff). But it is more logical to pass pgoff rather than have the ->nopage function calculate it itself anyway (because that's a similar layering violation). Having the populate handler install the pte itself is likewise a nasty thing to be doing. This patch introduces a new fault handler that replaces ->nopage and ->populate and (later) ->nopfn. Most of the old mechanism is still in place so there is a lot of duplication and nice cleanups that can be removed if everyone switches over. The rationale for doing this in the first place is that nonlinear mappings are subject to the pagefault vs invalidate/truncate race too, and it seemed stupid to duplicate the synchronisation logic rather than just consolidate the two. After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in pagecache. Seems like a fringe functionality anyway. NOPAGE_REFAULT is removed. This should be implemented with ->fault, and no users have hit mainline yet. [akpm@linux-foundation.org: cleanup] [randy.dunlap@oracle.com: doc. fixes for readahead] [akpm@linux-foundation.org: build fix] Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Mark Fasheh <mark.fasheh@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19mm: fix fault vs invalidate race for linear mappingsNick Piggin
Fix the race between invalidate_inode_pages and do_no_page. Andrea Arcangeli identified a subtle race between invalidation of pages from pagecache with userspace mappings, and do_no_page. The issue is that invalidation has to shoot down all mappings to the page, before it can be discarded from the pagecache. Between shooting down ptes to a particular page, and actually dropping the struct page from the pagecache, do_no_page from any process might fault on that page and establish a new mapping to the page just before it gets discarded from the pagecache. The most common case where such invalidation is used is in file truncation. This case was catered for by doing a sort of open-coded seqlock between the file's i_size, and its truncate_count. Truncation will decrease i_size, then increment truncate_count before unmapping userspace pages; do_no_page will read truncate_count, then find the page if it is within i_size, and then check truncate_count under the page table lock and back out and retry if it had subsequently been changed (ptl will serialise against unmapping, and ensure a potentially updated truncate_count is actually visible). Complexity and documentation issues aside, the locking protocol fails in the case where we would like to invalidate pagecache inside i_size. do_no_page can come in anytime and filemap_nopage is not aware of the invalidation in progress (as it is when it is outside i_size). The end result is that dangling (->mapping == NULL) pages that appear to be from a particular file may be mapped into userspace with nonsense data. Valid mappings to the same place will see a different page. Andrea implemented two working fixes, one using a real seqlock, another using a page->flags bit. He also proposed using the page lock in do_no_page, but that was initially considered too heavyweight. However, it is not a global or per-file lock, and the page cacheline is modified in do_no_page to increment _count and _mapcount anyway, so a further modification should not be a large performance hit. Scalability is not an issue. This patch implements this latter approach. ->nopage implementations return with the page locked if it is possible for their underlying file to be invalidated (in that case, they must set a special vm_flags bit to indicate so). do_no_page only unlocks the page after setting up the mapping completely. invalidation is excluded because it holds the page lock during invalidation of each page (and ensures that the page is not mapped while holding the lock). This also allows significant simplifications in do_no_page, because we have the page locked in the right place in the pagecache from the start. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19[XFS] Implement ->page_mkwrite in XFS.David Chinner
Hook XFS up to ->page_mkwrite to ensure that we know about mmap pages being written to. This allows use to do correct delayed allocation and ENOSPC checking as well as remap unwritten extents so that they get converted correctly during writeback. This is done via the generic block_page_mkwrite code. SGI-PV: 940392 SGI-Modid: xfs-linux-melb:xfs-kern:29149a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-07-17knfsd: exportfs: add exportfs.h headerChristoph Hellwig
currently the export_operation structure and helpers related to it are in fs.h. fs.h is already far too large and there are very few places needing the export bits, so split them off into a separate header. [akpm@linux-foundation.org: fix cifs build] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Neil Brown <neilb@suse.de> Cc: Steven French <sfrench@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17Freezer: make kernel threads nonfreezable by defaultRafael J. Wysocki
Currently, the freezer treats all tasks as freezable, except for the kernel threads that explicitly set the PF_NOFREEZE flag for themselves. This approach is problematic, since it requires every kernel thread to either set PF_NOFREEZE explicitly, or call try_to_freeze(), even if it doesn't care for the freezing of tasks at all. It seems better to only require the kernel threads that want to or need to be frozen to use some freezer-related code and to remove any freezer-related code from the other (nonfreezable) kernel threads, which is done in this patch. The patch causes all kernel threads to be nonfreezable by default (ie. to have PF_NOFREEZE set by default) and introduces the set_freezable() function that should be called by the freezable kernel threads in order to unset PF_NOFREEZE. It also makes all of the currently freezable kernel threads call set_freezable(), so it shouldn't cause any (intentional) change of behaviour to appear. Additionally, it updates documentation to describe the freezing of tasks more accurately. [akpm@linux-foundation.org: build fixes] Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Nigel Cunningham <nigel@nigel.suspend2.net> Cc: Pavel Machek <pavel@ucw.cz> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17mm: clean up and kernelify shrinker registrationRusty Russell
I can never remember what the function to register to receive VM pressure is called. I have to trace down from __alloc_pages() to find it. It's called "set_shrinker()", and it needs Your Help. 1) Don't hide struct shrinker. It contains no magic. 2) Don't allocate "struct shrinker". It's not helpful. 3) Call them "register_shrinker" and "unregister_shrinker". 4) Call the function "shrink" not "shrinker". 5) Reduce the 17 lines of waffly comments to 13, but document it properly. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: David Chinner <dgc@sgi.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-14[XFS] Fix XFS_IOC_FSBULKSTAT{,_SINGLE} & XFS_IOC_FSINUMBERS in compat modeMichal Marek
* 32bit struct xfs_fsop_bulkreq has different size and layout of members, no matter the alignment. Move the code out of the #else branch (why was it there in the first place?). Define _32 variants of the ioctl constants. * 32bit struct xfs_bstat is different because of time_t and on i386 because of different padding. Make xfs_bulkstat_one() accept a custom "output formatter" in the private_data argument which takes care of the xfs_bulkstat_one_compat() that takes care of the different layout in the compat case. * i386 struct xfs_inogrp has different padding. Add a similar "output formatter" mecanism to xfs_inumbers(). SGI-PV: 967354 SGI-Modid: xfs-linux-melb:xfs-kern:29102a Signed-off-by: Michal Marek <mmarek@suse.cz> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-07-14[XFS] Compat ioctl handler for handle operationsMichal Marek
32bit struct xfs_fsop_handlereq has different size and offsets (due to pointers). TODO: case XFS_IOC_{FSSETDM,ATTRLIST,ATTRMULTI}_BY_HANDLE still not handled. SGI-PV: 967354 SGI-Modid: xfs-linux-melb:xfs-kern:29101a Signed-off-by: Michal Marek <mmarek@suse.cz> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-07-14[XFS] Compat ioctl handler for XFS_IOC_FSGEOMETRY_V1.Michal Marek
i386 struct xfs_fsop_geom_v1 has no padding after the last member, so the size is different. SGI-PV: 967354 SGI-Modid: xfs-linux-melb:xfs-kern:29100a Signed-off-by: Michal Marek <mmarek@suse.cz> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-07-14[XFS] Concurrent Multi-File Data StreamsDavid Chinner
In media spaces, video is often stored in a frame-per-file format. When dealing with uncompressed realtime HD video streams in this format, it is crucial that files do not get fragmented and that multiple files a placed contiguously on disk. When multiple streams are being ingested and played out at the same time, it is critical that the filesystem does not cross the streams and interleave them together as this creates seek and readahead cache miss latency and prevents both ingest and playout from meeting frame rate targets. This patch set creates a "stream of files" concept into the allocator to place all the data from a single stream contiguously on disk so that RAID array readahead can be used effectively. Each additional stream gets placed in different allocation groups within the filesystem, thereby ensuring that we don't cross any streams. When an AG fills up, we select a new AG for the stream that is not in use. The core of the functionality is the stream tracking - each inode that we create in a directory needs to be associated with the directories' stream. Hence every time we create a file, we look up the directories' stream object and associate the new file with that object. Once we have a stream object for a file, we use the AG that the stream object point to for allocations. If we can't allocate in that AG (e.g. it is full) we move the entire stream to another AG. Other inodes in the same stream are moved to the new AG on their next allocation (i.e. lazy update). Stream objects are kept in a cache and hold a reference on the inode. Hence the inode cannot be reclaimed while there is an outstanding stream reference. This means that on unlink we need to remove the stream association and we also need to flush all the associations on certain events that want to reclaim all unreferenced inodes (e.g. filesystem freeze). SGI-PV: 964469 SGI-Modid: xfs-linux-melb:xfs-kern:29096a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Barry Naujok <bnaujok@sgi.com> Signed-off-by: Donald Douwsma <donaldd@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Tim Shimmin <tes@sgi.com> Signed-off-by: Vlad Apostolov <vapo@sgi.com>
2007-07-14[XFS] XFS should not be looking at filp reference countsChristoph Hellwig
A check for file_count is always a bad idea. Linux has the ->release method to deal with cleanups on last close and ->flush is only for the very rare case where we want to perform an operation on every drop of a reference to a file struct. This patch gets rid of vop_close and surrounding code in favour of simply doing the page flushing from ->release. SGI-PV: 966562 SGI-Modid: xfs-linux-melb:xfs-kern:28952a Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-07-14[XFS] Fix remount,readonly path to flush everything correctly.David Chinner
The remount readonly path can fail to writeback properly because we still have active transactions after calling xfs_quiesce_fs(). Further investigation shows that this path is broken in the same ways that the xfs freeze path was broken so fix it the same way. SGI-PV: 964464 SGI-Modid: xfs-linux-melb:xfs-kern:28869a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-07-14[XFS] Map unwritten extents correctly for I/o completion processingDavid Chinner
If we have multiple unwritten extents within a single page, we fail to tell the I/o completion construction handlers we need a new handle for the second and subsequent blocks in the page. While we still issue the I/O correctly, we do not have the correct ranges recorded in the ioend structures and hence when we go to convert the unwritten extents we screw it up. Make sure we start a new ioend every time the mapping changes so that we convert the correct ranges on I/O completion. SGI-PV: 964647 SGI-Modid: xfs-linux-melb:xfs-kern:28797a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-07-14[XFS] Handle null returned from xfs_vtoi() in xfs_setfilesize().David Chinner
SGI-PV: 965636 SGI-Modid: xfs-linux-melb:xfs-kern:28777a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Olaf Weber <olaf@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-07-14[XFS] Block on unwritten extent conversion during synchronous direct I/O.David Chinner
Currently we do not wait on extent conversion to occur, and hence we can return to userspace from a synchronous direct I/O write without having completed all the actions in the write. Hence a read after the write may see zeroes (unwritten extent) rather than the data that was written. Block the I/O completion by triggering a synchronous workqueue flush to ensure that the conversion has occurred before we return to userspace. SGI-PV: 964092 SGI-Modid: xfs-linux-melb:xfs-kern:28775a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-07-14[XFS] Flush the block device before closing it on unmount.David Chinner
SGI-PV: 965630 SGI-Modid: xfs-linux-melb:xfs-kern:28774a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-07-14[XFS] Lazy Superblock CountersDavid Chinner
When we have a couple of hundred transactions on the fly at once, they all typically modify the on disk superblock in some way. create/unclink/mkdir/rmdir modify inode counts, allocation/freeing modify free block counts. When these counts are modified in a transaction, they must eventually lock the superblock buffer and apply the mods. The buffer then remains locked until the transaction is committed into the incore log buffer. The result of this is that with enough transactions on the fly the incore superblock buffer becomes a bottleneck. The result of contention on the incore superblock buffer is that transaction rates fall - the more pressure that is put on the superblock buffer, the slower things go. The key to removing the contention is to not require the superblock fields in question to be locked. We do that by not marking the superblock dirty in the transaction. IOWs, we modify the incore superblock but do not modify the cached superblock buffer. In short, we do not log superblock modifications to critical fields in the superblock on every transaction. In fact we only do it just before we write the superblock to disk every sync period or just before unmount. This creates an interesting problem - if we don't log or write out the fields in every transaction, then how do the values get recovered after a crash? the answer is simple - we keep enough duplicate, logged information in other structures that we can reconstruct the correct count after log recovery has been performed. It is the AGF and AGI structures that contain the duplicate information; after recovery, we walk every AGI and AGF and sum their individual counters to get the correct value, and we do a transaction into the log to correct them. An optimisation of this is that if we have a clean unmount record, we know the value in the superblock is correct, so we can avoid the summation walk under normal conditions and so mount/recovery times do not change under normal operation. One wrinkle that was discovered during development was that the blocks used in the freespace btrees are never accounted for in the AGF counters. This was once a valid optimisation to make; when the filesystem is full, the free space btrees are empty and consume no space. Hence when it matters, the "accounting" is correct. But that means the when we do the AGF summations, we would not have a correct count and xfs_check would complain. Hence a new counter was added to track the number of blocks used by the free space btrees. This is an *on-disk format change*. As a result of this, lazy superblock counters are a mkfs option and at the moment on linux there is no way to convert an old filesystem. This is possible - xfs_db can be used to twiddle the right bits and then xfs_repair will do the format conversion for you. Similarly, you can convert backwards as well. At some point we'll add functionality to xfs_admin to do the bit twiddling easily.... SGI-PV: 964999 SGI-Modid: xfs-linux-melb:xfs-kern:28652a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-07-14[XFS] Use generic shrinker interfaces in XFS.Andrew Morton
SGI-PV: 964986 SGI-Modid: xfs-linux-melb:xfs-kern:28642a Signed-Off-By: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-07-14[XFS] Fix double free in xfs_buf_get_noaddr error handling pathChristoph Hellwig
SGI-PV: 964983 SGI-Modid: xfs-linux-melb:xfs-kern:28639a Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-07-14[XFS] Only use refcounted pages for I/OChristoph Hellwig
Many block drivers (aoe, iscsi) really want refcountable pages in bios, which is what almost everyone send down. XFS unfortunately has a few places where it sends down buffers that may come from kmalloc, which breaks them. Fix the places that use kmalloc()d buffers. SGI-PV: 964546 SGI-Modid: xfs-linux-melb:xfs-kern:28562a Signed-Off-By: Christoph Hellwig <hch@infradead.org> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-07-10sendfile: remove .sendfile from filesystems that use generic_file_sendfile()Jens Axboe
They can use generic_file_splice_read() instead. Since sys_sendfile() now prefers that, there should be no change in behaviour. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-06-19[XFS] s/memclear_highpage_flush/zero_user_page/Christoph Hellwig
SGI-PV: 957103 SGI-Modid: xfs-linux-melb:xfs-kern:28678a Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-05-29[XFS] Write at EOF may not update filesize correctly.David Chinner
The recent fix for preventing NULL files from being left around does not update the file size corectly in all cases. The missing case is a write extending the file that does not need to allocate a block. In that case we used a read mapping of the extent which forced the use of the read I/O completion handler instead of the write I/O completion handle. Hence the file size was not updated on I/O completion. SGI-PV: 965068 SGI-Modid: xfs-linux-melb:xfs-kern:28657a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Nathan Scott <nscott@aconex.com> Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-05-17Remove SLAB_CTOR_CONSTRUCTORChristoph Lameter
SLAB_CTOR_CONSTRUCTOR is always specified. No point in checking it. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: David Howells <dhowells@redhat.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Steven French <sfrench@us.ibm.com> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Dave Kleikamp <shaggy@austin.ibm.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Anton Altaparmakov <aia21@cantab.net> Cc: Mark Fasheh <mark.fasheh@oracle.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@ucw.cz> Cc: David Chinner <dgc@sgi.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08Merge git://oss.sgi.com:8090/xfs/xfs-2.6Linus Torvalds
* git://oss.sgi.com:8090/xfs/xfs-2.6: [XFS] Add lockdep support for XFS [XFS] Fix race in xfs_write() b/w dmapi callout and direct I/O checks. [XFS] Get rid of redundant "required" in msg. [XFS] Export via a function xfs_buftarg_list for use by kdb/xfsidbg. [XFS] Remove unused ilen variable and references. [XFS] Fix to prevent the notorious 'NULL files' problem after a crash. [XFS] Fix race condition in xfs_write(). [XFS] Fix uquota and oquota enforcement problems. [XFS] propogate return codes from flush routines [XFS] Fix quotaon syscall failures for group enforcement requests. [XFS] Invalidate quotacheck when mounting without a quota type. [XFS] reducing the number of random number functions. [XFS] remove more misc. unused args [XFS] the "aendp" arg to xfs_dir2_data_freescan is always NULL, remove it. [XFS] The last argument "lsn" of xfs_trans_commit() is always called with
2007-05-08mm: move common segment checks to separate helper functionDmitriy Monakhov
[akpm@linux-foundation.org: cleanup] Signed-off-by: Monakhov Dmitriy <dmonakhov@openvz.org> Cc: Christoph Hellwig <hch@lst.de> Acked-by: Anton Altaparmakov <aia21@cam.ac.uk> Acked-by: David Chinner <dgc@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08[XFS] Add lockdep support for XFSLachlan McIlroy
SGI-PV: 963965 SGI-Modid: xfs-linux-melb:xfs-kern:28485a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-05-08[XFS] Fix race in xfs_write() b/w dmapi callout and direct I/O checks.Lachlan McIlroy
In xfs_write() the iolock is dropped and reacquired in XFS_SEND_DATA() which means that the file could change from not-cached to cached and we need to redo the direct I/O checks. We should also redo the direct I/O checks when the file size changes regardless if O_APPEND is set or not. SGI-PV: 963483 SGI-Modid: xfs-linux-melb:xfs-kern:28440a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-05-08[XFS] Export via a function xfs_buftarg_list for use by kdb/xfsidbg.Tim Shimmin
SGI-PV: 963465 SGI-Modid: xfs-linux-melb:xfs-kern:28414a Signed-off-by: Tim Shimmin <tes@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
2007-05-08[XFS] Fix to prevent the notorious 'NULL files' problem after a crash.Lachlan McIlroy
The problem that has been addressed is that of synchronising updates of the file size with writes that extend a file. Without the fix the update of a file's size, as a result of a write beyond eof, is independent of when the cached data is flushed to disk. Often the file size update would be written to the filesystem log before the data is flushed to disk. When a system crashes between these two events and the filesystem log is replayed on mount the file's size will be set but since the contents never made it to disk the file is full of holes. If some of the cached data was flushed to disk then it may just be a section of the file at the end that has holes. There are existing fixes to help alleviate this problem, particularly in the case where a file has been truncated, that force cached data to be flushed to disk when the file is closed. If the system crashes while the file(s) are still open then this flushing will never occur. The fix that we have implemented is to introduce a second file size, called the in-memory file size, that represents the current file size as viewed by the user. The existing file size, called the on-disk file size, is the one that get's written to the filesystem log and we only update it when it is safe to do so. When we write to a file beyond eof we only update the in- memory file size in the write operation. Later when the I/O operation, that flushes the cached data to disk completes, an I/O completion routine will update the on-disk file size. The on-disk file size will be updated to the maximum offset of the I/O or to the value of the in-memory file size if the I/O includes eof. SGI-PV: 958522 SGI-Modid: xfs-linux-melb:xfs-kern:28322a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-05-08[XFS] Fix race condition in xfs_write().Lachlan McIlroy
This change addresses a race in xfs_write() where, for direct I/O, the flags need_i_mutex and need_flush are setup before the iolock is acquired. The logic used to setup the flags may change between setting the flags and acquiring the iolock resulting in these flags having incorrect values. For example, if a file is not currently cached then need_i_mutex is set to zero and then if the file is cached before the iolock is acquired we will fail to do the flushinval before the direct write. The flush (and also the call to xfs_zero_eof()) need to be done with the iolock held exclusive so we need to acquire the iolock before checking for cached data (or if the write begins after eof) to prevent this state from changing. For direct I/O I've chosen to always acquire the iolock in shared mode initially and if there is a need to promote it then drop it and reacquire it. There's also some other tidy-ups including removing the O_APPEND offset adjustment since that work is done in generic_write_checks() (and we don't use offset as an input parameter anywhere). SGI-PV: 962170 SGI-Modid: xfs-linux-melb:xfs-kern:28319a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-05-08[XFS] propogate return codes from flush routinesLachlan McIlroy
This patch handles error return values in fs_flush_pages and fs_flushinval_pages. It changes the prototype of fs_flushinval_pages so we can propogate the errors and handle them at higher layers. I also modified xfs_itruncate_start so that it could propogate the error further. SGI-PV: 961990 SGI-Modid: xfs-linux-melb:xfs-kern:28231a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Stewart Smith <stewart@flamingspork.com> Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-05-07slab allocators: Remove SLAB_DEBUG_INITIAL flagChristoph Lameter
I have never seen a use of SLAB_DEBUG_INITIAL. It is only supported by SLAB. I think its purpose was to have a callback after an object has been freed to verify that the state is the constructor state again? The callback is performed before each freeing of an object. I would think that it is much easier to check the object state manually before the free. That also places the check near the code object manipulation of the object. Also the SLAB_DEBUG_INITIAL callback is only performed if the kernel was compiled with SLAB debugging on. If there would be code in a constructor handling SLAB_DEBUG_INITIAL then it would have to be conditional on SLAB_DEBUG otherwise it would just be dead code. But there is no such code in the kernel. I think SLUB_DEBUG_INITIAL is too problematic to make real use of, difficult to understand and there are easier ways to accomplish the same effect (i.e. add debug code before kfree). There is a related flag SLAB_CTOR_VERIFY that is frequently checked to be clear in fs inode caches. Remove the pointless checks (they would even be pointless without removeal of SLAB_DEBUG_INITIAL) from the fs constructors. This is the last slab flag that SLUB did not support. Remove the check for unimplemented flags from SLUB. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-22[PATCH] Make XFS workqueues nonfreezableRafael J. Wysocki
Since freezable workqueues are broken in 2.6.21-rc (cf. http://marc.theaimsgroup.com/?l=linux-kernel&m=116855740612755, http://marc.theaimsgroup.com/?l=linux-kernel&m=117261312523921&w=2) it's better to change the only user of them, which is XFS, to use "normal" nonfreezable workqueues. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Pavel Machek <pavel@ucw.cz> Cc: David Chinner <dgc@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-20[PATCH] xfs warning fixAndrew Morton
fs/xfs/linux-2.6/xfs_super.c:903: warning: 'noinline' attribute ignored Cc: David Chinner <dgc@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14[PATCH] sysctl: remove insert_at_head from register_sysctlEric W. Biederman
The semantic effect of insert_at_head is that it would allow new registered sysctl entries to override existing sysctl entries of the same name. Which is pain for caching and the proc interface never implemented. I have done an audit and discovered that none of the current users of register_sysctl care as (excpet for directories) they do not register duplicate sysctl entries. So this patch simply removes the support for overriding existing entries in the sys_sysctl interface since no one uses it or cares and it makes future enhancments harder. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Ralf Baechle <ralf@linux-mips.org> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Andi Kleen <ak@muc.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Corey Minyard <minyard@acm.org> Cc: Neil Brown <neilb@suse.de> Cc: "John W. Linville" <linville@tuxdriver.com> Cc: James Bottomley <James.Bottomley@steeleye.com> Cc: Jan Kara <jack@ucw.cz> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Mark Fasheh <mark.fasheh@oracle.com> Cc: David Chinner <dgc@sgi.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14[PATCH] remove many unneeded #includes of sched.hTim Schmielau
After Al Viro (finally) succeeded in removing the sched.h #include in module.h recently, it makes sense again to remove other superfluous sched.h includes. There are quite a lot of files which include it but don't actually need anything defined in there. Presumably these includes were once needed for macros that used to live in sched.h, but moved to other header files in the course of cleaning it up. To ease the pain, this time I did not fiddle with any header files and only removed #includes from .c-files, which tend to cause less trouble. Compile tested against 2.6.20-rc2 and 2.6.20-rc2-mm2 (with offsets) on alpha, arm, i386, ia64, mips, powerpc, and x86_64 with allnoconfig, defconfig, allmodconfig, and allyesconfig as well as a few randconfigs on x86_64 and all configs in arch/arm/configs on arm. I also checked that no new warnings were introduced by the patch (actually, some warnings are removed that were emitted by unnecessarily included header files). Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de> Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12[PATCH] mark struct inode_operations const 3Arjan van de Ven
Many struct inode_operations in the kernel can be "const". Marking them const moves these to the .rodata section, which avoids false sharing with potential dirty data. In addition it'll catch accidental writes at compile time to these shared resources. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12[PATCH] Make XFS use BH_Unwritten and BH_Delay correctlyDavid Chinner
Don't hide buffer_unwritten behind buffer_delay() and remove the hack that clears unexpected buffer_unwritten() states now that it can't happen. Signed-off-by: Dave Chinner <dgc@sgi.com> Acked-by: Christoph Hellwig <hch@lst.de> Cc: Timothy Shimmin <tes@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12[PATCH] Make BH_Unwritten a first class bufferhead flag V2David Chinner
Currently, XFS uses BH_PrivateStart for flagging unwritten extent state in a bufferhead. Recently, I found the long standing mmap/unwritten extent conversion bug, and it was to do with partial page invalidation not clearing the unwritten flag from bufferheads attached to the page but beyond EOF. See here for a full explaination: http://oss.sgi.com/archives/xfs/2006-12/msg00196.html The solution I have checked into the XFS dev tree involves duplicating code from block_invalidatepage to clear the unwritten flag from the bufferhead(s), and then calling block_invalidatepage() to do the rest. Christoph suggested that this would be better solved by pushing the unwritten flag into the common buffer head flags and just adding the call to discard_buffer(): http://oss.sgi.com/archives/xfs/2006-12/msg00239.html The following patch makes BH_Unwritten a first class citizen. Signed-off-by: Dave Chinner <dgc@sgi.com> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-10[XFS] Don't use kmap in xfs_iozero.David Chinner
kmap() is inefficient and does not scale well. kmap_atomic() is a better choice. Use the generic wrapper function instead of open coding the kmap-memset-dcache flush-kunmap stuff. SGI-PV: 960904 SGI-Modid: xfs-linux-melb:xfs-kern:28041a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-02-10[XFS] Remove unused header files for MAC and CAP checking functionality.Eric Sandeen
xfs_mac.h and xfs_cap.h provide definitions and macros that aren't used anywhere in XFS at all. They are left-overs from "to be implement at some point in the future" functionality that Irix XFS has. If this functionality ever goes into Linux, it will be provided at a different layer, most likely through the security hooks in the kernel so we will never need this functionality in XFS. Patch provided by Eric Sandeen (sandeen@sandeen.net). SGI-PV: 960895 SGI-Modid: xfs-linux-melb:xfs-kern:28036a Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-02-10[XFS] Make freeze code a little cleaner.David Chinner
Fixes a few small issues (mostly cosmetic) that were picked up during the review cycle for the last set of freeze path changes. SGI-PV: 959267 SGI-Modid: xfs-linux-melb:xfs-kern:28035a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-02-10[XFS] Clean up use of VFS attr flagsEric Sandeen
Use the the generic VFS attr flags where appropriate instead of open coding them to the same values. Patch provided by Eric Sandeen. SGI-PV: 960868 SGI-Modid: xfs-linux-melb:xfs-kern:28033a Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-02-10[XFS] Remove useless memory barrierRalf Baechle
wake_up's implementation does an implicit memory barrier so the explicit memory barrier is not needed in vfs_sync_worker. Patch provided by Ralf Baechle. SGI-PV: 960867 SGI-Modid: xfs-linux-melb:xfs-kern:28032a Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-02-10[XFS] XFS sysctl cleanupsEric W. Biederman
Removes unneeded sysctl insert at head behaviour. Cleans up sysctl definitions to use C99 initialisers. Patch provided by Eric W. Biederman. SGI-PV: 960192 SGI-Modid: xfs-linux-melb:xfs-kern:28031a Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-02-10[XFS] Fix callers of xfs_iozero() to zero the correct range.Lachlan McIlroy
The problem is the two callers of xfs_iozero() are rounding out the range to be zeroed to the end of a fsb and in some cases this extends past the new eof. The call to commit_write() in xfs_iozero() will cause the Linux inode's file size to be set too high. SGI-PV: 960788 SGI-Modid: xfs-linux-melb:xfs-kern:28013a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-02-10[XFS] Ensure a frozen filesystem has a clean log before writing the dummyDavid Chinner
record. The current Linux XFS freeze code is a mess. We flush the metadata buffers out while we are still allowing new transactions to start and then fail to flush the dirty buffers back out before writing the unmount and dummy records to the log. This leads to problems when the frozen filesystem is used for snapshots - we do log recovery on a readonly image and often it appears that the log image in the snapshot is not correct. Hence we end up with hangs, oops and mount failures when trying to mount a snapshot image that has been created when the filesystem has not been correctly frozen. To fix this, we need to move th metadata flush to after we wait for all current transactions to complete in teh second stage of the freeze. This means that when we write the final log records, the log should be clean and recovery should never occur on a snapshot image created from a frozen filesystem. SGI-PV: 959267 SGI-Modid: xfs-linux-melb:xfs-kern:28010a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Donald Douwsma <donaldd@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>