aboutsummaryrefslogtreecommitdiff
path: root/Documentation/filesystems
AgeCommit message (Collapse)Author
2009-08-18mm: revert "oom: move oom_adj value"KOSAKI Motohiro
The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to the mm_struct. It was a very good first step for sanitize OOM. However Paul Menage reported the commit makes regression to his job scheduler. Current OOM logic can kill OOM_DISABLED process. Why? His program has the code of similar to the following. ... set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */ ... if (vfork() == 0) { set_oom_adj(0); /* Invoked child can be killed */ execve("foo-bar-cmd"); } .... vfork() parent and child are shared the same mm_struct. then above set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also change oom_adj for vfork() parent. Then, vfork() parent (job scheduler) lost OOM immune and it was killed. Actually, fork-setting-exec idiom is very frequently used in userland program. We must not break this assumption. Then, this patch revert commit 2ff05b2b and related commit. Reverted commit list --------------------- - commit 2ff05b2b4e (oom: move oom_adj value from task_struct to mm_struct) - commit 4d8b9135c3 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE) - commit 8123681022 (oom: only oom kill exiting tasks with attached memory) - commit 933b787b57 (mm: copy over oom_adj value at fork time) Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-07-28driver core: documentation: make it clear that sysfs is optionalLucian Adrian Grijincu
The original text suggested that sysfs is mandatory and always compiled in the kernel. Signed-off-by: Lucian Adrian Grijincu <lgrijincu@ixiacom.com> Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-06-24update Documentation/filesystems/LockingChristoph Hellwig
The rules for locking in many superblock operations has changed significantly, so update the documentation for it. Also correct some older updates and ommissions. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-22Merge branch 'for-2.6.31' of git://fieldses.org/git/linux-nfsdLinus Torvalds
* 'for-2.6.31' of git://fieldses.org/git/linux-nfsd: (60 commits) SUNRPC: Fix the TCP server's send buffer accounting nfsd41: Backchannel: minorversion support for the back channel nfsd41: Backchannel: cleanup nfs4.0 callback encode routines nfsd41: Remove ip address collision detection case nfsd: optimise the starting of zero threads when none are running. nfsd: don't take nfsd_mutex twice when setting number of threads. nfsd41: sanity check client drc maxreqs nfsd41: move channel attributes from nfsd4_session to a nfsd4_channel_attr struct NFS: kill off complicated macro 'PROC' sunrpc: potential memory leak in function rdma_read_xdr nfsd: minor nfsd_vfs_write cleanup nfsd: Pull write-gathering code out of nfsd_vfs_write nfsd: track last inode only in use_wgather case sunrpc: align cache_clean work's timer nfsd: Use write gathering only with NFSv2 NFSv4: kill off complicated macro 'PROC' NFSv4: do exact check about attribute specified knfsd: remove unreported filehandle stats counters knfsd: fix reply cache memory corruption knfsd: reply cache cleanups ...
2009-06-18Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: jbd2: clean up jbd2_journal_try_to_free_buffers() ext4: Don't update ctime for non-extent-mapped inodes ext4: Fix up whitespace issues in fs/ext4/inode.c ext4: Fix 64-bit block type problem on 32-bit platforms ext4: teach the inode allocator to use a goal inode number ext4: Use a hash of the topdir directory name for the Orlov parent group ext4: document the "abort" mount option ext4: move the abort flag from s_mount_opts to s_mount_flags ext4: update the s_last_mounted field in the superblock ext4: change s_mount_opt to be an unsigned int ext4: online defrag -- Add EXT4_IOC_MOVE_EXT ioctl ext4: avoid unnecessary spinlock in critical POSIX ACL path ext3: avoid unnecessary spinlock in critical POSIX ACL path ext4: convert instrumentation from markers to tracepoints jbd2: convert instrumentation from markers to tracepoints
2009-06-18isofs: let mode and dmode mount options override rock ridge mode settingJan Kara
So far, permissions set via 'mode' and/or 'dmode' mount options were effective only if the medium had no rock ridge extensions (or was mounted without them). Add 'overriderockmode' mount option to indicate that these options should override permissions set in rock ridge extensions. Maybe this should be default but the current behavior is there since mount options were created so I think we should not change how they behave. Cc: <Hans-Joachim.Baader@cjt.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18Doc fix: ext2 can only have 32,000 subdirs, not 32,768Michael Shields
ext2.txt says that dirs can have 32,768 subdirs, but the actual value of EXT2_LINK_MAX is 32000. ext3 is the same, but the doc does not mention it. One of ext4's features is to "fix 32000 subdirectory limit". Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18proc.txt: update kernel filesystem/proc.txt documentationStefani Seibold
An update for the "Process-Specific Subdirectories" section to reflect the changes till kernel 2.6.30. Signed-off-by: Stefani Seibold <stefani@seibold.net> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18proc: export statistics for softirq to /procKeika Kobayashi
Export statistics for softirq in /proc/softirqs and /proc/stat. 1. /proc/softirqs Implement /proc/softirqs which shows the number of softirq for each CPU like /proc/interrupts. 2. /proc/stat Add the "softirq" line to /proc/stat. This line shows the number of softirq for all cpu. The first column is the total of all softirqs and each subsequent column is the total for particular softirq. [kosaki.motohiro@jp.fujitsu.com: remove redundant for_each_possible_cpu() loop] Signed-off-by: Keika Kobayashi <kobayashi.kk@ncos.nec.co.jp> Reviewed-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Eric Dumazet <dada1@cosmosbay.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-17Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: get rid of BKL in fs/sysv get rid of BKL in fs/minix get rid of BKL in fs/efs befs ->pust_super() doesn't need BKL Cleanup of adfs headers 9P doesn't need BKL in ->umount_begin() fuse doesn't need BKL in ->umount_begin() No instance of ->bmap() needs BKL remove unlock_kernel() left accidentally ext4: avoid unnecessary spinlock in critical POSIX ACL path ext3: avoid unnecessary spinlock in critical POSIX ACL path
2009-06-17No instance of ->bmap() needs BKLAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-16Merge branch 'akpm'Linus Torvalds
* akpm: (182 commits) fbdev: bf54x-lq043fb: use kzalloc over kmalloc/memset fbdev: *bfin*: fix __dev{init,exit} markings fbdev: *bfin*: drop unnecessary calls to memset fbdev: bfin-t350mcqb-fb: drop unused local variables fbdev: blackfin has __raw I/O accessors, so use them in fb.h fbdev: s1d13xxxfb: add accelerated bitblt functions tcx: use standard fields for framebuffer physical address and length fbdev: add support for handoff from firmware to hw framebuffers intelfb: fix a bug when changing video timing fbdev: use framebuffer_release() for freeing fb_info structures radeon: P2G2CLK_ALWAYS_ONb tested twice, should 2nd be P2G2CLK_DAC_ALWAYS_ONb? s3c-fb: CPUFREQ frequency scaling support s3c-fb: fix resource releasing on error during probing carminefb: fix possible access beyond end of carmine_modedb[] acornfb: remove fb_mmap function mb862xxfb: use CONFIG_OF instead of CONFIG_PPC_OF mb862xxfb: restrict compliation of platform driver to PPC Samsung SoC Framebuffer driver: add Alpha Channel support atmel-lcdc: fix pixclock upper bound detection offb: use framebuffer_alloc() to allocate fb_info struct ... Manually fix up conflicts due to kmemcheck in mm/slab.c
2009-06-16oom: move oom_adj value from task_struct to mm_structDavid Rientjes
The per-task oom_adj value is a characteristic of its mm more than the task itself since it's not possible to oom kill any thread that shares the mm. If a task were to be killed while attached to an mm that could not be freed because another thread were set to OOM_DISABLE, it would have needlessly been terminated since there is no potential for future memory freeing. This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct mm_struct. This requires task_lock() on a task to check its oom_adj value to protect against exec, but it's already necessary to take the lock when dereferencing the mm to find the total VM size for the badness heuristic. This fixes a livelock if the oom killer chooses a task and another thread sharing the same memory has an oom_adj value of OOM_DISABLE. This occurs because oom_kill_task() repeatedly returns 1 and refuses to kill the chosen task while select_bad_process() will repeatedly choose the same task during the next retry. Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in oom_kill_task() to check for threads sharing the same memory will be removed in the next patch in this series where it will no longer be necessary. Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since these threads are immune from oom killing already. They simply report an oom_adj value of OOM_DISABLE. Cc: Nick Piggin <npiggin@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6: fat: split fat_generic_ioctl FAT: add 'errors' mount option
2009-06-15Merge commit 'v2.6.30' into for-2.6.31J. Bruce Fields
2009-06-15Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: (22 commits) nilfs2: support contiguous lookup of blocks nilfs2: add sync_page method to page caches of meta data nilfs2: use device's backing_dev_info for btree node caches nilfs2: return EBUSY against delete request on snapshot nilfs2: modify list of unsupported features in caveats nilfs2: enable sync_page method nilfs2: set bio unplug flag for the last bio in segment nilfs2: allow future expansion of metadata read out via get info ioctl NILFS2: Pagecache usage optimization on NILFS2 nilfs2: remove nilfs_btree_operations from btree mapping nilfs2: remove nilfs_direct_operations from direct mapping nilfs2: remove bmap pointer operations nilfs2: remove useless b_low and b_high fields from nilfs_bmap struct nilfs2: remove pointless NULL check of bpop_commit_alloc_ptr function nilfs2: move get block functions in bmap.c into btree codes nilfs2: remove nilfs_bmap_delete_block nilfs2: remove nilfs_bmap_put_block nilfs2: remove header file for segment list operations nilfs2: eliminate removal list of segments nilfs2: add sufile function that can modify multiple segment usages ...
2009-06-14Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (31 commits) trivial: remove the trivial patch monkey's name from SubmittingPatches trivial: Fix a typo in comment of addrconf_dad_start() trivial: usb: fix missing space typo in doc trivial: pci hotplug: adding __init/__exit macros to sgi_hotplug trivial: Remove the hyphen from git commands trivial: fix ETIMEOUT -> ETIMEDOUT typos trivial: Kconfig: .ko is normally not included in module names trivial: SubmittingPatches: fix typo trivial: Documentation/dell_rbu.txt: fix typos trivial: Fix Pavel's address in MAINTAINERS trivial: ftrace:fix description of trace directory trivial: unnecessary (void*) cast removal in sound/oss/msnd.c trivial: input/misc: Fix typo in Kconfig trivial: fix grammo in bus_for_each_dev() kerneldoc trivial: rbtree.txt: fix rb_entry() parameters in sample code trivial: spelling fix in ppc code comments trivial: fix typo in bio_alloc kernel doc trivial: Documentation/rbtree.txt: cleanup kerneldoc of rbtree.txt trivial: Miscellaneous documentation typo fixes trivial: fix typo milisecond/millisecond for documentation and source comments. ...
2009-06-13Merge branch 'docs-next' of git://git.lwn.net/linux-2.6Linus Torvalds
* 'docs-next' of git://git.lwn.net/linux-2.6: Document the debugfs API Documentation: Add "how to write a good patch summary" to SubmittingPatches SubmittingPatches: fix typo docs: Encourage better changelogs in the development process document Document Reported-by in SubmittingPatches
2009-06-13ext4: document the "abort" mount optionTheodore Ts'o
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-12trivial: Miscellaneous documentation typo fixesMatt LaPlante
Fix various typos in documentation txts. Signed-off-by: Matt LaPlante <kernel1@cyberdogtech.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2009-06-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmwLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: (25 commits) GFS2: Merge gfs2_get_sb into gfs2_get_sb_meta GFS2: Fix cache coherency between truncate and O_DIRECT read GFS2: Fix locking issue mounting gfs2meta fs GFS2: Remove unused variable GFS2: smbd proccess hangs with flock() call. GFS2: Remove args subdir from gfs2 sysfs files GFS2: Remove lockstruct subdir from gfs2 sysfs files GFS2: Move gfs2_unlink_ok into ops_inode.c GFS2: Move gfs2_readlinki into ops_inode.c GFS2: Move gfs2_rmdiri into ops_inode.c GFS2: Merge mount.c and ops_super.c into super.c GFS2: Clean up some file names GFS2: Be more aggressive in reclaiming unlinked inodes GFS2: Add a rgrp bitmap full flag GFS2: Improve resource group error handling GFS2: Don't warn when delete inode fails on ro filesystem GFS2: Update docs GFS2: Umount recovery race fix GFS2: Remove a couple of unused sysfs entries GFS2: Add commit= mount option ...
2009-06-10nilfs2: modify list of unsupported features in caveatsRyusuke Konishi
This clarifies missing features of nilfs as a regular filesystem. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-06Document the debugfs APIJonathan Corbet
This is an updated document covering the internal API for the debugfs filesystem. Thanks to Shen Feng for suggesting that I put this text here and noting that the old LWN version was rather out of date. Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Reported-by: Shen Feng <shen@cn.fujitsu.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2009-06-04FAT: add 'errors' mount optionDenis Karpov
On severe errors FAT remounts itself in read-only mode. Allow to specify FAT fs desired behavior through 'errors' mount option: panic, continue or remount read-only. `mount -t [fat|vfat] -o errors=[panic,remount-ro,continue] \ <bdev> <mount point>` This is analog to ext2 fs 'errors' mount option. Signed-off-by: Denis Karpov <ext-denis.2.karpov@nokia.com> Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
2009-05-21hugh: update email addressHugh Dickins
My old address will shut down in a few days time: remove it from the tree, and add a tmpfs (shmem filesystem) maintainer entry with the new address. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-19GFS2: Update docsSteven Whitehouse
Update a few things which were out of date, and fix a typo. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-02mm: close page_mkwrite racesNick Piggin
Change page_mkwrite to allow implementations to return with the page locked, and also change it's callers (in page fault paths) to hold the lock until the page is marked dirty. This allows the filesystem to have full control of page dirtying events coming from the VM. Rather than simply hold the page locked over the page_mkwrite call, we call page_mkwrite with the page unlocked and allow callers to return with it locked, so filesystems can avoid LOR conditions with page lock. The problem with the current scheme is this: a filesystem that wants to associate some metadata with a page as long as the page is dirty, will perform this manipulation in its ->page_mkwrite. It currently then must return with the page unlocked and may not hold any other locks (according to existing page_mkwrite convention). In this window, the VM could write out the page, clearing page-dirty. The filesystem has no good way to detect that a dirty pte is about to be attached, so it will happily write out the page, at which point, the filesystem may manipulate the metadata to reflect that the page is no longer dirty. It is not always possible to perform the required metadata manipulation in ->set_page_dirty, because that function cannot block or fail. The filesystem may need to allocate some data structure, for example. And the VM cannot mark the pte dirty before page_mkwrite, because page_mkwrite is allowed to fail, so we must not allow any window where the page could be written to if page_mkwrite does fail. This solution of holding the page locked over the 3 critical operations (page_mkwrite, setting the pte dirty, and finally setting the page dirty) closes out races nicely, preventing page cleaning for writeout being initiated in that window. This provides the filesystem with a strong synchronisation against the VM here. - Sage needs this race closed for ceph filesystem. - Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913). - I need it for fsblock. - I suspect other filesystems may need it too (eg. btrfs). - I have converted buffer.c to the new locking. Even simple block allocation under dirty pages might be susceptible to i_size changing under partial page at the end of file (we also have a buffer.c-side problem here, but it cannot be fixed properly without this patch). - Other filesystems (eg. NFS, maybe btrfs) will need to change their page_mkwrite functions themselves. [ This also moves page_mkwrite another step closer to fault, which should eventually allow page_mkwrite to be moved into ->fault, and thus avoiding a filesystem calldown and page lock/unlock cycle in __do_fault. ] [akpm@linux-foundation.org: fix derefs of NULL ->mapping] Cc: Sage Weil <sage@newdream.net> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-28update Documentation/filesystems/00-INDEX with new nfsd related docs.Benny Halevy
Signed-off-by: Benny Halevy <bhalevy@panasas.com> Cc: James Lentini <jlentini@netapp.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-24CacheFiles: Fix the documentation to use the correct credential pointer namesMarc Dionne
Adjust the CacheFiles documentation to use the correct names of the credential pointers in task_struct. The documentation was using names from the old versions of the credentials patches. Signed-off-by: Marc Dionne <marc.c.dionne@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-20Documentation/filesystems: remove out of date reference to BKL being heldAdrian McMenamin
Documentation/filesystems/vfs.txt incorrectly states that the kernel is locked during the call to statfs (Documentation/filesystems/Locking correctly says it is not). This patch removes the offending sentence. remove reference to BKL being held in statfs Signed-off-by: Adrian McMenamin <adrian@mcmen.demon.co.uk> Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-04-17Staging: Pohmelfs: Added IO permissions and priorities.Evgeniy Polyakov
Signed-off-by: Evgeniy Polyakov <zbr@ioremap.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-04-07nilfs2: clean up sketch fileRyusuke Konishi
The sketch file is a file to mark checkpoints with user data. It was experimentally introduced in the original implementation, and now obsolete. The file was handled differently with regular files; the file size got truncated when a checkpoint was created. This stops the special treatment and will treat it as a regular file. Most users are not affected because mkfs.nilfs2 no longer makes this file. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07nilfs2: add documentRyusuke Konishi
This adds a document describing the features, mount options, userland tools, usage, disk format, and related URLs for the nilfs2 file system. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-06Merge branch 'for-2.6.30' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
* 'for-2.6.30' of git://linux-nfs.org/~bfields/linux: (81 commits) nfsd41: define nfsd4_set_statp as noop for !CONFIG_NFSD_V4 nfsd41: define NFSD_DRC_SIZE_SHIFT in set_max_drc nfsd41: Documentation/filesystems/nfs41-server.txt nfsd41: CREATE_EXCLUSIVE4_1 nfsd41: SUPPATTR_EXCLCREAT attribute nfsd41: support for 3-word long attribute bitmask nfsd: dynamically skip encoded fattr bitmap in _nfsd4_verify nfsd41: pass writable attrs mask to nfsd4_decode_fattr nfsd41: provide support for minor version 1 at rpc level nfsd41: control nfsv4.1 svc via /proc/fs/nfsd/versions nfsd41: add OPEN4_SHARE_ACCESS_WANT nfs4_stateid bmap nfsd41: access_valid nfsd41: clientid handling nfsd41: check encode size for sessions maxresponse cached nfsd41: stateid handling nfsd: pass nfsd4_compound_state* to nfs4_preprocess_{state,seq}id_op nfsd41: destroy_session operation nfsd41: non-page DRC for solo sequence responses nfsd41: Add a create session replay cache nfsd41: create_session operation ...
2009-04-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6: (714 commits) Staging: sxg: slicoss: Specify the license for Sahara SXG and Slicoss drivers Staging: serqt_usb: fix build due to proc tty changes Staging: serqt_usb: fix checkpatch errors Staging: serqt_usb: add TODO file Staging: serqt_usb: Lindent the code Staging: add USB serial Quatech driver staging: document that the wifi staging drivers a bit better Staging: echo cleanup Staging: BUG to BUG_ON changes Staging: remove some pointless conditionals before kfree_skb() Staging: line6: fix build error, select SND_RAWMIDI Staging: line6: fix checkpatch errors in variax.c Staging: line6: fix checkpatch errors in toneport.c Staging: line6: fix checkpatch errors in pcm.c Staging: line6: fix checkpatch errors in midibuf.c Staging: line6: fix checkpatch errors in midi.c Staging: line6: fix checkpatch errors in dumprequest.c Staging: line6: fix checkpatch errors in driver.c Staging: line6: fix checkpatch errors in audio.c Staging: line6: fix checkpatch errors in pod.c ...
2009-04-03nfsd41: Documentation/filesystems/nfs41-server.txtBenny Halevy
Initial nfs41 server write up describing the status of the linux server implementation. [nfsd41: document unenforced nfs41 compound ordering rules.] [get rid of CONFIG_NFSD_V4_1] Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-03Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (28 commits) trivial: Update my email address trivial: NULL noise: drivers/mtd/tests/mtd_*test.c trivial: NULL noise: drivers/media/dvb/frontends/drx397xD_fw.h trivial: Fix misspelling of "Celsius". trivial: remove unused variable 'path' in alloc_file() trivial: fix a pdlfush -> pdflush typo in comment trivial: jbd header comment typo fix for JBD_PARANOID_IOFAIL trivial: wusb: Storage class should be before const qualifier trivial: drivers/char/bsr.c: Storage class should be before const qualifier trivial: h8300: Storage class should be before const qualifier trivial: fix where cgroup documentation is not correctly referred to trivial: Give the right path in Documentation example trivial: MTD: remove EOL from MODULE_DESCRIPTION trivial: Fix typo in bio_split()'s documentation trivial: PWM: fix of #endif comment trivial: fix typos/grammar errors in Kconfig texts trivial: Fix misspelling of firmware trivial: cgroups: documentation typo and spelling corrections trivial: Update contact info for Jochen Hein trivial: fix typo "resgister" -> "register" ...
2009-04-03Staging: pohmelfs: documentation.Evgeniy Polyakov
This patch includes POHMELFS design and implementation description. Separate file includes mount options, default parameters and usage examples. Signed-off-by: Eveniy Polyakov <zbr@ioremap.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-04-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-fscacheLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-fscache: (41 commits) NFS: Add mount options to enable local caching on NFS NFS: Display local caching state NFS: Store pages from an NFS inode into a local cache NFS: Read pages from FS-Cache into an NFS inode NFS: nfs_readpage_async() needs to be accessible as a fallback for local caching NFS: Add read context retention for FS-Cache to call back with NFS: FS-Cache page management NFS: Add some new I/O counters for FS-Cache doing things for NFS NFS: Invalidate FsCache page flags when cache removed NFS: Use local disk inode cache NFS: Define and create inode-level cache objects NFS: Define and create superblock-level objects NFS: Define and create server-level objects NFS: Register NFS for caching and retrieve the top-level index NFS: Permit local filesystem caching to be enabled for NFS NFS: Add FS-Cache option bit and debug bit NFS: Add comment banners to some NFS functions FS-Cache: Make kAFS use FS-Cache CacheFiles: A cache that backs onto a mounted filesystem CacheFiles: Export things for CacheFiles ...
2009-04-03Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osdLinus Torvalds
* 'for-linus' of git://git.open-osd.org/linux-open-osd: fs: Add exofs to Kernel build exofs: Documentation exofs: export_operations exofs: super_operations and file_system_type exofs: dir_inode and directory operations exofs: address_space_operations exofs: symlink_inode and fast_symlink_inode operations exofs: file and file_inode operations exofs: Kbuild, Headers and osd utils
2009-04-03Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6: udf: Don't write integrity descriptor too often udf: Try anchor in block 256 first udf: Some type fixes and cleanups udf: use hardware sector size udf: fix novrs mount option udf: Fix oops when invalid character in filename occurs udf: return f_fsid for statfs(2) udf: Add checks to not underflow sector_t udf: fix default mode and dmode options handling udf: fix sparse warnings: udf: unsigned last[i] cannot be less than 0 udf: implement mode and dmode mounting options udf: reduce stack usage of udf_get_filename udf: reduce stack usage of udf_load_pvoldesc Fix the udf code not to pass structs on stack where possible. Remove struct typedefs from fs/udf/ecma_167.h et al.
2009-04-03CacheFiles: A cache that backs onto a mounted filesystemDavid Howells
Add an FS-Cache cache-backend that permits a mounted filesystem to be used as a backing store for the cache. CacheFiles uses a userspace daemon to do some of the cache management - such as reaping stale nodes and culling. This is called cachefilesd and lives in /sbin. The source for the daemon can be downloaded from: http://people.redhat.com/~dhowells/cachefs/cachefilesd.c And an example configuration from: http://people.redhat.com/~dhowells/cachefs/cachefilesd.conf The filesystem and data integrity of the cache are only as good as those of the filesystem providing the backing services. Note that CacheFiles does not attempt to journal anything since the journalling interfaces of the various filesystems are very specific in nature. CacheFiles creates a misc character device - "/dev/cachefiles" - that is used to communication with the daemon. Only one thing may have this open at once, and whilst it is open, a cache is at least partially in existence. The daemon opens this and sends commands down it to control the cache. CacheFiles is currently limited to a single cache. CacheFiles attempts to maintain at least a certain percentage of free space on the filesystem, shrinking the cache by culling the objects it contains to make space if necessary - see the "Cache Culling" section. This means it can be placed on the same medium as a live set of data, and will expand to make use of spare space and automatically contract when the set of data requires more space. ============ REQUIREMENTS ============ The use of CacheFiles and its daemon requires the following features to be available in the system and in the cache filesystem: - dnotify. - extended attributes (xattrs). - openat() and friends. - bmap() support on files in the filesystem (FIBMAP ioctl). - The use of bmap() to detect a partial page at the end of the file. It is strongly recommended that the "dir_index" option is enabled on Ext3 filesystems being used as a cache. ============= CONFIGURATION ============= The cache is configured by a script in /etc/cachefilesd.conf. These commands set up cache ready for use. The following script commands are available: (*) brun <N>% (*) bcull <N>% (*) bstop <N>% (*) frun <N>% (*) fcull <N>% (*) fstop <N>% Configure the culling limits. Optional. See the section on culling The defaults are 7% (run), 5% (cull) and 1% (stop) respectively. The commands beginning with a 'b' are file space (block) limits, those beginning with an 'f' are file count limits. (*) dir <path> Specify the directory containing the root of the cache. Mandatory. (*) tag <name> Specify a tag to FS-Cache to use in distinguishing multiple caches. Optional. The default is "CacheFiles". (*) debug <mask> Specify a numeric bitmask to control debugging in the kernel module. Optional. The default is zero (all off). The following values can be OR'd into the mask to collect various information: 1 Turn on trace of function entry (_enter() macros) 2 Turn on trace of function exit (_leave() macros) 4 Turn on trace of internal debug points (_debug()) This mask can also be set through sysfs, eg: echo 5 >/sys/modules/cachefiles/parameters/debug ================== STARTING THE CACHE ================== The cache is started by running the daemon. The daemon opens the cache device, configures the cache and tells it to begin caching. At that point the cache binds to fscache and the cache becomes live. The daemon is run as follows: /sbin/cachefilesd [-d]* [-s] [-n] [-f <configfile>] The flags are: (*) -d Increase the debugging level. This can be specified multiple times and is cumulative with itself. (*) -s Send messages to stderr instead of syslog. (*) -n Don't daemonise and go into background. (*) -f <configfile> Use an alternative configuration file rather than the default one. =============== THINGS TO AVOID =============== Do not mount other things within the cache as this will cause problems. The kernel module contains its own very cut-down path walking facility that ignores mountpoints, but the daemon can't avoid them. Do not create, rename or unlink files and directories in the cache whilst the cache is active, as this may cause the state to become uncertain. Renaming files in the cache might make objects appear to be other objects (the filename is part of the lookup key). Do not change or remove the extended attributes attached to cache files by the cache as this will cause the cache state management to get confused. Do not create files or directories in the cache, lest the cache get confused or serve incorrect data. Do not chmod files in the cache. The module creates things with minimal permissions to prevent random users being able to access them directly. ============= CACHE CULLING ============= The cache may need culling occasionally to make space. This involves discarding objects from the cache that have been used less recently than anything else. Culling is based on the access time of data objects. Empty directories are culled if not in use. Cache culling is done on the basis of the percentage of blocks and the percentage of files available in the underlying filesystem. There are six "limits": (*) brun (*) frun If the amount of free space and the number of available files in the cache rises above both these limits, then culling is turned off. (*) bcull (*) fcull If the amount of available space or the number of available files in the cache falls below either of these limits, then culling is started. (*) bstop (*) fstop If the amount of available space or the number of available files in the cache falls below either of these limits, then no further allocation of disk space or files is permitted until culling has raised things above these limits again. These must be configured thusly: 0 <= bstop < bcull < brun < 100 0 <= fstop < fcull < frun < 100 Note that these are percentages of available space and available files, and do _not_ appear as 100 minus the percentage displayed by the "df" program. The userspace daemon scans the cache to build up a table of cullable objects. These are then culled in least recently used order. A new scan of the cache is started as soon as space is made in the table. Objects will be skipped if their atimes have changed or if the kernel module says it is still using them. =============== CACHE STRUCTURE =============== The CacheFiles module will create two directories in the directory it was given: (*) cache/ (*) graveyard/ The active cache objects all reside in the first directory. The CacheFiles kernel module moves any retired or culled objects that it can't simply unlink to the graveyard from which the daemon will actually delete them. The daemon uses dnotify to monitor the graveyard directory, and will delete anything that appears therein. The module represents index objects as directories with the filename "I..." or "J...". Note that the "cache/" directory is itself a special index. Data objects are represented as files if they have no children, or directories if they do. Their filenames all begin "D..." or "E...". If represented as a directory, data objects will have a file in the directory called "data" that actually holds the data. Special objects are similar to data objects, except their filenames begin "S..." or "T...". If an object has children, then it will be represented as a directory. Immediately in the representative directory are a collection of directories named for hash values of the child object keys with an '@' prepended. Into this directory, if possible, will be placed the representations of the child objects: INDEX INDEX INDEX DATA FILES ========= ========== ================================= ================ cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400 cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...DB1ry cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...N22ry cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...FP1ry If the key is so long that it exceeds NAME_MAX with the decorations added on to it, then it will be cut into pieces, the first few of which will be used to make a nest of directories, and the last one of which will be the objects inside the last directory. The names of the intermediate directories will have '+' prepended: J1223/@23/+xy...z/+kl...m/Epqr Note that keys are raw data, and not only may they exceed NAME_MAX in size, they may also contain things like '/' and NUL characters, and so they may not be suitable for turning directly into a filename. To handle this, CacheFiles will use a suitably printable filename directly and "base-64" encode ones that aren't directly suitable. The two versions of object filenames indicate the encoding: OBJECT TYPE PRINTABLE ENCODED =============== =============== =============== Index "I..." "J..." Data "D..." "E..." Special "S..." "T..." Intermediate directories are always "@" or "+" as appropriate. Each object in the cache has an extended attribute label that holds the object type ID (required to distinguish special objects) and the auxiliary data from the netfs. The latter is used to detect stale objects in the cache and update or retire them. Note that CacheFiles will erase from the cache any file it doesn't recognise or any file of an incorrect type (such as a FIFO file or a device file). ========================== SECURITY MODEL AND SELINUX ========================== CacheFiles is implemented to deal properly with the LSM security features of the Linux kernel and the SELinux facility. One of the problems that CacheFiles faces is that it is generally acting on behalf of a process, and running in that process's context, and that includes a security context that is not appropriate for accessing the cache - either because the files in the cache are inaccessible to that process, or because if the process creates a file in the cache, that file may be inaccessible to other processes. The way CacheFiles works is to temporarily change the security context (fsuid, fsgid and actor security label) that the process acts as - without changing the security context of the process when it the target of an operation performed by some other process (so signalling and suchlike still work correctly). When the CacheFiles module is asked to bind to its cache, it: (1) Finds the security label attached to the root cache directory and uses that as the security label with which it will create files. By default, this is: cachefiles_var_t (2) Finds the security label of the process which issued the bind request (presumed to be the cachefilesd daemon), which by default will be: cachefilesd_t and asks LSM to supply a security ID as which it should act given the daemon's label. By default, this will be: cachefiles_kernel_t SELinux transitions the daemon's security ID to the module's security ID based on a rule of this form in the policy. type_transition <daemon's-ID> kernel_t : process <module's-ID>; For instance: type_transition cachefilesd_t kernel_t : process cachefiles_kernel_t; The module's security ID gives it permission to create, move and remove files and directories in the cache, to find and access directories and files in the cache, to set and access extended attributes on cache objects, and to read and write files in the cache. The daemon's security ID gives it only a very restricted set of permissions: it may scan directories, stat files and erase files and directories. It may not read or write files in the cache, and so it is precluded from accessing the data cached therein; nor is it permitted to create new files in the cache. There are policy source files available in: http://people.redhat.com/~dhowells/fscache/cachefilesd-0.8.tar.bz2 and later versions. In that tarball, see the files: cachefilesd.te cachefilesd.fc cachefilesd.if They are built and installed directly by the RPM. If a non-RPM based system is being used, then copy the above files to their own directory and run: make -f /usr/share/selinux/devel/Makefile semodule -i cachefilesd.pp You will need checkpolicy and selinux-policy-devel installed prior to the build. By default, the cache is located in /var/fscache, but if it is desirable that it should be elsewhere, than either the above policy files must be altered, or an auxiliary policy must be installed to label the alternate location of the cache. For instructions on how to add an auxiliary policy to enable the cache to be located elsewhere when SELinux is in enforcing mode, please see: /usr/share/doc/cachefilesd-*/move-cache.txt When the cachefilesd rpm is installed; alternatively, the document can be found in the sources. ================== A NOTE ON SECURITY ================== CacheFiles makes use of the split security in the task_struct. It allocates its own task_security structure, and redirects current->act_as to point to it when it acts on behalf of another process, in that process's context. The reason it does this is that it calls vfs_mkdir() and suchlike rather than bypassing security and calling inode ops directly. Therefore the VFS and LSM may deny the CacheFiles access to the cache data because under some circumstances the caching code is running in the security context of whatever process issued the original syscall on the netfs. Furthermore, should CacheFiles create a file or directory, the security parameters with that object is created (UID, GID, security label) would be derived from that process that issued the system call, thus potentially preventing other processes from accessing the cache - including CacheFiles's cache management daemon (cachefilesd). What is required is to temporarily override the security of the process that issued the system call. We can't, however, just do an in-place change of the security data as that affects the process as an object, not just as a subject. This means it may lose signals or ptrace events for example, and affects what the process looks like in /proc. So CacheFiles makes use of a logical split in the security between the objective security (task->sec) and the subjective security (task->act_as). The objective security holds the intrinsic security properties of a process and is never overridden. This is what appears in /proc, and is what is used when a process is the target of an operation by some other process (SIGKILL for example). The subjective security holds the active security properties of a process, and may be overridden. This is not seen externally, and is used whan a process acts upon another object, for example SIGKILLing another process or opening a file. LSM hooks exist that allow SELinux (or Smack or whatever) to reject a request for CacheFiles to run in a context of a specific security label, or to create files and directories with another security label. This documentation is added by the patch to: Documentation/filesystems/caching/cachefiles.txt Signed-Off-By: David Howells <dhowells@redhat.com> Acked-by: Steve Dickson <steved@redhat.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
2009-04-03FS-Cache: Add and document asynchronous operation handlingDavid Howells
Add and document asynchronous operation handling for use by FS-Cache's data storage and retrieval routines. The following documentation is added to: Documentation/filesystems/caching/operations.txt ================================ ASYNCHRONOUS OPERATIONS HANDLING ================================ ======== OVERVIEW ======== FS-Cache has an asynchronous operations handling facility that it uses for its data storage and retrieval routines. Its operations are represented by fscache_operation structs, though these are usually embedded into some other structure. This facility is available to and expected to be be used by the cache backends, and FS-Cache will create operations and pass them off to the appropriate cache backend for completion. To make use of this facility, <linux/fscache-cache.h> should be #included. =============================== OPERATION RECORD INITIALISATION =============================== An operation is recorded in an fscache_operation struct: struct fscache_operation { union { struct work_struct fast_work; struct slow_work slow_work; }; unsigned long flags; fscache_operation_processor_t processor; ... }; Someone wanting to issue an operation should allocate something with this struct embedded in it. They should initialise it by calling: void fscache_operation_init(struct fscache_operation *op, fscache_operation_release_t release); with the operation to be initialised and the release function to use. The op->flags parameter should be set to indicate the CPU time provision and the exclusivity (see the Parameters section). The op->fast_work, op->slow_work and op->processor flags should be set as appropriate for the CPU time provision (see the Parameters section). FSCACHE_OP_WAITING may be set in op->flags prior to each submission of the operation and waited for afterwards. ========== PARAMETERS ========== There are a number of parameters that can be set in the operation record's flag parameter. There are three options for the provision of CPU time in these operations: (1) The operation may be done synchronously (FSCACHE_OP_MYTHREAD). A thread may decide it wants to handle an operation itself without deferring it to another thread. This is, for example, used in read operations for calling readpages() on the backing filesystem in CacheFiles. Although readpages() does an asynchronous data fetch, the determination of whether pages exist is done synchronously - and the netfs does not proceed until this has been determined. If this option is to be used, FSCACHE_OP_WAITING must be set in op->flags before submitting the operation, and the operating thread must wait for it to be cleared before proceeding: wait_on_bit(&op->flags, FSCACHE_OP_WAITING, fscache_wait_bit, TASK_UNINTERRUPTIBLE); (2) The operation may be fast asynchronous (FSCACHE_OP_FAST), in which case it will be given to keventd to process. Such an operation is not permitted to sleep on I/O. This is, for example, used by CacheFiles to copy data from a backing fs page to a netfs page after the backing fs has read the page in. If this option is used, op->fast_work and op->processor must be initialised before submitting the operation: INIT_WORK(&op->fast_work, do_some_work); (3) The operation may be slow asynchronous (FSCACHE_OP_SLOW), in which case it will be given to the slow work facility to process. Such an operation is permitted to sleep on I/O. This is, for example, used by FS-Cache to handle background writes of pages that have just been fetched from a remote server. If this option is used, op->slow_work and op->processor must be initialised before submitting the operation: fscache_operation_init_slow(op, processor) Furthermore, operations may be one of two types: (1) Exclusive (FSCACHE_OP_EXCLUSIVE). Operations of this type may not run in conjunction with any other operation on the object being operated upon. An example of this is the attribute change operation, in which the file being written to may need truncation. (2) Shareable. Operations of this type may be running simultaneously. It's up to the operation implementation to prevent interference between other operations running at the same time. ========= PROCEDURE ========= Operations are used through the following procedure: (1) The submitting thread must allocate the operation and initialise it itself. Normally this would be part of a more specific structure with the generic op embedded within. (2) The submitting thread must then submit the operation for processing using one of the following two functions: int fscache_submit_op(struct fscache_object *object, struct fscache_operation *op); int fscache_submit_exclusive_op(struct fscache_object *object, struct fscache_operation *op); The first function should be used to submit non-exclusive ops and the second to submit exclusive ones. The caller must still set the FSCACHE_OP_EXCLUSIVE flag. If successful, both functions will assign the operation to the specified object and return 0. -ENOBUFS will be returned if the object specified is permanently unavailable. The operation manager will defer operations on an object that is still undergoing lookup or creation. The operation will also be deferred if an operation of conflicting exclusivity is in progress on the object. If the operation is asynchronous, the manager will retain a reference to it, so the caller should put their reference to it by passing it to: void fscache_put_operation(struct fscache_operation *op); (3) If the submitting thread wants to do the work itself, and has marked the operation with FSCACHE_OP_MYTHREAD, then it should monitor FSCACHE_OP_WAITING as described above and check the state of the object if necessary (the object might have died whilst the thread was waiting). When it has finished doing its processing, it should call fscache_put_operation() on it. (4) The operation holds an effective lock upon the object, preventing other exclusive ops conflicting until it is released. The operation can be enqueued for further immediate asynchronous processing by adjusting the CPU time provisioning option if necessary, eg: op->flags &= ~FSCACHE_OP_TYPE; op->flags |= ~FSCACHE_OP_FAST; and calling: void fscache_enqueue_operation(struct fscache_operation *op) This can be used to allow other things to have use of the worker thread pools. ===================== ASYNCHRONOUS CALLBACK ===================== When used in asynchronous mode, the worker thread pool will invoke the processor method with a pointer to the operation. This should then get at the container struct by using container_of(): static void fscache_write_op(struct fscache_operation *_op) { struct fscache_storage *op = container_of(_op, struct fscache_storage, op); ... } The caller holds a reference on the operation, and will invoke fscache_put_operation() when the processor function returns. The processor function is at liberty to call fscache_enqueue_operation() or to take extra references. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Steve Dickson <steved@redhat.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
2009-04-03FS-Cache: Object management state machineDavid Howells
Implement the cache object management state machine. The following documentation is added to illuminate the working of this state machine. It will also be added as: Documentation/filesystems/caching/object.txt ==================================================== IN-KERNEL CACHE OBJECT REPRESENTATION AND MANAGEMENT ==================================================== ============== REPRESENTATION ============== FS-Cache maintains an in-kernel representation of each object that a netfs is currently interested in. Such objects are represented by the fscache_cookie struct and are referred to as cookies. FS-Cache also maintains a separate in-kernel representation of the objects that a cache backend is currently actively caching. Such objects are represented by the fscache_object struct. The cache backends allocate these upon request, and are expected to embed them in their own representations. These are referred to as objects. There is a 1:N relationship between cookies and objects. A cookie may be represented by multiple objects - an index may exist in more than one cache - or even by no objects (it may not be cached). Furthermore, both cookies and objects are hierarchical. The two hierarchies correspond, but the cookies tree is a superset of the union of the object trees of multiple caches: NETFS INDEX TREE : CACHE 1 : CACHE 2 : : : +-----------+ : +----------->| IObject | : +-----------+ | : +-----------+ : | ICookie |-------+ : | : +-----------+ | : | : +-----------+ | +------------------------------>| IObject | | : | : +-----------+ | : V : | | : +-----------+ : | V +----------->| IObject | : | +-----------+ | : +-----------+ : | | ICookie |-------+ : | : V +-----------+ | : | : +-----------+ | +------------------------------>| IObject | +-----+-----+ : | : +-----------+ | | : | : | V | : V : | +-----------+ | : +-----------+ : | | ICookie |------------------------->| IObject | : | +-----------+ | : +-----------+ : | | V : | : V | +-----------+ : | : +-----------+ | | ICookie |-------------------------------->| IObject | | +-----------+ : | : +-----------+ V | : V : | +-----------+ | : +-----------+ : | | DCookie |------------------------->| DObject | : | +-----------+ | : +-----------+ : | | : : | +-------+-------+ : : | | | : : | V V : : V +-----------+ +-----------+ : : +-----------+ | DCookie | | DCookie |------------------------>| DObject | +-----------+ +-----------+ : : +-----------+ : : In the above illustration, ICookie and IObject represent indices and DCookie and DObject represent data storage objects. Indices may have representation in multiple caches, but currently, non-index objects may not. Objects of any type may also be entirely unrepresented. As far as the netfs API goes, the netfs is only actually permitted to see pointers to the cookies. The cookies themselves and any objects attached to those cookies are hidden from it. =============================== OBJECT MANAGEMENT STATE MACHINE =============================== Within FS-Cache, each active object is managed by its own individual state machine. The state for an object is kept in the fscache_object struct, in object->state. A cookie may point to a set of objects that are in different states. Each state has an action associated with it that is invoked when the machine wakes up in that state. There are four logical sets of states: (1) Preparation: states that wait for the parent objects to become ready. The representations are hierarchical, and it is expected that an object must be created or accessed with respect to its parent object. (2) Initialisation: states that perform lookups in the cache and validate what's found and that create on disk any missing metadata. (3) Normal running: states that allow netfs operations on objects to proceed and that update the state of objects. (4) Termination: states that detach objects from their netfs cookies, that delete objects from disk, that handle disk and system errors and that free up in-memory resources. In most cases, transitioning between states is in response to signalled events. When a state has finished processing, it will usually set the mask of events in which it is interested (object->event_mask) and relinquish the worker thread. Then when an event is raised (by calling fscache_raise_event()), if the event is not masked, the object will be queued for processing (by calling fscache_enqueue_object()). PROVISION OF CPU TIME --------------------- The work to be done by the various states is given CPU time by the threads of the slow work facility (see Documentation/slow-work.txt). This is used in preference to the workqueue facility because: (1) Threads may be completely occupied for very long periods of time by a particular work item. These state actions may be doing sequences of synchronous, journalled disk accesses (lookup, mkdir, create, setxattr, getxattr, truncate, unlink, rmdir, rename). (2) Threads may do little actual work, but may rather spend a lot of time sleeping on I/O. This means that single-threaded and 1-per-CPU-threaded workqueues don't necessarily have the right numbers of threads. LOCKING SIMPLIFICATION ---------------------- Because only one worker thread may be operating on any particular object's state machine at once, this simplifies the locking, particularly with respect to disconnecting the netfs's representation of a cache object (fscache_cookie) from the cache backend's representation (fscache_object) - which may be requested from either end. ================= THE SET OF STATES ================= The object state machine has a set of states that it can be in. There are preparation states in which the object sets itself up and waits for its parent object to transit to a state that allows access to its children: (1) State FSCACHE_OBJECT_INIT. Initialise the object and wait for the parent object to become active. In the cache, it is expected that it will not be possible to look an object up from the parent object, until that parent object itself has been looked up. There are initialisation states in which the object sets itself up and accesses disk for the object metadata: (2) State FSCACHE_OBJECT_LOOKING_UP. Look up the object on disk, using the parent as a starting point. FS-Cache expects the cache backend to probe the cache to see whether this object is represented there, and if it is, to see if it's valid (coherency management). The cache should call fscache_object_lookup_negative() to indicate lookup failure for whatever reason, and should call fscache_obtained_object() to indicate success. At the completion of lookup, FS-Cache will let the netfs go ahead with read operations, no matter whether the file is yet cached. If not yet cached, read operations will be immediately rejected with ENODATA until the first known page is uncached - as to that point there can be no data to be read out of the cache for that file that isn't currently also held in the pagecache. (3) State FSCACHE_OBJECT_CREATING. Create an object on disk, using the parent as a starting point. This happens if the lookup failed to find the object, or if the object's coherency data indicated what's on disk is out of date. In this state, FS-Cache expects the cache to create The cache should call fscache_obtained_object() if creation completes successfully, fscache_object_lookup_negative() otherwise. At the completion of creation, FS-Cache will start processing write operations the netfs has queued for an object. If creation failed, the write ops will be transparently discarded, and nothing recorded in the cache. There are some normal running states in which the object spends its time servicing netfs requests: (4) State FSCACHE_OBJECT_AVAILABLE. A transient state in which pending operations are started, child objects are permitted to advance from FSCACHE_OBJECT_INIT state, and temporary lookup data is freed. (5) State FSCACHE_OBJECT_ACTIVE. The normal running state. In this state, requests the netfs makes will be passed on to the cache. (6) State FSCACHE_OBJECT_UPDATING. The state machine comes here to update the object in the cache from the netfs's records. This involves updating the auxiliary data that is used to maintain coherency. And there are terminal states in which an object cleans itself up, deallocates memory and potentially deletes stuff from disk: (7) State FSCACHE_OBJECT_LC_DYING. The object comes here if it is dying because of a lookup or creation error. This would be due to a disk error or system error of some sort. Temporary data is cleaned up, and the parent is released. (8) State FSCACHE_OBJECT_DYING. The object comes here if it is dying due to an error, because its parent cookie has been relinquished by the netfs or because the cache is being withdrawn. Any child objects waiting on this one are given CPU time so that they too can destroy themselves. This object waits for all its children to go away before advancing to the next state. (9) State FSCACHE_OBJECT_ABORT_INIT. The object comes to this state if it was waiting on its parent in FSCACHE_OBJECT_INIT, but its parent died. The object will destroy itself so that the parent may proceed from the FSCACHE_OBJECT_DYING state. (10) State FSCACHE_OBJECT_RELEASING. (11) State FSCACHE_OBJECT_RECYCLING. The object comes to one of these two states when dying once it is rid of all its children, if it is dying because the netfs relinquished its cookie. In the first state, the cached data is expected to persist, and in the second it will be deleted. (12) State FSCACHE_OBJECT_WITHDRAWING. The object transits to this state if the cache decides it wants to withdraw the object from service, perhaps to make space, but also due to error or just because the whole cache is being withdrawn. (13) State FSCACHE_OBJECT_DEAD. The object transits to this state when the in-memory object record is ready to be deleted. The object processor shouldn't ever see an object in this state. THE SET OF EVENTS ----------------- There are a number of events that can be raised to an object state machine: (*) FSCACHE_OBJECT_EV_UPDATE The netfs requested that an object be updated. The state machine will ask the cache backend to update the object, and the cache backend will ask the netfs for details of the change through its cookie definition ops. (*) FSCACHE_OBJECT_EV_CLEARED This is signalled in two circumstances: (a) when an object's last child object is dropped and (b) when the last operation outstanding on an object is completed. This is used to proceed from the dying state. (*) FSCACHE_OBJECT_EV_ERROR This is signalled when an I/O error occurs during the processing of some object. (*) FSCACHE_OBJECT_EV_RELEASE (*) FSCACHE_OBJECT_EV_RETIRE These are signalled when the netfs relinquishes a cookie it was using. The event selected depends on whether the netfs asks for the backing object to be retired (deleted) or retained. (*) FSCACHE_OBJECT_EV_WITHDRAW This is signalled when the cache backend wants to withdraw an object. This means that the object will have to be detached from the netfs's cookie. Because the withdrawing releasing/retiring events are all handled by the object state machine, it doesn't matter if there's a collision with both ends trying to sever the connection at the same time. The state machine can just pick which one it wants to honour, and that effects the other. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Steve Dickson <steved@redhat.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
2009-04-03FS-Cache: Add use of /proc and presentation of statisticsDavid Howells
Make FS-Cache create its /proc interface and present various statistical information through it. Also provide the functions for updating this information. These features are enabled by: CONFIG_FSCACHE_PROC CONFIG_FSCACHE_STATS CONFIG_FSCACHE_HISTOGRAM The /proc directory for FS-Cache is also exported so that caching modules can add their own statistics there too. The FS-Cache module is loadable at this point, and the statistics files can be examined by userspace: cat /proc/fs/fscache/stats cat /proc/fs/fscache/histogram Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Steve Dickson <steved@redhat.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
2009-04-03FS-Cache: Add the FS-Cache cache backend API and documentationDavid Howells
Add the API for a generic facility (FS-Cache) by which caches may declare them selves open for business, and may obtain work to be done from network filesystems. The header file is included by: #include <linux/fscache-cache.h> Documentation for the API is also added to: Documentation/filesystems/caching/backend-api.txt This API is not usable without the implementation of the utility functions which will be added in further patches. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Steve Dickson <steved@redhat.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
2009-04-03FS-Cache: Add the FS-Cache netfs API and documentationDavid Howells
Add the API for a generic facility (FS-Cache) by which filesystems (such as AFS or NFS) may call on local caching capabilities without having to know anything about how the cache works, or even if there is a cache: +---------+ | | +--------------+ | NFS |--+ | | | | | +-->| CacheFS | +---------+ | +----------+ | | /dev/hda5 | | | | | +--------------+ +---------+ +-->| | | | | | |--+ | AFS |----->| FS-Cache | | | | |--+ +---------+ +-->| | | | | | | +--------------+ +---------+ | +----------+ | | | | | | +-->| CacheFiles | | ISOFS |--+ | /var/cache | | | +--------------+ +---------+ General documentation and documentation of the netfs specific API are provided in addition to the header files. As this patch stands, it is possible to build a filesystem against the facility and attempt to use it. All that will happen is that all requests will be immediately denied as if no cache is present. Further patches will implement the core of the facility. The facility will transfer requests from networking filesystems to appropriate caches if possible, or else gracefully deny them. If this facility is disabled in the kernel configuration, then all its operations will trivially reduce to nothing during compilation. WHY NOT I_MAPPING? ================== I have added my own API to implement caching rather than using i_mapping to do this for a number of reasons. These have been discussed a lot on the LKML and CacheFS mailing lists, but to summarise the basics: (1) Most filesystems don't do hole reportage. Holes in files are treated as blocks of zeros and can't be distinguished otherwise, making it difficult to distinguish blocks that have been read from the network and cached from those that haven't. (2) The backing inode must be fully populated before being exposed to userspace through the main inode because the VM/VFS goes directly to the backing inode and does not interrogate the front inode's VM ops. Therefore: (a) The backing inode must fit entirely within the cache. (b) All backed files currently open must fit entirely within the cache at the same time. (c) A working set of files in total larger than the cache may not be cached. (d) A file may not grow larger than the available space in the cache. (e) A file that's open and cached, and remotely grows larger than the cache is potentially stuffed. (3) Writes go to the backing filesystem, and can only be transferred to the network when the file is closed. (4) There's no record of what changes have been made, so the whole file must be written back. (5) The pages belong to the backing filesystem, and all metadata associated with that page are relevant only to the backing filesystem, and not anything stacked atop it. OVERVIEW ======== FS-Cache provides (or will provide) the following facilities: (1) Caches can be added / removed at any time, even whilst in use. (2) Adds a facility by which tags can be used to refer to caches, even if they're not available yet. (3) More than one cache can be used at once. Caches can be selected explicitly by use of tags. (4) The netfs is provided with an interface that allows either party to withdraw caching facilities from a file (required for (1)). (5) A netfs may annotate cache objects that belongs to it. This permits the storage of coherency maintenance data. (6) Cache objects will be pinnable and space reservations will be possible. (7) The interface to the netfs returns as few errors as possible, preferring rather to let the netfs remain oblivious. (8) Cookies are used to represent indices, files and other objects to the netfs. The simplest cookie is just a NULL pointer - indicating nothing cached there. (9) The netfs is allowed to propose - dynamically - any index hierarchy it desires, though it must be aware that the index search function is recursive, stack space is limited, and indices can only be children of indices. (10) Indices can be used to group files together to reduce key size and to make group invalidation easier. The use of indices may make lookup quicker, but that's cache dependent. (11) Data I/O is effectively done directly to and from the netfs's pages. The netfs indicates that page A is at index B of the data-file represented by cookie C, and that it should be read or written. The cache backend may or may not start I/O on that page, but if it does, a netfs callback will be invoked to indicate completion. The I/O may be either synchronous or asynchronous. (12) Cookies can be "retired" upon release. At this point FS-Cache will mark them as obsolete and the index hierarchy rooted at that point will get recycled. (13) The netfs provides a "match" function for index searches. In addition to saying whether a match was made or not, this can also specify that an entry should be updated or deleted. FS-Cache maintains a virtual index tree in which all indices, files, objects and pages are kept. Bits of this tree may actually reside in one or more caches. FSDEF | +------------------------------------+ | | NFS AFS | | +--------------------------+ +-----------+ | | | | homedir mirror afs.org redhat.com | | | +------------+ +---------------+ +----------+ | | | | | | 00001 00002 00007 00125 vol00001 vol00002 | | | | | +---+---+ +-----+ +---+ +------+------+ +-----+----+ | | | | | | | | | | | | | PG0 PG1 PG2 PG0 XATTR PG0 PG1 DIRENT DIRENT DIRENT R/W R/O Bak | | PG0 +-------+ | | 00001 00003 | +---+---+ | | | PG0 PG1 PG2 In the example above, two netfs's can be seen to be backed: NFS and AFS. These have different index hierarchies: (*) The NFS primary index will probably contain per-server indices. Each server index is indexed by NFS file handles to get data file objects. Each data file objects can have an array of pages, but may also have further child objects, such as extended attributes and directory entries. Extended attribute objects themselves have page-array contents. (*) The AFS primary index contains per-cell indices. Each cell index contains per-logical-volume indices. Each of volume index contains up to three indices for the read-write, read-only and backup mirrors of those volumes. Each of these contains vnode data file objects, each of which contains an array of pages. The very top index is the FS-Cache master index in which individual netfs's have entries. Any index object may reside in more than one cache, provided it only has index children. Any index with non-index object children will be assumed to only reside in one cache. The FS-Cache overview can be found in: Documentation/filesystems/caching/fscache.txt The netfs API to FS-Cache can be found in: Documentation/filesystems/caching/netfs-api.txt Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Steve Dickson <steved@redhat.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
2009-04-02documentation: update Documentation/filesystem/proc.txt and ↵Shen Feng
Documentation/sysctls Now /proc/sys is described in many places and much information is redundant. This patch updates the proc.txt and move the /proc/sys desciption out to the files in Documentation/sysctls. Details are: merge - 2.1 /proc/sys/fs - File system data - 2.11 /proc/sys/fs/mqueue - POSIX message queues filesystem - 2.17 /proc/sys/fs/epoll - Configuration options for the epoll interface with Documentation/sysctls/fs.txt. remove - 2.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats since it's not better then the Documentation/binfmt_misc.txt. merge - 2.3 /proc/sys/kernel - general kernel parameters with Documentation/sysctls/kernel.txt remove - 2.5 /proc/sys/dev - Device specific parameters since it's obsolete the sysfs is used now. remove - 2.6 /proc/sys/sunrpc - Remote procedure calls since it's not better then the Documentation/sysctls/sunrpc.txt move - 2.7 /proc/sys/net - Networking stuff - 2.9 Appletalk - 2.10 IPX to newly created Documentation/sysctls/net.txt. remove - 2.8 /proc/sys/net/ipv4 - IPV4 settings since it's not better then the Documentation/networking/ip-sysctl.txt. add - Chapter 3 Per-Process Parameters to descibe /proc/<pid>/xxx parameters. Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02udf: implement mode and dmode mounting optionsMarcin Slusarz
"dmode" allows overriding permissions of directories and "mode" allows overriding permissions of files. Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jan Kara <jack@suse.cz>
2009-04-01Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (33 commits) ext4: Regularize mount options ext4: fix locking typo in mballoc which could cause soft lockup hangs ext4: fix typo which causes a memory leak on error path jbd2: Update locking coments ext4: Rename pa_linear to pa_type ext4: add checks of block references for non-extent inodes ext4: Check for an valid i_mode when reading the inode from disk ext4: Use WRITE_SYNC for commits which are caused by fsync() ext4: Add auto_da_alloc mount option ext4: Use struct flex_groups to calculate get_orlov_stats() ext4: Use atomic_t's in struct flex_groups ext4: remove /proc tuning knobs ext4: Add sysfs support ext4: Track lifetime disk writes ext4: Fix discard of inode prealloc space with delayed allocation. ext4: Automatically allocate delay allocated blocks on rename ext4: Automatically allocate delay allocated blocks on close ext4: add EXT4_IOC_ALLOC_DA_BLKS ioctl ext4: Simplify delalloc code by removing mpage_da_writepages() ext4: Save stack space by removing fake buffer heads ...