diff options
Diffstat (limited to 'Documentation/filesystems/Exporting')
-rw-r--r-- | Documentation/filesystems/Exporting | 176 |
1 files changed, 176 insertions, 0 deletions
diff --git a/Documentation/filesystems/Exporting b/Documentation/filesystems/Exporting new file mode 100644 index 00000000000..31047e0fe14 --- /dev/null +++ b/Documentation/filesystems/Exporting @@ -0,0 +1,176 @@ + +Making Filesystems Exportable +============================= + +Most filesystem operations require a dentry (or two) as a starting +point. Local applications have a reference-counted hold on suitable +dentrys via open file descriptors or cwd/root. However remote +applications that access a filesystem via a remote filesystem protocol +such as NFS may not be able to hold such a reference, and so need a +different way to refer to a particular dentry. As the alternative +form of reference needs to be stable across renames, truncates, and +server-reboot (among other things, though these tend to be the most +problematic), there is no simple answer like 'filename'. + +The mechanism discussed here allows each filesystem implementation to +specify how to generate an opaque (out side of the filesystem) byte +string for any dentry, and how to find an appropriate dentry for any +given opaque byte string. +This byte string will be called a "filehandle fragment" as it +corresponds to part of an NFS filehandle. + +A filesystem which supports the mapping between filehandle fragments +and dentrys will be termed "exportable". + + + +Dcache Issues +------------- + +The dcache normally contains a proper prefix of any given filesystem +tree. This means that if any filesystem object is in the dcache, then +all of the ancestors of that filesystem object are also in the dcache. +As normal access is by filename this prefix is created naturally and +maintained easily (by each object maintaining a reference count on +its parent). + +However when objects are included into the dcache by interpreting a +filehandle fragment, there is no automatic creation of a path prefix +for the object. This leads to two related but distinct features of +the dcache that are not needed for normal filesystem access. + +1/ The dcache must sometimes contain objects that are not part of the + proper prefix. i.e that are not connected to the root. +2/ The dcache must be prepared for a newly found (via ->lookup) directory + to already have a (non-connected) dentry, and must be able to move + that dentry into place (based on the parent and name in the + ->lookup). This is particularly needed for directories as + it is a dcache invariant that directories only have one dentry. + +To implement these features, the dcache has: + +a/ A dentry flag DCACHE_DISCONNECTED which is set on + any dentry that might not be part of the proper prefix. + This is set when anonymous dentries are created, and cleared when a + dentry is noticed to be a child of a dentry which is in the proper + prefix. + +b/ A per-superblock list "s_anon" of dentries which are the roots of + subtrees that are not in the proper prefix. These dentries, as + well as the proper prefix, need to be released at unmount time. As + these dentries will not be hashed, they are linked together on the + d_hash list_head. + +c/ Helper routines to allocate anonymous dentries, and to help attach + loose directory dentries at lookup time. They are: + d_alloc_anon(inode) will return a dentry for the given inode. + If the inode already has a dentry, one of those is returned. + If it doesn't, a new anonymous (IS_ROOT and + DCACHE_DISCONNECTED) dentry is allocated and attached. + In the case of a directory, care is taken that only one dentry + can ever be attached. + d_splice_alias(inode, dentry) will make sure that there is a + dentry with the same name and parent as the given dentry, and + which refers to the given inode. + If the inode is a directory and already has a dentry, then that + dentry is d_moved over the given dentry. + If the passed dentry gets attached, care is taken that this is + mutually exclusive to a d_alloc_anon operation. + If the passed dentry is used, NULL is returned, else the used + dentry is returned. This corresponds to the calling pattern of + ->lookup. + + +Filesystem Issues +----------------- + +For a filesystem to be exportable it must: + + 1/ provide the filehandle fragment routines described below. + 2/ make sure that d_splice_alias is used rather than d_add + when ->lookup finds an inode for a given parent and name. + Typically the ->lookup routine will end: + if (inode) + return d_splice(inode, dentry); + d_add(dentry, inode); + return NULL; + } + + + + A file system implementation declares that instances of the filesystem +are exportable by setting the s_export_op field in the struct +super_block. This field must point to a "struct export_operations" +struct which could potentially be full of NULLs, though normally at +least get_parent will be set. + + The primary operations are decode_fh and encode_fh. +decode_fh takes a filehandle fragment and tries to find or create a +dentry for the object referred to by the filehandle. +encode_fh takes a dentry and creates a filehandle fragment which can +later be used to find/create a dentry for the same object. + +decode_fh will probably make use of "find_exported_dentry". +This function lives in the "exportfs" module which a filesystem does +not need unless it is being exported. So rather that calling +find_exported_dentry directly, each filesystem should call it through +the find_exported_dentry pointer in it's export_operations table. +This field is set correctly by the exporting agent (e.g. nfsd) when a +filesystem is exported, and before any export operations are called. + +find_exported_dentry needs three support functions from the +filesystem: + get_name. When given a parent dentry and a child dentry, this + should find a name in the directory identified by the parent + dentry, which leads to the object identified by the child dentry. + If no get_name function is supplied, a default implementation is + provided which uses vfs_readdir to find potential names, and + matches inode numbers to find the correct match. + + get_parent. When given a dentry for a directory, this should return + a dentry for the parent. Quite possibly the parent dentry will + have been allocated by d_alloc_anon. + The default get_parent function just returns an error so any + filehandle lookup that requires finding a parent will fail. + ->lookup("..") is *not* used as a default as it can leave ".." + entries in the dcache which are too messy to work with. + + get_dentry. When given an opaque datum, this should find the + implied object and create a dentry for it (possibly with + d_alloc_anon). + The opaque datum is whatever is passed down by the decode_fh + function, and is often simply a fragment of the filehandle + fragment. + decode_fh passes two datums through find_exported_dentry. One that + should be used to identify the target object, and one that can be + used to identify the object's parent, should that be necessary. + The default get_dentry function assumes that the datum contains an + inode number and a generation number, and it attempts to get the + inode using "iget" and check it's validity by matching the + generation number. A filesystem should only depend on the default + if iget can safely be used this way. + +If decode_fh and/or encode_fh are left as NULL, then default +implementations are used. These defaults are suitable for ext2 and +extremely similar filesystems (like ext3). + +The default encode_fh creates a filehandle fragment from the inode +number and generation number of the target together with the inode +number and generation number of the parent (if the parent is +required). + +The default decode_fh extract the target and parent datums from the +filehandle assuming the format used by the default encode_fh and +passed them to find_exported_dentry. + + +A filehandle fragment consists of an array of 1 or more 4byte words, +together with a one byte "type". +The decode_fh routine should not depend on the stated size that is +passed to it. This size may be larger than the original filehandle +generated by encode_fh, in which case it will have been padded with +nuls. Rather, the encode_fh routine should choose a "type" which +indicates the decode_fh how much of the filehandle is valid, and how +it should be interpreted. + + |