diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2009-04-05 11:06:45 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2009-04-05 11:06:45 -0700 |
commit | 3516c6a8dc0b1153c611c4cf0dc4a51631f052bb (patch) | |
tree | c54a5fc916cbe73e43dee20902642f367f44a551 /Documentation/filesystems/pohmelfs/design_notes.txt | |
parent | 714f83d5d9f7c785f622259dad1f4fad12d64664 (diff) | |
parent | ba0e1ebb7ea0616eebc29d2077355bacea62a9d8 (diff) |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6: (714 commits)
Staging: sxg: slicoss: Specify the license for Sahara SXG and Slicoss drivers
Staging: serqt_usb: fix build due to proc tty changes
Staging: serqt_usb: fix checkpatch errors
Staging: serqt_usb: add TODO file
Staging: serqt_usb: Lindent the code
Staging: add USB serial Quatech driver
staging: document that the wifi staging drivers a bit better
Staging: echo cleanup
Staging: BUG to BUG_ON changes
Staging: remove some pointless conditionals before kfree_skb()
Staging: line6: fix build error, select SND_RAWMIDI
Staging: line6: fix checkpatch errors in variax.c
Staging: line6: fix checkpatch errors in toneport.c
Staging: line6: fix checkpatch errors in pcm.c
Staging: line6: fix checkpatch errors in midibuf.c
Staging: line6: fix checkpatch errors in midi.c
Staging: line6: fix checkpatch errors in dumprequest.c
Staging: line6: fix checkpatch errors in driver.c
Staging: line6: fix checkpatch errors in audio.c
Staging: line6: fix checkpatch errors in pod.c
...
Diffstat (limited to 'Documentation/filesystems/pohmelfs/design_notes.txt')
-rw-r--r-- | Documentation/filesystems/pohmelfs/design_notes.txt | 70 |
1 files changed, 70 insertions, 0 deletions
diff --git a/Documentation/filesystems/pohmelfs/design_notes.txt b/Documentation/filesystems/pohmelfs/design_notes.txt new file mode 100644 index 00000000000..6d6db60d567 --- /dev/null +++ b/Documentation/filesystems/pohmelfs/design_notes.txt @@ -0,0 +1,70 @@ +POHMELFS: Parallel Optimized Host Message Exchange Layered File System. + + Evgeniy Polyakov <zbr@ioremap.net> + +Homepage: http://www.ioremap.net/projects/pohmelfs + +POHMELFS first began as a network filesystem with coherent local data and +metadata caches but is now evolving into a parallel distributed filesystem. + +Main features of this FS include: + * Locally coherent cache for data and metadata with (potentially) byte-range locks. + Since all Linux filesystems lock the whole inode during writing, algorithm + is very simple and does not use byte-ranges, although they are sent in + locking messages. + * Completely async processing of all events except creation of hard and symbolic + links, and rename events. + Object creation and data reading and writing are processed asynchronously. + * Flexible object architecture optimized for network processing. + Ability to create long paths to objects and remove arbitrarily huge + directories with a single network command. + (like removing the whole kernel tree via a single network command). + * Very high performance. + * Fast and scalable multithreaded userspace server. Being in userspace it works + with any underlying filesystem and still is much faster than async in-kernel NFS one. + * Client is able to switch between different servers (if one goes down, client + automatically reconnects to second and so on). + * Transactions support. Full failover for all operations. + Resending transactions to different servers on timeout or error. + * Read request (data read, directory listing, lookup requests) balancing between multiple servers. + * Write requests are replicated to multiple servers and completed only when all of them are acked. + * Ability to add and/or remove servers from the working set at run-time. + * Strong authentification and possible data encryption in network channel. + * Extended attributes support. + +POHMELFS is based on transactions, which are potentially long-standing objects that live +in the client's memory. Each transaction contains all the information needed to process a given +command (or set of commands, which is frequently used during data writing: single transactions +can contain creation and data writing commands). Transactions are committed by all the servers +to which they are sent and, in case of failures, are eventually resent or dropped with an error. +For example, reading will return an error if no servers are available. + +POHMELFS uses a asynchronous approach to data processing. Courtesy of transactions, it is +possible to detach replies from requests and, if the command requires data to be received, the +caller sleeps waiting for it. Thus, it is possible to issue multiple read commands to different +servers and async threads will pick up replies in parallel, find appropriate transactions in the +system and put the data where it belongs (like the page or inode cache). + +The main feature of POHMELFS is writeback data and the metadata cache. +Only a few non-performance critical operations use the write-through cache and +are synchronous: hard and symbolic link creation, and object rename. Creation, +removal of objects and data writing are asynchronous and are sent to +the server during system writeback. Only one writer at a time is allowed for any +given inode, which is guarded by an appropriate locking protocol. +Because of this feature, POHMELFS is extremely fast at metadata intensive +workloads and can fully utilize the bandwidth to the servers when doing bulk +data transfers. + +POHMELFS clients operate with a working set of servers and are capable of balancing read-only +operations (like lookups or directory listings) between them. +Administrators can add or remove servers from the set at run-time via special commands (described +in Documentation/pohmelfs/info.txt file). Writes are replicated to all servers. + +POHMELFS is capable of full data channel encryption and/or strong crypto hashing. +One can select any kernel supported cipher, encryption mode, hash type and operation mode +(hmac or digest). It is also possible to use both or neither (default). Crypto configuration +is checked during mount time and, if the server does not support it, appropriate capabilities +will be disabled or mount will fail (if 'crypto_fail_unsupported' mount option is specified). +Crypto performance heavily depends on the number of crypto threads, which asynchronously perform +crypto operations and send the resulting data to server or submit it up the stack. This number +can be controlled via a mount option. |