aboutsummaryrefslogtreecommitdiff
path: root/fs/btrfs/extent-tree.c
AgeCommit message (Collapse)Author
2009-02-20Btrfs: try committing transaction before returning ENOSPCJosef Bacik
This fixes a problem where we could return -ENOSPC when we may actually have plenty of space, the space is just pinned. Instead of returning -ENOSPC immediately, commit the transaction first and then try and do the allocation again. This patch also does chunk allocation for metadata if we pass the 80% threshold for metadata space. This will help with stack usage since the chunk allocation will happen early on, instead of when the allocation is happening. Signed-off-by: Josef Bacik <jbacik@redhat.com>
2009-02-20Btrfs: add better -ENOSPC handlingJosef Bacik
This is a step in the direction of better -ENOSPC handling. Instead of checking the global bytes counter we check the space_info bytes counters to make sure we have enough space. If we don't we go ahead and try to allocate a new chunk, and then if that fails we return -ENOSPC. This patch adds two counters to btrfs_space_info, bytes_delalloc and bytes_may_use. bytes_delalloc account for extents we've actually setup for delalloc and will be allocated at some point down the line. bytes_may_use is to keep track of how many bytes we may use for delalloc at some point. When we actually set the extent_bit for the delalloc bytes we subtract the reserved bytes from the bytes_may_use counter. This keeps us from not actually being able to allocate space for any delalloc bytes. Signed-off-by: Josef Bacik <jbacik@redhat.com>
2009-02-12Btrfs: hold trans_mutex when using btrfs_record_root_in_transYan Zheng
btrfs_record_root_in_trans needs the trans_mutex held to make sure two callers don't race to setup the root in a given transaction. This adds it to all the places that were missing it. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2009-02-12Btrfs: make a lockdep class for the extent buffer locksChris Mason
Btrfs is currently using spin_lock_nested with a nested value based on the tree depth of the block. But, this doesn't quite work because the max tree depth is bigger than what spin_lock_nested can deal with, and because locks are sometimes taken before the level field is filled in. The solution here is to use lockdep_set_class_and_name instead, and to set the class before unlocking the pages when the block is read from the disk and just after init of a freshly allocated tree block. btrfs_clear_path_blocking is also changed to take the locks in the proper order, and it also makes sure all the locks currently held are properly set to blocking before it tries to retake the spinlocks. Otherwise, lockdep gets upset about bad lock orderin. The lockdep magic cam from Peter Zijlstra <peterz@infradead.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12Btrfs: use larger metadata clusters in ssd modeChris Mason
Larger metadata clusters can significantly improve writeback performance on ssd drives with large erasure blocks. The larger clusters make it more likely a given IO will completely overwrite the ssd block, so it doesn't have to do an internal rwm cycle. On spinning media, lager metadata clusters end up spreading out the metadata more over time, which makes fsck slower, so we don't want this to be the default. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12Btrfs: make sure all pending extent operations are completeJosef Bacik
Theres a slight problem with finish_current_insert, if we set all to 1 and then go through and don't actually skip any of the extents on the pending list, we could exit right after we've added new extents. This is a problem because by inserting the new extents we could have gotten new COW's to happen and such, so we may have some pending updates to do or even more inserts to do after that. So this patch will only exit if we have never skipped any of the extents in the pending list, and we have no extents to insert, this will make sure that all of the pending work is truly done before we return. I've been running with this patch for a few days with all of my other testing and have not seen issues. Thanks, Signed-off-by: Josef Bacik <jbacik@redhat.com>
2009-02-05Btrfs: Fix memory leak in cache_drop_leaf_refChris Mason
The code wasn't doing a kfree on the sorted array Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04Btrfs: Make btrfs_drop_snapshot work in larger and more efficient chunksChris Mason
Every transaction in btrfs creates a new snapshot, and then schedules the snapshot from the last transaction for deletion. Snapshot deletion works by walking down the btree and dropping the reference counts on each btree block during the walk. If if a given leaf or node has a reference count greater than one, the reference count is decremented and the subtree pointed to by that node is ignored. If the reference count is one, walking continues down into that node or leaf, and the references of everything it points to are decremented. The old code would try to work in small pieces, walking down the tree until it found the lowest leaf or node to free and then returning. This was very friendly to the rest of the FS because it didn't have a huge impact on other operations. But it wouldn't always keep up with the rate that new commits added new snapshots for deletion, and it wasn't very optimal for the extent allocation tree because it wasn't finding leaves that were close together on disk and processing them at the same time. This changes things to walk down to a level 1 node and then process it in bulk. All the leaf pointers are sorted and the leaves are dropped in order based on their extent number. The extent allocation tree and commit code are now fast enough for this kind of bulk processing to work without slowing the rest of the FS down. Overall it does less IO and is better able to keep up with snapshot deletions under high load. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04Btrfs: Change btree locking to use explicit blocking pointsChris Mason
Most of the btrfs metadata operations can be protected by a spinlock, but some operations still need to schedule. So far, btrfs has been using a mutex along with a trylock loop, most of the time it is able to avoid going for the full mutex, so the trylock loop is a big performance gain. This commit is step one for getting rid of the blocking locks entirely. btrfs_tree_lock takes a spinlock, and the code explicitly switches to a blocking lock when it starts an operation that can schedule. We'll be able get rid of the blocking locks in smaller pieces over time. Tracing allows us to find the most common cause of blocking, so we can start with the hot spots first. The basic idea is: btrfs_tree_lock() returns with the spin lock held btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in the extent buffer flags, and then drops the spin lock. The buffer is still considered locked by all of the btrfs code. If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops the spin lock and waits on a wait queue for the blocking bit to go away. Much of the code that needs to set the blocking bit finishes without actually blocking a good percentage of the time. So, an adaptive spin is still used against the blocking bit to avoid very high context switch rates. btrfs_clear_lock_blocking() clears the blocking bit and returns with the spinlock held again. btrfs_tree_unlock() can be called on either blocking or spinning locks, it does the right thing based on the blocking bit. ctree.c has a helper function to set/clear all the locked buffers in a path as blocking. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04Btrfs: sort references by byte number during btrfs_inc_refChris Mason
When a block goes through cow, we update the reference counts of everything that block points to. The internal pointers of the block can be in just about any order, and it is likely to have clusters of things that are close together and clusters of things that are not. To help reduce the seeks that come with updating all of these reference counts, sort them by byte number before actual updates are done. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-21Btrfs: fix tree logs parallel syncYan Zheng
To improve performance, btrfs_sync_log merges tree log sync requests. But it wrongly merges sync requests for different tree logs. If multiple tree logs are synced at the same time, only one of them actually gets synced. This patch has following changes to fix the bug: Move most tree log related fields in btrfs_fs_info to btrfs_root. This allows merging sync requests separately for each tree log. Don't insert root item into the log root tree immediately after log tree is allocated. Root item for log tree is inserted when log tree get synced for the first time. This allows syncing the log root tree without first syncing all log trees. At tree-log sync, btrfs_sync_log first sync the log tree; then updates corresponding root item in the log root tree; sync the log root tree; then update the super block. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2009-01-21Btrfs: fix stop searching test in replace_one_extentYan Zheng
replace_one_extent searches tree leaves for references to a given extent. It stops searching if it goes beyond the last possible position. The last possible position is computed by adding the starting offset of a found file extent to the full size of the extent. The code uses physical size of the extent as the full size. This is incorrect when compression is used. The fix is get the full size from ram_bytes field of file extent item. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2009-01-21Btrfs: remove duplicated #includeHuang Weiyi
Removed duplicated #include "compat.h"in fs/btrfs/extent-tree.c Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-21Btrfs: Fix infinite loop in btrfs_extent_post_opYan Zheng
btrfs_extent_post_op calls finish_current_insert and del_pending_extents. They both may enter infinite loops. finish_current_insert enters infinite loop if it only finds some backrefs to update. The fix is to check for pending backref updates before restarting the loop. The infinite loop in del_pending_extents is due to a the skipped variable not being properly reset before looping around. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2009-01-21Btrfs: fix locking issue in btrfs_remove_block_groupYan Zheng
We should hold the block_group_cache_lock while modifying the block groups red-black tree. Thank you, Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2009-01-21Btrfs: simplify iteration codesQinghuang Feng
Merge list_for_each* and list_entry to list_for_each_entry* Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-21Btrfs: removed unused #include <version.h>'sHuang Weiyi
Removed unused #include <version.h>'s in btrfs Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-06Btrfs: tree logging checksum fixesYan Zheng
This patch contains following things. 1) Limit the max size of btrfs_ordered_sum structure to PAGE_SIZE. This struct is kmalloced so we want to keep it reasonable. 2) Replace copy_extent_csums by btrfs_lookup_csums_range. This was duplicated code in tree-log.c 3) Remove replay_one_csum. csum items are replayed at the same time as replaying file extents. This guarantees we only replay useful csums. 4) nbytes accounting fix. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2009-01-06Btrfs: drop remaining LINUX_KERNEL_VERSION checks and compat codeChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-05Btrfs: Fix checkpatch.pl warningsChris Mason
There were many, most are fixed now. struct-funcs.c generates some warnings but these are bogus. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-05Btrfs: Fix free block discard calls down to the block layerLiu Hui
This is a patch to fix discard semantic to make Btrfs work with FTL and SSD. We can improve FTL's performance by telling it which sectors are freed by file system. But if we don't tell FTL the information of free sectors in proper time, the transaction mechanism of Btrfs will be destroyed and Btrfs could not roll back the previous transaction under the power loss condition. There are some problems in the old implementation: 1, In __free_extent(), the pinned down extents should not be discarded. 2, In free_extents(), the free extents are all pinned, so they need to be discarded in transaction committing time instead of free_extents(). 3, The reserved extent used by log tree should be discard too. This patch change discard behavior as follows: 1, For the extents which need to be free at once, we discard them in update_block_group(). 2, Delay discarding the pinned extent in btrfs_finish_extent_commit() when committing transaction. 3, Remove discarding from free_extents() and __free_extent() 4, Add discard interface into btrfs_free_reserved_extent() 5, Discard sectors before updating the free space cache, otherwise, FTL will destroy file system data.
2008-12-19Btrfs: set EXTENT_BOUNDARY bit before marking extent delalloc.Yan Zheng
There is a race in relocate_inode_pages, it happens when find_delalloc_range finds the delalloc extent before the boundary bit is set. Thank you, Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-12-19Btrfs: properly update block accounting for metadataYan Zheng
This adds the missing block accounting code to finish_current_insert and makes block accounting for root item properly protected by the delalloc spin lock. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-12-16Btrfs: delete checksum items before marking blocks freeChris Mason
Btrfs maintains a cache of blocks available for allocation in ram. The code that frees extents was marking the extents free and then deleting the checksum items. This meant it was possible the extent would be reallocated before the checksum item was actually deleted, leading to races and other problems as the checksums were updated for the newly allocated extent. The fix is to delete the checksum before marking the extent free. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-15Btrfs: Don't use spin*lock_irq for the delalloc lockChris Mason
The delalloc lock doesn't need to have irqs disabled, nobody that changes the number of delalloc bytes in the FS is running with irqs off. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-12Btrfs: fix nodatasum handling in balancing codeYan Zheng
Checksums on data can be disabled by mount option, so it's possible some data extents don't have checksums or have invalid checksums. This causes trouble for data relocation. This patch contains following things to make data relocation work. 1) make nodatasum/nodatacow mount option only affects new files. Checksums and COW on data are only controlled by the inode flags. 2) check the existence of checksum in the nodatacow checker. If checksums exist, force COW the data extent. This ensure that checksum for a given block is either valid or does not exist. 3) update data relocation code to properly handle the case of checksum missing. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-12-12Btrfs: shared seed deviceYan Zheng
This patch makes seed device possible to be shared by multiple mounted file systems. The sharing is achieved by cloning seed device's btrfs_fs_devices structure. Thanks you, Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-12-11Btrfs: fix leaking block group on balanceYan Zheng
The block group structs are referenced in many different places, and it's not safe to free while balancing. So, those block group structs were simply leaked instead. This patch replaces the block group pointer in the inode with the starting byte offset of the block group and adds reference counting to the block group struct. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-12-10Btrfs: Add checking of csum tree in balancing codeYan Zheng
This updates the space balancing code for the new checksum format. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-12-10Btrfs: Delete csum items when freeing extentsChris Mason
This finishes off the new checksumming code by removing csum items for extents that are no longer in use. The trick is doing it without racing because a single csum item may hold csums for more than one extent. Extra checks are added to btrfs_csum_file_blocks to make sure that we are using the correct csum item after dropping locks. A new btrfs_split_item is added to split a single csum item so it can be split without dropping the leaf lock. This is used to remove csum bytes from the middle of an item. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-08Btrfs: superblock duplicationYan Zheng
This patch implements superblock duplication. Superblocks are stored at offset 16K, 64M and 256G on every devices. Spaces used by superblocks are preserved by the allocator, which uses a reverse mapping function to find the logical addresses that correspond to superblocks. Thank you, Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-12-02Btrfs: make things static and include the right headersChristoph Hellwig
Shut up various sparse warnings about symbols that should be either static or have their declarations in scope. Signed-off-by: Christoph Hellwig <hch@lst.de>
2008-11-20Btrfs: Fix for lockdep warnings with alloc_mutex and pinned_mutexJosef Bacik
This the lockdep complaint by having a different mutex to gaurd caching the block group, so you don't end up with this backwards dependancy. Thank you, Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-11-20Btrfs: compat code fixesChris Mason
The btrfs git kernel trees is used to build a standalone tree for compiling against older kernels. This commit makes the standalone tree work with 2.6.27 Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-19Btrfs: Fixes for 2.6.28-rc API changesChris Mason
* open/close_bdev_excl -> open/close_bdev_exclusive * blkdev_issue_discard takes a GFP mask now * Fix blkdev_issue_discard usage now that it is enabled Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-19Btrfs: fix free space accounting when unpinning extentsJosef Bacik
This patch fixes what I hope is the last early ENOSPC bug left. I did not know that pinned extents would merge into one big extent when inserted on to the pinned extent tree, so I was adding free space to a block group that could possibly span multiple block groups. This is a big issue because first that space doesn't exist in that block group, and second we won't actually use that space because there are a bunch of other checks to make sure we're allocating within the constraints of the block group. This patch fixes the problem by adding the btrfs_add_free_space to btrfs_update_pinned_extents which makes sure we are adding the appropriate amount of free space to the appropriate block group. Thanks much to Lee Trager for running my myriad of debug patches to help me track this problem down. Thank you, Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-11-18Btrfs: Some fixes for batching extent insert.Liu Hui
In insert_extents(), when ret==1 and last is not zero, it should check if the current inserted item is the last item in this batching inserts. If so, it should just break from loop. If not, 'cur = insert_list->next' will make no sense because the list is empty now, and 'op' will point to an unexpectable place. There are also some trivial fixs in this patch including one comment typo error and deleting two redundant lines. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-17Btrfs: Add some debugging around the ENOSPC bugsJosef Bacik
Some people are still reporting problems with early enospc. This will help narrown down the cause. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-17Btrfs: fix free space leakJosef Bacik
In my batch delete/update/insert patch I introduced a free space leak. The extent that we do the original search on in free_extents is never pinned, so we always update the block saying that it has free space, but the free space never actually gets added to the free space tree, since op->del will always be 0 and it's never actually added to the pinned extents tree. This patch fixes this problem by making sure we call pin_down_bytes on the pending extent op and set op->del to the return value of pin_down_bytes so update_block_group is called with the right value. This seems to fix the case where we were getting ENOSPC when there was plenty of space available. Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-11-17Btrfs: Seed device supportYan Zheng
Seed device is a special btrfs with SEEDING super flag set and can only be mounted in read-only mode. Seed devices allow people to create new btrfs on top of it. The new FS contains the same contents as the seed device, but it can be mounted in read-write mode. This patch does the following: 1) split code in btrfs_alloc_chunk into two parts. The first part does makes the newly allocated chunk usable, but does not do any operation that modifies the chunk tree. The second part does the the chunk tree modifications. This division is for the bootstrap step of adding storage to the seed device. 2) Update device management code to handle seed device. The basic idea is: For an FS grown from seed devices, its seed devices are put into a list. Seed devices are opened on demand at mounting time. If any seed device is missing or has been changed, btrfs kernel module will refuse to mount the FS. 3) make btrfs_find_block_group not return NULL when all block groups are read-only. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-11-12Btrfs: mount ro and remount supportYan Zheng
This patch adds mount ro and remount support. The main changes in patch are: adding btrfs_remount and related helper function; splitting the transaction related code out of close_ctree into btrfs_commit_super; updating allocator to properly handle read only block group. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-11-12Btrfs: batch extent inserts/updates/deletions on the extent rootJosef Bacik
While profiling the allocator I noticed a good amount of time was being spent in finish_current_insert and del_pending_extents, and as the filesystem filled up more and more time was being spent in those functions. This patch aims to try and reduce that problem. This happens two ways 1) track if we tried to delete an extent that we are going to update or insert. Once we get into finish_current_insert we discard any of the extents that were marked for deletion. This saves us from doing unnecessary work almost every time finish_current_insert runs. 2) Batch insertion/updates/deletions. Instead of doing a btrfs_search_slot for each individual extent and doing the needed operation, we instead keep the leaf around and see if there is anything else we can do on that leaf. On the insert case I introduced a btrfs_insert_some_items, which will take an array of keys with an array of data_sizes and try and squeeze in as many of those keys as possible, and then return how many keys it was able to insert. In the update case we search for an extent ref, update the ref and then loop through the leaf to see if any of the other refs we are looking to update are on that leaf, and then once we are done we release the path and search for the next ref we need to update. And finally for the deletion we try and delete the extent+ref in pairs, so we will try to find extent+ref pairs next to the extent we are trying to free and free them in bulk if possible. This along with the other cluster fix that Chris pushed out a bit ago helps make the allocator preform more uniformly as it fills up the disk. There is still a slight drop as we fill up the disk since we start having to stick new blocks in odd places which results in more COW's than on a empty fs, but the drop is not nearly as severe as it was before. Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-11-13Btrfs: Fix handling of space info full during allocationsChris Mason
When we fail to allocate a new block group, we should still do the checks to make sure allocations try again with the minimum requested allocation size. This also fixes a deadlock that come from a missed down_read in the chunk allocation failure handling. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-10Btrfs: empty_size allocation fixes againChris Mason
The allocator wasn't catching all of the cases where it needed to do extra loops because the check to enforce them wasn't happening early enough. When the allocator decided to increase the size of the allocation for metadata clustering, it wasn't always setting the empty_size to include the extra (optional) bytes. This also fixes the empty_size field to be correct. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-10Btrfs: Try harder while searching for free spaceChris Mason
The loop searching for free space would exit out too soon when metadata clustering was trying to allocate a large extent. This makes sure a full scan of the free space is done searching for only the minimum extent size requested by the higher layers. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-10Btrfs: Don't substract too much from the allocation target (avoid wrapping)Chris Mason
When metadata allocation clustering has to fall back to unclustered allocs because large free areas could not be found, it was sometimes substracting too much from the total bytes to allocate. This would make it wrap below zero. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-07Btrfs: Fix more false enospc errors and an oops from empty clusteringChris Mason
In comes cases the empty cluster was added twice to the total number of bytes the allocator was trying to find. With empty clustering on, the hint byte was sometimes outside of the block group. Add an extra goto to find the correct block group. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-07Btfs: More metadata allocator optimizationsChris Mason
This lowers the empty cluster target for metadata allocations. The lower target makes it easier to do allocations and still seems to perform well. It also fixes the allocator loop to drop the empty cluster when things start getting difficult, avoiding false enospc warnings. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-06Btrfs: enforce metadata allocation clusteringChris Mason
The allocator uses the last allocation as a starting point for metadata allocations, and tries to allocate in clusters of at least 256k. If the search for a free block fails to find the expected block, this patch forces a new cluster to be found in the free list. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-06Btrfs: Optimize compressed writeback and readsChris Mason
When reading compressed extents, try to put pages into the page cache for any pages covered by the compressed extent that readpages didn't already preload. Add an async work queue to handle transformations at delayed allocation processing time. Right now this is just compression. The workflow is: 1) Find offsets in the file marked for delayed allocation 2) Lock the pages 3) Lock the state bits 4) Call the async delalloc code The async delalloc code clears the state lock bits and delalloc bits. It is important this happens before the range goes into the work queue because otherwise it might deadlock with other work queue items that try to lock those extent bits. The file pages are compressed, and if the compression doesn't work the pages are written back directly. An ordered work queue is used to make sure the inodes are written in the same order that pdflush or writepages sent them down. This changes extent_write_cache_pages to let the writepage function update the wbc nr_written count. Signed-off-by: Chris Mason <chris.mason@oracle.com>