From 8c7c6e34a1256a5082d38c8e9bd1474476912715 Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki Date: Wed, 7 Jan 2009 18:08:00 -0800 Subject: memcg: mem+swap controller core This patch implements per cgroup limit for usage of memory+swap. However there are SwapCache, double counting of swap-cache and swap-entry is avoided. Mem+Swap controller works as following. - memory usage is limited by memory.limit_in_bytes. - memory + swap usage is limited by memory.memsw_limit_in_bytes. This has following benefits. - A user can limit total resource usage of mem+swap. Without this, because memory resource controller doesn't take care of usage of swap, a process can exhaust all the swap (by memory leak.) We can avoid this case. And Swap is shared resource but it cannot be reclaimed (goes back to memory) until it's used. This characteristic can be trouble when the memory is divided into some parts by cpuset or memcg. Assume group A and group B. After some application executes, the system can be.. Group A -- very large free memory space but occupy 99% of swap. Group B -- under memory shortage but cannot use swap...it's nearly full. Ability to set appropriate swap limit for each group is required. Maybe someone wonder "why not swap but mem+swap ?" - The global LRU(kswapd) can swap out arbitrary pages. Swap-out means to move account from memory to swap...there is no change in usage of mem+swap. In other words, when we want to limit the usage of swap without affecting global LRU, mem+swap limit is better than just limiting swap. Accounting target information is stored in swap_cgroup which is per swap entry record. Charge is done as following. map - charge page and memsw. unmap - uncharge page/memsw if not SwapCache. swap-out (__delete_from_swap_cache) - uncharge page - record mem_cgroup information to swap_cgroup. swap-in (do_swap_page) - charged as page and memsw. record in swap_cgroup is cleared. memsw accounting is decremented. swap-free (swap_free()) - if swap entry is freed, memsw is uncharged by PAGE_SIZE. There are people work under never-swap environments and consider swap as something bad. For such people, this mem+swap controller extension is just an overhead. This overhead is avoided by config or boot option. (see Kconfig. detail is not in this patch.) TODO: - maybe more optimization can be don in swap-in path. (but not very safe.) But we just do simple accounting at this stage. [nishimura@mxp.nes.nec.co.jp: make resize limit hold mutex] [hugh@veritas.com: memswap controller core swapcache fixes] Signed-off-by: KAMEZAWA Hiroyuki Cc: Li Zefan Cc: Balbir Singh Cc: Pavel Emelyanov Signed-off-by: Daisuke Nishimura Signed-off-by: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index b07c48b09a9..f63b20dd771 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1661,7 +1661,8 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, #ifdef CONFIG_CGROUP_MEM_RES_CTLR unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont, - gfp_t gfp_mask) + gfp_t gfp_mask, + bool noswap) { struct scan_control sc = { .may_writepage = !laptop_mode, @@ -1674,6 +1675,9 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont, }; struct zonelist *zonelist; + if (noswap) + sc.may_swap = 0; + sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) | (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK); zonelist = NODE_DATA(numa_node_id())->node_zonelists; -- cgit v1.2.3 From 08e552c69c6930d64722de3ec18c51844d06ee28 Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki Date: Wed, 7 Jan 2009 18:08:01 -0800 Subject: memcg: synchronized LRU A big patch for changing memcg's LRU semantics. Now, - page_cgroup is linked to mem_cgroup's its own LRU (per zone). - LRU of page_cgroup is not synchronous with global LRU. - page and page_cgroup is one-to-one and statically allocated. - To find page_cgroup is on what LRU, you have to check pc->mem_cgroup as - lru = page_cgroup_zoneinfo(pc, nid_of_pc, zid_of_pc); - SwapCache is handled. And, when we handle LRU list of page_cgroup, we do following. pc = lookup_page_cgroup(page); lock_page_cgroup(pc); .....................(1) mz = page_cgroup_zoneinfo(pc); spin_lock(&mz->lru_lock); .....add to LRU spin_unlock(&mz->lru_lock); unlock_page_cgroup(pc); But (1) is spin_lock and we have to be afraid of dead-lock with zone->lru_lock. So, trylock() is used at (1), now. Without (1), we can't trust "mz" is correct. This is a trial to remove this dirty nesting of locks. This patch changes mz->lru_lock to be zone->lru_lock. Then, above sequence will be written as spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU mem_cgroup_add/remove/etc_lru() { pc = lookup_page_cgroup(page); mz = page_cgroup_zoneinfo(pc); if (PageCgroupUsed(pc)) { ....add to LRU } spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU This is much simpler. (*) We're safe even if we don't take lock_page_cgroup(pc). Because.. 1. When pc->mem_cgroup can be modified. - at charge. - at account_move(). 2. at charge the PCG_USED bit is not set before pc->mem_cgroup is fixed. 3. at account_move() the page is isolated and not on LRU. Pros. - easy for maintenance. - memcg can make use of laziness of pagevec. - we don't have to duplicated LRU/Active/Unevictable bit in page_cgroup. - LRU status of memcg will be synchronized with global LRU's one. - # of locks are reduced. - account_move() is simplified very much. Cons. - may increase cost of LRU rotation. (no impact if memcg is not configured.) Signed-off-by: KAMEZAWA Hiroyuki Cc: Li Zefan Cc: Balbir Singh Cc: Pavel Emelyanov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index f63b20dd771..45983af1de3 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -512,7 +512,6 @@ redo: lru = LRU_UNEVICTABLE; add_page_to_unevictable_list(page); } - mem_cgroup_move_lists(page, lru); /* * page's status can change while we move it among lru. If an evictable @@ -547,7 +546,6 @@ void putback_lru_page(struct page *page) lru = !!TestClearPageActive(page) + page_is_file_cache(page); lru_cache_add_lru(page, lru); - mem_cgroup_move_lists(page, lru); put_page(page); } #endif /* CONFIG_UNEVICTABLE_LRU */ @@ -813,6 +811,7 @@ int __isolate_lru_page(struct page *page, int mode, int file) return ret; ret = -EBUSY; + if (likely(get_page_unless_zero(page))) { /* * Be careful not to clear PageLRU until after we're @@ -821,6 +820,7 @@ int __isolate_lru_page(struct page *page, int mode, int file) */ ClearPageLRU(page); ret = 0; + mem_cgroup_del_lru(page); } return ret; @@ -1134,7 +1134,6 @@ static unsigned long shrink_inactive_list(unsigned long max_scan, SetPageLRU(page); lru = page_lru(page); add_page_to_lru_list(zone, page, lru); - mem_cgroup_move_lists(page, lru); if (PageActive(page) && scan_global_lru(sc)) { int file = !!page_is_file_cache(page); zone->recent_rotated[file]++; @@ -1263,7 +1262,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone, ClearPageActive(page); list_move(&page->lru, &zone->lru[lru].list); - mem_cgroup_move_lists(page, lru); + mem_cgroup_add_lru_list(page, lru); pgmoved++; if (!pagevec_add(&pvec, page)) { __mod_zone_page_state(zone, NR_LRU_BASE + lru, pgmoved); @@ -2408,6 +2407,7 @@ retry: __dec_zone_state(zone, NR_UNEVICTABLE); list_move(&page->lru, &zone->lru[l].list); + mem_cgroup_move_lists(page, LRU_UNEVICTABLE, l); __inc_zone_state(zone, NR_INACTIVE_ANON + l); __count_vm_event(UNEVICTABLE_PGRESCUED); } else { @@ -2416,6 +2416,7 @@ retry: */ SetPageUnevictable(page); list_move(&page->lru, &zone->lru[LRU_UNEVICTABLE].list); + mem_cgroup_rotate_lru_list(page, LRU_UNEVICTABLE); if (page_evictable(page, NULL)) goto retry; } -- cgit v1.2.3 From f89eb90e33fd4e4e0cc1a6d20afd63c5a561885a Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Wed, 7 Jan 2009 18:08:14 -0800 Subject: inactive_anon_is_low: move to vmscan The inactive_anon_is_low() is called only vmscan. Then it can move to vmscan.c This patch doesn't have any functional change. Reviewd-by: KAMEZAWA Hiroyuki Acked-by: Rik van Riel Signed-off-by: KOSAKI Motohiro Cc: Balbir Singh Cc: Daisuke Nishimura Cc: Hugh Dickins Cc: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 45983af1de3..f75d924cb4f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1291,6 +1291,26 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone, pagevec_release(&pvec); } +/** + * inactive_anon_is_low - check if anonymous pages need to be deactivated + * @zone: zone to check + * + * Returns true if the zone does not have enough inactive anon pages, + * meaning some active anon pages need to be deactivated. + */ +static int inactive_anon_is_low(struct zone *zone) +{ + unsigned long active, inactive; + + active = zone_page_state(zone, NR_ACTIVE_ANON); + inactive = zone_page_state(zone, NR_INACTIVE_ANON); + + if (inactive * zone->inactive_ratio < active) + return 1; + + return 0; +} + static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, struct zone *zone, struct scan_control *sc, int priority) { -- cgit v1.2.3 From 6e9015716ae9b59e9635d692fddfcfb9582c146c Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Wed, 7 Jan 2009 18:08:15 -0800 Subject: mm: introduce zone_reclaim struct Add zone_reclam_stat struct for later enhancement. A later patch uses this. This patch doesn't any behavior change (yet). Reviewed-by: KAMEZAWA Hiroyuki Signed-off-by: KOSAKI Motohiro Acked-by: Rik van Riel Cc: Balbir Singh Cc: Daisuke Nishimura Cc: Hugh Dickins Cc: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 47 ++++++++++++++++++++++++++++++----------------- 1 file changed, 30 insertions(+), 17 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index f75d924cb4f..03ca923c665 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -130,6 +130,12 @@ static DECLARE_RWSEM(shrinker_rwsem); #define scan_global_lru(sc) (1) #endif +static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone, + struct scan_control *sc) +{ + return &zone->reclaim_stat; +} + /* * Add a shrinker callback to be called from the vm */ @@ -1029,6 +1035,7 @@ static unsigned long shrink_inactive_list(unsigned long max_scan, struct pagevec pvec; unsigned long nr_scanned = 0; unsigned long nr_reclaimed = 0; + struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc); pagevec_init(&pvec, 1); @@ -1072,10 +1079,14 @@ static unsigned long shrink_inactive_list(unsigned long max_scan, if (scan_global_lru(sc)) { zone->pages_scanned += nr_scan; - zone->recent_scanned[0] += count[LRU_INACTIVE_ANON]; - zone->recent_scanned[0] += count[LRU_ACTIVE_ANON]; - zone->recent_scanned[1] += count[LRU_INACTIVE_FILE]; - zone->recent_scanned[1] += count[LRU_ACTIVE_FILE]; + reclaim_stat->recent_scanned[0] += + count[LRU_INACTIVE_ANON]; + reclaim_stat->recent_scanned[0] += + count[LRU_ACTIVE_ANON]; + reclaim_stat->recent_scanned[1] += + count[LRU_INACTIVE_FILE]; + reclaim_stat->recent_scanned[1] += + count[LRU_ACTIVE_FILE]; } spin_unlock_irq(&zone->lru_lock); @@ -1136,7 +1147,7 @@ static unsigned long shrink_inactive_list(unsigned long max_scan, add_page_to_lru_list(zone, page, lru); if (PageActive(page) && scan_global_lru(sc)) { int file = !!page_is_file_cache(page); - zone->recent_rotated[file]++; + reclaim_stat->recent_rotated[file]++; } if (!pagevec_add(&pvec, page)) { spin_unlock_irq(&zone->lru_lock); @@ -1196,6 +1207,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone, struct page *page; struct pagevec pvec; enum lru_list lru; + struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc); lru_add_drain(); spin_lock_irq(&zone->lru_lock); @@ -1208,7 +1220,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone, */ if (scan_global_lru(sc)) { zone->pages_scanned += pgscanned; - zone->recent_scanned[!!file] += pgmoved; + reclaim_stat->recent_scanned[!!file] += pgmoved; } if (file) @@ -1251,7 +1263,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone, * pages in get_scan_ratio. */ if (scan_global_lru(sc)) - zone->recent_rotated[!!file] += pgmoved; + reclaim_stat->recent_rotated[!!file] += pgmoved; while (!list_empty(&l_inactive)) { page = lru_to_page(&l_inactive); @@ -1344,6 +1356,7 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc, unsigned long anon, file, free; unsigned long anon_prio, file_prio; unsigned long ap, fp; + struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc); /* If we have no swap space, do not bother scanning anon pages. */ if (nr_swap_pages <= 0) { @@ -1376,17 +1389,17 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc, * * anon in [0], file in [1] */ - if (unlikely(zone->recent_scanned[0] > anon / 4)) { + if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) { spin_lock_irq(&zone->lru_lock); - zone->recent_scanned[0] /= 2; - zone->recent_rotated[0] /= 2; + reclaim_stat->recent_scanned[0] /= 2; + reclaim_stat->recent_rotated[0] /= 2; spin_unlock_irq(&zone->lru_lock); } - if (unlikely(zone->recent_scanned[1] > file / 4)) { + if (unlikely(reclaim_stat->recent_scanned[1] > file / 4)) { spin_lock_irq(&zone->lru_lock); - zone->recent_scanned[1] /= 2; - zone->recent_rotated[1] /= 2; + reclaim_stat->recent_scanned[1] /= 2; + reclaim_stat->recent_rotated[1] /= 2; spin_unlock_irq(&zone->lru_lock); } @@ -1402,11 +1415,11 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc, * proportional to the fraction of recently scanned pages on * each list that were recently referenced and in active use. */ - ap = (anon_prio + 1) * (zone->recent_scanned[0] + 1); - ap /= zone->recent_rotated[0] + 1; + ap = (anon_prio + 1) * (reclaim_stat->recent_scanned[0] + 1); + ap /= reclaim_stat->recent_rotated[0] + 1; - fp = (file_prio + 1) * (zone->recent_scanned[1] + 1); - fp /= zone->recent_rotated[1] + 1; + fp = (file_prio + 1) * (reclaim_stat->recent_scanned[1] + 1); + fp /= reclaim_stat->recent_rotated[1] + 1; /* Normalize to percentages */ percent[0] = 100 * ap / (ap + fp + 1); -- cgit v1.2.3 From c9f299d9862deadf9fbee3ca28d915fdb006975a Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Wed, 7 Jan 2009 18:08:16 -0800 Subject: mm: add zone nr_pages helper function Add zone_nr_pages() helper function. It is used by a later patch. This patch doesn't have any functional change. Reviewed-by: KAMEZAWA Hiroyuki Signed-off-by: KOSAKI Motohiro Acked-by: Rik van Riel Cc: Balbir Singh Cc: Daisuke Nishimura Cc: Hugh Dickins Cc: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 03ca923c665..6827d35954f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -136,6 +136,13 @@ static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone, return &zone->reclaim_stat; } +static unsigned long zone_nr_pages(struct zone *zone, struct scan_control *sc, + enum lru_list lru) +{ + return zone_page_state(zone, NR_LRU_BASE + lru); +} + + /* * Add a shrinker callback to be called from the vm */ @@ -1365,10 +1372,10 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc, return; } - anon = zone_page_state(zone, NR_ACTIVE_ANON) + - zone_page_state(zone, NR_INACTIVE_ANON); - file = zone_page_state(zone, NR_ACTIVE_FILE) + - zone_page_state(zone, NR_INACTIVE_FILE); + anon = zone_nr_pages(zone, sc, LRU_ACTIVE_ANON) + + zone_nr_pages(zone, sc, LRU_INACTIVE_ANON); + file = zone_nr_pages(zone, sc, LRU_ACTIVE_FILE) + + zone_nr_pages(zone, sc, LRU_INACTIVE_FILE); free = zone_page_state(zone, NR_FREE_PAGES); /* If we have very few page cache pages, force-scan anon pages. */ -- cgit v1.2.3 From eeee9a8cd1e93c8b94e7788790fa9e2f8910c779 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Wed, 7 Jan 2009 18:08:17 -0800 Subject: mm: make get_scan_ratio() safe for memcg Currently, get_scan_ratio() always calculate the balancing value for global reclaim and memcg reclaim doesn't use it. Therefore it doesn't have scan_global_lru() condition. However, we plan to expand get_scan_ratio() to be usable for memcg too, latter. Then, The dependency code of global reclaim in the get_scan_ratio() insert into scan_global_lru() condision explictly. This patch doesn't have any functional change. Acked-by: Rik van Riel Signed-off-by: KOSAKI Motohiro Cc: Balbir Singh Cc: Daisuke Nishimura Cc: Hugh Dickins Cc: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 6827d35954f..e2b31a522a6 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1376,13 +1376,16 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc, zone_nr_pages(zone, sc, LRU_INACTIVE_ANON); file = zone_nr_pages(zone, sc, LRU_ACTIVE_FILE) + zone_nr_pages(zone, sc, LRU_INACTIVE_FILE); - free = zone_page_state(zone, NR_FREE_PAGES); - /* If we have very few page cache pages, force-scan anon pages. */ - if (unlikely(file + free <= zone->pages_high)) { - percent[0] = 100; - percent[1] = 0; - return; + if (scan_global_lru(sc)) { + free = zone_page_state(zone, NR_FREE_PAGES); + /* If we have very few page cache pages, + force-scan anon pages. */ + if (unlikely(file + free <= zone->pages_high)) { + percent[0] = 100; + percent[1] = 0; + return; + } } /* -- cgit v1.2.3 From 14797e2363c2b2f1ce139fd1c5a215e4e05aa1d9 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Wed, 7 Jan 2009 18:08:18 -0800 Subject: memcg: add inactive_anon_is_low() The inactive_anon_is_low() is key component of active/inactive anon balancing on reclaim. However current inactive_anon_is_low() function only consider global reclaim. Therefore, we need following ugly scan_global_lru() condition. if (lru == LRU_ACTIVE_ANON && (!scan_global_lru(sc) || inactive_anon_is_low(zone))) { shrink_active_list(nr_to_scan, zone, sc, priority, file); return 0; it cause that memcg reclaim always deactivate pages when shrink_list() is called. To make mem_cgroup_inactive_anon_is_low() improve active/inactive anon balancing of memcgroup. Acked-by: KAMEZAWA Hiroyuki Acked-by: Rik van Riel Signed-off-by: KOSAKI Motohiro Cc: Cyrill Gorcunov Cc: "Pekka Enberg" Cc: Balbir Singh Cc: Daisuke Nishimura Cc: Hugh Dickins Cc: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 37 +++++++++++++++++++++++-------------- 1 file changed, 23 insertions(+), 14 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index e2b31a522a6..b2bc06bffcf 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1310,14 +1310,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone, pagevec_release(&pvec); } -/** - * inactive_anon_is_low - check if anonymous pages need to be deactivated - * @zone: zone to check - * - * Returns true if the zone does not have enough inactive anon pages, - * meaning some active anon pages need to be deactivated. - */ -static int inactive_anon_is_low(struct zone *zone) +static int inactive_anon_is_low_global(struct zone *zone) { unsigned long active, inactive; @@ -1330,6 +1323,25 @@ static int inactive_anon_is_low(struct zone *zone) return 0; } +/** + * inactive_anon_is_low - check if anonymous pages need to be deactivated + * @zone: zone to check + * @sc: scan control of this context + * + * Returns true if the zone does not have enough inactive anon pages, + * meaning some active anon pages need to be deactivated. + */ +static int inactive_anon_is_low(struct zone *zone, struct scan_control *sc) +{ + int low; + + if (scan_global_lru(sc)) + low = inactive_anon_is_low_global(zone); + else + low = mem_cgroup_inactive_anon_is_low(sc->mem_cgroup, zone); + return low; +} + static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, struct zone *zone, struct scan_control *sc, int priority) { @@ -1340,8 +1352,7 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, return 0; } - if (lru == LRU_ACTIVE_ANON && - (!scan_global_lru(sc) || inactive_anon_is_low(zone))) { + if (lru == LRU_ACTIVE_ANON && inactive_anon_is_low(zone, sc)) { shrink_active_list(nr_to_scan, zone, sc, priority, file); return 0; } @@ -1509,9 +1520,7 @@ static void shrink_zone(int priority, struct zone *zone, * Even if we did not try to evict anon pages at all, we want to * rebalance the anon lru active/inactive ratio. */ - if (!scan_global_lru(sc) || inactive_anon_is_low(zone)) - shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0); - else if (!scan_global_lru(sc)) + if (inactive_anon_is_low(zone, sc)) shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0); throttle_vm_writeout(sc->gfp_mask); @@ -1807,7 +1816,7 @@ loop_again: * Do some background aging of the anon list, to give * pages a chance to be referenced before reclaiming. */ - if (inactive_anon_is_low(zone)) + if (inactive_anon_is_low(zone, &sc)) shrink_active_list(SWAP_CLUSTER_MAX, zone, &sc, priority, 0); -- cgit v1.2.3 From a3d8e0549d913e30968fa02e505dfe02c0a23e0d Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Wed, 7 Jan 2009 18:08:19 -0800 Subject: memcg: add mem_cgroup_zone_nr_pages() Introduce mem_cgroup_zone_nr_pages(). It is called by zone_nr_pages() helper function. This patch doesn't have any behavior change. Acked-by: KAMEZAWA Hiroyuki Acked-by: Rik van Riel Signed-off-by: KOSAKI Motohiro Acked-by: Balbir Singh Cc: Daisuke Nishimura Cc: Hugh Dickins Cc: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index b2bc06bffcf..d958d624d3a 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -139,6 +139,9 @@ static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone, static unsigned long zone_nr_pages(struct zone *zone, struct scan_control *sc, enum lru_list lru) { + if (!scan_global_lru(sc)) + return mem_cgroup_zone_nr_pages(sc->mem_cgroup, zone, lru); + return zone_page_state(zone, NR_LRU_BASE + lru); } -- cgit v1.2.3 From 3e2f41f1f64744f7942980d93cc93dd3e5924560 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Wed, 7 Jan 2009 18:08:20 -0800 Subject: memcg: add zone_reclaim_stat Introduce mem_cgroup_per_zone::reclaim_stat member and its statics collecting function. Now, get_scan_ratio() can calculate correct value on memcg reclaim. [hugh@veritas.com: avoid reclaim_stat oops when disabled] Acked-by: KAMEZAWA Hiroyuki Acked-by: Rik van Riel Signed-off-by: KOSAKI Motohiro Cc: Balbir Singh Cc: Daisuke Nishimura Cc: Hugh Dickins Cc: KOSAKI Motohiro Signed-off-by: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 27 +++++++++++++-------------- 1 file changed, 13 insertions(+), 14 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index d958d624d3a..56fc7abe4d2 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -133,6 +133,9 @@ static DECLARE_RWSEM(shrinker_rwsem); static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone, struct scan_control *sc) { + if (!scan_global_lru(sc)) + return mem_cgroup_get_reclaim_stat(sc->mem_cgroup, zone); + return &zone->reclaim_stat; } @@ -1087,17 +1090,14 @@ static unsigned long shrink_inactive_list(unsigned long max_scan, __mod_zone_page_state(zone, NR_INACTIVE_ANON, -count[LRU_INACTIVE_ANON]); - if (scan_global_lru(sc)) { + if (scan_global_lru(sc)) zone->pages_scanned += nr_scan; - reclaim_stat->recent_scanned[0] += - count[LRU_INACTIVE_ANON]; - reclaim_stat->recent_scanned[0] += - count[LRU_ACTIVE_ANON]; - reclaim_stat->recent_scanned[1] += - count[LRU_INACTIVE_FILE]; - reclaim_stat->recent_scanned[1] += - count[LRU_ACTIVE_FILE]; - } + + reclaim_stat->recent_scanned[0] += count[LRU_INACTIVE_ANON]; + reclaim_stat->recent_scanned[0] += count[LRU_ACTIVE_ANON]; + reclaim_stat->recent_scanned[1] += count[LRU_INACTIVE_FILE]; + reclaim_stat->recent_scanned[1] += count[LRU_ACTIVE_FILE]; + spin_unlock_irq(&zone->lru_lock); nr_scanned += nr_scan; @@ -1155,7 +1155,7 @@ static unsigned long shrink_inactive_list(unsigned long max_scan, SetPageLRU(page); lru = page_lru(page); add_page_to_lru_list(zone, page, lru); - if (PageActive(page) && scan_global_lru(sc)) { + if (PageActive(page)) { int file = !!page_is_file_cache(page); reclaim_stat->recent_rotated[file]++; } @@ -1230,8 +1230,8 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone, */ if (scan_global_lru(sc)) { zone->pages_scanned += pgscanned; - reclaim_stat->recent_scanned[!!file] += pgmoved; } + reclaim_stat->recent_scanned[!!file] += pgmoved; if (file) __mod_zone_page_state(zone, NR_ACTIVE_FILE, -pgmoved); @@ -1272,8 +1272,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone, * This helps balance scan pressure between file and anonymous * pages in get_scan_ratio. */ - if (scan_global_lru(sc)) - reclaim_stat->recent_rotated[!!file] += pgmoved; + reclaim_stat->recent_rotated[!!file] += pgmoved; while (!list_empty(&l_inactive)) { page = lru_to_page(&l_inactive); -- cgit v1.2.3 From 9439c1c95b5c25b8031b2a7eb7e1590eb84be7f5 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Wed, 7 Jan 2009 18:08:21 -0800 Subject: memcg: remove mem_cgroup_cal_reclaim() Now, get_scan_ratio() return correct value although memcg reclaim. Then, mem_cgroup_calc_reclaim() can be removed. So, memcg reclaim get the same capability of anon/file reclaim balancing as global reclaim now. Acked-by: KAMEZAWA Hiroyuki Acked-by: Rik van Riel Signed-off-by: KOSAKI Motohiro Cc: Balbir Singh Cc: Daisuke Nishimura Cc: Hugh Dickins Cc: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 27 ++++++++++----------------- 1 file changed, 10 insertions(+), 17 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 56fc7abe4d2..66bb6ef44b5 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1466,30 +1466,23 @@ static void shrink_zone(int priority, struct zone *zone, get_scan_ratio(zone, sc, percent); for_each_evictable_lru(l) { - if (scan_global_lru(sc)) { - int file = is_file_lru(l); - int scan; + int file = is_file_lru(l); + int scan; - scan = zone_page_state(zone, NR_LRU_BASE + l); - if (priority) { - scan >>= priority; - scan = (scan * percent[file]) / 100; - } + scan = zone_page_state(zone, NR_LRU_BASE + l); + if (priority) { + scan >>= priority; + scan = (scan * percent[file]) / 100; + } + if (scan_global_lru(sc)) { zone->lru[l].nr_scan += scan; nr[l] = zone->lru[l].nr_scan; if (nr[l] >= swap_cluster_max) zone->lru[l].nr_scan = 0; else nr[l] = 0; - } else { - /* - * This reclaim occurs not because zone memory shortage - * but because memory controller hits its limit. - * Don't modify zone reclaim related data. - */ - nr[l] = mem_cgroup_calc_reclaim(sc->mem_cgroup, zone, - priority, l); - } + } else + nr[l] = scan; } while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] || -- cgit v1.2.3 From e72e2bd6747c7a5c432197b6614cf3a387e61a0e Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki Date: Wed, 7 Jan 2009 18:08:23 -0800 Subject: memcg: rename scan global lru Rename scan_global_lru() to scanning_global_lru(). Signed-off-by: KAMEZAWA Hiroyuki Cc: Balbir Singh Cc: Daisuke Nishimura Cc: Hugh Dickins Cc: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 32 ++++++++++++++++---------------- 1 file changed, 16 insertions(+), 16 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 66bb6ef44b5..f03c239440a 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -125,15 +125,15 @@ static LIST_HEAD(shrinker_list); static DECLARE_RWSEM(shrinker_rwsem); #ifdef CONFIG_CGROUP_MEM_RES_CTLR -#define scan_global_lru(sc) (!(sc)->mem_cgroup) +#define scanning_global_lru(sc) (!(sc)->mem_cgroup) #else -#define scan_global_lru(sc) (1) +#define scanning_global_lru(sc) (1) #endif static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone, struct scan_control *sc) { - if (!scan_global_lru(sc)) + if (!scanning_global_lru(sc)) return mem_cgroup_get_reclaim_stat(sc->mem_cgroup, zone); return &zone->reclaim_stat; @@ -142,7 +142,7 @@ static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone, static unsigned long zone_nr_pages(struct zone *zone, struct scan_control *sc, enum lru_list lru) { - if (!scan_global_lru(sc)) + if (!scanning_global_lru(sc)) return mem_cgroup_zone_nr_pages(sc->mem_cgroup, zone, lru); return zone_page_state(zone, NR_LRU_BASE + lru); @@ -1090,7 +1090,7 @@ static unsigned long shrink_inactive_list(unsigned long max_scan, __mod_zone_page_state(zone, NR_INACTIVE_ANON, -count[LRU_INACTIVE_ANON]); - if (scan_global_lru(sc)) + if (scanning_global_lru(sc)) zone->pages_scanned += nr_scan; reclaim_stat->recent_scanned[0] += count[LRU_INACTIVE_ANON]; @@ -1129,7 +1129,7 @@ static unsigned long shrink_inactive_list(unsigned long max_scan, if (current_is_kswapd()) { __count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scan); __count_vm_events(KSWAPD_STEAL, nr_freed); - } else if (scan_global_lru(sc)) + } else if (scanning_global_lru(sc)) __count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scan); __count_zone_vm_events(PGSTEAL, zone, nr_freed); @@ -1228,7 +1228,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone, * zone->pages_scanned is used for detect zone's oom * mem_cgroup remembers nr_scan by itself. */ - if (scan_global_lru(sc)) { + if (scanning_global_lru(sc)) { zone->pages_scanned += pgscanned; } reclaim_stat->recent_scanned[!!file] += pgmoved; @@ -1337,7 +1337,7 @@ static int inactive_anon_is_low(struct zone *zone, struct scan_control *sc) { int low; - if (scan_global_lru(sc)) + if (scanning_global_lru(sc)) low = inactive_anon_is_low_global(zone); else low = mem_cgroup_inactive_anon_is_low(sc->mem_cgroup, zone); @@ -1390,7 +1390,7 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc, file = zone_nr_pages(zone, sc, LRU_ACTIVE_FILE) + zone_nr_pages(zone, sc, LRU_INACTIVE_FILE); - if (scan_global_lru(sc)) { + if (scanning_global_lru(sc)) { free = zone_page_state(zone, NR_FREE_PAGES); /* If we have very few page cache pages, force-scan anon pages. */ @@ -1474,7 +1474,7 @@ static void shrink_zone(int priority, struct zone *zone, scan >>= priority; scan = (scan * percent[file]) / 100; } - if (scan_global_lru(sc)) { + if (scanning_global_lru(sc)) { zone->lru[l].nr_scan += scan; nr[l] = zone->lru[l].nr_scan; if (nr[l] >= swap_cluster_max) @@ -1550,7 +1550,7 @@ static void shrink_zones(int priority, struct zonelist *zonelist, * Take care memory controller reclaiming has small influence * to global LRU. */ - if (scan_global_lru(sc)) { + if (scanning_global_lru(sc)) { if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) continue; note_zone_scanning_priority(zone, priority); @@ -1603,12 +1603,12 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, delayacct_freepages_start(); - if (scan_global_lru(sc)) + if (scanning_global_lru(sc)) count_vm_event(ALLOCSTALL); /* * mem_cgroup will not do shrink_slab. */ - if (scan_global_lru(sc)) { + if (scanning_global_lru(sc)) { for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) { if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) @@ -1627,7 +1627,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, * Don't shrink slabs when reclaiming memory from * over limit cgroups */ - if (scan_global_lru(sc)) { + if (scanning_global_lru(sc)) { shrink_slab(sc->nr_scanned, sc->gfp_mask, lru_pages); if (reclaim_state) { sc->nr_reclaimed += reclaim_state->reclaimed_slab; @@ -1658,7 +1658,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, congestion_wait(WRITE, HZ/10); } /* top priority shrink_zones still had more to do? don't OOM, then */ - if (!sc->all_unreclaimable && scan_global_lru(sc)) + if (!sc->all_unreclaimable && scanning_global_lru(sc)) ret = sc->nr_reclaimed; out: /* @@ -1671,7 +1671,7 @@ out: if (priority < 0) priority = 0; - if (scan_global_lru(sc)) { + if (scanning_global_lru(sc)) { for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) { if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) -- cgit v1.2.3 From a7885eb8ad465ec9db99ac5b5e6680f0ca8e11c8 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Wed, 7 Jan 2009 18:08:24 -0800 Subject: memcg: swappiness Currently, /proc/sys/vm/swappiness can change swappiness ratio for global reclaim. However, memcg reclaim doesn't have tuning parameter for itself. In general, the optimal swappiness depend on workload. (e.g. hpc workload need to low swappiness than the others.) Then, per cgroup swappiness improve administrator tunability. Signed-off-by: KAMEZAWA Hiroyuki Signed-off-by: KOSAKI Motohiro Cc: Balbir Singh Cc: Daisuke Nishimura Cc: Hugh Dickins Cc: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index f03c239440a..ece2f405187 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1707,14 +1707,15 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, #ifdef CONFIG_CGROUP_MEM_RES_CTLR unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont, - gfp_t gfp_mask, - bool noswap) + gfp_t gfp_mask, + bool noswap, + unsigned int swappiness) { struct scan_control sc = { .may_writepage = !laptop_mode, .may_swap = 1, .swap_cluster_max = SWAP_CLUSTER_MAX, - .swappiness = vm_swappiness, + .swappiness = swappiness, .order = 0, .mem_cgroup = mem_cont, .isolate_pages = mem_cgroup_isolate_pages, -- cgit v1.2.3 From c772be939e078afd2505ede7d596a30f8f61de95 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Wed, 7 Jan 2009 18:08:25 -0800 Subject: memcg: fix calculation of active_ratio Currently, inactive_ratio of memcg is calculated at setting limit. because page_alloc.c does so and current implementation is straightforward porting. However, memcg introduced hierarchy feature recently. In hierarchy restriction, memory limit is not only decided memory.limit_in_bytes of current cgroup, but also parent limit and sibling memory usage. Then, The optimal inactive_ratio is changed frequently. So, everytime calculation is better. Tested-by: KAMEZAWA Hiroyuki Acked-by: KAMEZAWA Hiroyuki Signed-off-by: KOSAKI Motohiro Cc: Balbir Singh Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index ece2f405187..9a27c44aa32 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1340,7 +1340,7 @@ static int inactive_anon_is_low(struct zone *zone, struct scan_control *sc) if (scanning_global_lru(sc)) low = inactive_anon_is_low_global(zone); else - low = mem_cgroup_inactive_anon_is_low(sc->mem_cgroup, zone); + low = mem_cgroup_inactive_anon_is_low(sc->mem_cgroup); return low; } -- cgit v1.2.3