From: Larry Woodman <lwoodman@redhat.com> Date: Wed, 30 Jul 2008 12:31:26 -0400 Subject: [mm] NUMA: system is slow when over-committing memory Message-id: 1217435486.8250.21.camel@localhost.localdomain O-Subject: [RHEL5-U3 patch] Prevent 100% cpu time in RHEL5 kernel under NUMA when zone_reclaim_mode=1 Bugzilla: 457264 RH-Acked-by: Rik van Riel <riel@redhat.com> We received a report about the RHEL5 kernel running 1000 times slower than the upstream kernel for several seconds or even minutes when over-committing the memory on one node of a multi-core NUMA system. This only happens when zone_reclaim_mode gets set to 1 by build_zone_lists when it determine another node is sufficiently "far-away" and it is better to reclaim pages in a zone before going off node. I verified this and determined that the upstream kernel does not let more than one core/cpu in __reclaim_zone() at a time. Without this change we can have multiple cores performing direct reclaim on the same zone at the same time and this causes lots of spin_lock_irq(&zone->lru_lock) contention. The attached patch add that logic to RHEL4 by overloading the zone->all_unreclaimable field similar to the way the zone->flags is done upstream. While this is a bit hacky it preserves the kABI. Fixes BZ 457264. ---------------------------------------------------------------------------- We were hoping you and possibly other VM experts might be able to give us some input on this issue. Issue: 100% time spent in EL5 kernel under NUMA zone_reclaim_mode=1 configuration Hashworms is a program which simulates a finite automaton. It uses two large arrays, which it dynamically resizes using realloc(). In normal operation, the arrays will gradually grow until they hit a maximum memory size. Periodically it resets the arrays to very small sizes and allows them to re-expand. when the program downsizes its arrays, it checkpoints to disk. This is a pretty large file, 1.5-2 GB. It's done with fwrite. The source code attached is configured to use 3GB of memory max. On a dual-socket Nhm system (with 2GB in each numa node, 4GB total) running RedHat EL5 2.6.18-92.4.el5 (with zone_reclaim_mode set to '1'), it runs normally for a little while, but then seems to get lost in the kernel. 'top' reports 100% 'sys' utilization for that thread. Progress on the app continues, at a very slow pace (1000x slower than normal). 'strace' reports that the time is not spent inside system calls generated by the program. This always happens during the second or later memory allocation phase. It persists for 5-10 minutes, then everything returns to normal (until the next time it happens). The source code is attached. One can run this like 'a.out -p 061 -e'. To observe the failure, run top in another window and hit '1' to get the per-CPU view. You should see time split about 97/3 between usr and sys, then suddenly change to 100% sys. Unzip the file, compile with 'gcc -O3 hw-bigcount.c', and run as above. Near the top of the file is a #define MEMORY_SIZE which you can use to change the total amount of memory used. We don't see this problem in any of these below scenarios: A) If we remove that disk fwrite, the failure goes away (because of no disk writes involved) B) If we set zone_reclaim_mode to '0', then the failure doesn't happen C) If we run kernel version 2.6.25-14.fc9.x86_64, then the failure doesn't happen with zone_reclaim_mode set to '1' While we expect zone_reclaim to throttle the process (while writing out dirty pages if a zone fills up), we were wondering why we don't see this issue on FC9 and if there are any VM improvements in this area, in the recent kernels like FC9. application/x-compressed attachment (hashworms.tgz) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index deb05bf..d36cdb6 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -263,6 +263,10 @@ struct zone { char *name; } ____cacheline_internodealigned_in_smp; +enum { + ZONE_ALL_UNRECLAIMABLE, /* all pages pinned */ + ZONE_RECLAIM_LOCKED, /* prevents concurrent reclaim */ +}; /* * The "priority" of VM scanning is how much of the queues we will scan in one diff --git a/mm/page_alloc.c b/mm/page_alloc.c index fc9b9ab..45e6c43 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -423,7 +423,7 @@ static void free_pages_bulk(struct zone *zone, int count, struct list_head *list, int order) { spin_lock(&zone->lock); - zone->all_unreclaimable = 0; + clear_bit(ZONE_ALL_UNRECLAIMABLE, &zone->all_unreclaimable); zone->pages_scanned = 0; while (count--) { struct page *page; @@ -1337,7 +1337,7 @@ void show_free_areas(void) K(zone->nr_inactive), K(zone->present_pages), zone->pages_scanned, - (zone->all_unreclaimable ? "yes" : "no") + (test_bit(ZONE_ALL_UNRECLAIMABLE, &zone->all_unreclaimable) ? "yes" : "no") ); printk("lowmem_reserve[]:"); for (i = 0; i < MAX_NR_ZONES; i++) diff --git a/mm/vmscan.c b/mm/vmscan.c index ea3b83d..fa94dd7 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -966,7 +966,7 @@ static unsigned long shrink_zones(int priority, struct zone **zones, note_zone_scanning_priority(zone, priority); - if (zone->all_unreclaimable && priority != DEF_PRIORITY) + if (test_bit(ZONE_ALL_UNRECLAIMABLE, &zone->all_unreclaimable) && priority != DEF_PRIORITY) continue; /* Let kswapd poll it */ sc->all_unreclaimable = 0; @@ -1147,7 +1147,7 @@ loop_again: if (!populated_zone(zone)) continue; - if (zone->all_unreclaimable && priority != DEF_PRIORITY) + if (test_bit(ZONE_ALL_UNRECLAIMABLE, &zone->all_unreclaimable) && priority != DEF_PRIORITY) continue; if (!zone_watermark_ok(zone, order, zone->pages_high, @@ -1180,7 +1180,7 @@ scan: if (!populated_zone(zone)) continue; - if (zone->all_unreclaimable && priority != DEF_PRIORITY) + if (test_bit(ZONE_ALL_UNRECLAIMABLE, &zone->all_unreclaimable) && priority != DEF_PRIORITY) continue; if (!zone_watermark_ok(zone, order, zone->pages_high, @@ -1195,11 +1195,11 @@ scan: lru_pages); nr_reclaimed += reclaim_state->reclaimed_slab; total_scanned += sc.nr_scanned; - if (zone->all_unreclaimable) + if (test_bit(ZONE_ALL_UNRECLAIMABLE, &zone->all_unreclaimable)) continue; if (nr_slab == 0 && zone->pages_scanned >= (zone->nr_active + zone->nr_inactive) * 6) - zone->all_unreclaimable = 1; + set_bit(ZONE_ALL_UNRECLAIMABLE, &zone->all_unreclaimable); /* * If we've done a decent amount of scanning and * the reclaim ratio is low, start doing writepage @@ -1356,7 +1356,7 @@ static unsigned long shrink_all_zones(unsigned long nr_pages, int pass, if (!populated_zone(zone)) continue; - if (zone->all_unreclaimable && prio != DEF_PRIORITY) + if (test_bit(ZONE_ALL_UNRECLAIMABLE, &zone->all_unreclaimable) && prio != DEF_PRIORITY) continue; /* For pass = 0 we don't shrink the active list */ @@ -1654,6 +1654,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) { cpumask_t mask; int node_id; + int ret; /* * Zone reclaim reclaims unmapped file backed pages and @@ -1677,21 +1678,25 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) * then do not scan. */ if (!(gfp_mask & __GFP_WAIT) || - zone->all_unreclaimable || + test_bit(ZONE_ALL_UNRECLAIMABLE, &zone->all_unreclaimable) || atomic_read(&zone->reclaim_in_progress) > 0 || (current->flags & PF_MEMALLOC)) return 0; /* * Only run zone reclaim on the local zone or on zones that do not - * have associated processors. This will favor the local processor - * over remote processors and spread off node memory allocations - * as wide as possible. + * have associated processors and only allow one reclaim at a time. + * This will favor the local processor over remote processors and + * spread off node memory allocations as wide as possible. */ node_id = zone->zone_pgdat->node_id; mask = node_to_cpumask(node_id); if (!cpus_empty(mask) && node_id != numa_node_id()) return 0; - return __zone_reclaim(zone, gfp_mask, order); + if (test_and_set_bit(ZONE_RECLAIM_LOCKED, &zone->all_unreclaimable)) + return 0; + ret = __zone_reclaim(zone, gfp_mask, order); + clear_bit(ZONE_RECLAIM_LOCKED, &zone->all_unreclaimable); + return ret; } #endif