Sophie: kernel-2.6.18-238.19.1.el5.centos.plus src

kernel-2.6.18-238.19.1.el5.centos.plus.src.rpm

From: Larry Woodman <lwoodman@redhat.com>
Date: Wed, 30 Jul 2008 12:31:26 -0400
Subject: [mm] NUMA: system is slow when over-committing memory
Message-id: 1217435486.8250.21.camel@localhost.localdomain
O-Subject: [RHEL5-U3 patch] Prevent 100% cpu time in RHEL5 kernel under NUMA when zone_reclaim_mode=1
Bugzilla: 457264
RH-Acked-by: Rik van Riel <riel@redhat.com>

We received a report about the RHEL5 kernel running
1000 times slower than the upstream kernel for several seconds or even
minutes when over-committing the memory on one node of a multi-core NUMA
system.  This only happens when zone_reclaim_mode gets set to 1 by
build_zone_lists when it determine another node is sufficiently
"far-away" and it is better to reclaim pages in a zone before going off
node.  I verified this and determined that the upstream kernel does not
let more than one core/cpu in __reclaim_zone() at a time.  Without this
change we can have multiple cores performing direct reclaim on the same
zone at the same time and this causes lots of
spin_lock_irq(&zone->lru_lock) contention.

The attached patch add that logic to RHEL4 by overloading the
zone->all_unreclaimable field similar to the way the zone->flags is done
upstream.  While this is a bit hacky it preserves the kABI.

Fixes BZ 457264.

----------------------------------------------------------------------------

We were hoping you and possibly other VM experts might be able to give
us some input on this issue.

Issue: 100% time spent in EL5 kernel under NUMA zone_reclaim_mode=1
configuration

Hashworms is a program which simulates a finite automaton. It uses two
large arrays,
which it dynamically resizes using realloc(). In normal operation, the
arrays will
gradually grow until they hit a maximum memory size. Periodically it
resets the arrays
to very small sizes and allows them to re-expand. when the program
downsizes its arrays,
it checkpoints to disk. This is a pretty large file, 1.5-2 GB. It's done
with fwrite.

The source code attached is configured to use 3GB of memory max.
On a dual-socket Nhm system (with 2GB in each numa node, 4GB total)
running RedHat EL5
2.6.18-92.4.el5 (with zone_reclaim_mode set to '1'), it runs normally
for a little while,
but then seems to get lost in the kernel. 'top' reports 100% 'sys'
utilization for that thread.
Progress on the app continues, at a very slow pace (1000x slower than
normal).
'strace' reports that the time is not spent inside system calls
generated by the program.
This always happens during the second or later memory allocation phase.
It persists
for 5-10 minutes, then everything returns to normal (until the next time
it happens).

The source code is attached. One can run this like 'a.out -p 061 -e'.

To observe the failure, run top in another window and hit '1' to get the
per-CPU view. You should see time split about 97/3 between usr and sys,
then suddenly
change to 100% sys.

Unzip the file, compile with 'gcc -O3 hw-bigcount.c', and run as above.
Near the top of the file
is a #define MEMORY_SIZE which you can use to change the total amount of
memory used.

We don't see this problem in any of these below scenarios:

A) If we remove that disk fwrite, the failure goes away (because of no
disk writes
involved)
B) If we set zone_reclaim_mode to '0', then the failure doesn't happen
C) If we run kernel version 2.6.25-14.fc9.x86_64, then the failure
doesn't happen
with zone_reclaim_mode  set to '1'

While we expect zone_reclaim to throttle the process (while writing out
dirty pages if
a zone fills up), we were wondering why we don't see this issue on FC9
and if there
are any VM improvements in this area, in the recent kernels like FC9.

application/x-compressed attachment (hashworms.tgz)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index deb05bf..d36cdb6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -263,6 +263,10 @@ struct zone {
 	char			*name;
 } ____cacheline_internodealigned_in_smp;
 
+enum {
+        ZONE_ALL_UNRECLAIMABLE,         /* all pages pinned */
+        ZONE_RECLAIM_LOCKED,            /* prevents concurrent reclaim */
+};
 
 /*
  * The "priority" of VM scanning is how much of the queues we will scan in one
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fc9b9ab..45e6c43 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -423,7 +423,7 @@ static void free_pages_bulk(struct zone *zone, int count,
 					struct list_head *list, int order)
 {
 	spin_lock(&zone->lock);
-	zone->all_unreclaimable = 0;
+	clear_bit(ZONE_ALL_UNRECLAIMABLE, &zone->all_unreclaimable);
 	zone->pages_scanned = 0;
 	while (count--) {
 		struct page *page;
@@ -1337,7 +1337,7 @@ void show_free_areas(void)
 			K(zone->nr_inactive),
 			K(zone->present_pages),
 			zone->pages_scanned,
-			(zone->all_unreclaimable ? "yes" : "no")
+			(test_bit(ZONE_ALL_UNRECLAIMABLE, &zone->all_unreclaimable) ? "yes" : "no")
 			);
 		printk("lowmem_reserve[]:");
 		for (i = 0; i < MAX_NR_ZONES; i++)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ea3b83d..fa94dd7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -966,7 +966,7 @@ static unsigned long shrink_zones(int priority, struct zone **zones,
 
 		note_zone_scanning_priority(zone, priority);
 
-		if (zone->all_unreclaimable && priority != DEF_PRIORITY)
+		if (test_bit(ZONE_ALL_UNRECLAIMABLE, &zone->all_unreclaimable) && priority != DEF_PRIORITY)
 			continue;	/* Let kswapd poll it */
 
 		sc->all_unreclaimable = 0;
@@ -1147,7 +1147,7 @@ loop_again:
 			if (!populated_zone(zone))
 				continue;
 
-			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
+			if (test_bit(ZONE_ALL_UNRECLAIMABLE, &zone->all_unreclaimable) && priority != DEF_PRIORITY)
 				continue;
 
 			if (!zone_watermark_ok(zone, order, zone->pages_high,
@@ -1180,7 +1180,7 @@ scan:
 			if (!populated_zone(zone))
 				continue;
 
-			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
+			if (test_bit(ZONE_ALL_UNRECLAIMABLE, &zone->all_unreclaimable) && priority != DEF_PRIORITY)
 				continue;
 
 			if (!zone_watermark_ok(zone, order, zone->pages_high,
@@ -1195,11 +1195,11 @@ scan:
 						lru_pages);
 			nr_reclaimed += reclaim_state->reclaimed_slab;
 			total_scanned += sc.nr_scanned;
-			if (zone->all_unreclaimable)
+			if (test_bit(ZONE_ALL_UNRECLAIMABLE, &zone->all_unreclaimable))
 				continue;
 			if (nr_slab == 0 && zone->pages_scanned >=
 				    (zone->nr_active + zone->nr_inactive) * 6)
-				zone->all_unreclaimable = 1;
+				set_bit(ZONE_ALL_UNRECLAIMABLE, &zone->all_unreclaimable);
 			/*
 			 * If we've done a decent amount of scanning and
 			 * the reclaim ratio is low, start doing writepage
@@ -1356,7 +1356,7 @@ static unsigned long shrink_all_zones(unsigned long nr_pages, int pass,
 		if (!populated_zone(zone))
 			continue;
 
-		if (zone->all_unreclaimable && prio != DEF_PRIORITY)
+		if (test_bit(ZONE_ALL_UNRECLAIMABLE, &zone->all_unreclaimable) && prio != DEF_PRIORITY)
 			continue;
 
 		/* For pass = 0 we don't shrink the active list */
@@ -1654,6 +1654,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 {
 	cpumask_t mask;
 	int node_id;
+	int ret;
 
 	/*
 	 * Zone reclaim reclaims unmapped file backed pages and
@@ -1677,21 +1678,25 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	 * then do not scan.
 	 */
 	if (!(gfp_mask & __GFP_WAIT) ||
-		zone->all_unreclaimable ||
+		test_bit(ZONE_ALL_UNRECLAIMABLE, &zone->all_unreclaimable) ||
 		atomic_read(&zone->reclaim_in_progress) > 0 ||
 		(current->flags & PF_MEMALLOC))
 			return 0;
 
 	/*
 	 * Only run zone reclaim on the local zone or on zones that do not
-	 * have associated processors. This will favor the local processor
-	 * over remote processors and spread off node memory allocations
-	 * as wide as possible.
+	 * have associated processors and only allow one reclaim at a time. 
+	 * This will favor the local processor over remote processors and 
+	 * spread off node memory allocations as wide as possible.
 	 */
 	node_id = zone->zone_pgdat->node_id;
 	mask = node_to_cpumask(node_id);
 	if (!cpus_empty(mask) && node_id != numa_node_id())
 		return 0;
-	return __zone_reclaim(zone, gfp_mask, order);
+	if (test_and_set_bit(ZONE_RECLAIM_LOCKED, &zone->all_unreclaimable))
+		return 0;
+	ret = __zone_reclaim(zone, gfp_mask, order);
+	clear_bit(ZONE_RECLAIM_LOCKED, &zone->all_unreclaimable);
+	return ret;
 }
 #endif