From: Erik Jacobson <ejacobso@redhat.com> Subject: [RHEL5.1 Patch] BZ 220416 FPSWA exceptions taking excessive system time (performance issue) Date: Thu, 21 Dec 2006 13:32:16 -0500 Bugzilla: 220416 Message-Id: <20061221183215.GA17230@redhat.com> Changelog: [ia64] FPSWA exceptions take excessive system time This is bugzilla 220416. Improve the scalability of the fpswa code that rate-limits logging of messages. There are 2 distinctly different problems in this code. 1) If prctl is used to disable logging, last_time is never updated. The result is that fpu_swa_count is zeroed out on EVERY fp fault. This causes a very very hot cache line. The fix reduces the wallclock time of a 1024p FP exception test from 28734 sec to 19 sec!!! 2) On VERY large systems, excessive messages are logged because multiple cpus can each reset or increment fpu_swa_count at about the same time. The result is that hundreds of messages are logged each second. The fixes reduces the logging rate to ~1 per second. Signed-off-by: Jack Steiner <steiner@sgi.com> --- Here is some additional detail about the second problem. The second problem occurs only at high cpu counts. The problem has been seen on 768p & 1024p systems. It does not _appear_ to occur on 512p systems although I would not rule out that it could occur under some conditions. The current algorithm relies on the fact that <last_time> & <fpu_swa_count> are infrequently modified. Under most circumstances, these variables are globally shared & each cpu has a read-only copy of the cache line. When one of the <last_time>/<fpu_swa_count> variables is modified, the cpu must first cause the line to be evicted from all other cpus, then obtain an exclusive copy of the cacheline. When the line is finally obtained exclusively, there is no guarantee that the contents have not changed from the values observed when the line was held shared. Look specifically at this code and consider the case where <fpu_swa_count> is incremented: (1) if ((fpu_swa_count < 4) ... (2) .... (3) ++fpu_swa_count; - code at line (1) gets a shared copy of the cacheline containing <fpu_swa_count>. - the value of <fpu_swa_count> is loaded into a register - code at line (2) generates a few more memory references & introduces a small delay between (1) & (3) - code at line (3) does: - add "1" to the possibly stale register copy of <fpu_swa_count> that was read from the shared line in (1) - obtains an exclusive copy of <fpu_swa_count> - OTHER CPUS MAY HAVE already update this line!! - stores the potentially stale result to <fpu_swa_count> On heavily loaded large systems, it is probable that multiple cpus will each make the same updates to <fpu_swa_count>. The <fpu_swa_count> can be repeated reset to the same or smaller values. The result is that excessive messages are logged & the cachelines containing <fpu_swa_count> becomes very hot. This is easily verified. The log from the customer site showed periods of correct behavior & periods of bad behavior. I stripped the timestamps out of the logs, and ran the "times" thru "uniq -c". This is from one portion of a log and is what I would expect to see. Column 1 is a count of the number of "floating-point" messages that were logged for the 1 second interval. Note the pattern of a burst of 4 messages every 5 sec 4 00:21:29 4 00:21:34 5 00:21:39 4 00:21:44 4 00:21:50 4 00:21:55 4 00:22:00 Here is another portion of the log that shows the bad behavior. The system is logging 100's of messages per second!!!! Logging never stops. 103 14:26:07 122 14:26:08 112 14:26:09 83 14:26:10 98 14:26:11 91 14:26:12 123 14:26:13 123 14:26:14 97 14:26:15 100 14:26:16 117 14:26:17 84 14:26:18 Index: linux/arch/ia64/kernel/traps.c =================================================================== --- linux.orig/arch/ia64/kernel/traps.c 2006-12-20 13:26:47.276186100 -0600 +++ linux/arch/ia64/kernel/traps.c 2006-12-21 10:27:50.987899608 -0600 @@ -325,6 +325,15 @@ fp_emulate (int fp_fault, void *bundle, return ret.status; } +struct fpu_swa_msg { + unsigned long count; + unsigned long time; +}; +static DEFINE_PER_CPU(struct fpu_swa_msg, cpulast); +DECLARE_PER_CPU(struct fpu_swa_msg, cpulast); +static struct fpu_swa_msg last __cacheline_aligned; + + /* * Handle floating-point assist faults and traps. */ @@ -334,8 +343,6 @@ handle_fpu_swa (int fp_fault, struct pt_ long exception, bundle[2]; unsigned long fault_ip; struct siginfo siginfo; - static int fpu_swa_count = 0; - static unsigned long last_time; fault_ip = regs->cr_iip; if (!fp_fault && (ia64_psr(regs)->ri == 0)) @@ -343,14 +350,37 @@ handle_fpu_swa (int fp_fault, struct pt_ if (copy_from_user(bundle, (void __user *) fault_ip, sizeof(bundle))) return -1; - if (jiffies - last_time > 5*HZ) - fpu_swa_count = 0; - if ((fpu_swa_count < 4) && !(current->thread.flags & IA64_THREAD_FPEMU_NOPRINT)) { - last_time = jiffies; - ++fpu_swa_count; - printk(KERN_WARNING - "%s(%d): floating-point assist fault at ip %016lx, isr %016lx\n", - current->comm, current->pid, regs->cr_iip + ia64_psr(regs)->ri, isr); + if (!(current->thread.flags & IA64_THREAD_FPEMU_NOPRINT)) { + unsigned long count, current_jiffies = jiffies; + struct fpu_swa_msg *cp = &__get_cpu_var(cpulast); + + if (unlikely(current_jiffies > cp->time)) + cp->count = 0; + if (unlikely(cp->count < 5)) { + cp->count++; + cp->time = current_jiffies + 5 * HZ; + + /* minimize races by grabbing a copy of count BEFORE checking last.time. */ + count = last.count; + barrier(); + + /* + * Lower 4 bits are used as a count. Upper bits are a sequence + * number that is updated when count is reset. The cmpxchg will + * fail is seqno has changed. This minimizes mutiple cpus + * reseting the count. + */ + if (current_jiffies > last.time) + (void) cmpxchg_acq(&last.count, count, 16 + (count & ~15)); + + /* used fetchadd to atomically update the count */ + if ((last.count & 15) < 5 && (ia64_fetchadd(1, &last.count, acq) & 15) < 5) { + last.time = current_jiffies + 5 * HZ; + printk(KERN_WARNING + "%s(%d): floating-point assist fault at ip %016lx, isr %016lx\n", + current->comm, current->pid, regs->cr_iip + ia64_psr(regs)->ri, isr); + } + } } exception = fp_emulate(fp_fault, bundle, ®s->cr_ipsr, ®s->ar_fpsr, &isr, ®s->pr,