Sophie: kernel-2.6.18-238.19.1.el5.centos.plus src

kernel-2.6.18-238.19.1.el5.centos.plus.src.rpm

From: Jeff Moyer <jmoyer@redhat.com>
Date: Mon, 5 Apr 2010 20:23:49 -0400
Subject: [fs] aio: fix flush_workqueue deadlock
Message-id: <x494ojpk5re.fsf@segfault.boston.devel.redhat.com>
Patchwork-id: 23898
O-Subject: [rhel5 patch] aio: fix flush_workqueue deadlock
Bugzilla: 546700
RH-Acked-by: Jerome Marchand <jmarchan@redhat.com>

Hi,

>From the buzilla description:

 "If many threads call io_destroy() with outstanding I/O, the aio
  subsystem can deadlock. The aio workqueue handler gets stuck in
  flush_workqueue() waiting for the workqueue_mutex, while another
  thread in io_destroy() is also in flush_workqeue() holding the
  workqueue_mutex and waiting for the workqueue to complete. The
  attached code demonstrates the issue and a workaround. It usually
  fails in a few seconds."

To further elucidate the problem, each thread essentially performs the
following steps in an infinite loop:

fd = open(filename, O_DIRECT);
io_setup(NUM_IOS, &ioctx);
io_submit(ioctx, NUM_IOS, iocbs)
close(fd);
io_destroy(ioctx);

Noteworthy, here, is that each thread has a separate file descriptor for
the file, and the fd is closed before the call to io_destroy.

This is what you'll see for the two processes that trigger the deadly
embrace:

PID: 476    TASK: ffff81042fb99100  CPU: 2   COMMAND: "aio/2"
 #0 [ffff81022f42bca0] schedule at ffffffff80062f66
 #1 [ffff81022f42bd78] __mutex_lock_slowpath at ffffffff80063c3f
 #2 [ffff81022f42bdb8] .text.lock.mutex at ffffffff80063c89 (via mutex_lock)
 #3 [ffff81022f42bdd8] flush_workqueue at ffffffff8009c800
 #4 [ffff81022f42bdf8] __put_ioctx at ffffffff800edd4d
 #5 [ffff81022f42be18] aio_fput_routine at ffffffff800ee2b4
 #6 [ffff81022f42be38] run_workqueue at ffffffff8004d80a
 #7 [ffff81022f42be78] worker_thread at ffffffff8004a142
 #8 [ffff81022f42bee8] kthread at ffffffff8003298b
 #9 [ffff81022f42bf48] kernel_thread at ffffffff8005dfb1

PID: 3572   TASK: ffff810217d18100  CPU: 0   COMMAND: "aio_wq_hang"
 #0 [ffff810217d5fe08] schedule at ffffffff80062f66
 #1 [ffff810217d5fee0] flush_cpu_workqueue at ffffffff8009c793
 #2 [ffff810217d5ff30] flush_workqueue at ffffffff8009c821
 #3 [ffff810217d5ff50] __put_ioctx at ffffffff800edd4d
 #4 [ffff810217d5ff70] sys_io_destroy at ffffffff800ee1ee
 #5 [ffff810217d5ff80] tracesys at ffffffff8005d28d (via system_call)
    RIP: 00002b7ee919b677  RSP: 000000007ce19a48  RFLAGS: 00000202
    RAX: ffffffffffffffda  RBX: ffffffff8005d28d  RCX: ffffffffffffffff
    RDX: 0000000000000002  RSI: 0000000000000000  RDI: 00002aaaaaafc000
    RBP: 000000007ce1a130   R8: 00000000ffffffff   R9: 0000003dbb719840
    R10: 000000007ce1a940  R11: 0000000000000202  R12: ffffffff800ee1ee
    R13: 000000007ce1a130  R14: 0000000000000000  R15: 0000000000001000
    ORIG_RAX: 00000000000000cf  CS: 0033  SS: 002b

What happens is that the last put on the fd occurs in the I/O completion
path (meaning that the fd was closed before the I/O completed), and so
the final put on the fd is deferred to process context in the aio
workqueue.

One of the other threads in the program then calls io_destroy on its
own, unrelated ioctx, between the time when the work is scheduled and
the time it is run.  The io_destroy code path calls __put_ioctx as it is
dropping the last reference on its ioctx.  __put_ioctx, in turn, calls
flush_workqueue, which takes the workqueue_mutex and then goes to sleep
waiting on the workqueue processing to finish.  Given that there is only
a single workqueue_mutex, the kernel now deadlocks as the code running
within the workqueue also tries to flush the workqueue upon the final
put of its ioctx.

The fix is to simply not flush the workqueue from within the workqueue.
Because this is the final put on the io_context, we know that there are
no other outstanding I/Os and no other outstanding fputs to be done.
Thus, this is a safe thing to do.

The upstream fix is a different, as they have rewritten much of the
waitqueue code to avoid this problem (whether intentional or not, I
cannot say).

I've modified the included reproducer to work within our RHTS test
environment.  I checked in /kernel/filesystems/bz546700 and added it to
the KernelTier2 tests.  I verified that the test reports failure when
run against an unpatched kernel, and that is reports success when run
against a patched kernel.  The reporter also ran a test kernel and
reported that it runs successfully in his environment.

This patch addresses bug 546700.  Comments, as always, are appreciated.

Cheers,
Jeff

Signed-off-by: Jarod Wilson <jarod@redhat.com>

diff --git a/fs/aio.c b/fs/aio.c
index aa2d9d3..5bcea97 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -500,7 +500,7 @@ void fastcall exit_aio(struct mm_struct *mm)
  *	Called when the last user of an aio context has gone away,
  *	and the struct needs to be freed.
  */
-void fastcall __put_ioctx(struct kioctx *ctx)
+void __put_ioctx_wq(struct kioctx *ctx, int wq_context)
 {
 	unsigned nr_events = ctx->max_reqs;
 
@@ -508,7 +508,8 @@ void fastcall __put_ioctx(struct kioctx *ctx)
 		BUG();
 
 	cancel_delayed_work(&ctx->wq);
-	flush_workqueue(aio_wq);
+	if (!wq_context)
+		flush_workqueue(aio_wq);
 	aio_free_ring(ctx);
 	mmdrop(ctx->mm);
 	ctx->mm = NULL;
@@ -523,6 +524,11 @@ void fastcall __put_ioctx(struct kioctx *ctx)
 	}
 }
 
+void fastcall __put_ioctx(struct kioctx *ctx)
+{
+	__put_ioctx_wq(ctx, 0);
+}
+
 /* aio_get_req
  *	Allocate a slot for an aio request.  Increments the users count
  * of the kioctx so that the kioctx stays around until all requests are
@@ -623,7 +629,7 @@ static void aio_fput_routine(void *data)
 		really_put_req(ctx, req);
 		spin_unlock_irq(&ctx->ctx_lock);
 
-		put_ioctx(ctx);
+		put_ioctx_from_wq(ctx);
 		spin_lock_irq(&fput_lock);
 	}
 	spin_unlock_irq(&fput_lock);
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 2d44b0a..29369d0 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -206,6 +206,7 @@ extern int FASTCALL(aio_put_req(struct kiocb *iocb));
 extern void FASTCALL(kick_iocb(struct kiocb *iocb));
 extern int FASTCALL(aio_complete(struct kiocb *iocb, long res, long res2));
 extern void FASTCALL(__put_ioctx(struct kioctx *ctx));
+extern void __put_ioctx_wq(struct kioctx *ctx, int wq_context);
 struct mm_struct;
 extern void FASTCALL(exit_aio(struct mm_struct *mm));
 extern struct kioctx *lookup_ioctx(unsigned long ctx_id);
@@ -227,6 +228,11 @@ int FASTCALL(io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 	if (unlikely(atomic_dec_and_test(&(kioctx)->users))) 		\
 		__put_ioctx(kioctx);					\
 } while (0)
+#define put_ioctx_from_wq(kioctx) do {					\
+	BUG_ON(unlikely(atomic_read(&(kioctx)->users) <= 0));		\
+	if (unlikely(atomic_dec_and_test(&(kioctx)->users))) 		\
+		__put_ioctx_wq(kioctx, 1);				\
+} while (0)
 
 #define in_aio() !is_sync_wait(current->io_wait)
 /* may be used for debugging */