From: Jeff Moyer <jmoyer@redhat.com> Date: Mon, 5 Apr 2010 20:23:49 -0400 Subject: [fs] aio: fix flush_workqueue deadlock Message-id: <x494ojpk5re.fsf@segfault.boston.devel.redhat.com> Patchwork-id: 23898 O-Subject: [rhel5 patch] aio: fix flush_workqueue deadlock Bugzilla: 546700 RH-Acked-by: Jerome Marchand <jmarchan@redhat.com> Hi, >From the buzilla description: "If many threads call io_destroy() with outstanding I/O, the aio subsystem can deadlock. The aio workqueue handler gets stuck in flush_workqueue() waiting for the workqueue_mutex, while another thread in io_destroy() is also in flush_workqeue() holding the workqueue_mutex and waiting for the workqueue to complete. The attached code demonstrates the issue and a workaround. It usually fails in a few seconds." To further elucidate the problem, each thread essentially performs the following steps in an infinite loop: fd = open(filename, O_DIRECT); io_setup(NUM_IOS, &ioctx); io_submit(ioctx, NUM_IOS, iocbs) close(fd); io_destroy(ioctx); Noteworthy, here, is that each thread has a separate file descriptor for the file, and the fd is closed before the call to io_destroy. This is what you'll see for the two processes that trigger the deadly embrace: PID: 476 TASK: ffff81042fb99100 CPU: 2 COMMAND: "aio/2" #0 [ffff81022f42bca0] schedule at ffffffff80062f66 #1 [ffff81022f42bd78] __mutex_lock_slowpath at ffffffff80063c3f #2 [ffff81022f42bdb8] .text.lock.mutex at ffffffff80063c89 (via mutex_lock) #3 [ffff81022f42bdd8] flush_workqueue at ffffffff8009c800 #4 [ffff81022f42bdf8] __put_ioctx at ffffffff800edd4d #5 [ffff81022f42be18] aio_fput_routine at ffffffff800ee2b4 #6 [ffff81022f42be38] run_workqueue at ffffffff8004d80a #7 [ffff81022f42be78] worker_thread at ffffffff8004a142 #8 [ffff81022f42bee8] kthread at ffffffff8003298b #9 [ffff81022f42bf48] kernel_thread at ffffffff8005dfb1 PID: 3572 TASK: ffff810217d18100 CPU: 0 COMMAND: "aio_wq_hang" #0 [ffff810217d5fe08] schedule at ffffffff80062f66 #1 [ffff810217d5fee0] flush_cpu_workqueue at ffffffff8009c793 #2 [ffff810217d5ff30] flush_workqueue at ffffffff8009c821 #3 [ffff810217d5ff50] __put_ioctx at ffffffff800edd4d #4 [ffff810217d5ff70] sys_io_destroy at ffffffff800ee1ee #5 [ffff810217d5ff80] tracesys at ffffffff8005d28d (via system_call) RIP: 00002b7ee919b677 RSP: 000000007ce19a48 RFLAGS: 00000202 RAX: ffffffffffffffda RBX: ffffffff8005d28d RCX: ffffffffffffffff RDX: 0000000000000002 RSI: 0000000000000000 RDI: 00002aaaaaafc000 RBP: 000000007ce1a130 R8: 00000000ffffffff R9: 0000003dbb719840 R10: 000000007ce1a940 R11: 0000000000000202 R12: ffffffff800ee1ee R13: 000000007ce1a130 R14: 0000000000000000 R15: 0000000000001000 ORIG_RAX: 00000000000000cf CS: 0033 SS: 002b What happens is that the last put on the fd occurs in the I/O completion path (meaning that the fd was closed before the I/O completed), and so the final put on the fd is deferred to process context in the aio workqueue. One of the other threads in the program then calls io_destroy on its own, unrelated ioctx, between the time when the work is scheduled and the time it is run. The io_destroy code path calls __put_ioctx as it is dropping the last reference on its ioctx. __put_ioctx, in turn, calls flush_workqueue, which takes the workqueue_mutex and then goes to sleep waiting on the workqueue processing to finish. Given that there is only a single workqueue_mutex, the kernel now deadlocks as the code running within the workqueue also tries to flush the workqueue upon the final put of its ioctx. The fix is to simply not flush the workqueue from within the workqueue. Because this is the final put on the io_context, we know that there are no other outstanding I/Os and no other outstanding fputs to be done. Thus, this is a safe thing to do. The upstream fix is a different, as they have rewritten much of the waitqueue code to avoid this problem (whether intentional or not, I cannot say). I've modified the included reproducer to work within our RHTS test environment. I checked in /kernel/filesystems/bz546700 and added it to the KernelTier2 tests. I verified that the test reports failure when run against an unpatched kernel, and that is reports success when run against a patched kernel. The reporter also ran a test kernel and reported that it runs successfully in his environment. This patch addresses bug 546700. Comments, as always, are appreciated. Cheers, Jeff Signed-off-by: Jarod Wilson <jarod@redhat.com> diff --git a/fs/aio.c b/fs/aio.c index aa2d9d3..5bcea97 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -500,7 +500,7 @@ void fastcall exit_aio(struct mm_struct *mm) * Called when the last user of an aio context has gone away, * and the struct needs to be freed. */ -void fastcall __put_ioctx(struct kioctx *ctx) +void __put_ioctx_wq(struct kioctx *ctx, int wq_context) { unsigned nr_events = ctx->max_reqs; @@ -508,7 +508,8 @@ void fastcall __put_ioctx(struct kioctx *ctx) BUG(); cancel_delayed_work(&ctx->wq); - flush_workqueue(aio_wq); + if (!wq_context) + flush_workqueue(aio_wq); aio_free_ring(ctx); mmdrop(ctx->mm); ctx->mm = NULL; @@ -523,6 +524,11 @@ void fastcall __put_ioctx(struct kioctx *ctx) } } +void fastcall __put_ioctx(struct kioctx *ctx) +{ + __put_ioctx_wq(ctx, 0); +} + /* aio_get_req * Allocate a slot for an aio request. Increments the users count * of the kioctx so that the kioctx stays around until all requests are @@ -623,7 +629,7 @@ static void aio_fput_routine(void *data) really_put_req(ctx, req); spin_unlock_irq(&ctx->ctx_lock); - put_ioctx(ctx); + put_ioctx_from_wq(ctx); spin_lock_irq(&fput_lock); } spin_unlock_irq(&fput_lock); diff --git a/include/linux/aio.h b/include/linux/aio.h index 2d44b0a..29369d0 100644 --- a/include/linux/aio.h +++ b/include/linux/aio.h @@ -206,6 +206,7 @@ extern int FASTCALL(aio_put_req(struct kiocb *iocb)); extern void FASTCALL(kick_iocb(struct kiocb *iocb)); extern int FASTCALL(aio_complete(struct kiocb *iocb, long res, long res2)); extern void FASTCALL(__put_ioctx(struct kioctx *ctx)); +extern void __put_ioctx_wq(struct kioctx *ctx, int wq_context); struct mm_struct; extern void FASTCALL(exit_aio(struct mm_struct *mm)); extern struct kioctx *lookup_ioctx(unsigned long ctx_id); @@ -227,6 +228,11 @@ int FASTCALL(io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb, if (unlikely(atomic_dec_and_test(&(kioctx)->users))) \ __put_ioctx(kioctx); \ } while (0) +#define put_ioctx_from_wq(kioctx) do { \ + BUG_ON(unlikely(atomic_read(&(kioctx)->users) <= 0)); \ + if (unlikely(atomic_dec_and_test(&(kioctx)->users))) \ + __put_ioctx_wq(kioctx, 1); \ +} while (0) #define in_aio() !is_sync_wait(current->io_wait) /* may be used for debugging */