Sophie: kernel-2.6.18-238.19.1.el5.centos.plus src

kernel-2.6.18-238.19.1.el5.centos.plus.src.rpm

From: Jeff Layton <jlayton@redhat.com>
Date: Thu, 5 Jun 2008 07:51:52 -0400
Subject: [nfs] sunrpc: fix hang due to eventd deadlock
Message-id: 1212666712-30755-3-git-send-email-jlayton@redhat.com
O-Subject: [RHEL5.3 PATCH 2/2] BZ#448754: SUNRPC: fix hang due to eventd deadlock...
Bugzilla: 448754
RH-Acked-by: Steve Dickson <SteveD@redhat.com>

When NFS needs to cleanup or reconnect a socket, it queues the task to
the generic kevents workqueues. This can cause a deadlock in rare
situations if a workqueue already has a job that will block waiting
for an RPC call on that socket and the reconnect job gets submitted to
the same workqueue.

The customer who reported this saw this problem using Lustre, but I
think it could also be possible to hit this on a root-on-NFS setup. The
description of the upstream patch is below, but it's actually not
correct. usermodehelper uses its own workqueue and so simply using it
cannot cause this deadlock. Doing a usermodehelper call from work
queued to the generic workqueue can cause this and that's what seems
to be happening in the original report.

The fix is fairly simple -- rather than queuing the reconnection and
cleanup to the generic workqueue, we queue it to rpciod's workqueue.
Testing this internally is tough since this is such a subtle race, but
the customer who submitted this upstream has tested it and it seems
to have fixed the problem for them.

This is the second submission of this patch -- the earlier one
could deadlock since rpciod could call xs_destroy(), which would try to
call flush_workqueue() on itself. This one takes a different approach
to cleaning up the queued work. If the work cannot be cancelled
(indicating that it's already running), it will wait until the
XPRT_CONNECTING bit on the transport is cleared. Once that occurs,
the work should then be complete. This obviates the need to call
flush_workqueue() and prevents the deadlock in the earlier patch.

Original patch description follows:

-------------[snip]-----------------
Backported from upstream commit c1384c9c4c184543375b52a0997d06cd98145164:

Author: Trond Myklebust <Trond.Myklebust@netapp.com>
Date:   Thu Jun 14 18:00:42 2007 -0400

    SUNRPC: fix hang due to eventd deadlock...

Brian Behlendorf writes:

The root cause of the NFS hang we were observing appears to be a rare
deadlock between the kernel provided usermodehelper API and the linux NFS
client.  The deadlock can arise because both of these services use the
generic linux work queues.  The usermodehelper API run the specified user
application in the context of the work queue.  And NFS submits both cleanup
and reconnect work to the generic work queue for handling.  Normally this
is fine but a deadlock can result in the following situation.

  - NFS client is in a disconnected state
  - [events/0] runs a usermodehelper app with an NFS dependent operation,
    this triggers an NFS reconnect.
  - NFS reconnect happens to be submitted to [events/0] work queue.
  - Deadlock, the [events/0] work queue will never process the
    reconnect because it is blocked on the previous NFS dependent
    operation which will not complete.`

The solution is simply to run reconnect requests on rpciod.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: Steve Dickson <SteveD@redhat.com>
-------------[snip]-----------------

diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index c3be5bc..af49932 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -303,6 +303,7 @@ static inline void xprt_clear_connecting(struct rpc_xprt *xprt)
 	smp_mb__before_clear_bit();
 	clear_bit(XPRT_CONNECTING, &xprt->state);
 	smp_mb__after_clear_bit();
+	wake_up_bit(&xprt->state, XPRT_CONNECTING);
 }
 
 static inline int xprt_connecting(struct rpc_xprt *xprt)
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 27fd06b..40dbb96 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -127,7 +127,7 @@ static void xprt_clear_locked(struct rpc_xprt *xprt)
 		clear_bit(XPRT_LOCKED, &xprt->state);
 		smp_mb__after_clear_bit();
 	} else
-		schedule_work(&xprt->task_cleanup);
+		queue_work(rpciod_workqueue, &xprt->task_cleanup);
 }
 
 /*
@@ -516,7 +516,7 @@ xprt_init_autodisconnect(unsigned long data)
 	if (xprt_connecting(xprt))
 		xprt_release_write(xprt, NULL);
 	else
-		schedule_work(&xprt->task_cleanup);
+		queue_work(rpciod_workqueue, &xprt->task_cleanup);
 	return;
 out_abort:
 	spin_unlock(&xprt->transport_lock);
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 6aac4df..8a2c717 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -476,6 +476,13 @@ clear_close_wait:
 	smp_mb__after_clear_bit();
 }
 
+static int
+xs_wait_bit_uninterruptible(void *word)
+{
+	schedule();
+	return 0;
+}
+
 /**
  * xs_destroy - prepare to shutdown a transport
  * @xprt: doomed transport
@@ -485,8 +492,9 @@ static void xs_destroy(struct rpc_xprt *xprt)
 {
 	dprintk("RPC:      xs_destroy xprt %p\n", xprt);
 
-	cancel_delayed_work(&xprt->connect_worker);
-	flush_scheduled_work();
+	if (!cancel_delayed_work(&xprt->connect_worker))
+		wait_on_bit(&xprt->state, XPRT_CONNECTING,
+			    xs_wait_bit_uninterruptible, TASK_UNINTERRUPTIBLE);
 
 	xprt_disconnect(xprt);
 	xs_close(xprt);
@@ -837,7 +845,7 @@ static void xs_tcp_state_change(struct sock *sk)
 		/* Try to schedule an autoclose RPC calls */
 		set_bit(XPRT_CLOSE_WAIT, &xprt->state);
 		if (test_and_set_bit(XPRT_LOCKED, &xprt->state) == 0)
-			schedule_work(&xprt->task_cleanup);
+			queue_work(rpciod_workqueue, &xprt->task_cleanup);
 	default:
 		xprt_disconnect(xprt);
 	}
@@ -1232,14 +1240,14 @@ static void xs_connect(struct rpc_task *task)
 	if (xprt->sock != NULL) {
 		dprintk("RPC:      xs_connect delayed xprt %p for %lu seconds\n",
 				xprt, xprt->reestablish_timeout / HZ);
-		schedule_delayed_work(&xprt->connect_worker,
+		queue_delayed_work(rpciod_workqueue, &xprt->connect_worker,
 					xprt->reestablish_timeout);
 		xprt->reestablish_timeout <<= 1;
 		if (xprt->reestablish_timeout > XS_TCP_MAX_REEST_TO)
 			xprt->reestablish_timeout = XS_TCP_MAX_REEST_TO;
 	} else {
 		dprintk("RPC:      xs_connect scheduled xprt %p\n", xprt);
-		schedule_work(&xprt->connect_worker);
+		queue_work(rpciod_workqueue, &xprt->connect_worker);
 
 		/* flush_scheduled_work can sleep... */
 		if (!RPC_IS_ASYNC(task))