From: Brad Peters <bpeters@redhat.com> Date: Wed, 20 Aug 2008 15:12:56 -0400 Subject: [openib] ehca: local CA ACK delay has an invalid value Message-id: 20080820191256.17528.64811.sendpatchset@squad5-lp1.lab.bos.redhat.com O-Subject: [PATCH RHEL5.3 458378] IB/ehca:Local CA ACK Delay is set to a invalid value Bugzilla: 458378 RH-Acked-by: Rik van Riel <riel@redhat.com> RH-Acked-by: David Howells <dhowells@redhat.com> RH-Acked-by: Doug Ledford <dledford@redhat.com> RHBZ#: ====== https://bugzilla.redhat.com/show_bug.cgi?id=458378 Description: =========== Bug fix / PPC only (as only PPC uses ehca) Note: This patch depends on the two patches from RHBZ #443800, being rolled into an OFED update by Doug Ledford During cluster test we saw that some infiniband HW returns invalid value of 0 to the device driver in the query_device() call for the Local CA ACK Delay. This invalid value result in a wrong Ack Timeout value for RC QPs, because applications will use the Local CA ACK Delay value to calculate the Timeout. Due to the wrong Timeout value, a lot of RC connections will be dropped because the adapter wait time for packet acknowledgement is to short. The possibillty of hitting this issue is increased by the size of the infiniband cluster and the workload which is running on these clusters. This patch checks whether we get a invalid value for Local CA ACK delay and sets a default minimum value. RHEL Version Found: ================ RHEL 5.2 kABI Status: ============ Will test once Brew recovers Brew: ===== Unable to build since Brew is down Upstream Status: ================ Posted and applied: http://lkml.org/lkml/2008/7/21/128 Test Status: ============ Tested by Stefan Roscher <IBM> by setting up a cluster environment, varying work load, and checking for connection drop. =============================================================== Brad Peters 1-978-392-1000 x 23183 IBM on-site partner. Proposed Patch: =============== This patch is based on 2.6.18-104.el5 Some firmware versions report a Local CA ACK Delay of 0. In that case, return a more sensible default value of 12 (-> 16 msec) instead. Signed-off-by: Joachim Fenkes <fenkes@de.ibm.com> diff --git a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c index f860eb3..c04cbb1 100644 --- a/drivers/infiniband/hw/ehca/ehca_hca.c +++ b/drivers/infiniband/hw/ehca/ehca_hca.c @@ -102,8 +102,9 @@ int ehca_query_device(struct ib_device *ibdev, struct ib_device_attr *props) } props->max_pkeys = 16; - props->local_ca_ack_delay - = rblock->local_ca_ack_delay; + /* Some FW versions say 0 here; insert sensible value in that case */ + props->local_ca_ack_delay = rblock->local_ca_ack_delay ? + min_t(u8, rblock->local_ca_ack_delay, 255) : 12; props->max_raw_ipv6_qp = min_t(unsigned, rblock->max_raw_ipv6_qp, INT_MAX); props->max_raw_ethy_qp