Sophie: freeipmi-0.5.1-7.el5 x86

freeipmi-0.5.1-7.el5.x86_64.rpm

Using Hostrange Input/Output in HPC environments

by

Albert Chu
chu11@llnl.gov

1) Introduction with Pdsh
-------------------------

Much of the hostrange input/output in FreeIPMI is based on the tool
pdsh (http://pdsh.sourceforge.net).  Pdsh is a parallel shell utility
which allows you to execute an arbitrary command across a cluster.
Algorithmically, pdsh creates a sliding window of threads, each which
generates a remote shell using an underlying 'rcmd" functionality
(such as rcmd(3) or ssh(1)).  As threads complete, the new threads
launch the command on other hosts until the command has been executed
on all hosts specified.

It is utilized at Lawrence Livermore National Laboratory (LLNL) on
clusters ranging from 4 to 1152 nodes.  Commands are capable of being
executed across the entire cluster in the matter of seconds rather
then minutes it would take to execute serially in a shell prompt.

Here's an example of pdsh at work on a small cluster.

> pdsh -w "wopr[0-5]" hostname
wopr0: wopr0
wopr1: wopr1
wopr2: wopr2
wopr3: wopr3
wopr5: wopr5
wopr4: wopr4

Now, determining the hostname of every node in your cluster isn't too
useful or interesting.  However, perhaps you want to determine if
every node of your cluster booted with the same kernel.

> pdsh -w "wopr[0-5]" uname -r
wopr1: 2.6.9-65chaos
wopr0: 2.6.9-65chaos
wopr5: 2.6.9-65chaos
wopr2: 2.6.9-65chaos
wopr4: 2.6.9-65chaos
wopr3: 2.6.9-65chaos

Seems pretty useful. However, on larger clusters, this type of output
will get pretty large, especially if the command generates greater
than 1 line of output for each node.  Lets say I want to determine if
the same config file has been configured on every node of the cluster.

> pdsh -w "wopr[0-5]" "cat /tmp/pretend_config"
wopr1: foo=/usr
wopr1: bar=/tmp
wopr1: baz=/etc
wopr1: xyzzy=static
wopr1:
wopr0: foo=/usr
wopr0: bar=/tmp
wopr0: baz=/etc
wopr0: xyzzy=static
wopr0:
wopr2: foo=/usr
wopr2: bar=/tmp
wopr2: baz=/etc
wopr2: xyzzy=dynamic
wopr2:
wopr4: foo=/usr
wopr4: bar=/tmp
wopr4: baz=/etc
wopr4: xyzzy=static
wopr4:
wopr5: foo=/usr
wopr5: bar=/tmp
wopr5: baz=/etc
wopr5: xyzzy=static
wopr5:
wopr3: foo=/usr
wopr3: bar=/tmp
wopr3: baz=/etc
wopr3: xyzzy=static
wopr3:

It's beginning to get pretty long and perhaps a bit hard to digest.
Pdsh also comes with a tool called dshbak for buffering this output to
make it more human readable.

> pdsh -w "wopr[0-5]" "cat /tmp/pretend_config" | dshbak
----------------
wopr1
----------------
 foo=/usr
 bar=/tmp
 baz=/etc
 xyzzy=static

----------------
wopr3
----------------
 foo=/usr
 bar=/tmp
 baz=/etc
 xyzzy=static
<snip - more of the same stuff>

This is a much nicer output to read.  However, if you have a much
larger cluster (or possibly much larger output), this type of output
will still be quite difficult to handle.  Dshbak also comes with a
consolidation function to shorten the output.

> pdsh -w "wopr[0-5]" "cat /tmp/pretend_config" | dshbak -c
----------------
wopr[0-1,3-5]
----------------
 foo=/usr
 bar=/tmp
 baz=/etc
 xyzzy=static

----------------
wopr2
----------------
 foo=/usr
 bar=/tmp
 baz=/etc
 xyzzy=dynamic

We see that for this particular pretend cluster config file, one
node's configuration is different.

Another problem that often comes up with large clusters is that nodes
are removed from the cluster for servicing or are down due to hardware
problems, hangs, crashes, etc.  So tools like pdsh can often sit and
eventually time out on those nodes that are removed or have problems.

In the cluster used in this example, wopr6 is a node that is currently
down and times out after awhile when you use pdsh.

> time pdsh -w "wopr[0-6]" hostname
wopr0: wopr0
wopr1: wopr1
wopr4: wopr4
wopr2: wopr2
wopr5: wopr5
wopr3: wopr3
pdsh@wopri: wopr6: mcmd: connect failed: No route to host

real    0m3.007s
user    0m0.003s
sys     0m0.007s

However, your average user may not know wopr6 is down, or does not
wish to continaully remove problem nodes (in this case wopr6) from the
list of nodes to communicate with.

The -v option in pdsh is used to selectively eliminate those nodes
that are considered down by whatsup and the libnodeupdown library
(http://whatsup.sourceforge.net).

Whatsup currently shows that wopr6 is down.

> whatsup
up:   7: wopr[0-5],wopri
down: 1: wopr6

So the -v option will have pdsh skip wopr6 automatically.

> time pdsh -v -w "wopr[0-6]" hostname
wopr1: wopr1
wopr0: wopr0
wopr2: wopr2
wopr5: wopr5
wopr4: wopr4
wopr3: wopr3

real    0m0.034s
user    0m0.005s
sys     0m0.012s

The time differences may not seem like much difference here in these
examples.  But think of when this is done across an extremeley large
cluster.

2) Hostrange input/output in FreeIPMI
-------------------------------------

Much of the hostrange input/output can be handled by running FreeIPMI
tools with pdsh.  However, pdsh requires that a shell be executed on
the remote node.  This can disrupt the CPU of running jobs on the
cluster and removes the advantages that IPMI over LAN does not
interrupt a CPU.

Starting with FreeIPMI 0.4.0, hostrange support has been added into
ipmi-chassis, ipmi-fru, ipmi-raw, ipmi-sensors, ipmi-sel, bmc-info,
and ipmimonitoring (it has existed in ipmipower since 0.1.0).  More
than one node at a time can be specified on the command line using the
hostrange format similar in pdsh.  Using a threaded model similar to
pdsh, each of the tools will create a sliding-window of threads, each
executing out-of-band IPMI in parallel.  The number of threads in the
window can be increased or decreased using the fanout -F option.

The tools now have similar functionality to pdsh, but all of the IPMI
communication is done out-of-band.  Ipmipower, which supported
hostranges since 0.1.0, has had some of its options and output
modified to to be consistent with the other tools.

(Note: On our test cluster, 'pwopr' hostnames have been used instead
of 'wopr' for configuring the IPMI IP addresses.  We have also XXXed
out our local usernames and passwords of course :-)

For example:

> ipmi-sensors -h "pwopr[0-5]" -u XXX -p YYY -s 10
pwopr0: 10: CPU3 Vcore (Voltage): 1.31 V (1.04/1.65): [OK]
pwopr5: 10: CPU3 Vcore (Voltage): 1.25 V (1.04/1.65): [OK]
pwopr1: 10: CPU3 Vcore (Voltage): 1.23 V (1.04/1.65): [OK]
pwopr3: 10: CPU2 Vcore (Voltage): 1.26 V (1.06/1.63): [OK]
pwopr2: 10: CPU2 Vcore (Voltage): 1.32 V (1.06/1.63): [OK]
pwopr4: 10: CPU2 Vcore (Voltage): 1.26 V (1.06/1.63): [OK]

Dshback functionality has been added with the -B (--buffered) and -C
(--consolidated) options.

> bmc-info -h "pwopr[0-5]" -u XXX -p YYY -B
----------------
pwopr5
----------------
Device ID:         22
Device Revision:   1
Firmware Revision: 1.12
                   [Device Available (normal operation)]
IPMI Version:      2.0
Additional Device Support:
                   [Sensor Device]
                   [SDR Repository Device]
                   [SEL Device]
                   [FRU Inventory Device]
                   [Chassis Device]
Manufacturer ID:   28C5h
Product ID:        4h
Aux Firmware Revision Info: 38420000h
Channel Information:
       Channel No: 1
      Medium Type: 802.3 LAN
    Protocol Type: IPMB-1.0
       Channel No: 5
      Medium Type: Asynch. Serial/Modem (RS-232)
    Protocol Type: IPMB-1.0
<snip - there's a lot more of the same stuff>

> bmc-info -h "pwopr[0-5]" -u XXX -p YYY -C
----------------
pwopr[0-1,5]
----------------
Device ID:         22
Device Revision:   1
Firmware Revision: 1.12
                   [Device Available (normal operation)]
IPMI Version:      2.0
Additional Device Support:
                   [Sensor Device]
                   [SDR Repository Device]
                   [SEL Device]
                   [FRU Inventory Device]
                   [Chassis Device]
Manufacturer ID:   28C5h
Product ID:        4h
Aux Firmware Revision Info: 38420000h
Channel Information:
       Channel No: 1
      Medium Type: 802.3 LAN
    Protocol Type: IPMB-1.0
       Channel No: 5
      Medium Type: Asynch. Serial/Modem (RS-232)
    Protocol Type: IPMB-1.0
<snip - different firmware for pwopr[2-4]>

If you have happened to install pdsh on your system, you may use
dshbak instead of the -B or -C option.  The -B and -C options were
added since many users may have not installed pdsh.

A whatsup-like tool and library have also been developed called
ipmidetect.  It performs a similar functionality to whatsup, but
instead detects what IPMI nodes exist in the cluster for faster
hostranged output.  The tool requires the ipmidetectd daemon be setup
and configured on the client (see ipmidetectd(8) and
ipmidetectd.conf(5) for more information).  The ipmidetectd daemon
regularly ipmipings remote nodes.  The ipmidetect tool and library
will determine detected vs. undetected ipmi systems based on the most
recent ipmipings received.

> /usr/sbin/ipmidetect
detected:   6: pwopr[0-5]
undetected: 1: pwopr6

For example, we re-introduce the bad 'pwopr6' node into the hostrange:

> time ipmi-sensors -h "pwopr[0-6]" -u XXX -p YYY -s 10
pwopr5: 10: CPU3 Vcore (Voltage): 1.25 V (1.04/1.65): [OK]
pwopr4: 10: CPU2 Vcore (Voltage): 1.26 V (1.06/1.63): [OK]
pwopr0: 10: CPU3 Vcore (Voltage): 1.31 V (1.04/1.65): [OK]
pwopr3: 10: CPU2 Vcore (Voltage): 1.26 V (1.06/1.63): [OK]
pwopr2: 10: CPU2 Vcore (Voltage): 1.32 V (1.06/1.63): [OK]
pwopr1: 10: CPU3 Vcore (Voltage): 1.23 V (1.04/1.65): [OK]
pwopr6: ipmi_open_outofband(): Connection timed out
real    0m25.000s
user    0m0.029s
sys     0m0.003s

Running with the -E option (and assuming ipmidetectd has been setup
and is running) the -E option equickly eliminates pwopr6.

> time ipmi-sensors -h "pwopr[0-6]" -u XXX -p YYY -s 10 -E
pwopr0: 10: CPU3 Vcore (Voltage): 1.31 V (1.04/1.65): [OK]
pwopr2: 10: CPU2 Vcore (Voltage): 1.32 V (1.06/1.63): [OK]
pwopr1: 10: CPU3 Vcore (Voltage): 1.23 V (1.04/1.65): [OK]
pwopr4: 10: CPU2 Vcore (Voltage): 1.26 V (1.06/1.63): [OK]
pwopr5: 10: CPU3 Vcore (Voltage): 1.25 V (1.04/1.65): [OK]
pwopr3: 10: CPU2 Vcore (Voltage): 1.26 V (1.06/1.63): [OK]

real    0m0.113s
user    0m0.030s
sys     0m0.003s

Notice the large affect this has on the time for the command to
complete.

3) Suggested use of hostrange input/output in FreeIPMI
------------------------------------------------------

Unlike pdsh, where you can run an arbitrary shell command, each
FreeIPMI tool has a relatively fixed type of output or sets of outputs
you can run.  Based on the features run or the output of the command,
the hostrange input/output will likely be used differently dependent
with the tool.  The following are some suggestions.  They are they
ways most will use the hostrange input/output.

bmc-info:

When using hostranges, you are probably trying to verify the firmware
version or hardware type for each BMC in your cluster.  You probably
want to run bmc-info with the consolidated output (-C) set most of the
time.

ipmi-sel:

Each node will likely have drastically different ipmi-sel output and a
massive amount of it.  Therefore buffered or consolidated output will
not be very useful.  The hostrange input is most useful for gathering
the SEL output of the entire cluster quickly and out-of-band.  You can
then grep for some type of error condition you are specifically
looking for or pipe it into a log monitoring utility.

The hostrange functionality is also very useful to quickly clear the
SEL logs across the entire cluster.

ipmi-raw:

The output of ipmi-raw will likely be only 1 long line.  The
consolidated output is likely what you're interested in using.

ipmi-sensors:

Each node of the cluster will likely have slightly different
temperatures, voltages, etc.  Therefore you may wish to run
ipmi-sensors with the -q option to make it easier to consolidate
output.

> ipmi-sensors -h "pwopr[0-6]" -u XXX -p YYY -g temperature -E -C -q
----------------
pwopr[2-3]
----------------
4: CPU1 Temp: [OK]
5: CPU2 Temp: [OK]
6: CPU3 Temp: [OK]
7: CPU4 Temp: [OK]
8: Sys Temp: [OK]
----------------
pwopr[0-1,4-5]
----------------
3: CPU1 Temp: [OK]
4: CPU2 Temp: [OK]
5: CPU3 Temp: [OK]
6: CPU4 Temp: [OK]
7: Sys Temp: [OK]

(Note: the firmware on this particular cluster is not-consistent
across it, leading to the different sensor record ids in the above.)

Based on what you see, you can of course dig deeper on those
individual nodes.  I imagine many users will want to run ipmi-sensors
with the default output (each line of output is prepended with
"hostname: ").  In this mode, key error messages and the node it came
from can be easily monitored along w/ grep and sed in scripts.

ipmimonitoring:

Ipmimonitoring interprets sensor readings into NOMINAL, WARNING, and
CRITICAL states.  It performs similar functionality of ipmi-sensors,
but gives you a somewhat fixed set of strings to more easily grep.
This allows you to use hostranged input/output with Ipmimonitoring to
quickly monitor your cluster out-of-band.

The -q option will also allow you suppress sensor readings to make
things easier for consolidating the output.

I imagine many users will want to run ipmimonitoring with the default
output (each line of output is prepended with "hostname: ").  In this
mode, you can easily grep for "Warning" and "Critical" and determine
which nodes have problems via scripts.

4) Exceptions to the hostrange support in FreeIPMI
--------------------------------------------------

The hostrange input/output is not been supported in bmc-config.  There
are two major reasons for this:

1) Almost always, bmc-config must be run in-band before it can be used
out-of-band.  This is because you can't run out-of-band until the BMC
is configured with atleast the IP address and MAC address.

2) Each BMC in the cluster must be configured with a different IP
address and MAC address.  So the parallelism that the hostrange input
gives you effectively cannot be used when trying to use bmc-config's
--commit option to configure a cluster using one config file.

An alternate hostrange output model is supported in ipmipower.  This
is due partly to legacy but mostly due to technical reasons.
Ipmipower was developed with a different architecture than bmc-info,
ipmi-sensors, ipmi-sel, etc. because of advanced scalability needs and
the need for it to interact with Powerman.  So it cannot use the
parallel stdout libraries developed.  It effectively emulates the
--consolidate-output functionality of the other tools.  A buffered
output option would not make any technical sense since every output
from ipmipower is 1 line long.