Using Hostrange Input/Output in HPC environments by Albert Chu chu11@llnl.gov 1) Introduction with Pdsh ------------------------- Much of the hostrange input/output in FreeIPMI is based on the tool pdsh (http://pdsh.sourceforge.net). Pdsh is a parallel shell utility which allows you to execute an arbitrary command across a cluster. Algorithmically, pdsh creates a sliding window of threads, each which generates a remote shell using an underlying 'rcmd" functionality (such as rcmd(3) or ssh(1)). As threads complete, the new threads launch the command on other hosts until the command has been executed on all hosts specified. It is utilized at Lawrence Livermore National Laboratory (LLNL) on clusters ranging from 4 to 1152 nodes. Commands are capable of being executed across the entire cluster in the matter of seconds rather then minutes it would take to execute serially in a shell prompt. Here's an example of pdsh at work on a small cluster. > pdsh -w "wopr[0-5]" hostname wopr0: wopr0 wopr1: wopr1 wopr2: wopr2 wopr3: wopr3 wopr5: wopr5 wopr4: wopr4 Now, determining the hostname of every node in your cluster isn't too useful or interesting. However, perhaps you want to determine if every node of your cluster booted with the same kernel. > pdsh -w "wopr[0-5]" uname -r wopr1: 2.6.9-65chaos wopr0: 2.6.9-65chaos wopr5: 2.6.9-65chaos wopr2: 2.6.9-65chaos wopr4: 2.6.9-65chaos wopr3: 2.6.9-65chaos Seems pretty useful. However, on larger clusters, this type of output will get pretty large, especially if the command generates greater than 1 line of output for each node. Lets say I want to determine if the same config file has been configured on every node of the cluster. > pdsh -w "wopr[0-5]" "cat /tmp/pretend_config" wopr1: foo=/usr wopr1: bar=/tmp wopr1: baz=/etc wopr1: xyzzy=static wopr1: wopr0: foo=/usr wopr0: bar=/tmp wopr0: baz=/etc wopr0: xyzzy=static wopr0: wopr2: foo=/usr wopr2: bar=/tmp wopr2: baz=/etc wopr2: xyzzy=dynamic wopr2: wopr4: foo=/usr wopr4: bar=/tmp wopr4: baz=/etc wopr4: xyzzy=static wopr4: wopr5: foo=/usr wopr5: bar=/tmp wopr5: baz=/etc wopr5: xyzzy=static wopr5: wopr3: foo=/usr wopr3: bar=/tmp wopr3: baz=/etc wopr3: xyzzy=static wopr3: It's beginning to get pretty long and perhaps a bit hard to digest. Pdsh also comes with a tool called dshbak for buffering this output to make it more human readable. > pdsh -w "wopr[0-5]" "cat /tmp/pretend_config" | dshbak ---------------- wopr1 ---------------- foo=/usr bar=/tmp baz=/etc xyzzy=static ---------------- wopr3 ---------------- foo=/usr bar=/tmp baz=/etc xyzzy=static <snip - more of the same stuff> This is a much nicer output to read. However, if you have a much larger cluster (or possibly much larger output), this type of output will still be quite difficult to handle. Dshbak also comes with a consolidation function to shorten the output. > pdsh -w "wopr[0-5]" "cat /tmp/pretend_config" | dshbak -c ---------------- wopr[0-1,3-5] ---------------- foo=/usr bar=/tmp baz=/etc xyzzy=static ---------------- wopr2 ---------------- foo=/usr bar=/tmp baz=/etc xyzzy=dynamic We see that for this particular pretend cluster config file, one node's configuration is different. Another problem that often comes up with large clusters is that nodes are removed from the cluster for servicing or are down due to hardware problems, hangs, crashes, etc. So tools like pdsh can often sit and eventually time out on those nodes that are removed or have problems. In the cluster used in this example, wopr6 is a node that is currently down and times out after awhile when you use pdsh. > time pdsh -w "wopr[0-6]" hostname wopr0: wopr0 wopr1: wopr1 wopr4: wopr4 wopr2: wopr2 wopr5: wopr5 wopr3: wopr3 pdsh@wopri: wopr6: mcmd: connect failed: No route to host real 0m3.007s user 0m0.003s sys 0m0.007s However, your average user may not know wopr6 is down, or does not wish to continaully remove problem nodes (in this case wopr6) from the list of nodes to communicate with. The -v option in pdsh is used to selectively eliminate those nodes that are considered down by whatsup and the libnodeupdown library (http://whatsup.sourceforge.net). Whatsup currently shows that wopr6 is down. > whatsup up: 7: wopr[0-5],wopri down: 1: wopr6 So the -v option will have pdsh skip wopr6 automatically. > time pdsh -v -w "wopr[0-6]" hostname wopr1: wopr1 wopr0: wopr0 wopr2: wopr2 wopr5: wopr5 wopr4: wopr4 wopr3: wopr3 real 0m0.034s user 0m0.005s sys 0m0.012s The time differences may not seem like much difference here in these examples. But think of when this is done across an extremeley large cluster. 2) Hostrange input/output in FreeIPMI ------------------------------------- Much of the hostrange input/output can be handled by running FreeIPMI tools with pdsh. However, pdsh requires that a shell be executed on the remote node. This can disrupt the CPU of running jobs on the cluster and removes the advantages that IPMI over LAN does not interrupt a CPU. Starting with FreeIPMI 0.4.0, hostrange support has been added into ipmi-chassis, ipmi-fru, ipmi-raw, ipmi-sensors, ipmi-sel, bmc-info, and ipmimonitoring (it has existed in ipmipower since 0.1.0). More than one node at a time can be specified on the command line using the hostrange format similar in pdsh. Using a threaded model similar to pdsh, each of the tools will create a sliding-window of threads, each executing out-of-band IPMI in parallel. The number of threads in the window can be increased or decreased using the fanout -F option. The tools now have similar functionality to pdsh, but all of the IPMI communication is done out-of-band. Ipmipower, which supported hostranges since 0.1.0, has had some of its options and output modified to to be consistent with the other tools. (Note: On our test cluster, 'pwopr' hostnames have been used instead of 'wopr' for configuring the IPMI IP addresses. We have also XXXed out our local usernames and passwords of course :-) For example: > ipmi-sensors -h "pwopr[0-5]" -u XXX -p YYY -s 10 pwopr0: 10: CPU3 Vcore (Voltage): 1.31 V (1.04/1.65): [OK] pwopr5: 10: CPU3 Vcore (Voltage): 1.25 V (1.04/1.65): [OK] pwopr1: 10: CPU3 Vcore (Voltage): 1.23 V (1.04/1.65): [OK] pwopr3: 10: CPU2 Vcore (Voltage): 1.26 V (1.06/1.63): [OK] pwopr2: 10: CPU2 Vcore (Voltage): 1.32 V (1.06/1.63): [OK] pwopr4: 10: CPU2 Vcore (Voltage): 1.26 V (1.06/1.63): [OK] Dshback functionality has been added with the -B (--buffered) and -C (--consolidated) options. > bmc-info -h "pwopr[0-5]" -u XXX -p YYY -B ---------------- pwopr5 ---------------- Device ID: 22 Device Revision: 1 Firmware Revision: 1.12 [Device Available (normal operation)] IPMI Version: 2.0 Additional Device Support: [Sensor Device] [SDR Repository Device] [SEL Device] [FRU Inventory Device] [Chassis Device] Manufacturer ID: 28C5h Product ID: 4h Aux Firmware Revision Info: 38420000h Channel Information: Channel No: 1 Medium Type: 802.3 LAN Protocol Type: IPMB-1.0 Channel No: 5 Medium Type: Asynch. Serial/Modem (RS-232) Protocol Type: IPMB-1.0 <snip - there's a lot more of the same stuff> > bmc-info -h "pwopr[0-5]" -u XXX -p YYY -C ---------------- pwopr[0-1,5] ---------------- Device ID: 22 Device Revision: 1 Firmware Revision: 1.12 [Device Available (normal operation)] IPMI Version: 2.0 Additional Device Support: [Sensor Device] [SDR Repository Device] [SEL Device] [FRU Inventory Device] [Chassis Device] Manufacturer ID: 28C5h Product ID: 4h Aux Firmware Revision Info: 38420000h Channel Information: Channel No: 1 Medium Type: 802.3 LAN Protocol Type: IPMB-1.0 Channel No: 5 Medium Type: Asynch. Serial/Modem (RS-232) Protocol Type: IPMB-1.0 <snip - different firmware for pwopr[2-4]> If you have happened to install pdsh on your system, you may use dshbak instead of the -B or -C option. The -B and -C options were added since many users may have not installed pdsh. A whatsup-like tool and library have also been developed called ipmidetect. It performs a similar functionality to whatsup, but instead detects what IPMI nodes exist in the cluster for faster hostranged output. The tool requires the ipmidetectd daemon be setup and configured on the client (see ipmidetectd(8) and ipmidetectd.conf(5) for more information). The ipmidetectd daemon regularly ipmipings remote nodes. The ipmidetect tool and library will determine detected vs. undetected ipmi systems based on the most recent ipmipings received. > /usr/sbin/ipmidetect detected: 6: pwopr[0-5] undetected: 1: pwopr6 For example, we re-introduce the bad 'pwopr6' node into the hostrange: > time ipmi-sensors -h "pwopr[0-6]" -u XXX -p YYY -s 10 pwopr5: 10: CPU3 Vcore (Voltage): 1.25 V (1.04/1.65): [OK] pwopr4: 10: CPU2 Vcore (Voltage): 1.26 V (1.06/1.63): [OK] pwopr0: 10: CPU3 Vcore (Voltage): 1.31 V (1.04/1.65): [OK] pwopr3: 10: CPU2 Vcore (Voltage): 1.26 V (1.06/1.63): [OK] pwopr2: 10: CPU2 Vcore (Voltage): 1.32 V (1.06/1.63): [OK] pwopr1: 10: CPU3 Vcore (Voltage): 1.23 V (1.04/1.65): [OK] pwopr6: ipmi_open_outofband(): Connection timed out real 0m25.000s user 0m0.029s sys 0m0.003s Running with the -E option (and assuming ipmidetectd has been setup and is running) the -E option equickly eliminates pwopr6. > time ipmi-sensors -h "pwopr[0-6]" -u XXX -p YYY -s 10 -E pwopr0: 10: CPU3 Vcore (Voltage): 1.31 V (1.04/1.65): [OK] pwopr2: 10: CPU2 Vcore (Voltage): 1.32 V (1.06/1.63): [OK] pwopr1: 10: CPU3 Vcore (Voltage): 1.23 V (1.04/1.65): [OK] pwopr4: 10: CPU2 Vcore (Voltage): 1.26 V (1.06/1.63): [OK] pwopr5: 10: CPU3 Vcore (Voltage): 1.25 V (1.04/1.65): [OK] pwopr3: 10: CPU2 Vcore (Voltage): 1.26 V (1.06/1.63): [OK] real 0m0.113s user 0m0.030s sys 0m0.003s Notice the large affect this has on the time for the command to complete. 3) Suggested use of hostrange input/output in FreeIPMI ------------------------------------------------------ Unlike pdsh, where you can run an arbitrary shell command, each FreeIPMI tool has a relatively fixed type of output or sets of outputs you can run. Based on the features run or the output of the command, the hostrange input/output will likely be used differently dependent with the tool. The following are some suggestions. They are they ways most will use the hostrange input/output. bmc-info: When using hostranges, you are probably trying to verify the firmware version or hardware type for each BMC in your cluster. You probably want to run bmc-info with the consolidated output (-C) set most of the time. ipmi-sel: Each node will likely have drastically different ipmi-sel output and a massive amount of it. Therefore buffered or consolidated output will not be very useful. The hostrange input is most useful for gathering the SEL output of the entire cluster quickly and out-of-band. You can then grep for some type of error condition you are specifically looking for or pipe it into a log monitoring utility. The hostrange functionality is also very useful to quickly clear the SEL logs across the entire cluster. ipmi-raw: The output of ipmi-raw will likely be only 1 long line. The consolidated output is likely what you're interested in using. ipmi-sensors: Each node of the cluster will likely have slightly different temperatures, voltages, etc. Therefore you may wish to run ipmi-sensors with the -q option to make it easier to consolidate output. > ipmi-sensors -h "pwopr[0-6]" -u XXX -p YYY -g temperature -E -C -q ---------------- pwopr[2-3] ---------------- 4: CPU1 Temp: [OK] 5: CPU2 Temp: [OK] 6: CPU3 Temp: [OK] 7: CPU4 Temp: [OK] 8: Sys Temp: [OK] ---------------- pwopr[0-1,4-5] ---------------- 3: CPU1 Temp: [OK] 4: CPU2 Temp: [OK] 5: CPU3 Temp: [OK] 6: CPU4 Temp: [OK] 7: Sys Temp: [OK] (Note: the firmware on this particular cluster is not-consistent across it, leading to the different sensor record ids in the above.) Based on what you see, you can of course dig deeper on those individual nodes. I imagine many users will want to run ipmi-sensors with the default output (each line of output is prepended with "hostname: "). In this mode, key error messages and the node it came from can be easily monitored along w/ grep and sed in scripts. ipmimonitoring: Ipmimonitoring interprets sensor readings into NOMINAL, WARNING, and CRITICAL states. It performs similar functionality of ipmi-sensors, but gives you a somewhat fixed set of strings to more easily grep. This allows you to use hostranged input/output with Ipmimonitoring to quickly monitor your cluster out-of-band. The -q option will also allow you suppress sensor readings to make things easier for consolidating the output. I imagine many users will want to run ipmimonitoring with the default output (each line of output is prepended with "hostname: "). In this mode, you can easily grep for "Warning" and "Critical" and determine which nodes have problems via scripts. 4) Exceptions to the hostrange support in FreeIPMI -------------------------------------------------- The hostrange input/output is not been supported in bmc-config. There are two major reasons for this: 1) Almost always, bmc-config must be run in-band before it can be used out-of-band. This is because you can't run out-of-band until the BMC is configured with atleast the IP address and MAC address. 2) Each BMC in the cluster must be configured with a different IP address and MAC address. So the parallelism that the hostrange input gives you effectively cannot be used when trying to use bmc-config's --commit option to configure a cluster using one config file. An alternate hostrange output model is supported in ipmipower. This is due partly to legacy but mostly due to technical reasons. Ipmipower was developed with a different architecture than bmc-info, ipmi-sensors, ipmi-sel, etc. because of advanced scalability needs and the need for it to interact with Powerman. So it cannot use the parallel stdout libraries developed. It effectively emulates the --consolidate-output functionality of the other tools. A buffered output option would not make any technical sense since every output from ipmipower is 1 line long.