Blog Archives
Full Coverage in Infiniband Monitoring with OSWatcher 3.0: IB Monitoring
Recently, I need to use OSWatcher in our large data warehouse environment running on Solaris OS. When I have downloaded (Metalink Note 301137.1) and untared the osw3.tar file, I have noticed some new scripts within the bundle. Once I checked the README file in the bundle I have noticed that new scripts are the ones introduced to track infiniband performance and status over Exadata Database Machine (or any other server using IB stack)
In this series of posts, I will try to explain the importance of those scripts for successful IB stack and RDS monitoring
oswib.sh
oswlib.sh is the first script we will be discussing. The script content is
1: #!/bin/ksh
2: #
3: # IB Diagnostics
4: #
5: #
6: echo "zzz ***"`date` >> $1
7: echo "IB Config on Hosts..." >> $1
8: echo "ibconfig...." >> $1
9: ifconfig >> $1
10: echo "" >> $1
11: echo "ib-bond..." >> $1
12: ib-bond --status >> $1
13: echo "" >> $1
14: echo "ibstat..." >> $1
15: ibstat >> $1
16: echo "" >> $1
17: echo "ibstatus..." >> $1
18: ibstatus >> $1
19: echo "" >> $1
20: echo "lspci -vv..." >> $1
21: lspci -vv |grep InfiniBand -A27 >> $1
22: echo "" >> $1
23: rm locks/iblock.file
Gather Basic Network Information
Let’s try to explain what each statement does. ifconfig at line 9 as you may expected will display the list of all network interfaces including the bond interfaces. You will see that bold ones are the Infiniband and IB bond devices. You might notice that there are some crude network statistics attached to them. What is important in here is error,dropped,collusions statistics for RX/TX. Ensure that those values are either 0 or negligible with compared to total number of network packets/frames sent/received.
[root@dbkon01:~]# ifconfig
bond0 Link encap:Ethernet HWaddr 00:22:64:F7:12:BC
inet addr:10.210.51.171 Bcast:10.210.51.255 Mask:255.255.255.0
inet6 addr: fe80::222:64ff:fef7:12bc/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:2718135321 errors:0 dropped:0 overruns:0 frame:0
TX packets:2609232342 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:675554897860 (629.1 GiB) TX bytes:1582459760400 (1.4 TiB)
bond0:1 Link encap:Ethernet HWaddr 00:22:64:F7:12:BC
inet addr:10.210.51.172 Bcast:10.210.51.255 Mask:255.255.255.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
bond1 Link encap:InfiniBand HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet addr:172.16.51.71 Bcast:172.16.51.255 Mask:255.255.255.0
inet6 addr: fe80::216:35ff:ffbf:2b11/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:54530368 errors:0 dropped:0 overruns:0 frame:0
TX packets:53991683 errors:0 dropped:29 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:11648571830 (10.8 GiB) TX bytes:17492409378 (16.2 GiB)
eth0 Link encap:Ethernet HWaddr 00:22:64:F7:12:BC
inet6 addr: fe80::222:64ff:fef7:12bc/64 Scope:Link
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:2242122831 errors:0 dropped:0 overruns:0 frame:0
TX packets:1554180534 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:573128361579 (533.7 GiB) TX bytes:1111525331344 (1.0 TiB)
Interrupt:169 Memory:f8000000-f8012100
eth1 Link encap:Ethernet HWaddr 00:22:64:F7:12:BE
inet6 addr: fe80::222:64ff:fef7:12be/64 Scope:Link
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:476012490 errors:0 dropped:0 overruns:0 frame:0
TX packets:1055051808 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:102426536281 (95.3 GiB) TX bytes:470934429056 (438.5 GiB)
Interrupt:177 Memory:fa000000-fa012100
ib0 Link encap:InfiniBand HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet6 addr: fe80::216:35ff:ffbf:2b11/64 Scope:Link
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:54259452 errors:0 dropped:0 overruns:0 frame:0
TX packets:53991677 errors:0 dropped:29 overruns:0 carrier:0
collisions:0 txqueuelen:256
RX bytes:11625191478 (10.8 GiB) TX bytes:17492408922 (16.2 GiB)
ib2 Link encap:InfiniBand HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet6 addr: fe80::216:35ff:ffbf:2b05/64 Scope:Link
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:270916 errors:0 dropped:0 overruns:0 frame:0
TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:256
RX bytes:23380352 (22.2 MiB) TX bytes:456 (456.0 b)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:130512043 errors:0 dropped:0 overruns:0 frame:0
TX packets:130512043 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:17683019151 (16.4 GiB) TX bytes:17683019151 (16.4 GiB)
virbr0 Link encap:Ethernet HWaddr 00:00:00:00:00:00
inet addr:192.168.122.1 Bcast:192.168.122.255 Mask:255.255.255.0
inet6 addr: fe80::200:ff:fe00:0/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:468 (468.0 b)
Infiniband Bonding Information
Infiniband bonding is somewhat similar to classical network bonding (or aggregation) with some behavioral differences. The major difference is that Infiniband network bonding interface is running in active/passive mode over Infiniband HCAs. No trunking is allowed as it is possible with classical Ethernet network. So if you have two 20 GBit interfaces you will have 20 Gbit theoretical throughput in an active IB network even that you have two (or more) interfaces. This can be seen easily at the output of ifconfig also. While ib0 interface has send/receive statistics, there is almost no traffic running over ib2 interface.
In case of a failure (or it can be done manually) bonding interface will detect the failure in the active component and will failover to the passive one and you will see some informative warning message in the /var/log/messages file just like in Ethernet bonding.
In a successful RAC configuration failover duration should be less than any CRS or watchdog timeout value. That’s because for a period of time no interconnect traffic (heartbeats, or cache fusion) will be available. So if this failover duration is too long due to host CPU utilization, a problem in HCA firmware, a configuration problem at IB switch,or any other problem clusterware or some watchdog will assume that node should be evicted from the cluster to protect cluster integrity.
To check the current status of IB bonding oswib.sh uses the ib-bond command (line 12). However there is a bug in using this command. Look at the output of the ib-bond command with –-status option.
[root@dbkon01:~]# ib-bond –status
bond0: 00:22:64:f7:12:bc 10.210.51.171/24 10.210.51.172/24
slave0: eth0 *
slave1: eth1
bond0 interface has nothing to do with IB bonding but it is the public network bonding (IPs over this bond is public and VIP IPs). This seems to be a problem with ib-bond command because it also displays non-IB bonding interfaces. In order to correct this issue, you can modify your oswib.sh script’s related line as
ib-bond --status-all
[root@dbkon01:~]# ib-bond –status-all
bond0: 00:22:64:f7:12:bc 10.210.51.171/24 10.210.51.172/24
slave0: eth0 *
slave1: eth1
bond1: 80:00:00:48:fe:80:00:00:00:00:00:00:00:16:35:ff:ff:bf:2b:11 172.16.51.71/24
slave0: ib0 *
slave1: ib2
As you see bond1 is the IB bonding interface and ib0 is the active port.
Infiniband Adapters & Configuration Information
ibstat is yet another important command to display each IB HCA port status. Depending on your configuration different outputs are possible
[root@dbkon01:~]# ibstat
CA ‘mlx4_0’
CA type: MT25418
Number of ports: 2
Firmware version: 2.5.0
Hardware version: a0
Node GUID: 0x001635ffffbf2b10
System image GUID: 0x001635ffffbf2b13
Port 1:
State: Active
Physical state: LinkUp
Rate: 20
Base lid: 7
LMC: 0
SM lid: 1
Capability mask: 0x02510868
Port GUID: 0x001635ffffbf2b11
Port 2:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02510868
Port GUID: 0x001635ffffbf2b12
CA ‘mlx4_1’
CA type: MT25418
Number of ports: 2
Firmware version: 2.5.0
Hardware version: a0
Node GUID: 0x001635ffffbf2b04
System image GUID: 0x001635ffffbf2b07
Port 1:
State: Active
Physical state: LinkUp
Rate: 20
Base lid: 6
LMC: 0
SM lid: 1
Capability mask: 0x02510868
Port GUID: 0x001635ffffbf2b05
Port 2:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02510868
Port GUID: 0x001635ffffbf2b06
In this sample configuration there are two 20 Gbit Mellanox HCAs with two ports each. Only Port 1 of each HCA is actively connected to an Infiniband switch port and the other ports are not in use. This output is specific to our configuration. For example in Exadata there is only one dual port HCA on each computation/storage cell and those ports are paired for redundant IB network configuration.
ibstatus is another way to check the same thing. Output also includes the port speeds.
[root@dbkon01:~]# ibstatus
Infiniband device ‘mlx4_0’ port 1 status:
default gid: fe80:0000:0000:0000:0016:35ff:ffbf:2b11
base lid: 0x7
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 20 Gb/sec (4X DDR)
Infiniband device ‘mlx4_0’ port 2 status:
default gid: fe80:0000:0000:0000:0016:35ff:ffbf:2b12
base lid: 0x0
sm lid: 0x0
state: 1: DOWN
phys state: 2: Polling
rate: 10 Gb/sec (4X)
Infiniband device ‘mlx4_1’ port 1 status:
default gid: fe80:0000:0000:0000:0016:35ff:ffbf:2b05
base lid: 0x6
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 20 Gb/sec (4X DDR)
Infiniband device ‘mlx4_1’ port 2 status:
default gid: fe80:0000:0000:0000:0016:35ff:ffbf:2b06
base lid: 0x0
sm lid: 0x0
state: 1: DOWN
phys state: 2: Polling
rate: 10 Gb/sec (4X)
As you know lspci displays the details of PCI devices attached to a host. In order to check details of your HCAs, OSW periodically logs the output of this command also. This output is also specific to HCA you are using. Meaning of some parameters might be interpreted by looking at the manuals of your vendor.
[root@dbkon01:~]# lspci -vv |grep InfiniBand -A27
13:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR, PCIe 2.0 2.5GT/s] (rev a0)
Subsystem: Mellanox Technologies MT25418 [ConnectX IB DDR, PCIe 2.0 2.5GT/s]
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 209
Region 0: Memory at fdc00000 (64-bit, non-prefetchable) [size=1M]
Region 2: Memory at f6800000 (64-bit, prefetchable) [size=8M]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] Vital Product Data
Capabilities: [9c] MSI-X: Enable+ Mask- TabSize=256
Vector table: BAR=0 offset=0007c000
PBA: BAR=0 offset=0007d000
Capabilities: [60] Express Endpoint IRQ 0
Device: Supported: MaxPayload 256 bytes, PhantFunc 0, ExtTag+
Device: Latency L0s <64ns, L1 unlimited
Device: AtnBtn- AtnInd- PwrInd-
Device: Errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
Device: MaxPayload 256 bytes, MaxReadReq 4096 bytes
Link: Supported Speed 2.5Gb/s, Width x8, ASPM L0s, Port 8
Link: Latency L0s unlimited, L1 unlimited
Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch-
Link: Speed 2.5Gb/s, Width x8
Capabilities: [100] Unknown (14)
—
19:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR, PCIe 2.0 2.5GT/s] (rev a0)
Subsystem: Mellanox Technologies MT25418 [ConnectX IB DDR, PCIe 2.0 2.5GT/s]
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 185
Region 0: Memory at fde00000 (64-bit, non-prefetchable) [size=1M]
Region 2: Memory at f7000000 (64-bit, prefetchable) [size=8M]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] Vital Product Data
Capabilities: [9c] MSI-X: Enable+ Mask- TabSize=256
Vector table: BAR=0 offset=0007c000
PBA: BAR=0 offset=0007d000
Capabilities: [60] Express Endpoint IRQ 0
Device: Supported: MaxPayload 256 bytes, PhantFunc 0, ExtTag+
Device: Latency L0s <64ns, L1 unlimited
Device: AtnBtn- AtnInd- PwrInd-
Device: Errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
Device: MaxPayload 256 bytes, MaxReadReq 4096 bytes
Link: Supported Speed 2.5Gb/s, Width x8, ASPM L0s, Port 8
Link: Latency L0s unlimited, L1 unlimited
Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch-
Link: Speed 2.5Gb/s, Width x8
Capabilities: [100] Unknown (14)