Blog Archives

Full Coverage in Infiniband Monitoring with OSWatcher 3.0: IB Monitoring

watch Recently, I need to use OSWatcher in our large data warehouse environment running on Solaris OS. When I have downloaded (Metalink Note  301137.1) and untared the osw3.tar file, I have noticed some new scripts within the bundle. Once I checked the README file in the bundle I have noticed that new scripts are the ones introduced to track infiniband performance and status over Exadata Database Machine (or any other server using IB stack)

In this series of posts, I will try to explain the importance of those scripts for successful IB stack and RDS monitoring

oswib.sh

oswlib.sh is the first script we will be discussing. The script content is

   1: #!/bin/ksh

   2: #

   3: # IB Diagnostics

   4: #

   5: #

   6: echo "zzz ***"`date` >> $1

   7: echo "IB Config on Hosts..." >> $1

   8: echo "ibconfig...." >> $1

   9: ifconfig >> $1

  10: echo "" >> $1

  11: echo "ib-bond..." >> $1

  12: ib-bond --status >> $1

  13: echo "" >> $1

  14: echo "ibstat..." >> $1

  15: ibstat >> $1

  16: echo "" >> $1

  17: echo "ibstatus..." >> $1

  18: ibstatus >> $1

  19: echo "" >> $1

  20: echo "lspci -vv..." >> $1

  21: lspci -vv |grep InfiniBand -A27 >> $1

  22: echo "" >> $1

  23: rm locks/iblock.file

Gather Basic Network Information

Let’s try to explain what each statement does. ifconfig at line 9 as you may expected will display the list of all network interfaces including the bond interfaces. You will see that bold ones are the Infiniband and IB bond devices. You might notice that there are some crude network statistics attached to them. What is important in here is error,dropped,collusions statistics for RX/TX. Ensure that those values are either 0 or negligible with compared to total number of network packets/frames sent/received.

[root@dbkon01:~]# ifconfig

bond0     Link encap:Ethernet  HWaddr 00:22:64:F7:12:BC 
          inet addr:10.210.51.171  Bcast:10.210.51.255  Mask:255.255.255.0

          inet6 addr: fe80::222:64ff:fef7:12bc/64 Scope:Link

          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1

          RX packets:2718135321 errors:0 dropped:0 overruns:0 frame:0

          TX packets:2609232342 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:0

          RX bytes:675554897860 (629.1 GiB)  TX bytes:1582459760400 (1.4 TiB)

bond0:1   Link encap:Ethernet  HWaddr 00:22:64:F7:12:BC 
          inet addr:10.210.51.172  Bcast:10.210.51.255  Mask:255.255.255.0

          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1

bond1     Link encap:InfiniBand  HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 
          inet addr:172.16.51.71  Bcast:172.16.51.255  Mask:255.255.255.0

          inet6 addr: fe80::216:35ff:ffbf:2b11/64 Scope:Link

          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1

          RX packets:54530368 errors:0 dropped:0 overruns:0 frame:0

          TX packets:53991683 errors:0 dropped:29 overruns:0 carrier:0

          collisions:0 txqueuelen:0

          RX bytes:11648571830 (10.8 GiB)  TX bytes:17492409378 (16.2 GiB)

eth0      Link encap:Ethernet  HWaddr 00:22:64:F7:12:BC 
          inet6 addr: fe80::222:64ff:fef7:12bc/64 Scope:Link

          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1

          RX packets:2242122831 errors:0 dropped:0 overruns:0 frame:0

          TX packets:1554180534 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:1000

          RX bytes:573128361579 (533.7 GiB)  TX bytes:1111525331344 (1.0 TiB)

          Interrupt:169 Memory:f8000000-f8012100

eth1      Link encap:Ethernet  HWaddr 00:22:64:F7:12:BE 
          inet6 addr: fe80::222:64ff:fef7:12be/64 Scope:Link

          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1

          RX packets:476012490 errors:0 dropped:0 overruns:0 frame:0

          TX packets:1055051808 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:1000

          RX bytes:102426536281 (95.3 GiB)  TX bytes:470934429056 (438.5 GiB)

          Interrupt:177 Memory:fa000000-fa012100

ib0       Link encap:InfiniBand  HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 
          inet6 addr: fe80::216:35ff:ffbf:2b11/64 Scope:Link

          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1

          RX packets:54259452 errors:0 dropped:0 overruns:0 frame:0

          TX packets:53991677 errors:0 dropped:29 overruns:0 carrier:0

          collisions:0 txqueuelen:256

          RX bytes:11625191478 (10.8 GiB)  TX bytes:17492408922 (16.2 GiB)

ib2       Link encap:InfiniBand  HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 
          inet6 addr: fe80::216:35ff:ffbf:2b05/64 Scope:Link

          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1

          RX packets:270916 errors:0 dropped:0 overruns:0 frame:0

          TX packets:6 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:256

          RX bytes:23380352 (22.2 MiB)  TX bytes:456 (456.0 b)

lo        Link encap:Local Loopback 
          inet addr:127.0.0.1  Mask:255.0.0.0

          inet6 addr: ::1/128 Scope:Host

          UP LOOPBACK RUNNING  MTU:16436  Metric:1

          RX packets:130512043 errors:0 dropped:0 overruns:0 frame:0

          TX packets:130512043 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:0

          RX bytes:17683019151 (16.4 GiB)  TX bytes:17683019151 (16.4 GiB)

virbr0    Link encap:Ethernet  HWaddr 00:00:00:00:00:00 
          inet addr:192.168.122.1  Bcast:192.168.122.255  Mask:255.255.255.0

          inet6 addr: fe80::200:ff:fe00:0/64 Scope:Link

          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

          RX packets:0 errors:0 dropped:0 overruns:0 frame:0

          TX packets:6 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:0

          RX bytes:0 (0.0 b)  TX bytes:468 (468.0 b)

Infiniband Bonding Information

Infiniband bonding is somewhat similar to classical network bonding (or aggregation) with some behavioral differences. The major difference is that Infiniband network bonding interface is running in active/passive mode over Infiniband HCAs. No trunking is allowed as it is possible with classical Ethernet network. So if you have two 20 GBit interfaces you will have 20 Gbit theoretical throughput in an active IB network even that you have two (or more) interfaces. This can be seen easily at the output of ifconfig also. While ib0 interface has send/receive statistics, there is almost no traffic running over ib2 interface.

In case of a failure (or it can be done manually) bonding interface will detect the failure in the active component and will failover to the passive one and you will see some informative warning message in the /var/log/messages file just like in Ethernet bonding.

In a successful RAC configuration failover duration should be less than any CRS or watchdog timeout value. That’s because for a period of time no interconnect traffic (heartbeats, or cache fusion) will be available. So if this failover duration is too long due to host CPU utilization, a problem in HCA firmware, a configuration problem at IB switch,or any other problem clusterware or some watchdog will assume that node should be evicted from the cluster to protect cluster integrity.

To check the current status of IB bonding oswib.sh uses the ib-bond command (line 12). However there is a bug in using this command. Look at the output of the ib-bond command with –-status option.

[root@dbkon01:~]# ib-bond –status

bond0: 00:22:64:f7:12:bc 10.210.51.171/24 10.210.51.172/24

slave0: eth0 *

slave1: eth1

bond0 interface has nothing to do with IB bonding but it is the public network bonding (IPs over this bond is public and VIP IPs). This seems to be a problem with ib-bond command because it also displays non-IB bonding interfaces. In order to correct this issue, you can modify your oswib.sh script’s related line as

ib-bond --status-all

[root@dbkon01:~]# ib-bond  –status-all

bond0: 00:22:64:f7:12:bc 10.210.51.171/24 10.210.51.172/24

slave0: eth0 *

slave1: eth1

bond1: 80:00:00:48:fe:80:00:00:00:00:00:00:00:16:35:ff:ff:bf:2b:11 172.16.51.71/24

slave0: ib0 *

slave1: ib2

As you see bond1 is the IB bonding interface and ib0 is the active port.

Infiniband Adapters & Configuration Information

ibstat is yet another important command to display each IB HCA port status. Depending on your configuration different outputs are possible

[root@dbkon01:~]# ibstat

CA ‘mlx4_0’

    CA type: MT25418

    Number of ports: 2

    Firmware version: 2.5.0

    Hardware version: a0

    Node GUID: 0x001635ffffbf2b10

    System image GUID: 0x001635ffffbf2b13

    Port 1:

        State: Active

        Physical state: LinkUp

        Rate: 20

        Base lid: 7

        LMC: 0

        SM lid: 1

        Capability mask: 0x02510868

        Port GUID: 0x001635ffffbf2b11

    Port 2:

        State: Down

        Physical state: Polling

        Rate: 10

        Base lid: 0

        LMC: 0

        SM lid: 0

        Capability mask: 0x02510868

        Port GUID: 0x001635ffffbf2b12

CA ‘mlx4_1’

    CA type: MT25418

    Number of ports: 2

    Firmware version: 2.5.0

    Hardware version: a0

    Node GUID: 0x001635ffffbf2b04

    System image GUID: 0x001635ffffbf2b07

    Port 1:

        State: Active

        Physical state: LinkUp

        Rate: 20

        Base lid: 6

        LMC: 0

        SM lid: 1

        Capability mask: 0x02510868

        Port GUID: 0x001635ffffbf2b05

    Port 2:

        State: Down

        Physical state: Polling

        Rate: 10

        Base lid: 0

        LMC: 0

        SM lid: 0

        Capability mask: 0x02510868

        Port GUID: 0x001635ffffbf2b06

In this sample configuration there are two 20 Gbit Mellanox HCAs with two ports each. Only Port 1 of each HCA is actively connected to an Infiniband switch port and the other ports are not in use. This output is specific to our configuration. For example in Exadata there is only one dual port HCA on each computation/storage cell and those ports are paired for redundant IB network configuration.

ibstatus is another way to check the same thing. Output also includes the port speeds.

[root@dbkon01:~]# ibstatus

Infiniband device ‘mlx4_0’ port 1 status:

    default gid:     fe80:0000:0000:0000:0016:35ff:ffbf:2b11

    base lid:     0x7

    sm lid:         0x1

    state:         4: ACTIVE

    phys state:     5: LinkUp

    rate:         20 Gb/sec (4X DDR)

Infiniband device ‘mlx4_0’ port 2 status:

    default gid:     fe80:0000:0000:0000:0016:35ff:ffbf:2b12

    base lid:     0x0

    sm lid:         0x0

    state:         1: DOWN

    phys state:     2: Polling

    rate:         10 Gb/sec (4X)

Infiniband device ‘mlx4_1’ port 1 status:

    default gid:     fe80:0000:0000:0000:0016:35ff:ffbf:2b05

    base lid:     0x6

    sm lid:         0x1

    state:         4: ACTIVE

    phys state:     5: LinkUp

    rate:         20 Gb/sec (4X DDR)

Infiniband device ‘mlx4_1’ port 2 status:

    default gid:     fe80:0000:0000:0000:0016:35ff:ffbf:2b06

    base lid:     0x0

    sm lid:         0x0

    state:         1: DOWN

    phys state:     2: Polling

    rate:         10 Gb/sec (4X)

As you know lspci displays the details of  PCI devices attached to a host. In order to check details of your HCAs, OSW periodically logs the output of this command also. This output is also specific to HCA you are using. Meaning of some parameters might be interpreted by looking at the manuals of your vendor.

[root@dbkon01:~]# lspci -vv |grep InfiniBand -A27

13:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR, PCIe 2.0 2.5GT/s] (rev a0)

    Subsystem: Mellanox Technologies MT25418 [ConnectX IB DDR, PCIe 2.0 2.5GT/s]

    Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B-

    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-

    Latency: 0, Cache Line Size: 64 bytes

    Interrupt: pin A routed to IRQ 209

    Region 0: Memory at fdc00000 (64-bit, non-prefetchable) [size=1M]

    Region 2: Memory at f6800000 (64-bit, prefetchable) [size=8M]

    Capabilities: [40] Power Management version 3

        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)

        Status: D0 PME-Enable- DSel=0 DScale=0 PME-

    Capabilities: [48] Vital Product Data

    Capabilities: [9c] MSI-X: Enable+ Mask- TabSize=256

        Vector table: BAR=0 offset=0007c000

        PBA: BAR=0 offset=0007d000

    Capabilities: [60] Express Endpoint IRQ 0

        Device: Supported: MaxPayload 256 bytes, PhantFunc 0, ExtTag+

        Device: Latency L0s <64ns, L1 unlimited

        Device: AtnBtn- AtnInd- PwrInd-

        Device: Errors: Correctable- Non-Fatal+ Fatal+ Unsupported-

        Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-

        Device: MaxPayload 256 bytes, MaxReadReq 4096 bytes

        Link: Supported Speed 2.5Gb/s, Width x8, ASPM L0s, Port 8

        Link: Latency L0s unlimited, L1 unlimited

        Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch-

        Link: Speed 2.5Gb/s, Width x8

    Capabilities: [100] Unknown (14)



19:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR, PCIe 2.0 2.5GT/s] (rev a0)

    Subsystem: Mellanox Technologies MT25418 [ConnectX IB DDR, PCIe 2.0 2.5GT/s]

    Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B-

    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-

    Latency: 0, Cache Line Size: 64 bytes

    Interrupt: pin A routed to IRQ 185

    Region 0: Memory at fde00000 (64-bit, non-prefetchable) [size=1M]

    Region 2: Memory at f7000000 (64-bit, prefetchable) [size=8M]

    Capabilities: [40] Power Management version 3

        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)

        Status: D0 PME-Enable- DSel=0 DScale=0 PME-

    Capabilities: [48] Vital Product Data

    Capabilities: [9c] MSI-X: Enable+ Mask- TabSize=256

        Vector table: BAR=0 offset=0007c000

        PBA: BAR=0 offset=0007d000

    Capabilities: [60] Express Endpoint IRQ 0

        Device: Supported: MaxPayload 256 bytes, PhantFunc 0, ExtTag+

        Device: Latency L0s <64ns, L1 unlimited

        Device: AtnBtn- AtnInd- PwrInd-

        Device: Errors: Correctable- Non-Fatal+ Fatal+ Unsupported-

        Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-

        Device: MaxPayload 256 bytes, MaxReadReq 4096 bytes

        Link: Supported Speed 2.5Gb/s, Width x8, ASPM L0s, Port 8

        Link: Latency L0s unlimited, L1 unlimited

        Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch-

        Link: Speed 2.5Gb/s, Width x8

    Capabilities: [100] Unknown (14)

Advertisement