Navdeep Saini
While designing the engineered systems, Oracle was aware that current industry standard network protocols like 10 GigE or 4 or 8G/s FC will not be able to handle the demands of the high-performance computing solution it was trying to design. With its acquisition of Sun, Oracle got leap-frogged to the Infiniband bandwagon as Sun was one of the founding members of Infiniband Trade Association (IBTA) (more information) and was actively working on IB HCAs and Switches. Hence, it made perfect sense that Oracle is exclusively using Infiniband to provide “backplane” network fabric for its engineered systems and providing 10GigE connectivity only for uplinks to existing Datacenter networks.
 
Infiniband architecture provides high bandwidth; low latency network communication that is designed specifically to target solutions for Application clusters using IPC as well as network fabric Storage Area Networks (SANs). It was conceived as a universal I/O fabric for next generation computing requirements, however IB never picked up in the marketplace and has always been confined to realms of high-performance computing environments. Reality has been that adoption has been limited, partly because to leverage full benefit, lot of changes has to be made to the application layer and vendors who were promoting IBA only had hardware assets (Sun, IBM, Intel, Mellanox, etc.) Application and OS vendors never had any business incentive to make the changes for IB. Hence, Ethernet became a ubiquitous layer 2 interconnect because of its ease-of-deployment and plug-n-play nature.

IB-vs-Ethernet

Feature

Infiniband

1Gb/10Gb Ethernet

Bus / Link Bandwidth (Half Duplex)

2.5, 10, 30 Gb/sec.

1, 10 Gb/sec.

Bus / Link Bandwidth (Full Duplex)

5, 20, 60 Gb/sec.

2, 20 Gb/sec.

Pin Count

4,16,48

4, 8

Transport Media

PCB, Copper and Fiber

PCB, Copper and Fiber

Maximum Signal Length Printed Circuit Board (PCB) / Copper

30 in., 17 m

20 in., 100 m

Maximum Signal Length Fiber

Km

Km

With engineered systems and its “hardware and software engineered to work together” approach, Oracle made some key changes to the software stack, both OS and applications that has to some extent enabled engineered systems to take advantages of the Infiniband Architecture. This blog is an attempt to highlight some of these changes and see how these should be implemented while deploying engineered systems.

Main Oracle Engineered Products Using Infiniband

  - Exadata
  - Exalogic
  - Sparc Super Clusters
  - And others…

This blog will focus on Infiniband network topology that is in use with Exalogic and Exadata.

Exadata

As we all know, Exadata consists of compute nodes, which serve as DB machines, storage cell nodes with direct attached storage and Flash Cards, all connected via high-speed Infiniband fabric. Without going into the technical details of Exadata, let’s see how Oracle is leveraging Infiniband features to provide low-latency, high-bandwidth network backplane in Exadata.

In Exadata, the Infiniband network consists of Sun Datacenter Infiniband Switch 36, two port IB HCA in each compute and storage cell node, providing 40 g/sec pipe between compute-to-compute and compute-to-storage cell nodes. Each 4-lane Infiniband connection is capable of delivering 80 g/sec full duplex with 40 g/sec per direction. So the hardware is in place, but how it is being actually used by OS (Oracle Enterprise Linux) and software (Oracle Clusterware, RAC, database) is the interesting thing to notice.

In order to fully realize all the benefits of IB fabric, Oracle made changes to its OS and Software to support datagram protocol called RDS. Exadata makes use of the RDS protocol over IB for storage, RAC and external connectivity (optional e.g. backup tapes, etc.)

The diagram above is a small representation of Exadata hardware. For sake of simplicity, we have 2 node RAC setup connected to 3 node storage cells via IB fabric. The database server nodes communicate with storage cells using something as the intelligent database protocol. Oracle built this database communication protocol called iDB from the ground-up, which is responsible for one of the most important feature of Exadata called Smart Scan (query offloading, cell-offloading).

iDB protocol is based on the industry standard RDS V3 protocol. RDS is the datagram protocol just like UDP, but reliable and zero copy with access to IB via socket API. RDS v3 supports RDMA read and RDMA write. RDMA enables remote applications to directly access application memory on a local system bypassing the OS kernel and hence shedding the CPU overhead and load on the host. The RDS protocol is also utilized in the Oracle RACs inter process communication (IPC) – the red line above denotes the interconnect data path. IP over Infiniband (IPoIB) is used by Oracle Clusterware for CSS (node monitor) communication only. It is important to remember that for private interconnect in RAC the maximum interconnect traffic is used for memory synchronization mechanism, i.e. memory to memory (SGA to SGA) transfer of entire data blocks (typically 8k), and even larger payload as in PQ (Parallel Query). The CSS node monitor traffic is very miniscule as it is just a network heartbeat for RAC nodes.

Why RDS/IB is Better in Comparison to UDP or TCP over Ethernet (gigabit or 10GigE)?

 

Traditionally non-Exadata Oracle RAC installations will utilize gigabit or recently 10GigE private network for RAC interconnect and 10gGigE or 4/8/16G Fiber Channel SAN fabric for storage traffic. The diagram above shows the data path for three types of applications on a database node. The green data path represents traditional TCP/IP over Ethernet traffic, which goes through kernel stack, which consumes host resources (memory and CPU cycles).

Exadata interconnect and storage traffic utilizes RDS, which completely bypass kernel stack achieving RDMA read and write with lower resource consumption on the host. This difference magnifies exponentially in large high transaction or large data warehouse database installations where all CPU cycles are required for query processing tasks and not wasted on network traffic. In speed and performance tests, RDS/IB has shown 50% less CPU utilization in comparison to IPoIB and 62% less CPU with RDS/Ethernet.

In short, Oracle made changes to Oracle Enterprise Linux UEK kernel and Oracle binaries to utilize RDS protocol over IB instead of traditional TCP/IP or UDP over Ethernet. This enabled them to avoid unnecessary CPU overhead of processing Ethernet TCP/IP or UDP packets in Kernel and achieve the ubiquitous holy grail of RDMA on low-latency high-bandwidth IB fabric, which is still faster than newer technologies being built around Ethernet infrastructure like RoCE (http://blog.infinibandta.org/2012/02/13/roce-and-infiniband-which-should-i-choose/). In engineered systems, presenting the entire Rack whether one-eight/quarter or half/full, totally encapsulated with compute, storage and IB network backplane, has made it easier for Datacenter managers to adopt IB based solution by just plug-n-play.

Things to Check in Exadata

As we can see, IB fabric hardware in terms of Sun Infiniband Switch 36 and IB HCAs in compute and storage nodes, comes bundled in the rack. It is important that software installed on this hardware is configured properly to fully utilize the benefits of IB.

RDS Module is Loaded in OS

By default on Exadata compute and cell nodes the OEL uek kernel that is loaded has RDS drivers loaded and configured. You can check if the RDS module is loaded on a node by running the following command:

$ /sbin/lsmod|grep rds

rds_rdma              117583  672

rds                   231688  1345 rds_rdma

rdma_cm                61961  2 rds_rdma,rdma_ucm

ib_core                75634  12 rds_rdma,ib_ipoib,rdma_ucm,ib_ucm,ib_uverbs,ib_umad,rdma_cm,ib_cm,iw_cm,mlx4_ib,ib_sa,ib_mad

RAC and Storage Traffic on Infiniband

On Exadata, compute nodes (virtual or physical) make sure the Infiniband network is used for storage communication. By default during Exadata deployment the private Infiniband network is configured and blocked of IP addresses assigned to bonded IB HCA ports. The database server Infiniband and the cell Infiniband must be on the same subnet in order to communicate with each other. With bonding, only one subnet is necessary for Infiniband addresses. Below is an example of subnet blocks assigned for IB traffic.

192.168.50.0/24 (netmask 255.255.255.0)

192.168.51.0/24 (netmask 255.255.255.0)

For example on DB node the bonded interface will look like this:

bondib0   Link encap:InfiniBand  HWaddr 80…………….

          inet addr:192.168.10.1  Bcast:192.168.11.255  Mask:255.255.252.0

          inet6 addr: fe80::221:2800:1ce:afab/64 Scope:Link

………

On the DB node you can check /etc/oracle/cell/network-config/cellinit.ora for IB IP address being used for compute node to communicate with cell and /etc/oracle/cell/network-config/cellip.ora for cell node IP addresses.

When using Oracle Clusterware, make sure RDS/IB is being utilized for RAC interconnect traffic. Use the following command to verify that the private network for Oracle Clusterware communication is using Infiniband:

$ oifcfg getif -type cluster_interconnect

bondib0  192.168.8.0  global  cluster_interconnect

Oracle Binaries Linked for RDS

The Reliable Data Socket (RDS) protocol should be used over the Infiniband network for database server to cell communication and Oracle Real Application Clusters (Oracle RAC) communication. Check the alert log to verify the private network for Oracle RAC is running the RDS protocol over the Infiniband network.

The following message should be in the log:

Cluster communication is configured to use the following interface(s) for this instance…….

cluster interconnect IPC version:Oracle RDS/IP (generic)

IPC Vendor 1 proto 3

Version 3.0

……

Another way to check if Oracle binary is linked to utilize RDS is to run following command:

$ORACLE_HOME/bin/skgxpinfo

The output should be:

rds

Note: For Oracle software versions below 11.2.0.2, the skgxpinfo command is not present. For 11.2.0.1, you can copy over skgxpinfo to the proper path in your 11.2.0.1 environment from an available 11.2.0.2 environment and execute it against the 11.2.0.1 database home(s) using the provided command.

If the RDS protocol is not being used over the InfiniBand network, then perform the following procedure:

1. Shut down any processes that are using the Oracle binary.

2. Change to the ORACLE_HOME/rdbms/lib directory.

3. Run the following command:

make -f ins_rdbms.mk ipc_rds ioracle

If separate Oracle_home is used for ASM, the RDS should be enabled for both of them.

Exalogic

While Exadata engineered systems were built exclusively for database loads that too specifically Oracle databases, Oracle designed Exalogic for the omnipresent compute power in a typical Datacenter. Exalogic engineered systems is designed to handle business applications, which are very compute intensive, and it is positioned as a complete middle tier platform in 19in rack chassis, fully integrated private cloud in the box. Without going into hardware details of Exalogic, in short it consists of many compute nodes each having dual port IB HCA and storage from ZFS appliance all connected using internal I/O fabric based in low-latency high-bandwidth IB. For external connection to the Datacenter, 10GigE connectivity is presented. Exalogic can connect to other Exa systems like Exadata, ZFS appliance, Sparc SuperCluster etc. using IB fabric.

Interestingly in Exalogic Infiniband I/O and network backplane is marketed separately as Exabus and there is a specific reason and we will see why. (In Exadata similar terminology could have been Exafabric). Exabus is the secret sauce in Exalogic that differentiates it from a bunch of other compute nodes bundled together. The IB fabric in Exalogic consists of IB HCAs in compute nodes and ZFS appliance that provide 2 40 G QDR ports, 2 NM2-GW switches which are both switches and gateways and standard Sun Datacenter Infiniband Switch 36, which is also called a spine switch. The difference in Exalogic is with the NM2-GW switches, which is not present in Exadata.

Each NM2-GW switch in Exalogic provides IB traffic switching between compute nodes/ZFS and also act as gateways for 10GigE Ethernet Traffic. Unlike Exadata, compute nodes in Exalogic do not have 10GigE network cards. External traffic from the Datacenter can only come via NM2-GW 10GE ports, 8 on each switch total of 8X10GigE total bandwidth to the outside Datacenter.

Exalogic Networking and Concept of Exabus

The above diagram shows connectivity in an Exalogic eight-rack setup. The compute nodes connect to each other and ZFS and other Exa systems via IO fabric and the Datacenter traffic comes into Exalogic via gateway 10GigE ports. Just like utilization of RDS protocol in Exadata, Exabus fabric also utilized special non-standard protocols to take full advantage of what IB has to offer. 

Let us look at the different protocols used in Exabus fabric:

1. IPoIB used for private communication between compute nodes, ZFS controllers and to other Exa systems like Exadata (in some cases). For example, some portion of WLS/COH cluster interconnects uses IPoIB.

2. EoIB is used for carrying Ethernet traffic between compute nodes and rest of the Datacenter infrastructure. For example, HTTP traffic from the external web server into weblogic servers running inside Exalogic.

3. Native IB (RDMA) communication used for low-latency communication between compute nodes for example Coherence and Tuxedo.

Benefits of Exabus Infiniband-vs-Standard TCP/IP or UDP over Ethernet

In typical the Datacenter, compute nodes will be bundled together on standard network backbones like 10GigE or Fiber Chanel Fabric for Storage. They typically suffer when they have to scale out to accommodate ever-increasing demands of applications. In large compute intensive systems, network I/O ends up being a huge bottleneck for application performance and is a major impediment in horizontal scaling using technologies like application clustering. Exalogic was designed specially to answer this shortcoming. Similar to utilization of RDS in Exadata, in Exalogic some of the software components (OS and applications) make use of a protocol known as Sockets Direct Protocol, which enables kernel bypass, eliminates buffer copies and uses larger 64bytes packets that reduces network overhead. In general, Infiniband provides over three times the throughput of 10 GigE and get 50% less latency using native SDP.

Sockets Direct Protocol (SDP) works by maintaining socket level compatibility. There is thin stack SDP stack in kernel, but a lot of transport handling and processing is done in the IB HCA, thereby providing some level of kernel bypass. The applications use socket level communications that don’t need to change when using SDP, some example are weblogic jdbc gridlink connecting to Exadata, some portion of OTD communication with other components in Exalogic.

Even though some of the components of Oracle Middleware on Exalogic still use IPoIB, the fact is the high speed IB fabric provides 40G QDR connection in comparison to 10GigE, if you are exchanging a lot of data; the bandwidth is helpful over the period of time. All of these performance enhancements are achieved with zero changes to applications using Exabus socket APIs.

Finally, for pieces where there are lots of communication between components and which could benefit from ultra-low latency like Coherence and Tuxedo, Oracle has exposed native Infiniband interfaces across multiple languages C++, Java. There is direct path from user space into IB HCA using RDMA semantics. This is the fastest way of communication. Oracle is still working on making this available for other layers in future Exalogic application versions.

In short, high speed IB fabric in Exalogic and utilization of specific protocols exposed by special APIs developed by Oracle makes it possible to easily scale-out the applications with increasing load.

The Secret Sauce in Exalogic

As we can see, the secret sauce in Exalogic is the Exabus, which is comprised of the Infiniband network and I/O backplane and set of changes and enhancements made to Middleware products that leverage this high-speed backplane. Oracle delivers the Oracle Fusion Middleware products for Exalogic under separate SKU “Oracle Exalogic Cloud Software” as we can easily see from the eDelivery cloud. This is separate from standard Oracle Fusion Middleware products available for “non-Exa” systems.

The Exalogic Cloud software is a set of Fusion Middleware products that have enhancements made for Exalogic. The Oracle WLS suite of products enhanced are WebLogic Application Server, JRockit JVM and Coherence In-Memory Data Grid. When deploying them on Exalogic the specific enhancements are enabled.

Some of the Important Enhancements in Fusion Middleware for Exalogic

Cluster-Level Session Replication Enhancements

As a part of this enhancement, WebLogic server utilizes Exabus IB fabric for doing session replication and failover in a WLS cluster. With this Weblogic server replicated more of the session data in parallel, over the network to second server using parallel socket connections instead of single connection (parallel RJVMs). Using enhanced JRockit JVM (part of Exalogic cloud software pack) WebLogic skips TCP/IP stack using Exabus IB networking protocol SDP to enable session payloads to be sent over the network with lower latency. As a result, web applications requiring high availability, end-user requests are responded to in a much quicker fashion.

 

Grid Link Data Source

As part of Exalogic WLS suite, a new component called Active Grid link for RAC is provided for application server connectivity to Oracle RAC clustered databases. This replaces the existing WebLogic Multi data-source capability for RAC connectivity. As mentioned above, Exalogic can connect to other Exa systems like Exadata over IB fabric. The Active grid link data source provides the JDBC connection pools across Exabus Infiniband fabric to Exadata using SDP protocol. As with WebLogic cluster-level session replication enhancements, the JDBC connections using SDP protocol take full benefit of higher bandwidth, low-latency kernel bypass mechanism thereby dramatically increasing the response time of high transaction volume applications running over WebLogic clusters.

Note that the SDP protocol is currently not supported with R12.2 Oracle E-Business Suite AppsDataSource driver. Oracle EBS application traffic from Exalogic to Exadata will still travel over IB fabric using native IPoIB without any changes to underlying techstack in EBS.

Things to Look for in Exalogic

Follow the example in MOS document# 1373571.1 for enabling Exalogic related enhance in the Oracle Fusion Middleware Suite of Products.

Bringing it all Together – Exadata and Exalogic

The above diagram shows a sample of Exalogic and Exadata integration. For Application traffic between Exalogic compute nodes, Exadata DB nodes and Cell Storage nodes, Oracle takes full advantage of IB fabric by using SDP, RDMA, iDB/RDS protocols, achieving kernel bypass and avoiding CPU overhead where ever possible. Other standard IP traffic utilizes IPoIB taking advantage of high bandwidth 40g QDR network. For external connectivity to the Datacenter 10GigE NICs are exposed, proving overall network performance.

Conclusion

Exalogic configured with Exadata is complete high-performance compute and storage solution wholly encapsulated in Racks. By utilizing high bandwidth low-latency IB backplane for internal communication between components in a Rack and exposing 10GigE NICs which can be easily be plugged into existing an Datacenter network infrastructure, Oracle has made it easier for customers to adopt all the benefits that IB has to offer without having to worry about setting separate IB fabric in their Datacenter. Oracle engineered systems are a truly converged infrastructure solution especially designed for Oracle products and provide high-performance computing much more superior to other products like Cisco UCS, IBM and Puresystems, which are based on standard low bandwidth protocols.

References

Oracle Clusterware and RAC Support for RDS Over Infiniband (Doc ID 751343.1)

Oracle Reliable Datagram Sockets (RDS) and InfiniBand (IB) Support for RAC Interconnect and Exadata Storage (Doc ID 745616.1)

Oracle Reliable Datagram Sockets (RDS) and InfiniBand (IB) Support (For Linux x86 and x86-64 Platforms) (Doc ID 761804.1)

http://www.oracle.com/technetwork/database/enterprise-edition/rds-installation-on-oracle-rac-10g--129858.pdf

Exalogic - How To Enable Exalogic Optimizations For WebLogic Server Running On Exalogic Machine (Doc ID 1373571.1)

Weblogic Becomes Unresponsive When Using Oracle E-Business Suite AppsDataSource Driver With SDP (Doc ID 1970352.1)