SGI 10-Gigabit Ethernet Network Adapter Manuel d'utilisateur

Naviguer en ligne ou télécharger Manuel d'utilisateur pour Mise en réseau SGI 10-Gigabit Ethernet Network Adapter. InfiniBand and 10-Gigabit Ethernet for Dummies Manuel d'utilisatio

  • Télécharger
  • Ajouter à mon manuel
  • Imprimer
  • Page
    / 150
  • Table des matières
  • MARQUE LIVRES
  • Noté. / 5. Basé sur avis des utilisateurs
Vue de la page 0
Designing Cloud and Grid Computing Systems
with InfiniBand and High-Speed Ethernet
Dhabaleswar K. (DK) Panda
The Ohio State University
http://www.cse.ohio-state.edu/~panda
A Tutorial at CCGrid ’11
by
Sayantan Sur
The Ohio State University
E-mail: sur[email protected]-state.edu
http://www.cse.ohio-state.edu/~surs
Vue de la page 0
1 2 3 4 5 6 ... 149 150

Résumé du contenu

Page 1 - A Tutorial at CCGrid ’11

Designing Cloud and Grid Computing Systems with InfiniBand and High-Speed EthernetDhabaleswar K. (DK) PandaThe Ohio State UniversityE-mail: panda@cse.

Page 2

Hadoop Architecture• Underlying Hadoop Distributed File System (HDFS)• Fault-tolerance by replicating data blocks• NameNode: stores information on dat

Page 3 - Computing Systems

CCGrid '11OpenFabrics Stack with Unified Verbs InterfaceVerbs Interface(libibverbs)Mellanox(libmthca)QLogic(libipathverbs)IBM (libehca)Chelsio(li

Page 4 - Cluster Computing Environment

• For IBoE and RoCE, the upper-level stacks remain completely unchanged• Within the hardware:– Transport and network layers remain completely unchange

Page 5 - (http://www.top500.org)

CCGrid '11OpenFabrics Software StackSA Subnet AdministratorMAD Management DatagramSMA Subnet Manager AgentPMA Performance Manager AgentIPoIB IP o

Page 6 - Grid Computing Environment

CCGrid '11103InfiniBand in the Top500Percentage share of InfiniBand is steadily increasing

Page 7

45%43%6%1%0%0%1%0%0%0%4%Number of SystemsGigabit Ethernet InfiniBandProprietary MyrinetQuadrics Mixed NUMAlink SP Switch Cray Interconnect Fat Tree Cu

Page 8 - Compute cluster

105InfiniBand System Efficiency in the Top500 ListCCGrid '1101020304050607080901000 50 100 150 200 250 300 350 400 450 500Efficiency (%)Top 500 S

Page 9 - Cloud Computing Environments

• 214 IB Clusters (42.8%) in the Nov ‘10 Top500 list (http://www.top500.org)• Installations in the Top 30 (13 systems):CCGrid '11Large-scale Infi

Page 10 - Hadoop Architecture

• HSE compute systems with ranking in the Nov 2010 Top500 list– 8,856-core installation in Purdue with ConnectX-EN 10GigE (#126)– 7,944-core installat

Page 11 - Memcached Architecture

• HSE has most of its popularity in enterprise computing and other non-scientific markets including Wide-area networking• Example Enterprise Computing

Page 12

• Introduction• Why InfiniBand and High-speed Ethernet?• Overview of IB, HSE, their Convergence and Features• IB and HSE HW/SW Products and Installati

Page 13 - • Software components

Memcached Architecture• Distributed Caching Layer– Allows to aggregate spare memory from multiple nodes– General purpose• Typically used to cache data

Page 14 - • Ex: TCP/IP, UDP/IP

Modern Interconnects and Protocols110ApplicationVerbsSocketsApplicationInterfaceTCP/IPHardwareOffloadTCP/IPEthernetDriverKernelSpaceProtocolImplementa

Page 15 - – Not scalable:

• Low-level Network Performance• Clusters with Message Passing Interface (MPI)• Datacenters with Sockets Direct Protocol (SDP) and TCP/IP (IPoIB)• Inf

Page 16 - Myrinet (1993 -) 1 Gbit/sec

CCGrid '11112Low-level Latency Measurements051015202530VPI-IBNative IBVPI-EthRoCESmall MessagesLatency (us)Message Size (bytes)010002000300040005

Page 17

CCGrid '11113Low-level Uni-directional Bandwidth Measurements02004006008001000120014001600VPI-IBNative IBVPI-EthRoCEUni-directional BandwidthBand

Page 18

• Low-level Network Performance• Clusters with Message Passing Interface (MPI)• Datacenters with Sockets Direct Protocol (SDP) and TCP/IP (IPoIB)• Inf

Page 19 - IB Trade Association

• High Performance MPI Library for IB and HSE– MVAPICH (MPI-1) and MVAPICH2 (MPI-2.2)– Used by more than 1,550 organizations in 60 countries– More tha

Page 20

CCGrid '11116One-way Latency: MPI over IB0123456Small Message LatencyMessage Size (bytes)Latency (us)1.961.541.602.17050100150200250300350400MVAP

Page 21 - • I/O interface bottlenecks

CCGrid '11117Bandwidth: MPI over IB0500100015002000250030003500Unidirectional BandwidthMillionBytes/secMessage Size (bytes)2665.63023.71901.11553

Page 22

CCGrid '11118One-way Latency: MPI over iWARP0102030405060708090Chelsio (TCP/IP)Chelsio (iWARP)Intel-NetEffect (TCP/IP)Intel-NetEffect (iWARP)Mess

Page 23

CCGrid '11119Bandwidth: MPI over iWARP0200400600800100012001400Message Size (bytes)Unidirectional BandwidthMillionBytes/sec839.81169.7373.31245.0

Page 24 - (not shown)

• Good System Area Networks with excellent performance (low latency, high bandwidth and low CPU utilization) for inter-processor communication (IPC) a

Page 25

CCGrid '11120Convergent Technologies: MPI Latency0102030405060Small MessagesLatency (us)Message Size (bytes)0200040006000800010000120001400016000

Page 26

CCGrid '11121Convergent Technologies:MPI Uni- and Bi-directional Bandwidth02004006008001000120014001600Native IBVPI-IBVPI-EthRoCEUni-directional

Page 27 - • Myricom GM

• Low-level Network Performance• Clusters with Message Passing Interface (MPI)• Datacenters with Sockets Direct Protocol (SDP) and TCP/IP (IPoIB)• Inf

Page 28 - IB Hardware Acceleration

CCGrid '11123IPoIB vs. SDP Architectural ModelsTraditional ModelPossible SDP ModelSockets AppSockets APISockets ApplicationSockets APIKernelTCP/I

Page 29 - • Hardware Checksum Engines

CCGrid '11124SDP vs. IPoIB (IB QDR)050010001500200028321285122K8K32KBandwidth (MBps)IPoIB-RCIPoIB-UDSDP0510152025302481632641282565121K2KLatency

Page 30 - TOE and iWARP Accelerators

• Low-level Network Performance• Clusters with Message Passing Interface (MPI)• Datacenters with Sockets Direct Protocol (SDP) and TCP/IP (IPoIB)• Inf

Page 31

• Option 1: Layer-1 Optical networks– IB standard specifies link, network and transport layers– Can use any layer-1 (though the standard says copper a

Page 32

Features• End-to-end guaranteed bandwidth channels• Dynamic, in-advance, reservation and provisioning of fractional/full lambdas• Secure control-plane

Page 33

• Supports SONET OC-192 or 10GE LAN-PHY/WAN-PHY• Idea is to make remote storage “appear” local• IB-WAN switch does frame conversion– IB standard allow

Page 34 - 2003 (Gen1), 2007 (Gen2)

CCGrid '11129InfiniBand Over SONET: Obsidian Longbows RDMAthroughput measurements over USNLinuxhostORNL700 milesLinuxhostChicagoCDCISeattleCDCISu

Page 35

• Hardware components– Processing cores and memory subsystem– I/O bus or links– Network adapters/switches• Software components– Communication stack• B

Page 36 - IB, HSE and their Convergence

CCGrid '11130IB over 10GE LAN-PHY and WAN-PHYLinuxhostORNL700 milesLinuxhostSeattleCDCIORNLCDCIlongbowIB/SlongbowIB/S3300 miles 4300 milesORNL lo

Page 37 - Traditional Ethernet

MPI over IB-WAN: Obsidian RoutersDelay (us) Distance (km)10 2100 201000 20010000 2000Cluster ACluster BWAN LinkObsidian WAN Router Obsidian WAN Router

Page 38 - IB Overview

Communication Options in Grid• Multiple options exist to perform data transfer on Grid• Globus-XIO framework currently does not support IB natively• W

Page 39 - Components: Channel Adapters

Globus-XIO Framework with ADTS DriverGlobus XIO Driver #nDataConnectionManagementPersistentSessionManagementBuffer &FileManagementData Transport I

Page 40 - • Switches: intra-subnet

134Performance of Memory BasedData Transfer• Performance numbers obtained while transferring 128 GB of aggregate data in chunks of 256 MB files• ADTS

Page 41 - – Not directly addressable

135Performance of Disk Based Data Transfer• Performance numbers obtained while transferring 128 GB of aggregate data in chunks of 256 MB files• Predic

Page 42

136Application Level Performance050100150200250300CCSMUltra-VizBandwidth (MBps)Target ApplicationsADTSIPoIB• Application performance for FTP getopera

Page 43 - IB Communication Model

• Low-level Network Performance• Clusters with Message Passing Interface (MPI)• Datacenters with Sockets Direct Protocol (SDP) and TCP/IP (IPoIB)• Inf

Page 44 - Queue Pair Model

A New Approach towards OFA in CloudCurrent ApproachTowards OFA in CloudApplicationAccelerated Sockets10 GigE or InfiniBandVerbs / Hardware OffloadCurr

Page 45 - Memory Registration

Memcached Design Using Verbs• Server and client perform a negotiation protocol– Master thread assigns clients to appropriate worker thread• Once a cli

Page 46 - Memory Protection

• Ex: TCP/IP, UDP/IP• Generic architecture for all networks• Host processor handles almost all aspects of communication– Data buffering (copies on sen

Page 47 - (Send/Receive Model)

Memcached Get Latency• Memcached Get latency– 4 bytes – DDR: 6 us; QDR: 5 us– 4K bytes -- DDR: 20 us; QDR:12 us• Almost factor of four improvement ove

Page 48 - Hardware ACK

Memcached Get TPS• Memcached Get transactions per second for 4 bytes– On IB DDR about 600K/s for 16 clients – On IB QDR 1.9M/s for 16 clients• Almost

Page 49

Hadoop: Java Communication Benchmark• Sockets level ping-pong bandwidth test• Java performance depends on usage of NIO (allocateDirect)• C and Java ve

Page 50

Hadoop: DFS IO Write Performance• DFS IO included in Hadoop, measures sequential access throughput• We have two map tasks each writing to a file of in

Page 51 - Hardware Protocol Offload

Hadoop: RandomWriter Performance• Each map generates 1GB of random binary data and writes to HDFS• SSD improves execution time by 50% with 1GigE for t

Page 52 - • Switching and Multicast

Hadoop Sort Benchmark• Sort: baseline benchmark for Hadoop• Sort phase: I/O bound; Reduce phase: communication bound• SSD improves performance by 28%

Page 53 - Buffering and Flow Control

• Introduction• Why InfiniBand and High-speed Ethernet?• Overview of IB, HSE, their Convergence and Features• IB and HSE HW/SW Products and Installati

Page 54 - Virtual Lanes

• Presented network architectures & trends for Clusters, Grid, Multi-tier Datacenters and Cloud Computing Systems• Presented background and detail

Page 55 - Service Levels and QoS

CCGrid '11Funding AcknowledgmentsFunding Support byEquipment Support by148

Page 56 - Traffic Segregation Benefits

CCGrid '11Personnel AcknowledgmentsCurrent Students – N. Dandapanthula (M.S.)– R. Darbha (M.S.)– V. Dhanraj (M.S.)– J. Huang (Ph.D.)– J. Jose (P

Page 57 - Identifiers)

• Traditionally relied on bus-basedtechnologies (last mile bottleneck)– E.g., PCI, PCI-X– One bit per wire– Performance increase through:• Increasing

Page 58 - Switch Complex

CCGrid '11Web Pointershttp://www.cse.ohio-state.edu/~pandahttp://www.cse.ohio-state.edu/~surshttp://nowlab.cse.ohio-state.eduMVAPICH Web Pagehttp

Page 59 - – 3D Torus (Sandia Red Sky)

• Network speeds saturated at around 1Gbps– Features provided were limited– Commodity networks were not considered scalable enough for very large-scal

Page 60 - More on Multipathing

• Industry Networking Standards• InfiniBand and High-speed Ethernet were introduced into the market to address these bottlenecks• InfiniBand aimed at

Page 61 - IB Multicast Example

• Introduction• Why InfiniBand and High-speed Ethernet?• Overview of IB, HSE, their Convergence and Features• IB and HSE HW/SW Products and Installati

Page 62

• IB Trade Association was formed with seven industry leaders (Compaq, Dell, HP, IBM, Intel, Microsoft, and Sun)• Goal: To design a scalable and high

Page 63 - IB Transport Services

• Introduction• Why InfiniBand and High-speed Ethernet?• Overview of IB, HSE, their Convergence and Features• IB and HSE HW/SW Products and Installati

Page 64 - Reliability

• 10GE Alliance formed by several industry leaders to take the Ethernet family to the next speed step• Goal: To achieve a scalable and high performanc

Page 65 - Transport Layer Capabilities

• Network speed bottlenecks• Protocol processing bottlenecks• I/O interface bottlenecksCCGrid '1121Tackling Communication Bottlenecks with IB and

Page 66 - Data Segmentation

• Bit serial differential signaling– Independent pairs of wires to transmit independent data (called a lane)– Scalable to any number of lanes– Easy to

Page 67 - Transaction Ordering

CCGrid '11Network Speed Acceleration with IB and HSEEthernet (1979 - ) 10 Mbit/secFast Ethernet (1993 -) 100 Mbit/secGigabit Ethernet (1995 -) 10

Page 68 - Message-level Flow-Control

2005 - 2006 - 2007 - 2008 - 2009 - 2010 - 2011Bandwidth per direction (Gbps)32G-IB-DDR48G-IB-DDR96G-IB-QDR48G-IB-QDR200G-IB-EDR112G-IB-FDR300G-IB-EDR1

Page 69

• Network speed bottlenecks• Protocol processing bottlenecks• I/O interface bottlenecksCCGrid '1125Tackling Communication Bottlenecks with IB and

Page 70

• Intelligent Network Interface Cards• Support entire protocol processing completely in hardware (hardware protocol offload engines)• Provide a rich c

Page 71 - Concepts in IB Management

• Fast Messages (FM)– Developed by UIUC• Myricom GM– Proprietary protocol stack from Myricom• These network stacks set the trend for high-performance

Page 72 - Subnet Manager

• Some IB models have multiple hardware accelerators– E.g., Mellanox IB adapters• Protocol Offload Engines– Completely implement ISO/OSI layers 2-4 (l

Page 73

• Interrupt Coalescing– Improves throughput, but degrades latency• Jumbo Frames– No latency impact; Incompatible with existing switches• Hardware Chec

Page 74 - HSE Overview

CCGrid '11Current and Next Generation Applications and Computing Systems3• Diverse Range of Applications– Processing and dataset characteristics

Page 75 - Differences

• TCP Offload Engines (TOE)– Hardware Acceleration for the entire TCP/IP stack– Initially patented by Tehuti Networks– Actually refers to the IC on th

Page 76 - – Multi Stream Semantics

• Also known as “Datacenter Ethernet” or “Lossless Ethernet”– Combines a number of optional Ethernet standards into one umbrella as mandatory requirem

Page 77

• Network speed bottlenecks• Protocol processing bottlenecks• I/O interface bottlenecksCCGrid '1132Tackling Communication Bottlenecks with IB and

Page 78

• InfiniBand initially intended to replace I/O bus technologies with networking-like technology– That is, bit serial differential signaling– With enha

Page 79

• Recent trends in I/O interfaces show that they are nearly matching head-to-head with network speeds (though they still lag a little bit)CCGrid &apos

Page 80

• Introduction• Why InfiniBand and High-speed Ethernet?• Overview of IB, HSE, their Convergence and Features• IB and HSE HW/SW Products and Installati

Page 81

• InfiniBand– Architecture and Basic Hardware Components– Communication Model and Semantics– Novel Features– Subnet Management and Services• High-spee

Page 82

CCGrid '1137Comparing InfiniBand with Traditional Networking StackApplication LayerMPI, PGAS, File SystemsTransport LayerOpenFabrics VerbsRC (rel

Page 83 - Offloaded TCP

• InfiniBand– Architecture and Basic Hardware Components– Communication Model and Semantics• Communication Model• Memory registration and protection•

Page 84

• Used by processing and I/O units to connect to fabric• Consume & generate IB packets• Programmable DMA engines with protection features• May hav

Page 85 - Myrinet Express (MX)

CCGrid '11Cluster Computing EnvironmentCompute clusterLANFrontendMeta-DataManagerI/O ServerNodeMetaDataDataComputeNodeComputeNodeI/O ServerNodeDa

Page 86 - Datagram Bypass Layer (DBL)

• Relay packets from a link to another• Switches: intra-subnet• Routers: inter-subnet• May support multicastCCGrid '11Components: Switches and Ro

Page 87 - • Solarflare approach:

• Network Links– Copper, Optical, Printed Circuit wiring on Back Plane– Not directly addressable• Traditional adapters built for copper cabling– Restr

Page 88

• InfiniBand– Architecture and Basic Hardware Components– Communication Model and Semantics• Communication Model• Memory registration and protection•

Page 89

CCGrid '11IB Communication ModelBasic InfiniBand Communication Semantics43

Page 90 - Hardware

• Each QP has two queues– Send Queue (SQ)– Receive Queue (RQ)– Work requests are queued to the QP (WQEs: “Wookies”)• QP to be linked to a Complete Que

Page 91 - IB Transport

1. Registration Request • Send virtual address and length2. Kernel handles virtual->physical mapping and pins region into physical memory• Process

Page 92 - IB iWARP/HSE RoE RoCE

• To send or receive data the l_keymust be provided to the HCA• HCA verifies access to local memory• For RDMA, initiator must have the r_key for the r

Page 93

CCGrid '11Communication in the Channel Semantics(Send/Receive Model)InfiniBand DeviceMemoryMemoryInfiniBand DeviceCQQPSend RecvMemorySegmentSend

Page 94 - IB Hardware Products

CCGrid '11Communication in the Memory Semantics (RDMA Model)InfiniBand DeviceMemoryMemoryInfiniBand DeviceCQQPSend RecvMemorySegmentSend WQE cont

Page 95 - Tyan Thunder S2935 Board

InfiniBand DeviceCCGrid '11Communication in the Memory Semantics (Atomics)MemoryMemoryInfiniBand DeviceCQQPSend RecvMemorySegmentSend WQE contain

Page 96 - IB Hardware Products (contd.)

CCGrid '11Trends for Computing Clusters in the Top 500 List (http://www.top500.org)Nov. 1996: 0/500 (0%)Nov. 2001: 43/500 (8.6%)Nov. 2006: 361

Page 97 - – Nortel Networks

• InfiniBand– Architecture and Basic Hardware Components– Communication Model and Semantics• Communication Model• Memory registration and protection•

Page 98 - • Support for VPI and RoCE

CCGrid '11Hardware Protocol OffloadComplete HardwareImplementationsExist51

Page 99 - – OFED 1.6 is underway

• Buffering and Flow Control• Virtual Lanes, Service Levels and QoS• Switching and MulticastCCGrid '11Link/Network Layer Capabilities52

Page 100 - (libibverbs)

• IB provides three-levels of communication throttling/control mechanisms– Link-level flow control (link layer feature)– Message-level flow control (t

Page 101 - • Within the hardware:

• Multiple virtual links within same physical link– Between 2 and 16• Separate buffers and flow control– Avoids Head-of-Line Blocking• VL15: reserved

Page 102 - OpenFabrics Software Stack

• Service Level (SL):– Packets may operate at one of 16 different SLs– Meaning not defined by IB• SL to VL mapping:– SL determines which VL on the nex

Page 103 - InfiniBand in the Top500

• InfiniBand Virtual Lanes allow the multiplexing of multiple independent logical traffic flows on the same physical link• Providing the benefits of i

Page 104 - SP Switch

• Each port has one or more associated LIDs (Local Identifiers)– Switches look up which port to forward a packet to based on its destination LID (DLID

Page 105

• Basic unit of switching is a crossbar– Current InfiniBand products use either 24-port (DDR) or 36-port (QDR) crossbars• Switches available in the ma

Page 106 - CCGrid '11

• Someone has to setup the forwarding tables and give every port an LID– “Subnet Manager” does this work• Different routing algorithms give different

Page 107 - • Integrated Systems

CCGrid '11Grid Computing Environment6Compute clusterLANFrontendMeta-DataManagerI/O ServerNodeMetaDataDataComputeNodeComputeNodeI/O ServerNodeData

Page 108 - Other HSE Installations

• Similar to basic switching, except…– … sender can utilize multiple LIDs associated to the same destination port• Packets sent to one DLID take a fix

Page 109 - Presentation Overview

CCGrid '11IB Multicast Example61

Page 110 - InfiniBand

CCGrid '11Hardware Protocol OffloadComplete HardwareImplementationsExist62

Page 111 - Case Studies

• Each transport service can have zero or more QPs associated with it– E.g., you can have four QPs based on RC and one QP based on UDCCGrid '11IB

Page 112 - Message Size (bytes)

CCGrid '11Trade-offs in Different Transport Types64AttributeReliableConnectionReliableDatagrameXtendedReliableConnectionUnreliableConnectionUnrel

Page 113 - Bandwidth (MBps)

• Data Segmentation• Transaction Ordering• Message-level Flow Control• Static Rate Control and Auto-negotiationCCGrid '11Transport Layer Capabili

Page 114

• IB transport layer provides a message-level communication granularity, not byte-level (unlike TCP)• Application can hand over a large message– Netwo

Page 115 - MVAPICH/MVAPICH2 Software

• IB follows a strong transaction ordering for RC• Sender network adapter transmits messages in the order in which WQEs were posted• Each QP utilizes

Page 116 - One-way Latency: MPI over IB

• Also called as End-to-end Flow-control– Does not depend on the number of network hops• Separate from Link-level Flow-Control– Link-level flow-contro

Page 117 - Bandwidth: MPI over IB

• IB allows link rates to be statically changed– On a 4X link, we can set data to be sent at 1X– For heterogeneous links, rate can be set to the lowes

Page 118

CCGrid '11Multi-Tier Datacenters and Enterprise Computing7...Enterprise Multi-tier DatacenterTier1Tier3Routers/ServersSwitchDatabase ServerAppli

Page 119 - Bandwidth: MPI over iWARP

• InfiniBand– Architecture and Basic Hardware Components– Communication Model and Semantics• Communication Model• Memory registration and protection•

Page 120

• Agents– Processes or hardware units running on each adapter, switch, router (everything on the network)– Provide capability to query and set paramet

Page 121 - Convergent Technologies:

Inactive LinksCCGrid '11Subnet ManagerActive LinksCompute NodeSwitchSubnet ManagerInactive LinkMulticast JoinMulticast SetupMulticast JoinMultica

Page 122

• InfiniBand– Architecture and Basic Hardware Components– Communication Model and Semantics– Novel Features– Subnet Management and Services• High-spee

Page 123 - InfiniBand CA

• High-speed Ethernet Family– Internet Wide-Area RDMA Protocol (iWARP)• Architecture and Components• Features– Out-of-order data placement– Dynamic an

Page 124 - SDP vs. IPoIB (IB QDR)

CCGrid '11IB and HSE RDMA Models: Commonalities and DifferencesIB iWARP/HSEHardware Acceleration Supported SupportedRDMA Supported SupportedAtomi

Page 125

• RDMA Protocol (RDMAP)– Feature-rich interface– Security Management• Remote Direct Data Placement (RDDP)– Data Placement and Delivery– Multi Stream S

Page 126 - IB on the WAN

• High-speed Ethernet Family– Internet Wide-Area RDMA Protocol (iWARP)• Architecture and Components• Features– Out-of-order data placement– Dynamic an

Page 127 - Features

• Place data as it arrives, whether in or out-of-order• If data is out-of-order, place it at the appropriate offset• Issues from the application’s per

Page 128 - “appear” local

• Part of the Ethernet standard, not iWARP– Network vendors use a separate interface to support it• Dynamic bandwidth allocation to flows based on int

Page 129 - Sunnyvale

CCGrid '11Integrated High-End Computing EnvironmentsCompute clusterMeta-DataManagerI/O ServerNodeMetaDataDataComputeNodeComputeNodeI/O ServerNode

Page 130 - 3300 miles 4300 miles

• Can allow for simple prioritization:– E.g., connection 1 performs better than connection 2– 8 classes provided (a connection can be in any class)• S

Page 131 - Cluster B

• High-speed Ethernet Family– Internet Wide-Area RDMA Protocol (iWARP)• Architecture and Components• Features– Out-of-order data placement– Dynamic an

Page 132 - Communication Options in Grid

• Regular Ethernet adapters and TOEs are fully compatible• Compatibility with iWARP required• Software iWARP emulates the functionality of iWARP on th

Page 133 - Modern WAN

CCGrid '11Different iWARP ImplementationsRegular Ethernet AdaptersApplicationHigh Performance SocketsSocketsNetwork AdapterTCPIPDevice DriverOffl

Page 134 - Data Transfer

• High-speed Ethernet Family– Internet Wide-Area RDMA Protocol (iWARP)• Architecture and Components• Features– Out-of-order data placement– Dynamic an

Page 135 - IPoIB-64MB

• Proprietary communication layer developed by Myricom for their Myrinet adapters– Third generation communication layer (after FM and GM)– Supports My

Page 136 - Application Level Performance

• Another proprietary communication layer developed by Myricom– Compatible with regular UDP sockets (embraces and extends)– Idea is to bypass the kern

Page 137

CCGrid '11Solarflare Communications: OpenOnload Stack87Typical HPC Networking StackTypical Commodity Networking Stack• HPC Networking Stack provi

Page 138

• InfiniBand– Architecture and Basic Hardware Components– Communication Model and Semantics– Novel Features– Subnet Management and Services• High-spee

Page 139 - Memcached Design Using Verbs

• Single network firmware to support both IB and Ethernet• Autosensing of layer-2 protocol– Can be configured to automatically work with either IB or

Page 140 - Memcached Get Latency

CCGrid '11Cloud Computing Environments9LANPhysical MachineVMVMPhysical MachineVMVMPhysical MachineVMVMVirtual FSMeta-DataMetaDataI/O ServerDataI/

Page 141 - Memcached Get TPS

• Native convergence of IB network and transport layers with Ethernet link layer• IB packets encapsulated in Ethernet frames• IB network layer already

Page 142 - Bandwidth with C version

• Very similar to IB over Ethernet– Often used interchangeably with IBoE– Can be used to explicitly specify link layer is Converged (Enhanced) Etherne

Page 143

CCGrid '11IB and HSE: Feature ComparisonIB iWARP/HSE RoE RoCEHardware Acceleration Yes Yes Yes YesRDMA Yes Yes Yes YesCongestion Control Yes Opti

Page 144

• Introduction• Why InfiniBand and High-speed Ethernet?• Overview of IB, HSE, their Convergence and Features• IB and HSE HW/SW Products and Installati

Page 145 - Hadoop Sort Benchmark

• Many IB vendors: Mellanox+Voltaire and Qlogic– Aligned with many server vendors: Intel, IBM, SUN, Dell– And many integrators: Appro, Advanced Cluste

Page 146

CCGrid '11Tyan Thunder S2935 Board(Courtesy Tyan)Similar boards from Supermicro with LOM features are also available 95

Page 147 - Concluding Remarks

• Customized adapters to work with IB switches– Cray XD1 (formerly by Octigabay), Cray CX1• Switches:– 4X SDR and DDR (8-288 ports); 12X SDR (small si

Page 148 - Funding Acknowledgments

• 10GE adapters: Intel, Myricom, Mellanox (ConnectX)• 10GE/iWARP adapters: Chelsio, NetEffect (now owned by Intel)• 40GE adapters: Mellanox ConnectX2-

Page 149 - Personnel Acknowledgments

• Mellanox ConnectX Adapter• Supports IB and HSE convergence• Ports can be configured to support IB or HSE• Support for VPI and RoCE– 8 Gbps (SDR), 16

Page 150 - Web Pointers

• Open source organization (formerly OpenIB)– www.openfabrics.org• Incorporates both IB and iWARP in a unified manner– Support for Linux and Windows–

Commentaires sur ces manuels

Pas de commentaire