SPACER SPACER
FAQ [Last updated May 2007]





High Performance PCI X Bus

Q. Where is Gigabit Ethernet's evolution in terms of performance and line utilization in respect to Fast Ethernet?

Q. What type of performance can be expected with Gigabit Ethernet?

Q. What type of system, speed and features are recommended for Gigabit Ethernet?

Q. How many Gigabit Ethernet cards be installed per PCI bus segment?

Q. What is the basic formula for determining how fast of a CPU is needed for Gigabit Ethernet?

Q. What are the typical performance-limiting issues with Gigabit Ethernet and when using protocol stacks like TCP/IP?

Q. How much of the CPU bandwidth is consumed copying payload data across the socket API interface in send() or recv() system calls?

Q. What are the solutions for gigabit networking in next generation designs?

Q. What is Intel's solution to the protocol stack and operating system overhead issues?

Q. What do the terms RDMA, iWarp and RNIC refer to?

Q. How are the high-speed protocol stack performance issues being addressed, what is the status of this work?

Q. Will full TCP/IP offload (TOE) help solve performance limiting issues in Gigabit Ethernet?

Q. Please summarize the primary issues and solutions regarding new and existing TOE products.

Q. What types of loopbacks and frame generators are available with your Gigabit Ethernet controllers and drivers?

Q. In multi-vendor integrated systems, what type of system level problems can most affect gigabit performance and how can these be resolved?

Q. Why is the system controller so important to Gigabit Ethernet performance?

Q. Are there known performance issues with some system controller's PCI interfaces?

Q. What type of CPU-offload and other features are supported by your Gigabit cards and drivers?

Q. I am evaluating interface cards and systems for Gigabit Ethernet, what do you recommend I do?

Q. Should I use Copper (1000 Base T) or Fiber (1000 Base SX/LX)?

Q. Is auto-negotiation required when using Gigabit Ethernet in 1000-base-T or 1000-base-X gigabit modes?

Q. What advantages are there for using PCI-X?

Q. Do UDP sockets perform better than TCP sockets and should they be used if the application permits?

Q. What are these "full socket" counts seen in the UDP statistics when I run an end-to-end UDP blaster test?

Q. What about Fibre Channel vs. Gigabit Ethernet?

Q. How robust and reliable are the VxWorks, Linux and Windows (2000, XP) drivers supplied with the OEM developer kit?

Q. What sort of effort should be expected to integrate the VxWorks driver into my target system?

Q. What are the types of issues are involved in integrating the VxWorks driver in a typical system?

Q. Is source code to the VxWorks, Linux and Windows (2000, XP) drivers available? How is technical support provided for these products?

Q. What else is contained on the OEM Developer Kit?

Q. Can the VxWorks or Linux drivers be ported to other operating systems?

Q. Can polling rather than using interrupts help improve performance?




Q. Where is Gigabit Ethernet's evolution in terms of performance and line utilization in respect to Fast Ethernet over a single connection?

A. In the early days of Fast Ethernet, line utilization per connection was relatively low -- usually anywhere from 20 - 50 percent. 6 megabytes per second (MB/sec) of throughput (50 % line utilization) was considered very good. Today, many Fast Ethernet connections are being utilized nearing 100 percent at 10-12MB/sec. This same evolution pattern is occurring for Gigabit Ethernet at a faster pace. Depending on system speed and operating system, many typical systems are now in the 60-80 percent line utilization range (approx. 75 - 100MB/sec) and ever increasing. Per-port line utilization nearing 100 percent is attainable (100 - 122MB/sec) given a system with around a 2+ GHZ processor and a 64-bit, 66 MHZ PCI (or PCI-X) bus and DDR memory.

Q. What type of performance can be expected with Gigabit Ethernet?

A. Benchmarks for Gigabit Ethernet under various system configurations and test parameters are provided for users and developers of Gigabit Ethernet as an aid in understanding overall system requirements and performance metrics. These benchmarks are performed in three primary categories as follows:

  1. Raw driver throughputs (loopback and end-to-end tests)
  2. Connectionless protocol (UDP/IP) using socket API
  3. Connection oriented protocol stack (TCP/IP) using socket API

Test results on specified systems:

Test system #1: 2.8-GHZ Xeon PC, E7525 chipset, 133/100 PCI-X (or PCIe), DDR2 SDRAM, Linux 2.4.30

Raw driver throughput: 245 megabytes per second sustained per port full-duplex (1.96 Gbps aggregate total send and receive), 490 MB/sec (3.9 Gbps) over two ports and 974MB/sec (7.79Gbps) over 4-ports

TCP sockets application test (blaster/blastee): 114 megabytes/sec sustained (912 Mb), 30 percent CPU utilization using a single connection

UDP sockets application test (blaster/blastee): 118 megabytes/sec sustained (944 Mb), 30 percent CPU utilization using a single attachment

Maximum frame rate: 850,000 - 1M frames per second per port (over 1.8 million fps over four ports) using short 60-100 byte payload frames

Test system #2: 1-GHZ PowerPC 7455, 64/66 PCI, PC133 SDRAM, Tornado 2.2 -- VxWorks 5.5 (MVME-5500 board)

Raw driver throughput: 198 MB/sec (2-ports, tx + rx), CPU util: 49%

TCP sockets application test (tx) (blaster): 74.4 MB/sec, CPU util: 73%

TCP sockets application test (rx) (blastee): 100.9 MB/sec, CPU util: 95%

Test system #3: 2-GHZ Xeon PC, 64-bit, 66 MHZ PCI-X (64/66 PCI mode), DDR SDRAM, Tornado 2.2/VxWorks 5.5

Raw driver throughput: 245 megabytes per second sustained per port (1.96 Gb aggregate total send and receive), 484 MB/sec (3.87 Gb) using two ports

TCP sockets application test (blaster/blastee): 116.2 megabytes/sec sustained (930 Mb), 44 percent CPU utilization using a single connection

UDP sockets application test (blaster/blastee): 118.76 megabytes/sec sustained (950 Mb), 20 percent CPU utilization using a single attachment

Maximum frame rate: 850,000 frames per second per port (over 1.2 million over two ports) using short 60-100 byte payload frames

Test system #4: 1GHZ dual-Pentium III PC, 64-bit, 66 MHZ PCI, PC133 SDRAM, Linux 2.4.18

Raw driver throughput: 242 megabytes per second sustained (aggregate total send and receive)

UDP sockets application test (blaster/blastee): 118 megabytes/sec sustained (944 Mb), 42 percent CPU utilization

TCP sockets application test (blaster/blastee): 114 megabytes/sec sustained (912 Mb), 50 percent CPU utilization

Maximum frame rate: Over Over 850,000 frames per second (using short 60 byte payload frames)

Note: The blaster/blastee tests were conducted between two systems through a Gigabit Ethernet switch at approximately 50 ft. cable distances using a single TCP/IP connection with one side sending and one side receiving.

Q. What type of system, speed and features are recommended for Gigabit Ethernet?

A. We provide the following information to assist in determining system requirements for supporting Gigabit Ethernet:

Gigabit Ethernet is fast technology -- the PCI bus, memory and CPU performance should be balanced to accommodate the performance required. Connection oriented protocol stacks and API's (primarily the socket API itself) and TCP/IP may have a high overhead factor when processing data at gigabit speeds. TCP performs better using larger window and message sizes -- windows of 16KB - 64KB are sometimes necessary to reduce latency issues. The current TCP implementation also has an issue in that it has characteristic poor performance in networks with a large bandwidth-delay product (long distances at high-rates). Transmitting short messages may also impact performance, however with UDP performs very well with short frame lengths. These are important factors to be considered when using protocol stacks with high-speed Gigabit Ethernet. Alternatives to TCP such as High-speed TCP, UDT, UDP, SCTP or RTP should also be explored to obtain the best performance and service levels, while minimizing the overhead factor to the CPU and memory. However as system speed is increasing and protocol stacks are more optimized for Gigabit networks, these factors are having less impact on system performance. Additionally, there are new experimental RFC's in the IETF meant to address the issues and provide high speed TCP.

CPU and driver: The Gigabit Ethernet technology including the device driver, amounts to moderately low overhead on the system. This is shown by the benchmarks that show near wire speed transmission capabilities when the TCP/IP protocol stack is not used. The overhead of the protocol stack and applications, which are the producers and consumers of the data, amount to a substantial percentage of the overhead when processing data at Gigabit speeds. Processing this much data through the protocol stack and application may require a high performance system. UDP can be used to minimize the protocol stack overhead while still providing an "IP based" solution.

Transferring data: The Gigabit Ethernet controllers use PCI bus-master DMA technology to transfer frames directly to and from SDRAM and outside of typical PCI and memory bandwidth usage, this results in low overhead to the host CPU. Our Gigabit Ethernet driver does not copy frame buffers as its processing is entirely frame buffer based and supports scatter/gather. In addition, Gigabit Ethernet uses a descriptor-list technology that allows the Gigabit Ethernet controller to transfer packet data in parallel with the host CPU. The driver does process transmit and receive frame buffer descriptors, but it processes many (tens or even hundreds) of these buffer descriptors for each interrupt from the Gigabit Ethernet controller. Our testing as shown in the benchmarks shows the controller and driver capable of near wire-speed transmission (about 96 percent line utilization) of approx 242 megabytes per second and over 800,000 Ethernet frames per second per port.

CPU offload: Host offload features including TCP/UDP/IP checksum offloading, interrupt coalescing and jumbo frame support can also be enabled in our driver further reducing overhead to the host CPU running the protocol stack.

In general, a system with a medium to fast 32-bit CPU (1 GHZ and up), DDR/SDRAM memory and a 64-bit or 66 MHZ PCI (preferrably PCI-X) interface is recommended to use with our Gigabit Ethernet controller. Gigabit Ethernet can be used on a 32-bit, 33MHZ PCI, but at Gigabit speeds a 32/33 PCI bus becomes a bottleneck. A 64-bit and/or 66 MHZ PCI or 100/133 PCI-X is highly recommended.

For medium performance of about 50-70 megabytes of TCP/IP or 95MB of UDP/IP with wire speed raw driver throughput, a 700+ MHZ PowerPC 750/7410 or a 800+ MHZ Pentium III or better is recommended.

For wire speed performance using host based protocol stacks including TCP/IP, we recommend systems with performance as good or better than the following:

PowerPC based:

- Motorola PowerPC 7400-7455 (700 MHZ - 1.2 GHZ+) - IBM PowerPC 750FX - Motorola PowerPC 8540/8560, 1-GHZ

- High speed PC133 or 64/128 bit DDR SDRAM memory (1-2 GB memory bandwidth, DDR SDRAM preferred)

- 64-bit, 66 MHZ PCI bus (always recommended), PCI-X for newer systems

- High-end system controller (PCI/PCI-X bus interface, SDRAM controller)

Intel based:

- Single or dual 1 - 3 GHZ Pentium4, Pentium-M, Core-Duo or Xeon (5000/5300 series). Multiple cores recommended.

- High speed DDR 64-bit SDRAM memory (DDR2/DDR3) (1-5 GB memory bandwidth)

- 64-bit, 66 MHZ PCI bus (minimum recommended), PCI-X or PCIe for newer systems

- High-end system controller (PCI/PCI-X/PCIe bus interfaces, SDRAM controller), ServerWorks LE, Intel E7520/7525 or Intel 3000/5000 series

Q. How many Gigabit Ethernet cards be installed per PCI bus segment?

A. Depending on the load, two or more Gigabit Ethernet controllers can be used in 64-bit, 66 MHZ PCI slots per PCI bus segment. Many systems today contain two or more PCI bus segments (channels) allowing additional loads, however the total SDRAM bandwidth must also be considered. Since Gigabit Ethernet can run at very high bus speed of up to 250MB per port, two interfaces can potentially consume much of the available PCI bandwidth if running at full-capacity simultaneously. Using additional PCI bus segments is recommended. One Gigabit Ethernet can be used in 32-bit, 33MHZ PCI slots up to the 90-118MB range, but throughputs above this may overrun the 32-bit PCI bus. 64-bit and/or 66 MHZ PCI bus based systems are highly recommended for Gigabit Ethernet.

Q. What is the basic formula for determining how fast of a CPU is needed for Gigabit Ethernet?

A. The basic rule or formula in the industry applicable to TCP/IP protocol is 1 MHZ of CPU for each megabit of data and allowing for application processing. This equates to a 1-GHZ CPU for a wire speed transfer in one direction or a 2-GHZ CPU for a bi-directional transfer in full-duplex mode when using the TCP/IP protocol stack. Using other protocols including UDP/IP or direct-IP can provide better performance with lower CPU utilization.

Q. What are the typical performance-limiting issues with Gigabit Ethernet and when using protocol stacks like TCP/IP?

A. In Gigabit Ethernet as well as other gigabit-speed transmission technologies, performance bottlenecks can occur due to hardware and software components that are "out of balance" with each other. In addition, performance bottlenecks are commonly found in "shared resources" used in the path of the communications channel. These are categorized as hardware issues (CPU, memory and bus bandwidth). Other overhead and performance limitations exist in the operating system, task switching, Socket API and the TCP/IP protocol stack itself. In Gigabit Ethernet, the primary limiting factors are not the physical wire, gigabit controller or device driver because these components employ dedicated hardware and high-speed intelligent burst DMA technology. In addition, the device driver is not directly involved in byte-level data transfer -- its overhead is based on frame rates. The driver acts only as a buffer manager and dispatcher operating on entire linked lists of buffers and sending DMA instructions to the gigabit controller. The gigabit controller itself is capable of transferring (via controller initiated bus-master DMA) hundreds of Ethernet frames without CPU intervention. Due to the design and architecture of the controller hardware, the driver operates efficiently and depending on frame rate consumes low to moderate CPU resources and is not a limiting factor in Gigabit Ethernet performance.

Primary limiting factors or bottlenecks in Gigabit Ethernet are typically found in the following areas:

1. PCI bus coupled with SDRAM access latency and poor performance of some system controllers. These issues include PCI transaction disconnects, retries and wait states due to fetching, cache coherency and bus and SDRAM arbitration. These factors normally only impact continuous transmission of short frames of less than 300 bytes and/or high-frame rate applications of greater than 500,000 frames per second, but can also occur with some poorly configured or performing system controllers. Also, the impact of these variables is typically greater on the transmit side (burst reads) than the receive side (posted writes) as disconnects, retries and wait states occur more frequently on reads from SDRAM.

2. Too much overhead at the Socket API and protocol stack for a CPU of given processor, exessive system overhead and task switching and memory speed/bandwidth. Although the DMA-driven Gigabit Ethernet controller and driver can operate at wire-speeds with less than a 400 MHZ processor, processing Gigabit Ethernet using the TCP/IP protocol stack and socket API may require a 1 to 2 GHZ CPU to process payload data at speeds approaching wire-speed. This is primarily due to the socket API interface which copies payload data between user and system buffers, system overhead including task switching, and the overhead incurred in the TCP/IP protocol stack having to do with checksums, header processing and segmentation.

3. TCP/IP protocol stack end-to-end acknowledgement latency. This may require use of TCP window sizes of 16-32K or greater. The primary problem with the current TCP implementation is that it has characteristic poor performance in networks with a large bandwidth-delay product (long distances at high-rates). Alternatively using a connectionless protocol like UDP/IP performs well and with short messages sizes, however UDP can be lossy and does not guarantee that packets wont be lost or delivered out of sequence.

Additional note(s): As system speeds are rapidly increasing, system-level performance bottlenecks are also decreasing making Gigabit Ethernet very deployable using technology available today. Still, careful consideration is required to ensure that the overall system and components are capable of processing data at gigabit speeds. Regarding use of protocol stack, TCP/IP may not be an optimal solution for real-time, short message and/or high frame-rate applications.

Q. How much of the CPU bandwidth is consumed copying payload data across the socket API interface in send() or recv() system calls?

A. On a 1-GHZ Intel Pentium III CPU with PC133 memory, about 46 percent of the CPU is consumed copying data in send() and recv() socket API system calls during a TCP/IP transfer at 110 megabytes per second (MB/sec) and about 24 percent on a 2-GHZ Intel Xeon CPU with DDR SDRAM. Based on this and the benchmark CPU utilizations, this would indicate that these Socket API buffer copies are likely the most significant single CPU/memory overhead factor in TCP/IP over Gigabit Ethernet -- much more so than the TCP/IP protocol stack itself.

Q. What are the solutions for gigabit networking in next generation designs?

A. Designs that use fast, tightly coupled low-latency hardware, CPUs with multiple cores and threads for partitioning, virtualization and incorporate system-level I/O acceleration, high-speed advanced FSB, DDR2/DDR3 memory, chipsets with PCI-Express bus interfaces and Gigabit interfaces with PCI-Express that incorporate advanced I/O acceleration and data-path offload techniques for lower CPU utilization and better network performance.

Q. What is Intel's solution to the protocol stack and operating system overhead issues?

A. Intel has a system-level technology called "I/OAT" (IOAT2) that is intended to solve the problem at the system level -- addressing the operating system, task switching, API and protocol stack bottlenecks without requiring any recoding of the application interfacing to the Socket API, or any changes to the existing host based protocol stack. For additional information on IOAT2, please visit the Intel I/OAT website pages found at the following link (the short animated demo is very informative):

http://www.intel.com/technology/ioacceleration/index.htm more>>

Q. What do the terms RDMA, iWarp and RNIC refer to?

A. RDMA is a relatively new advanced technology which allows the Gigabit controller to use DMA to move Ethernet messages directly to and from application buffers thus relieving the current methods of copying data across the socket API interface. iWarp is a technology that uses RDMA in conjunction with TCP/IP data path offload to offload the CPU for TCP/IP data transfer while keeping the host protocol stack intact. This is in contrast to the current TOE technology which implements a separate TCP/IP stack residing on the TOE controller. RDMA and iWarp show great promise, especially for 10-Gigabit Ethernet where it is mostly needed, but to be successful it must gain the support of not only hardware vendors, but also the Internet protocol community and major operating system vendors. Another issue facing iWarp/RDMA is that applications have to be recoded to use a new API and there are additional protocol stack layers on top of TCP/UDP/IP required to support it. RDMA and iWarp is expected to be first supported in Windows operating system's "Chimney" architecture and it is currently unclear when this support will be fully available in Linux or vxWorks.

Demonstrating recent and promising activity toward RDMA/iWarp technology, Broadcom has announced the industry's first RDMA/iWarp based Gigabit Ethernet controller called "NetExtreme II", and information on this device can be found at the following link:

Broadcom RDMA/iWarp product information more>>

Q. How are the high-speed protocol stack performance issues being addressed, what is the status of this work?

A. This is a new evolving section intended to provide summary information and links, please check back often for additional updates. There is much work being done in the standards, academic research and scientific communities including the IETF, ICIR and Open Group as well as in the high-speed network technology industry to address these issues which can be categorized as follows:

1. High-speed TCP: Design goals for 1 to 10 gigabit transmission over long distance networks to address issues with high-bandwidth/delay networks. To find information, use search engine keywords "HighSpeed TCP", "Scalable TCP", "FAST protocol" and "UDT protocol". In addition, there is a Draft RFC in the Transport Area Working Group within the IETF (RFC 3649, December 2003) for high-speed TCP. The following links may provide additional useful information:

http://www.ietf.org/rfc/rfc3649.txt
http://netlab.caltech.edu/FAST

2. New Socket API extensions: Will include features for efficient data transfer and buffer management among other new capabilities. Much of this work seems to be stemming from the RDMA Consortium with the "Sockets Direct Protocol" (SDP), the Infiniband/RDMA industry, the Open Group Transport and IETF Socket API Working Groups. There is a new Draft standard for RDDP submitted to the IETF which can be found at the link below. Use search words including "SDP", "RDMA", "SDDP", and "Socket API extensions". This section will be extended as more information becomes available. Please also see the following links for additional information:

http://www.rdmaconsortium.org/home
http://www.ietf.org/html.charters/rddp-charter.html
http://www.infinibandta.org/events/past/spring2002/2_Sockets_Direct_Protocol.pdf

3. Protocol efficiency: Features designed to improve the efficiency and lower the CPU utilization for high-speed multi-gigabit networking including new protocols including SCTP and enhancements to existing protocols. To be completed as information becomes available. Information on SCTP can be found at the following links:

http://www.ietf.org/html.charters/sigtran-charter.html
http://www.ietf.org/html.charters/tsvwg-charter.html

Other links:

http://www.icir.org
http://www.opengroup.org/icsc

Q. Will full TCP/IP offload (TOE) help solve performance limiting issues in Gigabit Ethernet?

A. TOE refers to "TCP offload engine" whereby the full TCP/IP protocol stack is offloaded to a dedicated controller effectively bypassing the host-resident protocol stack. It attempts to solve a system-level problem by pushing it out to the NIC, but most of the critical system level issues remain. Due to this and the lack of a standard, there are many operating system, integration, security and maintenance issues associated with TOE and to date TOE has not gained significant deployment or acceptance. In addition there has been significant progress in the industry to identify and address the issues specifically relating to TCP/IP/protocol stack performance and CPU utilization. The results of these efforts are now primarily in the RDMA Consortium and the IETF "RDDP" Working Group and this new standards-based technology is known as "RDMA" or "iWarp". In contrast to TOE, RDMA/iWarp allows for the host-based protocol stack to remain intact by offloading only the TCP/IP data path processing. In addition it employs RDMA technology to provide direct movement of data in and out of application buffers by the Gigabit controller using DMA. Because this approach is standardized and allows the hosts protocol stack to remain intact while addressing the real issues with protocol stack performance and CPU utilization, the RDMA/iWarp technology should have a much better chance of gaining widespread industry support. In addition to these new software standards, new system level hardware architectures like Intel's I/OAT are being developed to solve the system level hardware issues for both one and 10G Ethernet.

Q. Please summarize the primary issues and solutions regarding new and existing TOE products.

Issues with current (proprietary) TOE implementations:

  • TOE tries to address a problem without knowledge of all of the real underlying system level issues issues. It attempts to push the solution out to the NIC (TOE), however many of the system level issues still remain. For example, what benefit is a TOE if the PCI bus and system controller has performance issues and are a bottleneck? What if the SDRAM bandwidth is causing a bottleneck? How is the issue of data movement (copy) in the Socket API solved without recoding applications?
  • TOE typically performs worse than an intelligent NIC with PCI-X and DMA bursting capabilities when short to medium frame sizes are used and in high frame rate applications.
  • The major CPU and Ethernet silicon providers have come to realize that this problem must be solved at the system level with advanced hardware I/O acceleration architectures including hyper-threading, other parallel architectures, in-memory DMA instructions, PCI-Express, faster CPU's and faster DDR memory.
  • Little market acceptance - most TOE board and silicon vendors to-date have failed due to several reasons including cost, complexity, market acceptance and viability.
  • Because of its non-viability, TOE is not mainstream technology and since been abandoned by all the major Ethernet silicon vendors. Standards based solutions including system-level hardware acceleration and iWarp/RDMA are now emerging.
  • New TOE solutions are only being developed by 2nd or 3rd tier manufacturers and will likely fall behind in support for new technologies including PCI Express. PCI Express is critical for technologies like Gigabit Ethernet.
  • The major top-tier silicon vendors competing in this field are hard at work solving the 1G and 10G issues from a system levels standpoint -- the real problems are being solved with viable solutions.
  • TOE implementations are not implemented exclusively in hardware as some claim. Most use one or more embedded RISC microcontrollers and protocol stack software and/or software state machines.
  • Existing TOE is not standards based and different proprietary implementations are employed by each vendor.
  • Not supported by major operating systems including Windows, Linux and VxWorks - essentially today still a protocol stack bypass. System level issues still remain.
  • High cost and complexity for a network interface card -- typically $600 - $1200 for one or two ports.
  • The major GbE silicon vendors including Intel and Broadcom have CPU offload features that solve the primary performance issues related to the protocol stack - leaving operating system issues including data movement (in socket API) and task switching overhead as the primary CPU/memory utilization factors. The remaining issues are being solved from a system level standpoint. In contrast, most TOE implementations will only partially solve some issues and in limited cases utilizing large message sizes.
  • Will not likely alleviate issues in socket API data copy overhead on CPU and may only partially reduce task switching.

    Other Issues to consider:

  • May require applications be re-coded to use a non-standard or iSCSI type API to get around socket API issues.
  • TOE NICs will quickly become outdated or obsolete and not keep up with the system level performance increases.
  • Does not solve TCP protocol bandwidth/delay issues which are based on transmission distance.
  • Probable deployment, management and security issues.
  • Protocol stack functionality in TOE's is restricted in comparison to the rich features in Linux network protocol support, routing and other network capabilities.
  • Is not open source based and may not contain all of the latest applicable RFC's.
  • Has maintainability issues -- how quickly will compatibility and feature support issues be resolved with the TOE vendors stack?
  • Not proven -- where are the benchmarks utilizing various frame sizes?
  • High price to pay for a risky unproven solution offering limited benefit.

TOE summary of issues

  • Technology is too late for Gigabit Ethernet. TOE was gaining attention mostly in 2002, but has mostly died out due to lack of standards, viable solutions and market acceptance.
  • Possibility for one and 10-gigabit Ethernet based on new RDMA/RDDP/iWarp standards in the IETF and standards-based silicon technology from companies including Intel and Broadcom.
  • However issues are also being solved at the system level by the top tier major silicon vendors in this field, including Intel and I/OAT technology.

Alternative solutions to TOE

  • Use hyperthreaded and/or multicore CPUs on a Linux multiprocessor system. This is a standards based and low cost TOE which offers better performance and less complexity.
  • Be realistic and balance the hardware resources. Avoid using outdated hardware with known performance bottlenecks (CPU, memory, bus interfaces).
  • Use systems with fast DDR memory and PCI-Express or PCIe bus interfaces. These offer real and effective solutions.
  • Use the latest standards including RDMA/iWarp which allow for efficient transfers withing existing framework.
  • Solve the problem at the system level -- it is not advisable to use proprietary TOE band-aids for these stated reasons.

Q. What types of loopbacks and frame generators are available with your Gigabit Ethernet controllers and drivers?

A. Loopback tests and frame generators are available for both copper and fiber interfaces. For copper, the most useful types of loopbacks are internal loopbacks at the transceiver. These are available in our cards with 1000 base T copper transceivers. Internal transceiver loopbacks allow for verification and performance tests to be isolated on a single system. Our frame generators can be used in conjunction with loopbacks to exercise the hardware, driver and operating system at maximum bus and wire-speeds where CPU utilization can be measured. Fiber optic interfaces using Serdes do not have internal transceiver loopbacks, however can be easily put into a physical loopback using a standard loopback plug or just by looping a single fiber optic cable from the transmit back to the receive. External loopback connectors for 1000 base T copper interfaces are not available in Gigabit Ethernet as there is a "master/slave" link establishment required and all 8-wires in a RJ-45/CAT5 cable are used for both transmit and receive. However a physical loopback between two cards can be done by cabling the ports directly together and using the built-in frame generator. End-to-end frame generator loopbacks are also available and useful for additional testing in a more real environment.

Q. In multi-vendor integrated systems, what type of system level problems can most affect gigabit performance and how can these be resolved?

Most serious system level problems are ones that affects PCI burst-mode bus-master DMA initiated by the gige controller. They usually occur in centrally arbitrated resources like the system controller and SDRAM interfaces. A system controller/SDRAM interface that is not configured for PCI burst transfers, not configured for pre-fetching, incurs excessive retries or wait states or otherwise prematurely terminates or truncates a burst mode bus-master DMA from a high-speed controller like gige can severely limit Gigabit Ethernet performance. It can easily cut performance in 1/2 and in some cases reduce it to 20 percent of its capacity. It is in these cases when the Gigabit Ethernet controller requesting bus master DMA, it is dependent on performance of the host board, system controller and SDRAM interface. The system board vendor should be asked to verify gigabit operation of pci-based burst-mode bus-master DMA and provide performance statistics. In many cases a PCI bus-analyzer is needed to analyze and isolate and a vendor-supplied patch or BSP upgrade may be necessary to correct problem. Other issues include disabled data cache, lack of support for hardware cache coherency or other cache coherency-related issues.

Q. Why is the system controller so important to Gigabit Ethernet performance?

A. In hardware capacity and performance terms, the system controller is critical to performance because it is the central system resource that controls activity on the PCI bus as well as access to SDRAM from the PCI bus. Gigabit Ethernet controllers are bus-master initiators and designed and capable of transferring data across the PCI bus at wire-speed. The system controller is the target of these burst DMA transfers. For memory read operations (packet transmit), it is the system controller that interacts with the SDRAM interface to fetch the data and deliver it to the Gigabit controller. If the system controller cannot deliver the data at PCI bus or wire-speeds, then the read performance is constrained by the system controller's speed and capability. For memory writes (packet receive), the system controller also must arbitrate and interact with the SDRAM interface and can also constrain performance. Using the STOP# and TRDY# control signals, the system controller can limit performance in several ways -- by prematurely terminating burst transfers, causing retries and adding excessive wait states. In many cases it may be configuration related and the PCI bus may have been given too low of a priority or limits on burst lengths in accessing SDRAM resources.

Q. Are there known performance issues with some system controller's PCI interfaces?

A. Yes, there is a known issue with a Discovery1 system controller affecting transmit dma on boards with PowerPC 74xx processors. During transmit DMA burst reads from SDRAM initiated by a bus-master, the Discovery1 target may disconnect frequently and prematurely (after one 32-byte cache line is transferred) while snooping the cache and/or fetching data from SDRAM. This results in excessive retries and poor transmit DMA performance (85MB/sec or less even with 64-bit/66 MHZ PCI). It also may cause excessive transmit latency further inhibiting driver and protocol stack performance. Flushing and invalidating the cache in the driver during transmit (normally not necessary with hardware cache coherency) helps to minimize the impact of this problem, however transmit DMA performance may be still constrained on systems utilizing this controller.

Q. What type of CPU-offload and other features are supported by your Gigabit cards and drivers?

A. Our drivers and controllers provide feature-rich CPU offload capabilities and support many features including programmable DMA burst lengths, checksum offload (TCP, UDP and IP), programmed interrupt latency, jumbo frames, advanced packet filtering, auto negotiation, pause frame flow control, VLAN tagging and insertion, priority queues, loopback and a built-in frame-generator. Intel-based controllers additionally support packet segmentation and reassembly offload. RFC MIB statistics are also kept by the controller hardware and accessible via the driver. These features are configurable within the driver during initialization. Fail-over is also available as an option with many of the board level products. The drivers are also instrumented for integration-level verification using software trace, detailed statistics, register dumps and frame buffer capture. These features help make the integration process simple and verifiable to insure the driver and controller are fully integrated and working properly.

Q. I am evaluating interface cards and systems for Gigabit Ethernet, what do you recommend I do?

A. Whatever vendor you choose, we recommend that you take the following steps in evaluating and selecting your Gigabit Ethernet card and system:

1. Specify your application's performance requirements in megabytes per second and average message size.

2. Determine which protocol stack best fits your application requirements. TCP/IP is generally more suited for large message sizes and larger transfer lengths and may not perform well in applications with short messages or high frame-rates. Consider also using UDP or SCTP or other specialized protocol and if you application runs in a closed network consider using direct-IP or raw-frames.

3. Evaluate all the hardware you plan on using including the CPU, memory speed and system controllers. Make certain these have the performance and capacity for your application. Specifically, check specifications and ask vendors for performance and benchmark information for memory bandwidth and burst DMA performance from a fast PCI bus-master like Gigabit Ethernet. This is especially important because their SDRAM and system controller must be able to support these rates from the PCI bus otherwise the gigabit controller's throughput is constrained. If the system's performance is inadequate or the vendor can not provide benchmark and performance information or has not otherwise performance stress-tested their system with a fast bus-master like Gigabit Ethernet, then you can probably expect to have performance issues.

4. Evaluate the performance and features of the Gigabit Ethernet controllers. Make certain that the Gige vendor provides their own drivers and has performance and benchmark information. Benchmark information may not always apply to every system, but more importantly it shows that the vendor is knowledgeable and serious about performance issues and has stress-tested their products at maximum performance. It is also important to have features like built-in performance statistics, loopbacks and frame generators for integration and performance testing. Again, if the vendor can not provide this information or has not performance stress-tested their Gigabit Ethernet card on various systems or does not provide and support their own driver, you can probably expect to have integration or performance issues.

Q. Should I use Copper (1000 Base T) or Fiber (1000 Base SX/LX)?

A. Many customers use Gigabit Ethernet over copper for Local Area Networks. It is less expensive and normally easier to deploy over CAT5 than fiber. There are about equivalent in performance. Most fiber based gigabit controllers are not backward compatible with 100 mbit Fast Ethernet and only support gigabit mode with auto negotiation. However, fiber mode controllers support much greater distances (400m with 1000 Base SX, 5 or more kilometers with 1000 Base LX) and are preferred in many embedded applications. We provide solutions for both copper and fiber to suit different applications and network topologies.

Q. Is auto-negotiation required when using Gigabit Ethernet in 1000-base-T or 1000-base-X gigabit modes?

A. Yes. Auto negotiation is required when using Gigabit Ethernet in gigabit-speed mode (1000 base T or X) because the link must resolve a master/slave relationship and this is performed using auto negotiation. It is not required for 10/100 modes but is required for gigabit operation. An IEEE recommended practice for limiting allowable link connection modes and speeds is by leaving auto-negotiation enabled and limiting which modes are allowed by setting the auto-negotiation advertisement registers. Please refer to the IEEE 802.3-2000 specification for additional information.

Q. What advantages are there for using PCI-X?

A. PCI-X is backwards compatible with 64-bit, 66MHZ PCI. It can up to double the speed of a 64-bit, 66MHZ PCI by doubling the clock to up to 133 MHZ and allows more loads (cards) on the PCI bus. It also imposes new restrictions on retries and wait states for burst transfers -- thereby decreasing latency and improving performance. Typically, 2 cards running at 100 MHZ can be installed on a given PCI-X bus segment. There are system boards available from Intel, SuperMicro and other vendors with PCI-X slots and multiple PCI-X bus segments. Our newer cards including the models 6162, 5262 and 5362 dual-port's feature a PCI-X bus interface.

Q. Do UDP sockets perform better than TCP sockets and should they be used if the application permits?

A. Our tests show that UDP has much less overhead than TCP (up to 40 percent less) and our benchmarks show excellent performance and throughput using UDP sockets. If the application permits using UDP or if TCP is performance constrained, then we would suggest considering using UDP. Many voice, multimedia and streaming media applications are migrating to directly or indirectly using UDP or raw transmission services from the IP layer. Real-time and/or high bandwidth, low latency protocols including UDT, RTP and HighSpeed TCP may also be a good alternative to TCP.

Q. What are these "full socket" counts seen in the UDP statistics when I run an end-to-end UDP blaster test?

A. Full socket counts indicate that the UDP/IP protocol stack is running out of receive buffering resources, typically buffers of the 64 or 128 sizes. Since UDP is connectionless, UDP stores the message source IP address and port in the receive buffer. Therefore, an additional 16 bytes per message are needed for the receive buffer and this buffer must be allocated from a pool by the protocol stack during packet receive. If there is no space, as determined by the space calculation, received packet is dropped and the error is reported by udpstatShow as full sockets error. As such, they can occur when running end-to-end UDP/IP throughput tests for reasons including the following:

1. MBLK buffer pool sizes are not large enough. You may need to increase number of 64 and 128 byte buffers for receive udp address information (the driver maintains its own pool of receive frame buffers) to a number grater than the number of receive buffers and descriptor list size in driver. A safe recommended value for the number of size 64/128 udp receive buffers is 512 or greater.

2. The sending station is a faster system or is otherwise transmitting at a rate faster than the receiver.

3. The receiver application and UDP/protocol stack entity is not consuming fast enough keeping up with transmitter. The gigabit controller and driver entities are forwarding receive frames upstream to the protocol at a rate the protocol stack and application cannot keep up with.

4. There is no end-to-end flow control in a connectionless protocol like UDP (it is lossy in nature) and no way to force transmitter to slow down to receiver's pace. While it may be possible to use "pause frame" or other "tuning" techniques to flow control the transmitter, these are not designed to provide flow control for the protocol stack and may only have limited effectiveness.

To resolve this issue you must either ensure the receiver has enough receive buffers and is fast enough to keep up with the transmitter, or provide a mechanism in your application to provide for end-to-end flow control based on the receiver's ability to consume data. Increasing the number of 128-byte MBLKS may help, but will not completely alleviate problem if receiver cannot keep up.

Note: We have also seen that doubling the size of the UDP/IP receive socket buffer and message size also helps reduce the number of full-sockets. Using larger message and buffers sizes on the receive side still preserves message delineation.

Q. What about Fibre Channel vs. Gigabit Ethernet?

A. Fibre Channel is primarily for Storage Area Networks (SAN) and data center server applications and is more costly and complex to deploy than Gigabit Ethernet. When used in networking with protocol stacks, fibre channel does not offer significant advantages and indeed the added complexity can cause significant development and integration issues. Gigabit Ethernet is more suitable for a variety of applications including telecommunications, VoIP, multimedia, broadband video, digital imaging, internet security, streaming media and other high performance data intensive applications.

Q. How robust and reliable are the VxWorks, Linux and Windows (2000, XP) drivers supplied with the OEM developer kit?

A. Our drivers are very reliable, robust and mature and have been deployed by hundreds of customers over several years. We constantly stress test and benchmark our cards and drivers to the maximum wire-speed performance. Our benchmark results are posted on our website and we also publish performance information on product pages and in Datasheets. Please note however, that Windows operating system is not real-time, has high interrupt and task switching latency which may limit performance.

Q. What sort of effort should be expected to integrate the VxWorks driver into my target system?

A. Because VxWorks may have some board and BSP specific differences on systems from different vendors, some integration as well as testing should be expected on the customer's part. The effort required normally falls within a few days to one week for development plus time for additional testing. It is not a major effort as our driver has been integrated into many different vxWorks based systems including PowerPC, Intel, MIPS and Xscale. Many times the driver needs only to be recompiled for the target system. The driver is designed to work in various systems with minimal changes. The required changes are encapsulated within easily identifiable "ifdefs" a VxWorks integration manual is provided that describes the integration procedures in detail. We have found that most customers complete this effort in just a few days to one week with follow on testing of perhaps a few weeks depending on the customer's level of test requirements.

Q. What are the types of issues are involved in integrating the VxWorks driver in a typical system?

A. Because VxWorks runs on so many types of embedded and real-time systems and CPU types, some minor board and BSP specific differences requiring integration should be expected. Specifically, the gigabit controller and driver require the following system resources be mapped properly in order to function:

1. PCI Configuration of memory mapped i/o address used to access controller registers as a PCI target from the CPU and mapping to local bus address.

2. PCI configuration of interrupt and mapping to local CPU interrupt vector.

3. Mapping of local SDRAM buffers and buffer descriptor addresses to PCI bus addressees used by the controller for bus-master DMA operations.

In most cases, all that is needed is to re-compile the driver -- the default settings are appropriate and the driver is written to accommodate these minor variations in BSP implementations.

Q. Is source code to the Linux, VxWorks and Windows (2000, XP) drivers available? How is technical support provided for these products?

A.We supply an OEM developer kit which includes documentation and source code to our VxWorks (5.4/5.5), Linux (2.4/2.6) and Windows (2000, XP) drivers which is optionally available with the purchase of board level products. We provide 24/7 technical support via email requests to support@dssnetworks.com. Response time is usually one day or less depending on information requested.

Q. What else is contained on the OEM Developer Kit?

A. In addition to VxWorks, Linux and Windows driver source code, we provide test programs, loopback tests, frame generators, sample code, utilities, datasheets, user manuals, release notes and an integration guide in the OEM developer kit. Windows driver executables and installation are also included.

Q. Can the VxWorks or Linux drivers be ported to other operating systems?

A. Yes, we have ported our VxWorks driver to Linux and Windows and Solaris and our customers have ported it to other operating systems, systems with a custom or proprietary operating system or no operating system at all. The driver is designed to be portable and is well modularized. It also has a top and bottom half to make it more easily ported to other environments.

Q. Can polling for receive buffers rather than using interrupts help improve performance?

A. While processing transmit completions within the "send" function itself is effective at reducing transmit completion interrupts, polling for receive buffers is not recommended for the following reasons:

Polling for receive buffers would require polling at a high fixed rate depending on the number of descriptor lists configured. A typical polling rate would be around every 100-200 microseconds. This latency can not be too high or the buffer descriptor lists will overrun. Regardless of the operating system, it may be difficult to poll at such an interval and to guarantee this poll rate without using a high-speed system hardware timer. Since a hardware timer also produces an interrupt -- the polling may not be better then the controller's interrupt itself. If polling is performed with a background task, it would be required to run at a low system priority as not to block other system tasks. It would likely be difficult to time the poll rate to not be too slow or fast. Also, the task or interrupt thread that normally services the receive and transmit buffer descriptor lists must be one of the highest priority tasks in the system as to prevent receive buffer overruns and to also free and make available the transmitted buffers.

Because of these system issues, we believe polling is too complex, would likely consume more CPU, and invoke similar context switching as the interrupt processing itself. It would be more complex and less reliable. CPU's are designed for very fast interrupt processing and context switching. In addition, Gigabit Ethernet normally processes many receive frames and transmit completions per hardware interrupt resulting in interrupt processing not being a major factor in system performance. The Gigabit Ethernet controllers also have programmable interrupt latency which allows interrupts to be deferred in units of 100 microseconds. This programmable latency period allows more events to queue up -- allowing many receive buffers and transmit completions to be processed for each interrupt.

On the other hand, processing transmit completions can be done during the transmit process effectively, however it seems to be only of real benefit for uni-directional UDP/IP transmit-only tests.


Copyright 2000-2016 DSS Networks Inc. All Rights Reserved. [ Contact Us ]