|
http://www.xasun.com/article/2a/1991$3.html
1.With the Xeon 5500 series processors
Intel has diverged from its traditional Symmetric Multiprocessing (SMP) architecture to a Non-Uniform Memory Access (NUMA) architecture.In a two-processor scenario, the Xeon 5500 series processors are connected through a serial coherency link called QuickPath Interconnect (QPI). The QPI is capable of 6.4, 5.6 or 4.8 GT/s (gigatransfers per second), depending on the processor model. The Xeon 5500 series integrates the memory controller within the processor, resulting in two memory controllers in a two-socket system. Each memory controller has three memory channels and supports DDR-3 memory. Depending on processor model, the type of memory used, and the population of memory, memory may be clocked at 1333MHz, 1066MHz or 800MHz. Each memory channel supports up to 3 DIMMs per channel (DPC), for a theoretical maximum of 9 DIMMs per processor or 18 per 2-socket server. (See Figure 1 for illustration.) However, the actual maximum number of DIMMs per system is dependent upon the system design. 新的55系列的至强CPU已经由原来的SMP结构改成了现在的NUMA结构,两个CPU不再对共同的内存资源 管理,而是把内存控制器集成到CPU中,每个CPU可以管理3个通道一共9条内存,CPU之间通过QPI(可以理解为内部总线)互联。而内存使用时能达到的最高频率跟CPU本身和DIMM都有关系。
2.Memory Performance
With the varied number of configurations possible in the Xeon 5500 series processor-based systems, a number of variables emerge that influence processor/memory performance. The main variables are memory speed, memory interleaving, memory ranks and memory population across various memory channels and processors. Depending on the processor model and number of DIMMs, the performance of the Xeon 5500 platform will see large memory performance variances. We will look at each of these factors more closely in the next sections. 与内存性能最相关的包括CPU的类型,每通道安装的内存数,内存本身的性能,内存互联的方式,内存的RANK数等等。 2.1 Memory Speed As mentioned earlier, the memory speed is determined by the combination of the processor model, DIMM speed, and DIMMs per channel. 2.1.1 Processor model The initial Xeon 5500 series processor-based offerings will be categorized into 3 bins called Performance, Volume and Value. The 3 bins have the ability to clock memory at different maximum speeds: 1333MHz (X55xx processor models) 1066MHz (E552x or L552x and up) 800MHz (E550x) So, the processor model will limit the maximum frequency of the memory. Note: Because of the integrated memory controllers the former front-side bus (FSB) no longer exists. 内存控制器集成到CPU中后,FSB就不存在的(没有前端总线的概念,和AMD的处理器一致)。 2.1.2 DDR3 DIMM Speed DDR-3 memory will be available in various sizes at speeds of 1333MHz and 1066MHz. 1333MHz represents the maximum capability at which memory can be clocked. However, the memory will not be clocked faster than the capability of the processor model and will be clocked appropriately by the BIOS. 2.1.3 DIMMs per Channel (DPC) The number and type of DIMMs and the channels in which they reside will also determine the
speed at which memory will be clocked. Table 1 describes the behavior of the platform. The table below assumes a 1333MHz-capable processor model (X55xx). If a slower processor model is used, then the memory speed will be the lower of the memory speed and the processor model memory speed capability. If the DPC is not uniform across all the channels, then the system will clock to the frequency of the slowest channel. 每个通道使用不同数目的内存时,内存工作的频率是不一样的,具体见下表。
表1
 2.1.4 Low-level Performance Specifics It is important to understand the impact of the performance of the Xeon 5500 series platform, depending on the memory speed. We will use both low-level memory tools and application benchmarks to quantify the impact of memory speed. 关系内存性能的参数:延迟和吞吐量 Two of the key low-level metrics that are used to measure memory performance are memory latency and memory throughput. We use a base Xeon 5500 2.93GHz, 1333MHz-capable 2- socket system for this analysis. The memory configurations for the three memory speeds in the following benchmarks are as follows: 1333MHz – 6 x 4GB dual-rank 1333MHz DIMMs 1066MHz – 12 x 2GB dual-rank DIMMs for 1066MHz 800MHz – 12 x 2GB dual-rank DIMMs clocked down to 800MHz in BIOS Note: Memory ranks are explained in detail in section 3.3. As shown in 表2 below, we show the unloaded latency to local memory. The unloaded latency is measured at the application level and is designed to defeat processor prefetch mechanisms. As shown in the 表2, the difference between the fastest and slowest speeds is about 10%. This represents the high watermark for latency-sensitive workloads. Another important thing to note is that this is almost a 50% decrease in memory latency when compared to the previous generation Xeon 5400 series processor on 5000P chipset platforms. 内存延迟:1333对1066MHZ内存的提升在10%左右,但是55系列CPU对于54系列CPU总体上有50%的提升。
表2
 A better indicator of application performance is memory throughput. We use the triad component of the streams benchmark to compare the performance at different memory speeds. The memory throughput assumes all local memory allocation and all 8 cores utilizing main memory. As shown in 表3, the performance gain from running memory at 1066MHz versus 800MHz is 28%, and the performance gain from running at 1333MHz versus 1066MHz is 9%. So, the performance penalty of clocking memory down to 800MHz is far greater than clocking it down to 1066MHz. This new processor design comes with some trade-offs in memory capacity, performance, and cost: For example, more lower-cost/lower-capacity DIMMs mean lower memory speed. Alternatively, fewer higher-capacity DIMMs cost more but offer higher performance.
注意,内存频率从1333降到1066比从1066降到800损失要小。 表3

Regardless of memory speed, the Xeon 5500 platform represents a significant improvement in memory bandwidth over the previous Xeon 5400 platform. At 1333MHz, the improvement is almost 500% over the previous generation. This huge improvement is mainly due to dual
integrated memory controllers and faster DDR-3 1333MHz memory. This improvement translates into improved application performance and scalability. 至强55系列CPU比之前的54系列CPU的内存带宽提高了将近500% 2.1.5 Application Performance
In this section, we will discuss the impact of memory speed on the performance of three commonly used benchmarks: SPECint2006_rate, SPECfp2006_rate and SPECjbb2005. In each case, the benchmark scores are relative to the score at 800MHz as shown in Figure 8. SPECint2006_rate is typically used as an indicator of performance for commercial applications. It tends to be more sensitive to processor frequency and less to memory bandwidth. There are very few components in SPECint2006_rate that are memory bandwidth intensive and so the performance gain with memory speed improvements is the least for this workload. In fact, most of the difference observed is due to one of the sub-benchmarks that shows a high sensitivity to memory frequency. There is an 8% improvement going from 800MHz to 1333MHz while the improvement in memory bandwidth is almost 40%. SPECfp_rate is used as an indicator for HPC (high-performance computing) workloads. It tends to be memory bandwidth intensive and should reveal significant improvements for this workload as memory frequency increases. As expected, a number of sub-benchmarks demonstrate improvements as high as the difference in memory bandwidth. As shown in Figure 8, there is a 13% gain going from 800MHz to 1066MHz and another 6% improvement with 1333MHz. SPECfp_rate captures almost 50% of the memory bandwidth improvement. SPECjbb2005 is a workload that does not stress memory but keeps the data bus moderately utilized. This workload provides a middle ground and the performance gains reflect that trend. As shown in 表4, there is an 8% gain from 800MHz to 1066MHz and another 2% upside with 1333MHz. 表4

2.2 Memory Interleaving Memory interleaving refers to how physical memory is interleaved across the physical DIMMs. A balanced system provides the best interleaving. A Xeon 5500 series processor-based system is balanced when all memory channels on a socket have the same amount of memory. The simplest way to enforce optimal interleaving is by populating 6 identical DIMMs at 1333MHz, 12 identical DIMMs at 1066MHz and 18 identical DIMMs (where supported by platform) at 800MHz. This leads to lessened performance. Figure 9 shows the impact of reduced interleaving. The first configuration is a balanced baseline configuration where the memory is down-clocked to 800MHz in BIOS. The second configuration populates four channels with 50% more memory than two other channels causing an unbalanced configuration. The third configuration balances the memory on all channels by populating the channels with fewer DIMM slots with a DIMM that is double the capacity of others. (For example, two channels with 3 x 4GB DIMMs and one channel with 1 x 4GB and 1 x 8GB DIMMs.) This ensures that all channels have the same capacity. As 表6 shows, the first and third balanced configurations significantly outperform the
unbalanced configuration. Depending on the memory footprint of the application and memory access pattern, the impact could be higher or lower than the two applications cited in the figure. 注意,内存越多,内存的工作频率越低,12DIMMS工作在1066MHZ,18DIMMS工作在800MHZ,具体请看表7. 表6,表7


2.3 Memory Ranks
A memory rank is simply a segment of memory that is addressed by a specific address bit. DIMMs typically have 1, 2 or 4 memory ranks, as indicated by their size designation. A typical memory DIMM description: 2GB 4R x8 DIMM The 4R designator is the rank count for this particular DIMM (R for rank = 4) The x8 designator is the data width of the rank It is important to ensure that DIMMs with the appropriate number of ranks are populated in each channel for optimal performance. Whenever possible, it is recommended to use dual-rank DIMMs in the system. Dual-rank DIMMs offer better interleaving and hence better performance than single-rank DIMMs. For instance, a system populated with 6 x 2GB dual-rank DIMMs outperforms a system populated with 6 x 2GB single-rank DIMMs by 7% for SPECjbb2005. Dual-rank DIMMs are also better than quad-rank DIMMs because quad-rank DIMMs will cause the memory speed to be down-clocked. Another important guideline is to populate equivalent ranks per channel. For instance, mixing single-rank and dual-rank DIMMs in a channel should be avoided. RANK指的是内存的生产工艺,每个通道可以支持的RANK总数是有限的,实际应用的时候应该保证内存大小与内存频率上的平衡。往往推荐使用双RANK的内存。 2.4 Memory Population across Memory Channels It is important to ensure that all three memory channels in each processor are populated. The relative memory bandwidth is shown in Figure 10, which illustrates the loss of memory bandwidth as the number of channels populated decreases. This is because the bandwidth of all the memory channels is utilized to support the capability of the processor. So, as the channels are decreased, the burden to support the requisite bandwidth is increased on the remaining channels, causing them to become a bottleneck. 表8
 2.5 Memory Population Across Processor Sockets Because the Xeon 5500 series uses NUMA architecture, it is important to ensure that both memory controllers in the system are utilized, by providing both processors with memory. If only one processor is installed, only the associated DIMM slots can be used. Adding a second processor not only doubles the amount of memory available for use, but also doubles the number of memory controllers, thus doubling the system memory bandwidth. It is also optimal to populate memory for both processors in an identical fashion to provide a balanced system. Using Figure 11 as an example, Processor 0 has DIMMs populated but no DIMMs are populated for Processor 1. In this case, Processor 0 will have access to low latency local memory and high memory bandwidth. However, Processor 1 has access only to remote or “far” memory. So, threads executing on Processor 1 will have a long latency to access memory as compared to threads on
Processor 0. This is due to the latency penalty incurred to traverse the QPI links to access the data on the remote memory controller. The latency to access remote memory is almost 75% higher than local memory access. The bandwidth to remote memory is also limited by the capability of the QPI links. So, the goal should be to always populate both processors with memory. 表9
 3.0 Best Practices (最优配置方法) In this section, we recapture the various rules to be followed for optimal memory configuration on the Xeon 5500 based platforms. 3.1 Maximum Performance Follow these rules for peak performance: Always populate both processors with equal amounts of memory to ensure a balanced NUMA system.(两CPU使用相同容量内存) Always populate all 3 memory channels on each processor with equal memory capacity. (每个CPU的3个内存通道使用相同容量的内存) Ensure an even number of ranks are populated per channel. (每个通道占用的合适的RANK数) Use dual-rank DIMMs whenever appropriate. (可以的话使用双RANK的内存) For optimal 1333MHz performance, populate 6 dual-rank DIMMs (3 per processor). For optimal 1066MHz performance, populate 12 dual-rank DIMMs (6 per processor). For optimal 800MHz performance with high DIMM counts: – On 12 DIMM platforms, populate 12 dual-rank or quad-rank DIMMs (6 ) per processor. – On 16 DIMM platforms: Populate 12 dual-rank or quad-rank DIMMs (6 per processor). Populate 14 dual-rank DIMMs of one size and 2 dual-rank DIMMs of double the size as described in the interleaving section. With the above rules, it is not possible to have a performance-optimized system with 4GB, 8GB, 16GB, or 128GB. With 3 memory channels and interleaving rules, customers need to configure systems with 6GB, 12GB, 18GB, 24GB, 48GB, 72GB, 96GB, etc., for optimized performance. 3.2 Other Considerations 3.2.1 Plugging Order Take care to populate empty DIMM sockets in the specific order for each platform when adding DIMMs to Xeon 5500 series platforms, The DIMM socket farthest away from its associated processor, per memory channel, is always plugged first. Consult the documentation with your specific system for details. 3.2.2 Power Guidelines This document is focused on maximum performance configuration for Xeon 5500 series processor-based systems. Here are a few power guidelines for consideration: Fewer larger DIMMs (for example 6 x 4GB DIMMs vs. 12 x 2GB DIMMs will generally have lower power requirements x8 DIMMs (x8 data width of rank, see section 3.3) will generally draw less power than equivalently sized x4 DIMMs Consider BIOS configuration settings (see section 4.2.4) 3.2.3 Reliability Here are two reliability guidelines for consideration: Using fewer, larger DIMMs (for example 6 x 4 GB DIMMs vs. 12 x 2GB DIMMs is generally more reliable Xeon 5500 series memory controllers support IBM Chipkill memory protection technology with x4 DIMMs (x4 data width of rank; see sect. 3.3), but not with x8 DIMMs
3.2.4 BIOS Configuration Settings There are a number of BIOS configuration settings on servers using the Xeon 5500 series processors that can also affect memory performance or benchmark results. For example, most platforms allow the option of decreasing the memory clock speed below the supported maximum. This may be useful for power savings but, obviously, decreases memory performance. Meanwhile, options like Hyper-Threading Technology (formerly known as Simultaneous Multi- Threading) and Turbo Boost Technology can also significantly affect benchmark results. Specific memory configuration settings important to performance include: 表10
 原文作者: Ganesh Balakrishnan IBM System x and BladeCenter Performance Ralph M. Begun IBM System x Development |