According to Facebook, DDR Memory is the #2 failure in the Data Center. A Carnegie Mellon paper that studied DDR Memory failures in Facebook’s data center reported that FB swapped DIMMs out of ~1% of their servers on a monthly basis. Given the number of servers that Facebook has….this suggests they are swapping DIMMs every hour of every week of every month all year long! So I asked around….what do they do with the DIMMs they swap out? The answer came back on a LinkedIn group response from someone who collects the ‘bad’ DIMMs. His response? They destroy them. With DRAM prices skyrocketing and larger capacity and more expensive DIMMs becoming the norm when will Data Centers get a clue? ….IT MIGHT NOT BE THE DIMM. Even Google’s 2009 study concluded that memory failures sometimes ‘followed the system’.
So who is to blame?
Well I don’t want to point any fingers, but….we have seen some pretty cheap motherboards (missing ground planes, bad connectors, routing issues), BIOS updates that program the memory controller incorrectly and BIOS bugs incorrectly interpreting the DIMM (or SODIMM) SPD (small eprom with timing and characteristics). See my article on ROUNDING. Oh…and occasionally a Memory Controller bug or two. Not to mention the poor overall ground/power design that radiates noise from one memory channel to the other.
So what’s in your Data Center?
Today’s Data Centers are driving Server complexity up, but the market is driving Server price down. Thus suppliers are squeezed for margins and test/design validation gets neglected. Memory errors don’t scale well, so if you don’t want to be throwing good DIMMs away you might want to ensure that your Servers are well designed by performing a Memory Channel Audit. This review ensures that the electrical design of the server results in good signal integrity to and from the DIMMs on all signals, that the DIMM SPD is compliant to the JEDEC Specification, and that the BIOS programming is consistent with the SPD. A quick JEDEC Protocol Violation check will ensure no Memory Controller issues and a Performance Audit will tell you if your leaving any clock cycles on the table (meaning you could make your existing hardware run faster).