--> Data Center | FuturePlus Systems - Part 2

DDR4 3DS DIMMs: The next big thing in the Data Center

In order to give DDR4 a mid life kicker memory vendors are up’ing their game and producing 3DS DDR4 DIMMs.  What is 3DS you ask?  Its 3 Dimensional Stacking of die in a single package.  Not to be confused by ‘twin die’ which is just 2 die next to each other and not stacked.  3DS uses TSV (through silicon via’s) to make the connection between the dies. 3DS is a game changer when it comes to density.  DIMMS of 128GB, 256GB and possibly 512GB on a single DIMM is enabled by this technology.  RDIMMs or LRDIMMs can implement 3DS and have up to 4 ranks. The 3DS protocol works by introducing the concept of logical ranks in addition to physical ranks. The screen shot below from the DDR Detective shows what the traffic on a 3DS DDR4 memory bus looks like. Waveform showing interleaved traffic between the different physical and logical ranks on a single DDR4 3DS DIMM. The 3DS protocol is also different, as timing parameters between the physical ranks and the logical ranks have to be controlled.  FuturePlus Systems, who took the lead role in JEP 175 DDR4 Protocol Checks, has also created the 3DS protocol checks found in the 3DS option of its FS2800 DDR Detective product. DDR Detective 3DS specific violations.  These run continuously never missing a clock edge and can run for days checking to make sure no potential for data corruption due to protocol errors occur. What’s in your Server?  Well if its 3DS you will want to make sure you’re getting your money’s worth as these DIMMs can be $4000 or more...

Data Centers: Don’t Throw that DIMM Away!

According to Facebook, DDR Memory is the #2 failure in the Data Center.  A Carnegie Mellon paper that studied DDR Memory failures in Facebook’s data center reported that FB swapped DIMMs out of ~1% of their servers on a monthly basis.  Given the number of servers that Facebook has….this suggests they are swapping DIMMs every hour of every week of every month all year long!  So I asked around….what do they do with the DIMMs they swap out?  The answer came back on a LinkedIn group response from someone who collects the ‘bad’ DIMMs.  His response?  They destroy them.  With DRAM prices skyrocketing and larger capacity and more expensive DIMMs becoming the norm when will Data Centers get a clue? ….IT MIGHT NOT BE THE DIMM.  Even Google’s 2009 study concluded that memory failures sometimes ‘followed the system’. So who is to blame? Well I don’t want to point any fingers, but….we have seen some pretty cheap motherboards (missing ground planes, bad connectors, routing issues), BIOS updates that program the memory controller incorrectly and BIOS bugs incorrectly interpreting the DIMM (or SODIMM) SPD (small eprom with timing and characteristics). See my article on ROUNDING.  Oh…and occasionally a Memory Controller bug or two.  Not to mention the poor overall ground/power design that radiates noise from one memory channel to the other. So what’s in your Data Center? Today’s Data Centers are driving Server complexity up, but the market is driving Server price down.  Thus suppliers are squeezed for margins and test/design validation gets neglected.   Memory errors don’t scale well, so if you don’t want to be throwing good DIMMs away...

Speed up your Servers Memory Performance by Understanding Rounding!

Those of you who read my previous Blog post on the new DDR4 Revision B spec know that DDR4 has a much better defined Rounding Algorithm. So why is this important? The Early Days Timing parameters in JEDEC DDR specs are sometimes listed in nanoseconds, microseconds or milliseconds. When testing or simulating logic that runs on the DDR bus one often has to convert time listed in ns, us or ms into clock cycles and a non-integer number results. Since clock cycles occur in integers only we have to either round up or round down. As a carryover from spec to spec over the years there were various notes in the spec that referred to a ‘simple round up’. As the specifications evolved more timing parameters were added to the spec and the notes became sparser and sparser as to how to handle all these new parameters. With the advent of DIMMs and SODIMMs there needed to be a method for having a specific value or an allowable range of values be associated with a particular DIMM or SODIMM and a method to list other optional features that were implemented on that particular DIMM/SODIMM. Thus the SPD was created. SPD stands for Serial Presence Detect. This eprom like part is on every DIMM and SODIMM and is read by the BIOS in order to properly configure the DIMM/SODIMM for the system that it is residing in. So what does the SPD got to do with Rounding? Well the folks that work on that specification quickly realized the issue and the need for a specified rounding method. When the BIOS...

Data Center Down Time: DRAM Row Hammer Failures in the field

DDR3 memory is at the heart of almost all cloud computing servers today.  A recently publicized failure mechanism in DDR3 memory, coined Row Hammer,  has been shown to not only be a reliability issue but also a security risk for servers, laptops, desktops and embedded systems around the world.  In short, excessive accesses to a single Row in memory can cause bit flips in adjacent locations causing system crashes, corrupted data and even security exploits.  Several research papers have been published and more work is being done to help the industry understand this problem.  No industry standards group, government agency or trade association has signed up to address this issue.  Data Centers and end users are on their own. Computer architecture relies on three basic building blocks, the CPU or central processing unit, the I/O, Input and Output and the Memory.  When it comes to the memory the dominate technology is DRAM or Dynamic Random Access Memory.  Today’s most prevalent version of memory is called DDR3 which stands for the 3rd generation of Double Data Rate Memory.  In the quest to get memories smaller and faster memory vendors have had to make very small physical geometries.  These small geometries put memory cells very close together and as such one memory cell’s charge can leak into an adjacent one causing a bit flip.   It has come to the attention of the industry that this is indeed happening under certain conditions.  Very simply the problem occurs when the memory controller under command of the software causes an ACTIVATE command to a single row address repetitively.  If the physically adjacent rows have...
Request More Information/Quote or Call: (603) 472-5905
Send
Request More Information/Quote or Call: (603) 472-5905
Send