--> Data Center | FuturePlus Systems

The Threat That Just Keeps Getting Bigger: DRAM Row Hammer

  One of the more recent papers from Carnegie Mellon University and ETH Zurich tries to forewarn the tech industry about the prevalence of DRAM Row Hammer failures and lay out possible strategies to combat it. The testing they performed of the DRAM devices (DDR3, DDR4, and LPDDR4) helped to illustrate the magnitude of this error which appears to get worse for newer technologies. This means that Row Hammer is a problem that needs to be addressed sooner rather than later. Row Hammer is an electrostatic interference glitch that affects nearby cells and causes what is called “bit flips”. A bit flip in turn makes a “zero” into a “one” and vice versa.   This characterization study also explores the organization of the DRAM technology in order to thoroughly explain the Row Hammer error and give suggestions moving forward. The easiest suggestion included refreshing the memory more frequently which increases the charge of the victim row thus making it less susceptible to bit flips. But this suggestion takes a toll on power consumption and performance which is very important to data center operators. The other suggestion is to reconsider the design of DRAM at the component level which DRAM manufacturers are unwilling to do.   This in depth exploration helps readers understand the Row Hammer error a little better. It offers insight into the prevalence of the error in older technologies versus newer technologies. It also offers suggestions on how to mitigate the error. There are big risks with newer technologies having higher vulnerabilities to attacks such as Row Hammer. Hopefully as more research dives deep into this topic,...

Who is to blame for DDR Memory ECC errors?

Is it the DIMM or the System? For DDR4 DIMMs and SODIMMs (that support ECC) the ECC (Error Correcting Code) is calculated by the Memory Controller for each byte on a write. A single bit per byte is provided as part of the calculation and is stored in a different device than the byte its protecting is stored in. However, there is no checking of the write data once it reaches the DRAM. The ECC is really only used to protect the data on the Read. Once the data is read back, the Memory Controller checks the ECC and if incorrect tries to do some kind of recovery. That recovery is system dependent and not specified by the JEDEC spec. In fact, the ECC calculations and algorithms are also not specified and many system vendors do not release their ECC algorithms. If it is a single bit error it will do the correction and write back the corrected data to the DRAM. Single bit errors are also called ‘Soft Errors’. If it detects a double bit error it cannot do any correction as the ECC algorithm is mathematically limited and can only do Single Error Detection and Correction but only Double bit error Detection. You may have seen the acronym SECDED and this is where it comes from, Single Error Correction, Double Error Detection. Double bit errors are sometimes referred to as ‘Hard Errors’ and they will usually cause a machine check and a system crash. System log files should show all of the soft errors and the address that the error occurred on. In addition, it should indicate...

What do you mean there is NO Validation Report?

In our Services department we see all sorts of systems, network switches, routers, and medical devices, etc.  They all share a common theme….the DDR Memory does not work right. The engineers sending us these problem systems are frustrated and we often hear ‘we started getting failures in the field after having it work for years’ or ‘the applications now can’t tolerate any failures’. We even get the occasional ‘this memory stick fails but this one does not, can you tell us why?’. As we go through our Memory Channel Audit we often ask the customers ‘Where is the Validation report for this system?’ The customers almost always say ‘we have no idea!’. Call me old fashioned but I recall working for a large enterprise vendor (DEC) where you had to thoroughly test and validate a system and produce a report that proved, at the very least, you tested it and looked at the Signal Integrity. Given that our society is addicted to the internet, high speed communications, phones, laptops, air travel and on line everything, you would think that validating the platforms and systems that run all of these applications and make all of these critical calculations would at least have some kind of Validation Report. But they don’t and their customers are buying literally millions of them and the general public has become overly reliant on them. The engineers who deploy these systems and are responsible for them in the field should not buy them unless the suppliers PROVE they are good. Given that we are so addicted to the online world we have created we should pressure...

Row Hammer. The problem no DRAM vendor wants to talk about, except for one, Zentel

Zentel’s new DDR3 DRAM has published data showing zero Row Hammer failures.  I fondly recall talking to a large vendor’s ‘tiger team’ concerning Row Hammer failures a few years back.  I asked them what should their DDR3 users do if they start to experience Row Hammer failures.  Their response? ‘Upgrade to DDR4!’.  ‘How convenient’ I responded, ‘forcing the industry to throw away all those DDR3 based systems so you can sell more DDR4’.  And come to find out, DDR4 although a bit better, still experienced Row Hammer failures! We here at FuturePlus Systems are glad to see our colleagues in academia are still hunting down those Row Hammer vulnerabilities (https://rambleed.com/).  They can feel good about causing the industry to look for solutions, and it appears that Zentel has answered the call. To the best of our knowledge Zentel has the ONLY Row Hammer hardened DDR3 memory on the market.  See the Zentel data sheets here.  #rowhammer, #DDR3, #JEDEC, #AP Memory,...

Want to Ca$h in on Bitcoin, BlockChain and Cryptocurrency? Speed up your DDR Memory Accesses

Attention all Bit Coin, Ethereum Miners, Block Chain Fans and Distributed Ledger Technology experts.  Do you REALLY understand the computing limits of your hardware?  These applications are among the most compute intensive applications today.  Like most compute intensive applications DDR Memory is involved. There is some confusion over memory bandwidth versus memory latency.  Latency is the time to first access.  See below examples of a memory subsystem running well below the minimum latency allowed by the JEDEC DDR4 specification for some parameters.  Identifying these bottle necks can dramatically increase your memory access time thus your mining application.  Tuning your system for minimum latencies can add $$ to your crypto wallets. Figure 1:  DDR4 Memory Latencies measured on every clock cycle continuously.  Measurement made by the DDR Detective from FuturePlus Systems Bandwidth on the other hand is the amount of data that can be transferred over a certain time.  This is the Mega Bytes per Second metric.  See below.  This metric is important as it determines the amount of data bandwidth that can be sustained over a longer period of time.  If your latency can be improved this number will also improve. Figure 2:  DDR4 Memory Bandwidth measured on every clock cycle continuously on a per bank per rank basis.  Measurement made by the DDR Detective from FuturePlus Systems If your mining hardware is using the latest DDR4 Memory there is another metric (over DDR3) that needs to be considered.  That is Bank Group tuning.  In DDR4, back to back transactions to the same Bank Group, results in a performance penalty.  Back to back accesses to different Bank Groups is...

What is DDR4 Memory Gear-Down Mode?

A Reliability, Availability and Serviceability  (aka RAS) feature more clearly documented in the new JEDEC DDR4 Rev B spec, Gear-down mode, allows the DRAM Address/Command and Control bus to use every other rising clock of the DDR4 Memory bus clock. The Memory Controller indicates that it wants the DRAM to operate in Gear-down mode by setting bit 3 in Mode Register 3 at boot time.  The system then follows this operation with a sync pulse which is a single clock assertion of Chip Select.  The DRAM then notes that sync pulse assertion and sync’s to that rising clock edge.  It then uses every other rising edge of the clock after that.  So even though the memory controller clock frequency has not changed the DRAM only uses every other edge. Since the data uses both edges of the clock and now the DRAM Address/Command and Control uses every other edge of the rising clock they refer to it as ¼ rate or 2N.  Normally the Address/Command/Control uses only the rising edge of the clock. This is called ½ rate. The screen shot below shows what the bus actually looks like from the memory controller’s point of view in Gear-down mode. Waveform as seen on the FS2800 DDR Detective To reflect what the DRAM is actually using the test equipment needs to be able to adjust to gear-down mode and show what the DRAM is actually seeing on the DDR4 memory bus. State Listing as seen on the FS2800 DDR Detective, what the DRAM sees for DDR4 bus operations while in gear-down mode. Some little nuances come to light when a...
Request More Information/Quote or Call: (603) 472-5905
Send
Request More Information/Quote or Call: (603) 472-5905
Send