Enterprise Storage Guide, Featured

Flash Forward: The Future of Memory

What is flash memory, why is it important, why is it so darn expensive, and when can buyers expect some relief? When talking about flash, chances are that one of these questions is relevant to the discussion. Some of these questions have straightforward answers, others are far more complex. This primer will examine the state of flash as we move into 2019.

Flash is a computer data storage technology that’s faster than the preceding mass market permanent storage technology, the hard disk drive. Hard disks – affectionately, but somewhat inaccurately known as “spinning rust” – rotate a metal disk at high velocity underneath a magnetic write head. The write head aligns particles on the rotating platter in one direction or the other, allowing the read head in the drive to discern 1s and 0s.

Unfortunately, hard drives can only spin their platter so fast, and they become infinitely more complex with each read or write head one adds. In other words, spinning rust is only ever going to be so fast, and we crossed the 80/20 barrier on that quite some time ago. The average hard drive can be thought of as providing about 100 I/Os Per Second (IOPS), with the really sexy ones maybe getting to 500 IOPS at a 70/30 read/write workload split.

A single modern flash drive can push between 150,000 IOPS (100% write) to 250,000 IOPS (70/30 workload split). This makes the average flash drive somewhere between 500 and 1,500 times faster than the average spinning rust drive. On paper, at least.
The details do matter.

Physical Flash

Flash drives are, for the most part, made up of NAND chips. NAND flash is page addressable, which means that one cannot address individual bits of storage directly. To read or write a single bit from NAND flash, one must read or write an entire page.

NAND pages sizes vary. Some NAND pages are as small as 2kiB while 16kiB pages are increasingly the norm. Reading a single bit of data from a flash drive with 16kiB pages requires reading 131072 bits of data to find the one you want.

Blocks contain multiple pages, with a block typically containing between 128 and 256 pages (Figure 1). Due to the physics of how flash drives work, erasing data from a flash drive requires erasing an entire block. This means that on SSDs with large block and page sizes – which are increasingly common in the “bulk storage” SSDs with low write lifetimes – erasing a single bit of data could require erasing up to 4MiB – or 33554432 bits – of data.

Like reads, writes occur at the page level. If one wanted to write a single bit to a page in an empty block, one would only have to write a single page in order to write that bit. On the other hand, if any page in the destination block has data, then the write process requires that one read the entire block, erase that block, and then re-write the data.

Figure 1. Anatomy of a flash drive. (Click image for larger version).

Every erasure reduces the “write life” of the drive: individual flash blocks can be erased about 1,000 times. To compensate for this, SSD manufacturers have incorporated sophisticated wear levelling into their drives. Wear levelling helps to ensure erase and write operations only occur when there is an entire block to manipulate. For this reason – and we’ll touch on this more later – SSDs work best with large queue depths: they like to have lots of writes to work with so that they can only erase or write in whole-block chunks.

Other Storage Types

There are other approaches to memory beyond the type of NAND used in flash. Main system memory – also known as Random Access Memory (RAM) – uses a type of NAND that is both bit-addressable and bit-alterable. This makes it much faster than flash. Unfortunately, the design of RAM also means that data cannot be permanently stored in RAM, and once the power is off, the data disappears. The flip side of this is that RAM doesn’t wear out in the same way flash does.

In addition to RAM there is 3D XPoint. Details are scare on the physics underlying 3D XPoint, but we now know what the drives themselves have to offer. 3D XPoint is both byte-addressable and byte-alterable. This means reading or writing a bit only requires reading or writing 8 bits. Not quite RAM, but a significant improvement over flash. Most importantly, this change in addressability size means that one doesn’t have to keep the I/O path flooded with requests to achieve maximum performance: 3D XPoint-based drives (such as Intel’s Optane) work just fine at low queue depths.

In addition, 3D XPoint drives claim to be able to have 1000x the write life of NAND flash. This claim should be taken with a grain of salt. It is highly unlikely that an individual bit of 3D XPoint storage can be written to 1000x more than an individual bit in a NAND flash drive. That said, it is entirely believable that under normal operating circumstances, one can push 1000x as much write data through a 3D XPoint drive, when compared to a NAND flash drive.

If 3D XPoint proves to have better write life, this is likely because the change in addressability size means that 3D XPoint is better suited to how real applications store data: as a series of tiny little writes, not in great big strips. When it comes to Solid State Drives (SSDs), write patterns matter as much as the physics of the storage in use.

Ecosystem-Wide Considerations

To put all the above discussion about storage technologies in perspective, the rule of thumb for sizing a mixed workload hyper-converged infrastructure (HCI) cluster is 1 gigabit of network capacity for every 1000 of write IOPS you want to sustain. In HCI clusters, writes need to be sent over the network, so a really rudimentary way of looking at drive performance is that it will take about 10 spinning rust drives working together under your average mixed virtualized workloads to use up the network capacity of a modern storage system.

To compare, a single SSD can flatten a 10GbE link, which is why you see so many hybrid HCI appliances out there with 2x 10GbE ports (two for redundancy), and only 2x SSDs (again for redundancy) per node. All-flash HCI nodes – the well-architected ones, at least – tend to have multiple 40GbE or 100GbE network links on them. You don’t see many 10GbE ports on HCI clusters whose nodes are only pushing rust.

With IT teams today dealing with an ever-increasing number of clouds, containers, and VMs, it’s easy to understand that over the past several years demand for flash drives has consistently outpaced supply. This is the reason traditional hard disks haven’t gone away: they may be 500+ times slower than SSDs, but they store 10x as much data per dollar. Unfortunately, this is where things get really complicated.

Balance

NAND flash SSDs, NAND RAM, and whatever it is 3D XPoint is made out of are all made in semiconductor fabrication facilities. Not only are these three technologies all made in semiconductor fabs, but they’re more-or-less interchangeable. It doesn’t take much (in semiconductor fab terms) to convert a facility between traditional flash, RAM, and 3D XPoint.

This means that semiconductor companies are constantly balancing which of these technologies to produce in a given fab. Fabs cost anywhere from $1 billion – $14 billion each, and tend to have useful lifespans measured in single-digit years before a significant upgrade’s required to update the fabrication technology with newer equipment that can print more semiconductors on the same surface area.

Older fabs with older technology can have their useful lives extended by shifting production to meet different demand, but this has limits. It isn’t as simple to change, for example, a CPU-producing fab into a NAND flash fab as it is to convert a RAM fab into a NAND flash fab.

CPU fabs can also have longer lifespans, as there’s a rich contract manufacturing market for production of custom CPUs using older lithography processes. This isn’t quite true for SSD/RAM fabs. People who buy storage tend to be most interested in getting as much storage as possible for the lowest possible price, and that comes from constantly shrinking the lithography. Older, larger processes aren’t as valuable in the storage space.

If the above is straightforward to understand, everything else about flash is less so, in large part because it involves attempting to predict human behaviour from an imperfect set of metrics.

The Flash Market in 2019

Shortly after the non-volatile memory express (NVMe) standard was developed, demand for flash exploded. There are a number of reasons why, but one of the big ones is that the NVMe standard increased the size and number of I/O queues that can be made available to SSDs.

This meant that traditional (read: comparatively inexpensive) NAND flash SSDs could be pushed to deliver IOPS, throughput, and latency closer to what the physics of their construction says that they should be capable of. In other words, SSDs were no longer crippled by the legacy SAS and SATA storage interfaces that held them back: storage interfaces designed in an era where spinning rust was dominant, and flash SSDs largely a pipe dream.

The NVMe standard meant that server manufacturers could start cramming large numbers of inexpensive flash drives into a single server chassis. While it’s functionally impossible to squeeze the maximum hypothetical IOPS out of such a system, that was never the point of these types of systems. The point was to put large numbers of high capacity flash into them, write relatively infrequently to the systems, but be able to read any data, from any drive, at spectacularly low latency.

NVMe would ensure that these flash-heavy systems were able to absorb huge IOPS spikes gracefully, and the modern era of Bulk Data Computational Analysis (BDCA) was born. Entire data centers have been built with thousands upon thousands of extremely low latency, high IOPS storage. Data scientists can work on datasets that 10 years ago would have seemed impossible.

The world’s demand for data storage never quite seems to abate. More to the point, the world’s demand for low latency data storage continues to defy expectations.
Rationally, this supply and demand imbalance – and the associated high prices – cannot continue forever. Eventually either new supply will come online, or demand will abate. The Register’s Chris Mellor is predicting that the supply side has overcorrected, and that this will result in a NAND supply glut in 2019. Some people, including me, have doubts.

Flash Demand Is Tricky

Ultimately, there are only so many fabs in the world, and only so much capital available to build new ones, and/or upgrade old ones to newer technologies. The same is true of traditional hard drive production.

Over the past decade, investment in traditional hard drive manufacturing has stalled, and even declined. Flash sales are cannibalizing high margin spinning rust, and there’s no longer a point to making or selling 10k or 15k RPM hard disk drives, except to squeeze every last dollar out of the equipment investment you’ve already made (see Figure 2).

What the world wants from hard disk manufacturers is slow, reliable bulk storage with the lowest possible $/GiB. Basically, we want from hard disks the same thing we want from tape storage, but with better latency, and fewer robots required.

In response, hard drive manufacturers have been investing in flash production. Flash is where the margin is, and while these vendors will continue to create new technologies to advance hard disk drives, they’re not going to be dramatically scaling the number of units produced per year. We may never see another major hard disk drive plant open again.

Figure 2. Memory is taking a bigger bite of the overall IT pie. (Click image for larger version).
If all of this screams “flash glut incoming,” just hold on. Traditional disk drive manufacturers have some huge hurdles to overcome in transitioning to flash vendors. The first is simply scale: the primary buyers of anything in IT these days are the major cloud providers. Cloud providers only buy in bulk, and they don’t like leaving a lot of margin on the table.

If you don’t already produce SSDs in serious volume, the paltry margin that cloud providers allow isn’t likely to be enough to drive both ongoing R&D and continued investment in new fabs. Without new fabs, you can’t increase your volume. As the cloud providers squeeze every last cent out of the SSD manufacturers, that volume is critical to having enough capital available to keep up with the Joneses; this makes entering the flash market something of a Catch-22, unless you’re seriously flush with cash at the outset.

This has made the past few years very hard for the likes of Western Digital and Seagate, who have not had the easiest time transitioning to being flash vendors. There still remained enough enterprise and consumer demand to keep things going, but as public cloud adoption increases, the high-margin customers start to dry up.

The Media Plays a Role

Mellor – and others – argue that additional fab capacity from the large NAND flash players, such as Intel and Samsung, will finally start meeting cloud provider demand, allowing additional units to enter the on-premises IT and consumer markets, pushing down prices. We’ve already started to see some of this.
Where I disagree with their analysis is that I also see pent-up demand from non-cloud buyers. Some projects have been on hold for months, even years, waiting for the cost of flash to come down. Similarly, increased investment in edge computing in 2019 looks to drive non-public-cloud equipment purchases, and these units will almost certainly be all-flash servers.

IoT devices that were otherwise too expensive become viable as one of the most expensive components – NAND flash storage – comes down. Most importantly, while demand for flash capacity is increasing, demand for flash speed is not. This is demonstrated ably by the market failure of 3D XPoint to date, with Micron intending to buy Intel out of their join IMFT flash efforts, taking over the joint 3D XPoint production facility, and taking the technology in a different direction than Intel’s Optane vision.
Most people seem to want bulk storage that looks a lot like hard drives, costs about what hard drives do, but have better latency and lower failure rates than hard drives. This says to me that while the average $/GiB of flash is set to go down in 2019, we’re likely to see more units shipped in 2019 than 2018, as well as significantly more capacity shipped in 2019 than any previous year.

This makes 2019 more of a reversion to the mean regarding unit shipments and the change in $/GiB over time than a “glut,” though I suspect flash manufacturers will prefer Mellor’s interpretation, especially during shareholder calls.

The media plays a role in this. We’ve been saying for years that flash demand is extreme. Now we’re talking about a glut. How we shape the conversation does have an impact on how willing different buyers are to wait for the market to change. Flash is a commodity, no different than rice or oil, and in many ways speculation drives prices more than actual consumption.

Upcoming Changes

Intel and Micron are both expected to continue investment in 3D XPoint over the course of 2019, each evolving the technology in different directions. This looks to lead us in to a world in which computer storage has multiple tiers.

For the next few years, at least, RAM will continue to be volatile, bit addressable and alterable, about 10x faster than 3D XPoint, while costing several times what 3D XPoint does. 3D XPoint, meanwhile, will be byte addressable and alterable, about 1000x faster than NAND flash, and cost several times what NAND flash does.

NAND flash, meanwhile, will be page-addressable and block-alterable, about 500x-1500x faster than spinning rust, and cost several times what rust drives do. This is all as it was in 2018, and it will be the same in 2020. The exact values will waver a bit, but the fundamental sorting of storage capabilities and pricing is not going to change any time soon.

What’s likely to happen is that a new tier of storage will start to be introduced in late 2019 or early 2020 that slots in between traditional NAND flash and hard disk drives. 128-layer flash promises to drive the $/GiB down significantly, while also reducing flash write life quite a bit, producing a storage technology that is not merely read-optimized, but borders on Write Once Read Many (WORM). It will not be what you run a database on. It will be what you shove a trillion Facebook photos on, and then never overwrite.
Chinese chip fab YMTC looks set to produce 128-layer NAND in 2020, and it is expected to live somewhere between true WORM flash and the low-write-life 96-layer flash that one can find in bulk flash drives today. YMTC is not alone, though they are likely to be producing this flash in serious volume, threatening hard disk drive manufacturers, as the battle for bulk storage contracts heats up towards the end of 2019.

Flash sales overall, however, promise to be healthy. This is in part because flash manufacturers are constantly innovating. Samsung, for example, is investing in key-value SSDs, similar to Seagate’s Kinetic key-value hard drives.

While individual categories of flash may finally revert to the mean, and drag prices for the rest of us down closer to what large cloud providers pay, flash shipments overall will continue to increase, as flash storage WORMs its way into electronics niches at the edge of our networks that we didn’t even know we had.