Advanced Computing in the Age of AI | Wednesday, July 24, 2024

Facebook Loads Up Innovative Cold Storage Datacenter 

The next time you load an old photo from your Facebook archive, the odds are increasingly likely that it won't be coming out of one of the social network's three data centers, but a new cold storage facility that just opened up adjacent to the company's Prineville, Oregon datacenter.

The cold storage technologies that Facebook has developed have implications for any large-scale enterprise dealing with large volumes of infrequently accessed data that nonetheless has to be more or less online. Specifically, it has to be more online than a tape library with active archiving software that automagically retrieves and plunks it on storage arrays, but less online than a fast storage array.

Facebook revealed some feeds and speeds about the cold storage servers and their datacenters back in January at the Open Compute Summit. At the time, Jay Parikh, vice president of infrastructure, explained the problem that this new cold storage datacenter, and its very sophisticated storage arrays, are trying to solve. Back then, Facebook at over 240 billion photos that had been uploaded by its over 1 billion users worldwide, and they were adding a staggering 350 million photos a day. That is an incremental 7 PB of storage per day that Facebook has to add to a photo archive that is several exabytes in total capacity. Keeping that much data on fast storage is very expensive.

As it turns out, most of these photos are rarely accessed. Here's the access pattern, plotting traffic versus the age of the photos:


As you can see, 82 percent of the traffic pulling photos out of the archive is only accessing 8 percent of the capacity, and that means the other 92 percent can be stored on slower, lower-powered disk arrays.

Facebook came up with a scheme to alter its existing Open Vault storage array with its own funky hierarchical storage management software that reflects the very low usage of this data. It is also using disk drives using a new technique called shingled magnetic recording, or SMR, to significantly increase the capacity of its Open Vault arrays.

SMR drives are going to be in volume production from Seagate and HGST, to name two vendors, shortly. The technology gets its name from the fact that tracks of data are overlapped on the disk drive, like shingles on a roof. Tracks on a disk platter are about 75 nanometers apart today and cannot be scrunched any further together. But read heads can read at a much finer scale, so with SMR, Seagate can turn a 4 TB drive into one that has a 5 TB capacity by overlapping the write bands. Because the write heads are not fine grained, there is a performance hit with using SMR, and that is because as you write data to each track, the tracks to either side of it have to be empty or it will overwrite the data. So there is a bit of juggling with the data as new information is written to the disk, and presumably it gets worse as a drive fills up. That said, Seagate thinks it will be able to push up the capacity of a 3.5-inch disk to 20 TB using SMR, and that is the kind of thing Facebook loves to hear. (Facebook is not saying whose disks it is using, by the way.)

Facebook's Open Vault storage array. (Photo credit: Facebook)

Facebook's Open Vault storage array. (Photo credit: Facebook)

The other clever bit of this cold storage facility designed by Facebook is the way the photos are stored. Instead of just dumping a photo onto a disk drive, the company's engineers have broken each photo into pieces and used a Reed-Solomon encoding technique plus checksums to spread a photo over multiple disk drives in multiple disk arrays. What this means is that a photo can be served up from cold storage in parallel from multiple drives, thus making it look to applications like it is coming from a single very fast disk. (And you thought parallelization was just for compute. . . . )

This Reed-Solomon encoding is also used to recreate photos if any drives in the storage vaults fail; you just run the algorithm backwards and recreate data on a failed drive.

There is one other neat bit about the cold storage facility. To keep the power consumption as low as possible and get it in line with the actual access patterns of the Facebook Photo archive, only one disk drive in each Open Vault storage server is turned on at a time. All other drives are shut down to conserve power, and the short time it takes to power up a drive and serve it up in parallel fashion is so small that Facebook users don't even know their photos are coming out of cold storage, not disk arrays right next to the servers.

The Open Rack-Open Vault cold storage. (Photo credit: Facebook)

The Open Rack-Open Vault cold storage. (Photo credit: Facebook)

The Facebook cold storage facility uses modified Open Racks, which have been open sourced to the Open Compute Project founded by the company, and each rack of Open Vault storage servers can house 2 PB of storage. That is eight times the density of the prior Open Vault arrays, by the way. Each rack was designed to burn only 2 kilowatts of power, and the plan was for an Exabyte of storage to consume only 1.5 megawatts – and do so using 480 volt power and without redundant power systems in the datacenter. These cold storage variants of the Open Vault were projected to cost one third as much as the existing storage arrays and the datacenter housing them was designed to cost one-fifth as much as the conventional Facebook datacenter. (Which, by the way, is anything but conventional.)

What the new cold storage facility that Facebook has built clearly demonstrates is that enterprises with extreme scale have to take a holistic view of their storage, from data access patterns and volumes to the configuration of the systems and storage arrays that cope with them, all the way out to the datacenter that houses them. In a funny way, Facebook's cold storage datacenter is a very large, multi-exabyte disk drive.

EnterpriseTech is going to hook up with Facebook to see if the design worked out as well as the plan. Stay tuned.