How Shutterstock Keeps Its Hyperscale Infrastructure In Check
Since it was founded 11 years ago, Shutterstock has grown to become one of the largest image sharing marketplaces on the Internet. Managing that growth has not always been easy for the company, which runs its own IT infrastructure. But thanks to an in-memory analytic database that collects and analyzes 20,000 operational data points per second, it has managed to stay on top of the infrastructure expansion.
Shutterstock was founded in 2003 by Jon Oringer, who had created one of the Web's first pop-up blockers in the late 1990s. Oringer was looking to build upon that success by selling firewalls, cookie blockers, and other utilities. The email pitches for these products always did better with pictures, he realized, and figured that other Web entrepreneurs would benefit from low-cost, generic stock photos too.
Over the next year, Oringer took 30,000 photos with his Canon Rebel, and sold them on his new website, Shutterstock.com. The idea took off, other photographers signed up, and today Shutterstock is the second biggest provider of stock photos behind industry giant Getty Images. The New York City-based company brought in about $235 million in revenues last year, and enjoys a company valuation north of $2 billion (it went public on the NYSE in early 2013).
Unlike many Web companies today, Shutterstock runs its own server infrastructure. It is a complex setup that is composed of nearly 5,000 server nodes split between primary datacenters in Texas and Boston and smaller datacenters in Washington state and New Jersey.
The company's generic Linux servers do pretty much everything: uploading and compressing photos, videos, and vector graphics in an object store; hosting the Web-based catalog; processing credit cards and log-ins; and serving downloads to clients. Today the setup hosts 40 million images, which customers are downloading at the rate of three per second.
The company has used a variety of tools over the years to monitor the health of its infrastructure, but none of them could provide the deep, real-time inspection of operations that Shutterstock desired, explains Chris Fischer, the company's vice president of technology operations.
"It's a pretty sophisticated, complex architecture, and that's the reason some of this data is really important," Fischer tells EnterpriseTech. "Without the data it's not easy to be able to ensure health of the aggregate system without doing pretty rich analysis. It's not as simple as if we just had a few Web servers we were monitoring."
The company tried building its own monitoring solutions using open source tools like Nagios, Redis, MariaDB databases, but that approach either didn't scale, didn't run in real-time, or didn't have the richness the company desired. "You can use something to paint a line or draw a graph," Fischer says. "But if you want to query in real-time data structures that large, it's a pretty complicated problem."
The scale of Shutterstock's operations have grown to the point where there is no monitoring solution that can just be dropped into place. "We're doing sophisticated data analysis against system metrics, so it's not just up or down," he says. "It's the standard deviation over time or specific windows. We're doing computationally intensive analysis on a real-time data stream, and to do that, there's not a ton of tools out there. The closest thing to do something like that are time-series databases, but they're very, very new."
In 2012, Fischer and his Shutterstock colleagues began exploring in-memory databases as a way to get the operational analytics they desired for their hyperscale IT infrastructure. They had heard of MemSQL, a scale-out, in-memory relational database designed for distributed applications. Fischer checked out some of MemSQL's competitors in what is colloquially called the NewSQL space, but in the end selected MemSQL, citing the maturity of the company's products and leadership.
Today Shutterstock runs a rack of sixteen MemSQL nodes, each equipped with 256 GB of memory. More than 20,000 data points continuously flow into the cluster, providing a second-by-second update of everything that matters to the operations desk: CPU, disk, and RAM utilization; inbound and outbound network traffic; concurrent user count and failed authorization attempts; pictures uploaded and downloaded; API utilization; and credit card transaction rates and revenue per minute.
The database continuously performs calculations and generates the operational metrics that Fischer and his team need to determine the health of Shutterstock's IT infrastructure. The company keeps all the data from the last month live in the MemSQL system, providing the company instant notification if any metric dips or rises compared to the same readings from a minute, hour, day, or month ago.
"The part that's hard to do in these systems is that you're calculating that for tens of thousands of metrics every single second," Fischer says. "You just can't do that on disk-based systems. You need something that's in-memory or highly tuned to be able to do that kind of computational math all the time."
Standard monitoring tools may be able to writes tens of thousands of data points to a log every second, but their capability to interpret the data in real-time is what limits their usefulness in hyperscale environments, explains MemSQL co-founder and CEO Eric Frenkiel.
"It's the query-ability factor," he says. "A lot of the traditional systems just do rote monitoring. They can only write. They can't read, which means you cannot do correlation analytics or trend analysis or run ad hoc queries."
In Shutterstock's case, this query-ability--via good old SQL--brings big benefits when monitoring an API for abuse. Shutterstock has a deal with Facebook, for example, that allows Facebook users to programmatically access its services.
"They want to catch spammers and people who are abusing the API, and they want to do it intelligently," Frenkiel says, "because if you ban people on the APIs without a positive match, you upset a potential customer and create dissatisfaction. With MemSQL they can monitor when the API is being abused and isolate it very quickly."
Shutterstock has done a lot of things to maintain a high level of uptime with its website and infrastructure, and MemSQL is just one part of that strategy. Just the same, it has been an important part, and one that lets the operations team sleep better at night.
"Anything we can do to make the website more resilient, more robust, faster, or show better visibility to our engineering teams of what's going wrong or what's going great--that's all driven by data and having the ability to look at that data in various ways is super important," Fischer says. "It makes my job easier."