Advanced Computing in the Age of AI | Tuesday, May 21, 2024 Sparks Customer Insights from Decades of Data – ‘Fast Is Never Fast Enough’ 

We often hear that online retail is among the business sectors furthest along on its digital transformation and AI journeys. , the giant (2017 revenues: $1.75 billion) internet retailer known mostly for furniture and home decor, is a case in point.

The Midvale, UT, company has built out a digitized ecommerce environment that captures and analyzes the behavior of millions of monthly visitors to its website – behaviors such as adds and removals from shopping carts, product queries, product comparisons, time of days and so forth – amounting to billions of individual site events.

For Chris Robison, a self-described “data enthusiast” and head of marketing data science,’s enormous data stores is “where I knew I wanted to take my career…, ecommerce web log data is the playground we all dream of.”

Formed in 1999 during the dot-com boom, the company has captured two decades of customer data that Robison and the company’s teams of data scientists, engineers and analysts mine for “actionable insights,” as they say in the analytics world. The team faced not only a formidable data science challenge, it’s also data science at scale.

“As a culture we made a deliberate decision to never be complacent,” Robison told EnterpriseTech. “Fast is never fast enough. The rate at which our customers are becoming more sophisticated in terms of what they expect from their experience, and the rate the tech is changing, speed is key. We have to have horizontally scalable technologies that allow us to keep progressing, keep pushing and keep refining these processes, so we’re delivering more insights and delivering more personalized customer experiences in near real time.”

Robison, who joined the company nearly two years ago, said the initial problem he and his team tackled is figuring out when a customer is close to making a purchase. “With nearly 5 million products for sale onsite and billions of visits and page views in our historic web logs,” Robison wrote in a blog,  “sifting through the massive amounts of sparse data to construct adequate features and identify useful signals in a never-ending stream of cart interactions, page views, and product attribute selections proved to be an enormous computational challenge.”

Chris Robison of

To help the organization handle the live event stream (comprised of individual log entries for every action a customer performs) coming from the site, decided last year to convert itself into a Spark shop, and to help with that they partnered with Databricks and its Unified Analytics Platform to help build and train machine learning models. Databrick’s platform is a Spark-based open source framework for horizontally scaling clusters for big data processing. In fact, one reason chose Databricks is that its founders were Apache Spark’s creators “and they’re still making the majority of contributions to the open source project,” said Robison.

Databricks enables’s team to shorten the gap between proof of concept and production. “The last mile of productionizing models at scale is the most painful part of traditional deployments,” he said. “Databricks allows us to POC and productionize models all in the same environment on the same datasets. Most importantly, our data scientists can do both in R, Python or Scala, allowing for true flexibility across a variety of libraries and toolkits in addition to native Spark.”

Besides AWS-based automated cluster management, Databricks offers IPython-style data science notebooks – of which Robison is a particular fan – whose interfaces combine word processing with the shell and kernel of that notebook's programming language. The notebooks enable the team to analyze data together and work on code across different languages.

“It gives an environment where our data scientists can collaborate much more effectively,” Robison said. “We have very senior data scientists coming in with Python experience, and analysts coming out of (college) with more R experience, and we have others with various NoSQL backgrounds. We needed an environment where all those skill sets could come together. Databricks provides that from a back-end infrastructure standpoint. Then they take it a little further with their actual collaboration tools.

Inside of a collaborative session using a notebook, Robison said, he “can have an analyst in a single cell doing some SQL or some light R, I can log in and collaborate with them, and in the next cell I can be writing code in Python or Scala… You can move back and forth from cell to cell, so you can have multiple languages all in one notebook, which means you can have team members collaborating in the language they know best but still driving toward a holistic product and a holistic solution.”

Robison described a typical collaborative working session. Let’s say there’s a new team member working on a first project. Robison receives an after-hours text saying that the colleague is struggling with some code and asks Robison to look at his notebook.

“I can do that on my laptop, we can get together in one place in real time, make comments and revisions back and forth together. It allows for heightened levels of collaboration and coaching… So I can onboard people quicker, I can point them toward example notebooks that the analysts or data scientists have created that can include demos to help them get up to speed quicker. It really gets down to breaking down silos between individuals and teams and getting everyone in a similar environment.”

Between its scaled-out big data compute capabilities and its notebooks, the analytics team, according to Robison, can stand up new machine learning models five times faster, it can do intra-day improvements on existing models without new deploys, and it can quickly spin up/down clusters through self-service and thus respond to business partner needs faster. In addition, Databrick’s in-notebook version control “allows us to roll-back single moves inside a notebook making exploration and general trial/error approach to exploratory analysis seamless.”

In all, the cost of moving models to production has been cut by nearly 50 percent, according to Robison.

The ultimate objective of all of this is to understand’s customers better and propel them to make purchases when they’re on the company’s site.

“Our goal was to identify a large collection of features (customer behaviors) that could be used to answer these questions: when do you usually shop, when do you purchase, and what device do you prefer to use in each case,” Robison said. “Through our exploratory analysis of the session data we noticed some interesting trends in shopping behavior. People tend to window shop during the day while they are at work. However, they tend to wait until the evening to make purchases, especially large purchases. Moreover, our customers tend to shop on mobile devices, but convert on desktops.”

The team utilizes 1, 7, 14, and 30-day look-back customer behavior windows for empirical features and histograms. This generates signals showing changes in behavior as a customer moves closer to purchase.

“With some initial testing and cross-validation we realized that the web-log behavior could be enriched by identifying what a customer’s state was for any specific session, i.e., for a given session in time how long ago did this user make a return?” he said. “Are they a priority club member? If so, how long ago did they join the club, or cancel their membership? However, the joins between our sessionized data and various customer information tables presented another set of computational issues.”

For this, the team leveraged Spark’s Snowflake connector with query pushdown to make these complicated joins and aggregations efficient across millions of users per day.

Robison said once features are generated, model training using Spark on Databricks proves to be straightforward.

“We enhanced Spark’s cross-validation method to allow for cross-validation and hyper-parameter tuning inside each of three algorithms,” he said, “then picked an optimal algorithm and parameter set out of the three. Our cross-cross validation method allowed us to test several models for each retraining and constantly promote the best preforming model based on new data. Next, we leveraged the multi-language support in Databricks to generate rich reports and visualizations to describe each retraining and execution of our model. Reports include high-level model descriptions and versioning, visualization of classifications and underlying data distributions, classical statistical tests for changes in those distributions, metrics, parameter descriptions, and default runtime settings.”

That Databricks is hosted in AWS means that for both the exploratory phase, when machine learning models are under development, and when site traffic peaks, such as on Black Friday and around the holidays, the company’s IT organization leverages cloud compute elasticity.

“In the exploratory phase, the true analytics part of data science…you tend to ask questions and try to evaluate hypothesis that are extremely expensive in terms of computation power,” Robison said, “but in my general experience the first questions never lead me anywhere. Finding your way around in the dark as you get used to new data sets or new problems or new areas of the company, that early research phase can be very resource intensive. But you also have a time consideration. What I hate to see from one of my team members is to have a very interesting idea to explore but have to wait several days for the resources to anwer those questions.

With scalable, elastic compute resources, the team can stand up as many resources as needed for a short time period to get answers to exploratory questions. “So the beauty of a platform like Databricks is it prevents my team members from lags in the creative cycle. They’re able to keep innovating, keep asking questions and keep seeking answers and it heightens the focus and accelerates progress”.

Running analytics and site marketing in the cloud avoids disrupting the company’s ecommerce system.

A perfect storm can occur when site traffic hits high levels, which also raises the compute needs of the ML algorithms to interact with customers on the site in near time.

“No matter how fantastic the machine my team has developed, if we slow down the site at all then the machine learning is going to be the first thing to go – and rightfully so,” Robison said. “We’re an ecommerce site. Slowing down page load times, slowing down checkouts, can  be devastating. So we made the decision to move towards more of a cloud infrastructure utilizing new technologies like Databricks…to get the computational power when we need it at those heightened times. And it allows us to achieve our work without getting in the way of the actual site, if that makes sense, to make sure we have all the resources we need when we need them, and we can spin down cloud resources when we don’t need them.”