Advanced Computing in the Age of AI | Friday, April 19, 2024

One FinServe Company’s Strategy for Feeding the CCAR Beast 

(garagestock/Shutterstock)

Before even addressing the devilish complexity of CCAR simulations, the first problem financial services industry (FSI) firms must solve is finding the compute capacity to run resource hogging CCAR workloads. These are monster jobs run only once a year, placing hugely uneven demands on data centers, requiring more capacity than most FSI companies have on hand.

Thus, the CCAR Consumption Dilemma. Companies can either:

  1. Buy a carload of new servers, but this could be a bad OPEX bet that leaves excess capacity for the rest of the year when CCAR workloads aren’t running, or...
  2. Try to squeeze CCAR workloads into existing capacity, but this could disrupt regular operations and ignite a zero-sum steel cage fight among end users grabbing compute capacity for themselves.

CCAR (Comprehensive Capital Adequacy and Review) regulations, issued by the Federal Reserve after the financial crisis, are annual tests that assess the health of major institutions. Major banks and other large FSI institutions analyze macroeconomic scenarios with two-year time horizons, accounting for employment and interest rates, the stock market performance and other factors. For each scenario, institutions must revalue their assets and liabilities, and prepare financial statements.

At one major financial institution (that has requested anonymity), CCAR had a potentially large impact on capital management actions, including share repurchases, dividend payments to shareholders and liquidity requirements. One of the its business lines is variable annuities (VAs), complex, path-dependent instruments that represent billions of dollars of balance sheet exposure and are valued using compute-intensive Monte Carlo walk-forward simulations. CCAR testing adds additional strain on the company’s internal resources made worse as model resolutions are increased and reporting deadlines are tightened.

Faced with the CCAR Consumption Dilemma, the company rejected the options listed above and instead took the problem out to AWS, leveraging the scale-on-demand / pay-as-you-go cost model of a public cloud But this strategy posed integration, governance and control problems in its own right. For these and other issues, the company worked with Cycle Computing to create a parallel universe of the institution’s internal environment. CycleCloud provides a “tool chain” that addresses the security model, the support model, operations, cost reporting, planning and utilization, to orchestrate large-scale computing and storage resources in cloud environments. In also helped the institution’s skills gap needed for porting applications and data feeds to the cloud.

According to Jason Stowe, Cycle’s CEO, the FSI institution has total data set volume “in the low terabytes, running hundreds of thousands to millions of hours of computing” off of that data.

Cycle CEO Jason Stowe

Cycle CEO Jason Stowe

From a practical standpoint, Stowe said, running CCAR jobs “would have taken up the entire environment, the entire cluster, for a better part of a month. The exciting part about this is we were able to connect them to the cloud and enable them to take advantage of several thousand additional processors at a fraction of the cost of buying those processors and get the results back on an order of magnitude faster (weeks shortened to days) than would have been possible otherwise.”

Stowe said the AWS-based CCAR implementation saved the company 60 percent compared to the cost of adding internal compute capacity.

“It makes running CCAR workloads more cost effective but it also makes it so you end up with very flexible working models and you can make your people more productive, so no one’s ever waiting for compute,” he said. “They’re able to get access to compute power when they need it, and be able to turn on and off the environment as soon as they want to.”

Cycle began by working with the institution’s internal team to understand the scope of challenges it faced along the broader organization, including logistics of how the runs would work, how the data would be transferred. Stowe said Cycle provided a reference architecture covering both data transfer to/from the cloud, and computations and data management within the cloud. The reference architecture integrated the existing internal production environment with new cloud-based requirements and was the basis for the production design.

Naturally, security was a primary issue. Stowe said that in helping the company integrate its internal and external infrastructures, Cycle’s focus wass on making sure “this big compute environment had all the same authentication, compliance, reporting” capabilities in the cloud. For this, Cycle partnered with Second Watch, a managed service provider focused on AWS environments that helps organizations comply with federal regulations and reporting requirements. They put in place a secure-connection data transfer structure for communication between the internal data center and AWS building a virtual private cloud for the cloud environment. This included using AWS’s identity management tools to control access rights along with its capabilities and certifications (e.g., ISO 27001, ISO 27017, ISO 27018). The idea was to protect cloud components (storage, schedulers, execute nodes, etc.) within a virtual private network.

An initial aspect of the implementation process was verifying that results obtained with the cloud approach were consistent and correct, Stowe said. Parallel runs were performed on both the internal and cloud environments and results were analyzed for accuracy.

Then came performance tuning. Cycle monitored performance and configurations of the software to ensure that expected throughput was delivered. They also recommended configuration changes to accommodate differences between the internal and cloud environments related to changes in CPU and IO throughput. Stowe said this also helped manage cloud costs through the use of targeted instance types and other techniques.

According to Stowe, the cloud implementation met the requirements for performing CCAR yearly reports for VAs. He said the application was able to scale larger than originally planned, shortening runtimes and the choice of instance types enabled lower costs.

Over time, the institution identified three performance drivers within the cloud, according to Stowe:

  • Cores: The main driver for increased throughput is the vast capacity of a public cloud, in contrast with internal capacity. They found that increasing the number of cores by 7x cuts runtimes from one week to one day, according to Stowe.
  • Matching hardware with workload: Stowe said the flexibility of the cloud enhances throughput because hardware (RAM, number of cores per processor, etc.) is specified by the user at the start of each run. The hardware profile evolves contemporaneously with changes in the workload. In contrast, the internal grid’s hardware profile evolves typically with an annual budget cycle and new equipment acquisitions.
  • Hardware refresh: Productivity increased because cloud providers tend to keep with new hardware (CPU, GPU, storage, etc.) in sync with OEMs. Internally, hardware is upgraded on multi-year depreciation schedules.

These factors led the institution’s internal team to transfer other production workloads to AWS. Stowe said this came about because many financial batch workloads, including VA analytics, are suited to cloud computing. Each policy (or instrument, or security) is independent of the others in the portfolio, he said, so adding cores cuts the run time without adding complexity to the computation for many financial-batch workloads. According to Stowe, the institution moved month-end and daily batch reports workloads to the public cloud.

The daily batch cycle was ported to AWS as well, enabling portfolio managers to use more sophisticated models and, Stowe said, meet the demands of the daily production process. “The benefit of running this workload on the cloud is deeper understanding of the risk profile of the book without increasing the total compute expense,” Stowe said.

EnterpriseAI