IRS Ups Its Fight Against Tax Fraud Using Nvidia GPUs, Cloudera Data Platform, Spark 3.0
As the U.S. Internal Revenue Service processes hundreds of millions of tax returns for individuals and businesses each year, there is much more to the task than just calculating taxes, refunds and balances that are due.
Also critical to the process is an elaborate system to fight tax fraud by uncovering and analyzing irregular patterns in those returns so the government can get its money and taxpayers and businesses pay what they owe.
For the IRS, that involves huge data sets that have become unwieldy and difficult to analyze using existing tools as the data sets continue to grow.
Recently, things got easier for the IRS through a partnership with vendors Nvidia and Cloudera that brings together new processes that dramatically reduce the burden on the agency when it scans those gargantuan data sets in the search for fraud.
Deborah Tylor, an IRS data scientist, was tasked to comb a data set that was larger than 3TBs, but she found that even after letting the job run all night on a large bank of CPUs and servers that it would not successfully complete its work, according to a recent Nvidia blog post. She tried to rerun the job, but it failed again.
Serendipitously, Nasheb Ismaily, senior solutions engineer and streaming analytics subject matter expert lead with Cloudera, approached Rahul Tikekar, manager of a technical team that supports data analysts at the IRS. Ismaily asked if the agency might be interested in using the company’s Cloudera Data Platform (CDP) that was recently integrated with Apache Spark 3.0 software – an open-source unified analytics engine for large-scale data processing – which is accelerated by GPUs.
Tikekar said he was interested in trying it, particularly since the agency already had Nvidia graphics cards on standalone servers. Until then, though, using Spark to run them on a distributed cluster had eluded the agency, he added.
The possible fix presented a great use case to try it on.
Initial testing quickly found that it speeded up many parts of Tylor’s work up with 5x performance improvements, all without code changes, but that adjustments were needed, the blog post continued.
A team of Nvidia data scientists were called in to examine the code and they determined that some of the more complicated tasks were still running on CPUs, so they wrote new code to resolve the issues, the post stated. They inserted that fresh code into Spark’s software interface for RAPIDS, which is the open library for running data analytics on GPUs.
More testing found that this step solved the remaining issues, and the IRS data set was then running on GPUs in a distributed Spark cluster.
“The speedup was remarkable — Deb’s running the whole program on a four-node cluster right now,” Tikekar told Nvidia.
“Before Spark 3.0, this was not possible, but now we’re upping the ante with GPUs and we can dream of solving problems that were once impossible,” Tikekar added.
Scott McClellan, the senior director of Nvidia’s data science group, told EnterpriseAI that while the use of GPUs has been well-established for deep learning for a long time, it has not achieved the same mindshare for data processing, Extract, Transform and Load (ETL) and to some extent for machine learning.
“Nvidia's point of view is that has the same breakthrough potential for all of those, in terms of unlocking what is possible,” said McClellan. “Maybe it is not where the speedup is the same. but the fundamentals are the same in the sense that it unlocks use cases that were either impossible or impractical across ETL, data processing, machine learning and inference, not just deep learning. Now it makes sense.”
For Nvidia, helping to solve the technical problem for the IRS meant working with a customer that was already using GPUs and finding answers with another partner, Cloudera, to make it all work.
The IRS “had a hard problem that needed the [GPU] acceleration and was eager to try it,” said McClellan. The IRS was using Apache Spark but had not yet moved to Spark 3.0.
“It just all fell into place,” said McClellan. “It was more of a ‘here is a breakthrough we can make’ by speeding up this process [so that the IRS] can more effectively look for fraud and tax returns than it was able to do” previously.
“It was just impractical to use the technology without some acceleration in the way they needed to use it because you look for a pattern and then you have to make some changes and look for another pattern,” he said. “And every time it is taking days to turn around the output of the job, and then make some changes, and then do it again. It was just not practical.”
By bringing together the technologies from Nvidia, Cloudera and Spark 3.0, the project allows the IRS to reduce the time needed for its data set analyses – from days down to hours – said McClellan. “And there is an economic benefit to that as well. You can solve your problem at that scale with less hardware. But more important than that economic benefit is that you can solve the problem, period.”