HPC for Advanced Analytics at the USPS
Today, the United States Postal Service is on its third generation of supercomputers, with each generation more capable than its predecessor. IDC believes the USPS embrace of HPC exemplifies an important, accelerating IT trend: Leading organizations in the private and public sectors are increasingly turning to high-performance computing to tackle challenging big data analytics workloads that traditional enterprise IT technology alone cannot handle effectively.
The USPS case history is instructive. Each HPC generation employed by the USPS has been based on an SGI UV supercomputer provided by prime contractor FedCentric Technologies, based in Chevy Chase, Md. The SGI UV system stands out from the cluster crowd primarily because of its large, NUMA shared memory space that is designed to scale as high as 64TB. In this instance, architecture matters.
The Postal Service reports that the ability to process very large data problems entirely in memory can boost performance by three to six orders of magnitude. Conversely, having to process data outside of a shared memory can severely cut performance — by about 50 percent when moving data from one blade to a neighboring blade; by two-thirds when moving from the top blade in a rack to the middle blade; by a factor of 25 when moving data from the top blade to the bottom blade in the same rack, and by a far greater magnitude when transferring data between racks.
(This recognition gave rise to the USPS-coined term "high-density supercomputing," which refers to the notion of keeping memory and processing in tight physical proximity —"affinity," in USPS terminology — and using an in-memory database to attack big data problems.)
IDC uses the term high-performance data analysis (HPDA) to refer to workloads that are daunting enough to require HPC technology. The primary factors driving the HPDA trend are the complexity and time criticality of the most challenging big-data workloads. HPC can enable organizations to aim more complex questions at their data infrastructures and obtain answers faster, even with more variables included. IDC forecasts the global market for HPDA servers and external storage will grow robustly from $1.4 billion in 2013 to $4.3 billion in 2018.
Notably the FedCentric /SGI solution employed by the USPS is unusually adept at ingesting very large, complex HPDA problems and processing them entirely in memory. Enterprise IT technology alone was unable to keep pace with the Postal Service's rapidly growing daily volumes of batch-submitted data. HPC technology has enabled the USPS not only to achieve near-real-time response rates on this expanding data but also to begin exploiting mission-critical competitive opportunities.
It’s worth looking at USPS actions in this context.
The United States Post Office was founded in 1775, during the American Revolution, by the Second Continental Congress and was operated by the federal government until 1971, when the Postal Reorganization Act turned it into the independent United States Postal Service. Today, the USPS exploits advanced technology, including high-performance computing, to stay competitive with other delivery services and to ensure timely, accurate delivery of 160 billion pieces of mail each year.
The Postal Accountability and Enhancement Act of 2006 significantly changed how the Postal Service operates and conducts business. The act provided new flexibility, particularly in competitive pricing for shipping services. The 2006 legislation allowed the Postal Service to respond to dynamic market conditions and changing customer needs with competitive practices that are still closely regulated but can be more agile and assertive than before.
One can make the case that the 2006 act arrived none too soon. By then, the USPS' shipment volumes and revenue were in steep decline. First-Class Mail volume plummeted 29 percent between 1998 and 2008, thanks mainly to the escalating use of email and other Internet-based communications. During the same period, competition for package delivery from FedEx, UPS, and others intensified. The new legislation allowed the USPS to begin addressing its many challenges with greater flexibility. Improving the way the USPS handles big data would need to play a key role in this transformation.
The Postal Service had been computerized for many years, using batch processing on business mainframes and Unix servers to handle data-intensive tasks. But around the time when the Postal Accountability and Enhancement Act of 2006 was enacted, daily data volumes were nearing the petabyte range (despite the decline in the volume of mailed items). By the end of 2006, it took 36 hours to process every 24 hours' worth of batch-submitted data. Clearly, the Postal Service's enterprise IT technology couldn't keep up alone and needed help.
The USPS first tried unsuccessfully to exploit barcode sorters to help solve its big data challenges. This led to the idea to buy a supercomputer, with the goal of moving the Postal Service from batch processing to real-time processing of its expanding data volumes. Initial HPDA applications targeted for supercomputing were sorting ("sortation") and revenue protection. Features of USPS operations then included:
- At the time the RFP for its first supercomputer went out, the USPS was using scanning devices at thousands of post offices and other delivery locations in the United States and its territories to scan 4 billion pieces of mail and packages per day. This data was sent in batches to a central facility, where the Postal Service's mainframes and business servers compared the data with hundreds of billions of records to catch instances of insufficient postage, unpaid postage (e.g., Xeroxing a stamp), larger fraud schemes, and other revenue-reducing anomalies. The USPS computers were unable to perform this task in real time and had fallen behind the growing data volumes.
- In 2006–2007, only 1.5 percent of postal fraud was large scale, but these cases averaged $50,000 each to move through the federal court system. At that time, the USPS lacked the manpower to handle the small cases that together accounted for 98.5 percent of revenue loss to fraud.
- Dynamic sorting presented an equally daunting HPDA challenge. Picture thousands of packages arriving each day at a post office or other delivery unit. The sorting challenges go to the most experienced, highest-paid human sorters. They handle the task with error rates of 5 percent or less, but part of the problem is keeping up with route changes, new housing developments, and other alterations. In 2007 and the following years, in the midst of an economic recession and mounting competition, adding more high-priced human sorters simply was not an option. Pressure was in the opposite direction — to eliminate some of these existing highly paid jobs through attrition.
As is often the case with organizations that adopt HPC technology for advanced analytics, the Postal Service's supercomputer is used to augment, rather than replace, the capabilities of traditional business mainframes. Key technologies employed in the USPS HPC solution today include:
- An SGI UV 2000 supercomputer system with 4,096 Intel Xeon processor cores and 32TB of shared memory
- The Oracle TimesTen In-Memory database
- The FedCentric Memory-Centric database (MCDB) accelerator
- The FedCentric Memory-Centric database with GPU accelerator
The Postal Service's supercomputer, located in the Minneapolis suburb of Eagan, Minn., is connected with the USPS' passive adaptive scanning system (PASS) units in 15,000 post offices and other delivery facilities in the United States and its territories around the world. PASS units scan packages and transfer data to the supercomputer via the Internet. The supercomputer analyzes the data, comparing it with existing information in billions of records. In addition to performing revenue protection functions, the supercomputer sends sorting and routing instructions back to the delivery facilities.
Even with the relative slowness of the Internet, round-trip duration for the data averages only 50–100ms within the continental U.S. and just 225ms for a location as distant from Eagan as Guam. The Postal Service says with that level of performance, it can provide near-real- time responses for the PASS system's 15,000 scanners at peak levels as high as 10 million packages per hour. The new SGI UV 2000 supercomputer should be able to boost that capability even further.
The USPS plans to take dynamic routing much further in the coming years in order to provide real-time routing solutions based on information from handheld devices and GPS. The goal is to increase delivery efficiencies by addressing the major uncertainties that affect the first and last mile — especially changing traffic and weather conditions.
This geospatial data is a good fit for array analysis, and general-purpose graphics processing units (GPGPUs) are tailor-made for this. GPUs were originally designed to handle array analysis for visual processing, that is, pixelation. Their GPGPU cousins perform similar analysis on arrays of numbers, analogous to the vector processors that dominated supercomputer designs from the 1960s until about 1995. Standard x86 CPUs excel at instruction-level parallelism (ILP), while GPGPUs are far better at data-level parallelism (DLP).
This is one major reason why the USPS' SGI UV 2000 supercomputer contains both Intel x86-based processors and GPGPU accelerators. Another reason is the USPS' goal of high-density supercomputing. Each of the USPS' Intel CPUs contains 15 cores, while a GPGPU can include as many as 3,000 cores (albeit with much narrower threads). A supercomputer with, say, 1 million x86 cores would need far more space than one with 1 million GPGPU cores.
USPS accomplishments with supercomputer technology include the following:
- The USPS moved from batch processing to stream and complex event processing, delivering near-real-time results and capacity for up to 15,000 scanning devices at post offices and processing facilities across the United States and its territories.
- The USPS recorded five-nines (99.999 percent) of availability for the supercomputer-based system.
- The USPS augmented the sorting work of senior clerks with specialized knowledge, enabling some of these highly paid positions to be eliminated through attrition.
- The USPS used geospatial technology and inferencing to accurately predict and report real- time events.
About the Author
Steve Conway, Research Vice President in IDC's High Performance Computing group, plays a major role in directing and implementing HPC research related to the worldwide market for technical servers and supercomputers. A 25-year veteran of the HPC and IT industries, Mr. Conway authors key IDC studies, reports and white papers, helps organize and advance the HPC User Forum, and provides thought leadership and practical guidance for users, vendors and other members of the HPC community.
Before joining IDC, Mr. Conway was vice president of investor relations and corporate communications for Cray Inc. He was also a divisional leader for SGI and headed corporate communications and analyst relations for Cray Research and CompuServe Corporation. Mr. Conway had a 12-year career in university teaching and administration at Boston University and Harvard University. A former Senior Fulbright Fellow, he holds bachelor's and master's degrees in German from Columbia University and a master's in comparative literature from Brandeis University, where he also completed doctoral coursework and exams.