Advanced Computing in the Age of AI | Thursday, April 18, 2024

High Performance Embedded Computing: It’s What’s Under the Hood that Counts 

In 2000, Orbotech Ltd. went in search of a better image processing system for their Automated Optical Inspection (AOI) machines. They quickly realized that the existing solutions addressed both "classic" HPC workloads and embedded applications, but did not suit their specific <em>embedded</em> HPC needs. So the company set out to create their own solution, and began the project that would eventually be called <a href="" target="_blank">GRAPE</a> (Graph Processing Environment).<br />

Most of what you read about industrial HPC applications is concerned with the adaptation of "classic" HPC into an industrial setting. While the running of massive simulations in record time and the use of office computers to form all-night "grid" supercomputers is certainly interesting, what I want to talk about is a very different type of application. The use of small high-performance clusters embedded in industrial robotics.

Making Hardware with Software

printed circuit boardI work for a company called Orbotech Ltd. (ORBK). Orbotech provides Automated Optical Inspection and repair (AOI/AOR) equipment for the manufacture primarily of printed circuit boards and flat panel displays. In order to detect and/or repair defects in electronic components, our machines must be capable of processing from one to 100 gigabytes per second of continuous video from multiple sources. Generally they are expected to return results within half a minute of the first video frames arriving. The defects we need to detect can be as small as 1.5 microns on an object as large as 2x2 meters. This is like trying to find a grain of rice sitting on a white object in Central Park, New York, within 20 seconds!

Before the year 2000, all our systems used custom hardware (ASIC) for image processing and computer vision. The time-to-prototype from algorithm to hardware was about one year at a cost of over a million dollars. In addition once an algorithm was implemented in hardware it was "carved into silicon" and could not be changed without another long and expensive development cycle. As most of you know, the electronics industry is very dynamic — components are shrinking in size and growing in complexity at a rapid rate (Moore's Law, more or less). In addition, our customers are so secretive about their future plans that we often don't get to see the exact component we'll be inspecting until the machine is on the factory floor. This adds the requirement that our machines need to be very flexible. Ideally we would like to be able to respond to customization demands within weeks, not months.

In 1998 Orbotech started building its first machine to use a software-based image processing system running on a cluster of COTS computers. This project was a success and showed that more flexible, inexpensive systems could be created in much less time and cost. However, there still remained many unsolved problems. Adding more computers into these systems required rewriting large parts of the software in order to take advantage of the extra computing power. In addition the interactions between the various parallel algorithms was found to be very difficult to program and debug. Training and knowledge were also problems — our typical algorithm engineer is an applied mathematician with little knowledge (or desire) to engage in parallel programming. Put ten of these engineers on a single project and you very quickly find yourself in "parallel" spaghetti-code hell (one level down from regular spaghetti-code hell).

In 2000 after a post-mortem on our first HPeC machine, we realized that we needed a better system for writing and executing our algorithms. We first engaged in an in-depth survey of existing solutions. We quickly realized that our type of application had "fallen between the cracks" of current solutions. There were many software systems designed for executing "classic" HPC workloads, and many for designing embedded systems. But the special requirements of embedded HPC systems were not being addressed.

At this point we made the fateful decision to create our own solution. What we needed was a system that:

  • Didn't require an algorithm engineer to be familiar with parallel processing.
  • Could deal with complex systems of 100's of interacting algorithms and multiple sources of real-time I/O.
  • Could achieve very high hardware utilization. This was necessary because the cost of processing equipment would be part of our B.O.M. and passed on to our customers.
  • Allow updating of algorithm's quickly without re-compiling or re-building the system.
  • Would work on clusters of COTS computers.
  • Would allow easy scaling up (or down) by simply adding or subtracting computers from the cluster.

Thus begun the project that would eventually be called GRAPE (Graph Processing Environment), which we currently use to run our most computationally-intensive systems. While we borrowed "ideas" freely from HPC, embedded computing and academia, in the end there were too many differences in our systems and work environment to allow us to adopt an existing solution.

Software Running Software

What we created was a system based on a "data-flow" graph. Each node on the graph is an algorithm and data flows between algorithms over the connections between the nodes. Each algorithm engineer writes their particular algorithm as a node on the graph and then these nodes are connected to create a full application. A simple mapping language allows the graph to be mapped to different hardware configurations without rewriting any code. Once the system is set-up, parallelization across multiple processors and multiple computers takes place automatically. In addition a unique data synchronization language was created that allows data integrity to be automatically insured across the entire system, regardless of out-of-order parallel operations.

But most important of all, the algorithm engineer does not write parallel code. A system engineer maps the graph to hardware using a simple (declarative) mapping language and his own knowledge of the target hardware and the GRAPE system takes care of the rest.

Massively Parallel Algorithms

The GRAPE system was designed for optimal performance across large-clusters of multicore computers. However, recent advances in computing are leading manufacturers like Intel/AMD and NVIDIA to create "massively" parallel systems. These systems have hundreds or thousands of tiny computational cores running on a single chip. Indeed as the multicore revolution has proceeded, we have seen that the size of Orbotech's clusters has been diminishing. Whereas we had originally designed GRAPE to be scalable to hundreds of computers, now we need to design down to be able to run on thousands of "micro-cores" on a single chip.

This is the challenge of our current multi-year project, GrapeCL. This product will allow us to continue taking advantage of the latest "exotic" processing technologies produced by Intel, NVIDIA and others. In addition, its sister project, the AlgCL language, will create a friendly environment for algorithm engineers to design massively parallel computer vision algorithms. Both of these together will insure that Orbotech's products can continue to grow and maintain their cutting edge into the future.

About the Author

David MinorDavid is a veteran of the early Mac/Amiga games industry, authoring some of the first games sold on these platforms. He then went to work for Apple in 1987 where he was one of the developers of both Hypercard and MacroMedia Director. He moved to Israel in 1992 and was one of the founding employees of K.S. Waves, creating award winning tools for pro-audio music production. He joined Orbotech 10 years ago and is the architect of the GRAPE distributed processing system used in Orbotech's computer vision systems.