Advanced Computing in the Age of AI | Monday, July 22, 2024

Lucera Opens Door On High Frequency Trading Cloud 

If you think supercomputing is challenging, you should try high frequency trading.

High frequency trading and traditional supercomputing simulation have plenty in common, but there are big differences, too. They both require extreme systems, with HFT systems focusing on latency and speed while supercomputer clusters are designed for large scale and capacity; both have employed coprocessors to help boost the overall performance of underlying systems.

Supercomputers try to model something in the physical world – how a protein folds or how gasoline burns in an engine – while HFT systems are trying to model something that only exists in the electronic world of finance. And importantly, HFT systems are looking for patterns in the buying and selling of assets occurring at the nanosecond level that are, by and large, being generated by other HFT systems. When HFT systems change their behavior to try to make money, the behavior of the entire system starts changing and everyone has to go back to the drawing board and recreate their models.

Here's how Jacob Loveless, the CEO of Lucera, a new cloud dedicated to high frequency trading, liquidity matching, and foreign exchange, explains the underlying frustration of this business. "HFT model development is like you discover gravity one day and codify the natural law, and the next day it stops working and you have to start all over again."

The Lucera cloud is a spinout from New York-based financial services firm Cantor Fitzgerald, one of the early innovators in high frequency trading that actually left the field a few years back. Loveless is a former Department of Defense contractor who is an expert in data heuristics (and who cannot talk about the classified work he did for the US government), and he came back to Wall Street to start Cantor Lab, a research and development group that, among other things, built HFT systems for the financial firm and initially focused on Treasury bonds, not equities, back in the early 2000s.

"What we found is we needed to go down to this really low level of the data, that you couldn't aggregate it," explains Loveless. "You needed the data raw in order for any of the patterns in the data to actually be meaningful or dependable. So you could not, for example, look at bond movements from one day to the next, but you had to look at bond movements from one minute to the next minute. When you get to a small enough timescale, all of a sudden all of these patterns get to be somewhat dependable."

Meaningful patterns emerged at the millisecond level in the financial data back in 2003 and 2004, says Loveless. "Making a system react in a couple of milliseconds was not that difficult. It was difficult enough that you couldn't write garbage code, but it wasn't impossible."

But then everybody in the financial industry figured out how to do high frequency trading, and it became a systems and networking arms race. So if you wanted to trade in equities, for example, you had to move your systems into co-location facilities next to the exchanges, and it got so contentious that the New York Stock Exchange, which operates its data center in Mahwah, New Jersey, had to give all HFT customers wires that were exactly the same length as they hooked into their systems so they would have the same latency.

"By 2008 and 2009, this was crazy," says Loveless. "Systems had gotten to the point of absurdity where these opportunities only existed in the microsecond range. We were building systems that could react to patterns – take in information and look at it against a hash table and do something – in under 50 microseconds. To give you some perspective, 50 microseconds is the access time for a solid state drive."

Networking between exchanges similarly got crazy. About this same time, some traders figured out that the latency on microwave communications was lower than for signals going through fiber optic cables, and suddenly there were microwave links between New York and Chicago, and soon there were links connecting financial centers up and down the Eastern seaboard and across Western Europe. And these days, people are using millimeter band communications links, says Loveless, because microwave links are too slow.

The servers underneath high-frequency trading systems kept getting beefier, and the use of field programmable gate array (FPGA) coprocessors proliferated inside of systems, inside of switches (particularly those from Arista Networks), and inside of network adapter cards (from Solarflare). Loveless knows traders who run full-on trading systems on that Arista 7124FX switches, using 24 in-bound ports to get data from the exchanges, have a model coded in the FPGA, and do trading from the switch instead of from servers. But, oddly enough, using FPGAs has fallen out of favor because the models in high frequency trading are changing too fast for FPGA programmers to keep up.

"The reason is that you need to change the models too often," says Loveless. "The development cycle working in Verilog or VHDL is too long. Even if you get the greatest Verilog programmer ever, you are still talking about turning models around in weeks, not days."

As this HFT escalation was reaching a fever pitch, Cantor Fitzgerald took a step back three years ago, says Loveless, and projected that the money the firm would be making on a daily basis and what it would be spending on infrastructure was not going to work five years out. And so it decided to build a utility computing environment that is differentiated with software, such as its homegrown variant of the Solaris Unix environment, and services, like market data streams, and then sell raw infrastructure to high frequency traders who did not want to do all of this work themselves. Or could not.

And thus Lucera was born, and Cantor Fitzgerald is emulating online retailer Amazon a bit with its Amazon Web Services subsidiary by spinning off Lucera. There are plenty of companies that want to do high frequency trading, and Lucera has expertise in building HFT systems and networks. And moreover, there are applications that Cantor Fitzgerald does run on the Lucera cloud, such as its equity wholesale desk and foreign exchange trading operations. Just as AWS customers help subsidize the cost of Amazon's IT operations, Lucera does the same for Cantor Fitzgerald.

"We decided to be absurd, but not absurd absurd and do so many things in hardware," says Loveless with a laugh. "We are going to stay on an Intel-standard chip. We are going to have extreme systems, but it is not going to be as fun as it was. It is going to be really, really fast, but not stupid, stupid fast. There is still going to be somebody somewhere who is faster than us at some things, but we are going to be able to do things at a price point that they can't match."

The basic business model is to buy iron in much higher bulk than any high frequency trading firm typically does and to leverage that economic might to not only get aggressive pricing on systems and storage, but also to get other things it needs. For instance, the Lucera systems employ 10 Gb/sec Ethernet adapter cards from Chelsio Communications, and Lucera got access to the source code for the drivers for these adapters because a lot of what the company does to goose performance and reduce latency is to hack drivers for peripherals.

Most of the time HFT applications are written in C, but sometimes you need even more performance and you have to get even closer to the iron.

"Most of the code is in C, and you have to do nasty bits in assembler," says Loveless. "It sucks, but that's reality. Every high frequency trader on the planet writes stuff in assembler because any sufficiently advanced compiler is still not going to get the job done. The beauty of writing code for high frequency – and Donald Knuth would be horrified to hear this – is that there is no such thing as premature optimization. All optimization is necessary optimization. If you go through code and change it so it will drop five microseconds out of the runtime of that piece of code, you do that. It is totally worth it."

The Lucera cloud has not given up on FPGA accelerators completely. The company has created what is called a ticker plant in the lingo, which is a box that consolidates the market data feeds from a dozen exchanges and publishes them in various formats for HFT applications. These ticker plants cost on the order of $250,000 a pop, and you need two of them for redundancy.


Under normal circumstances, an HFT company would have systems of its own to so this, and generally they would be equipped with FPGAs to accelerate this feed consolidation. (The applications use this data to find the best buy and sell prices for an equity across those exchanges.) The reason this part of the HFT stack can stay in FPGAs is that the exchanges do not change their data formats all that often – perhaps once or twice a year – and so you can code your feed consolidation code in hardware. Coding market models in hardware such as in FPGAs is no longer practical, as mentioned above, because these are changing constantly.

The applications that codify those models run on very fast Xeon machines in the Lucera cloud, as it turns out. These machines are customized versions of systems designed by Scalable Informatics, a server and storage array maker based in Plymouth, Michigan that caters to financial services, oil and gas, and media and entertainment companies that need screaming performance.

Lucera is using a variant of Scalable Informatics' JackRabbit server line for the compute portion of the cloud. The servers running in the Lucera cloud are based on the latest "Ivy Bridge-EP" processors from Intel, and specifically, they have two of the Xeon E5-2687W v2 chips designed for workstations, which run at 3.4 GHz. (You can read more about the new Xeon E5-2600 v2 processors at this link.) There are 44 servers in a rack, and each server has sixteen cores for a total of 704 cores per rack. The Lucera cloud has facilities in Equinix data centers in New York and London up and running today with 22 racks of machines in each facility, and another 22 racks are being put into a Chicago facility. That will be a total of 46,464 cores. Lucera chose the workstation versions of the Ivy Bridge chips because HFT workloads do not have a lot of heavily multithreaded code, and it turns off all of the power saving features in the chip to push clock speeds them up to 3.6 GHz or 3.7 GHz; sometimes, it can get a machine to behave dependably and predictably at 3.8 GHz. The machines also have high-end DDR3 memory that can be overclocked to 2.1 GHz instead of 1.67 GHz or 1.87 GHz. And it tunes up all of the caches in the machine, too.

Each Lucera server has a dozen flash-based solid state drives configured in a RAID 1+0 setup, which is two mirrors of five drives plus two hot spares. There are two disk controllers for redundancy. The machines also have four 10 Gb/sec Ethernet ports, with two of them coming from Chelsio for very low latency work and two being on the motherboard and not quite as zippy.

And the funny bit is that this hardware will all be tossed out in a year and a half or less.

"In high frequency trading, you have got to be on the edge," Loveless explains. "You amortize your hardware costs over 18 months, and if you actually get 18 months out of something, well, that's just awesome."

The Lucera cloud is based on a custom variant of the open source SmartOS operating system, which was created by Joyent for its public cloud. Joyent took the open source variant of the Solaris operating system controlled by Oracle and added the KVM hypervisor to it. This KVM layer allows for Windows or Linux to be run on the cloud if necessary, but Loveless says most customers run in bare-metal mode atop SmartOS for performance reasons.

The Lucera SmartOS has its own orchestration engine to manage workloads on its cloud, and it uses Solaris zones to isolate workloads from each other on the cloud. Because HFT applications are so latency sensitive, slices of the machines are pinned to specific processors and the memory that hangs off those processors is pegged to those processors. Network interrupts for a network card are tied specifically to a socket as well. The underlying file system for the cloud is ZFS, also created by Sun and also open sourced before Oracle acquired Sun more than three years ago.


Networking is probably the most challenging part of running a high frequency trading cloud, and Lucera has worked with Scalable Informatics to create a custom router to link market data feeds to its clouds and has a homegrown software-defined networking stack to make use of it, too.

"There isn't just one source of information and that makes it hard," says Loveless. "Take foreign exchange, for example, where you have to be connected to hundreds of networks in order to make a decision. And so if you are talking about building a utility computing environment, you are going to have to be able to support at the edge of the utility environment hundreds and hundreds of private networks. This is not like Amazon Web Services where you have the Internet and maybe one or two private networks. Here, you have literally hundreds of pieces of fiber terminating at the edge and you need to manage that. So we wrote our own software-defined network that does that, and it runs on custom routers that are based on X86 processors."

Not many enterprise customers build hot-rod servers and custom routers and tweak up their own variants of an open source operating system, of course. But in businesses where speed and low latency matter, this practice could become more common in the future.