Pete’s Blog: Can FPGAs Overcome the FUD?
So, those party animals at ACTIV Financial have invited me to a little gathering tonight - at the Nasdaq MarketSite in Times Square - to formally launch ActivFeed MPU, their new hardware-accelerated feed handling software. It should be an great event, and I’m looking forward to finding out more about the black art of programming FPGAs (that’s Field Programmable Gate Arrays) - an emerging technology that could have a huge impact on low latency systems.
ACTIV is one of only a couple of market data players that I am aware of that are leveraging FPGAs. Down in St. Louis, Exegy is also in the game, with its ticker plant, but the approach the two companies are taking is pretty different. Whereas Exegy has built an appliance - a hardware and software black box that uses FPGAs to maximise performance - ACTIV has re-written its existing feed handling software to exploit the HyperTransport technology used by AMD as part of its Torrenza coprocessor initiative.
With ACTIV’s solution, some software functions are executed on the main CPU, and some are run on the FPGA coprocessor, which comes from San Jose-based Altera. Since HyperTransport is being backed by a consortium that includes Apple, Sun Microsystems and Cisco Systems, ACTIV hopes its approach will find more acceptability than a proprietary approach.
We’ll see. Geek that I am, I remain a tad cautious in my expectations for real-life FPGA deployment on Wall Street. My take is it’s gonna take a while for the FUD level - we are talking about relying on a new technology that’s understood by few - to be overcome. Until it does, I suspect the more traditional multi-core, multi-threading technology direction will remain the focus of low latency development.
Perhaps one area where we will see early take up of FPGAs is in latency monitoring offerings like TipOff from TS-Associates. Investing in monitoring software being perceived as way less mission critical than rolling out new ticker plants. It will also be interesting to see what happens as Intel rolls out its own QuickAssist coprocessor technology.
Until next time … here’s some good music.
Technorati Tags: activ, exegy, fpga, low latency, hardware acceleration, torrenza, hypertransport, altera





May 17th, 2007 at 11:33 am
Interesting to note also that during conversations for a story about a Wombat/AMD/Cisco test at STAC in our April issue of Market Data Insight, AMD’s HyperTransport was seen as a key element. Interesting too to note that we got the impression that Cisco was very much driving this particular benchmark test in order to lend credence to its InfiniBand Server Fabric Switches (SFSs) within the financial services vertical. The music isn’t quite Underworld, is it?
June 1st, 2007 at 10:33 pm
I’m guessing that the best role for FPGAs is highly compute-intensive work. Maybe something like
modeling bond yields? I work more on the infrastructure side of the world rather
than on the financial side so I can’t identify the ideal applications. But I’m thinking that they’ll
all have a high compute-to-network-I/O ratio.
I’m guessing that applications like feed handlers that have low compute-to-network-I/O ratios
are unlikely to benefit from FPGAs. I’ll assume that FPGAs might be used for trading, since
it’s a little eaiser to see the tradeoffs in that case.
Consider this thought experiment: Two twin sisters are given identical pools of capital to spend
building system that do statistical arbitrage, automated market making, high-frequency trading,
etc. Whatever they do, it’s a zero-sum game–when one sister wins, the other loses. They set out
to implement the same trading strategy on two different paths.
One sister decides to express the most compute-intesive parts of the trading strategy in FPGAs
while her twin just writes code in C, Java, or .NET using the tools common on Linux or Windows.
The FPGA sister has to learn VHDL and takes longer to implement and debug the trading
strategy. She still has to know C/Java/.NET to do the network I/O necessary to make her
trading strategy useful. Her strategy will run faster since it’s implemented in hardware, but
it will take her longer to bring the strategy to the market since it takes longer to implement.
The sister using C/Java/.NET for everything implements her strategy more quicly since she has
better tools. However, her strategy runs more slowely, having to be interpreted by a
general-purpose CPU and OS rather than being executed by an FPGA.
It seems likely that the C/Java/.NET sister will start trading first and will likely be the first
to pay back the capital invested in implementing the strategy. However, if sufficient capital is
available, the FPGA sister may eventually bring a faster strategy to market and make up for
lost time by winning trades against her sister because her strategy is faster.
The unstated premise is that the trading strategy doesn’t appreciably change while it’s
being implemented. If it did, both sisters whould have to start over and the sister using
C/Java/.NET would eventually win, just because she could bring changing strategies to market more
quickly.
So I see the critical variables as being the rate of change in trading strategies and the benefits
of faster execution of trading strategies. FPGAs seem to make the most sense where the VHDL
code would change slowly and the costs of faster execution could be paid back more quickly.
Othewise, it seems best to implement everything in a higher-level language running on a
general-purpose operating system so that you have the best tools to bring your strategy to
market quickly.
All of the above discussion has been about trading strategies because it simplifies the analysis.
Now let’s look at other tasks.
I’m not aware of any way to implement a full TCP/IP stack in hardware. If a TCP/IP stack is
going to be required for any part of the task at hand, then you’ll probably be running a
general-purpose OS and have access to the best higher-level languages and debugging tools.
A compute-intensive task might benefit from an FPGA because the overhead of getting anything
off the stack and onto the FPGA could be amortized over a lot of computation.
Otherwise, I’d guess you’re better off to take the time-to-market advantage of doing the
task in software.
Does this analysis over simplify the problem?
June 6th, 2007 at 6:58 pm
Bob - an interesting analysis, but one that misses out on some of the interesting capabilities that hardware solutions offer over conventional processor/software stacks. One such capability is content addressable memory. FPGA and VLWL hardware is normally paired with content addressable memory. This provides huge speedup for indexing, searching, matching type operations. Conventional programming languages completely lack syntax for expressing this data access paradigm.
When you’re playing in the hardware space, there are some realy fast, clever and devious things you can do which software running on a conventional processor can’t touch. In the ActivFeed MPU case, it’s all about managing and querying the initial value cache. And given that Activ’s querying capabilities represent a considerable advance on the traditional Triarch/TIB/RMDS style initial value cache, they’re able to leverage FPGAs very effectively, both for update passthrough with minimal latency, but also complex initial value queries with blinding response times.
Pete mentions my firm’s use of FPGA based technology in our TipOff middleware analysis and passive latency monitoring product. It is a key enabler, and our product simply would not perform adequately without it. When you’re doing deep multi-dimensional analysis of messaging traffic generated by a trading system, you have to do so with performance “greater” than that of the system you’re monitoring.
But the proprietary hardware buzz doesn’t end with FPGAs (and their sisters ASICS). They just allow you to crank through your existing 32 or 64 bit words faster, expressing algorithms in concise hardware rather than verbose software. We’re now looking beyind that to the next speedup dimension - what I call VLWL - Very Long Word Length - processors. And I’m not talking 128 or 256 bits (as in GPUs), but word lengths upwards of 16 Kbits, where entire market data messages can be processed in a fraction of the number of operations a conventional 32/64 bit processor would require. Couple that with 2GB of content addressible RAM, turn the way you express algorithms completely on their head, and you’re doing some real interesting stuff.