Silkenweb Example: Hackernews Clone

C++ patterns for low-latency applications including high-frequency trading

389 points by chris_overseas 1 year ago | 231 comments

nickelpro 1 year ago
Fairly trivial base introduction to the subject.
In my experience teaching undergrads they mostly get this stuff already. Their CompArch class has taught them the basics of branch prediction, cache coherence, and instruction caches; the trivial elements of performance.
I'm somewhat surprised the piece doesn't deal at all with a classic performance killer, false sharing, although it seems mostly concerned with single-threaded latency. The total lack of "free" optimization tricks like fat LTO, PGO, or even the standardized hinting attributes ([[likely]], [[unlikely]]) for optimizing icache layout was also surprising.
Neither this piece, nor my undergraduates, deal with the more nitty-gritty elements of performance. These mostly get into the usage specifics of particular IO APIs, synchronization primitives, IPC mechanisms, and some of the more esoteric compiler builtins.
Besides all that, what the nascent low-latency programmer almost always lacks, and the hardest thing to instill in them, is a certain paranoia. A genuine fear, hate, and anger, towards unnecessary allocations, copies, and other performance killers. A creeping feeling that causes them to compulsively run the benchmarks through callgrind looking for calls into the object cache that miss and go to an allocator in the middle of the hot loop.
I think a formative moment for me was when I was writing a low-latency server and I realized that constructing a vector I/O operation ended up being overall slower than just copying the small objects I was dealing with into a contiguous buffer and performing a single write. There's no such thing as a free copy, and that includes fat pointers.
- chipdart 1 year ago
  > Fairly trivial base introduction to the subject.
  Might be, but low-latency C++, in spite of being a field on its own, is a desert of information.
  The best resources available at the moment on low-latency C++ are a hand full of lectures from C++ conferences which left much to be desired.
  Putting aside the temptation to grandstand, this document is an outstanding contribution to the field and perhaps the first authoritative reference on the subject. Vague claims that you can piece together similar info from other courses does not count as a contribution, and helps no one.
  - Aurornis 11 months ago
    > this document is an outstanding contribution to the field and perhaps the first authoritative reference on the subject.
    I don't know how you arrive at this conclusion. The document really is an introduction to the same basic performance techniques that have been covered over and over. Loop unrolling, inlining, and the other techniques have appeared in countless textbooks and blog posts already.
    I was disappointed to read the paper because they spent so much time covering really basic micro techniques but then didn't cover any of the more complicated issues mentioned in the parent comment.
    I don't understand why you'd think this is an "outstanding contribution to the field" when it's basically a recap of simple techniques that have been covered countless times in textbooks and other works already. This paper may seem profound if someone has never, ever read anything about performance optimization before, but it's likely mundane to anyone who has worked on performance before or even wondered what inlining or -Funroll-loops does while reading some other code.
    - chipdart 11 months ago
      > I don't know how you arrive at this conclusion.
      I am well aware of this fact because I've researched the topic and I can state it without any degree if uncertainty. The only and resources there are scattered loose notes and presentations in conferences such as Timur Doulmer's work for C++ On Sea, and even so he's clear on how his work is mainly focused on real-time audio processing, which has different requirements than say HFT.
      > The document really is an introduction to the same basic performance techniques that have been covered over and over. Loop unrolling, inlining, and the other techniques have appeared in countless textbooks and blog posts already.
      Go ahead and cite the absolute best example you can come up from this incredible list of yours. The very top of your list will suffice to evaluate your whole list. I'll wait.
    - fsloth 11 months ago
      ”covered countless times in textbooks”
      What’s your favourite textbook on the subject?
- radarsat1 1 year ago
  > A creeping feeling that causes them to compulsively run the benchmarks through callgrind
  I'm happy I don't deal with such things these days, but I feel where the real paranoia always lies is the Heisenberg feeling of not even being able to even trust these things, the sneaky suspicion that the program is doing something different when I'm not measuring it.
- mbo 1 year ago
  Out of interest, do you have any literature that you'd recommend instead?
  - nickelpro 1 year ago
    On the software side I don't think HFT is as special a space as this paper makes it out to be.[1] Each year at cppcon there's another half-dozen talks going in depth on different elements of performance that cover more ground collectively than any single paper will.
    Similarly, there's an immense amount of formal literature and textbooks out of the game development space that can be very useful to newcomers looking for structural approaches to high performance compute and IO loops. Games care a lot about local and network latency, the problem spaces aren't that far apart (and writing games is a very fun way to learn).
    I don't have specific recommendations for holistic introductions to the field. I learn new techniques primarily through building things, watching conference talks, reading source code of other low latency projects, and discussion with coworkers.
    [1]: HFT is quite special on the hardware side, which is discussed in the paper. The NICs, network stacks, and extensive integration of FPGAs do heavily differentiate the industry and I don't want to insinuate otherwise.
    You will not find a lot of SystemVerilog programmers at a typical video game studio.
    - Maxatar 1 year ago
      As someone who does quant trading professionally and game development as a hobby, they both are performance sensitive, but they emphasize different kinds of performance. Trading is about minimizing latency while video games are about maximizing bandwidth.
      Video games try to cram as much work as possible within about 16 milliseconds whereas for most trading algorithms 16 milliseconds is too slow to do anything, you want to process and produce a response to input within the span of microseconds, which is 3 orders of magnitude faster than a single frame in a video game.
      The problem spaces really are quite distinct in this respect and a lot of techniques from one space don't really carry over to the other.
    - mgaunard 1 year ago
      Latency isn't even as important in HFT as people claim. What's most important is deterministically staying into a reasonable enveloppe (even at the 99.9 percentile) to satisfy real-time requirements and not fall behind.
      When it's really important, it's implemented in FPGA or with an ASIC.
    - hawk_ 1 year ago
      What other low latency projects in public domain are worth learning from?
  - _aavaa_ 1 year ago
    I'd recommend: https://www.computerenhance.com
    The author has a strong game development (engine and tooling) background and I have found it incredibly useful.
    It also satisfies the requirement for "A genuine fear, hate, and anger, towards unnecessary allocations, copies, and other performance killers."
    - hi_dang_ 1 year ago
      Very anecdotal but of the people I know in game studios who are tasked with engine work, and people who make a killing doing FPGA work for HFT firms, both camps shook their head at Casey’s HMH thing. Uniformly I do not know of a single professional developer of this sort of caliber who looked at HMH and thought it looks great. Quite the opposite. I think they found his approach and justifications unsound as it would instil awful practices of premature unfounded optimization and a disdain for normal library code in favour of hand-rolling your own half-baked implementations based on outdated trivia. I agree with them on the basis that HMH exposed an unprepared and inexperienced audience to something that has to be regarded with utmost care. For this, I refer to Jonathan Blow’s presentation of “a list is probably good enough” as an antidote. I think JB’s recommendations are more in line with actual practices, whereas Casey just raised red flags uniformly from here-and-now engine devs shipping multi platform games.
- contingencies 1 year ago
  How I might approach it. Interested in feedback from people closer to the space.
  First, split the load in to simple asset-specific data streams with a front-end FPGA for raw speed. Resist the temptation to actually execute here as the friction is too high for iteration, people, supply chain, etc. Input may be a FIX stream or similar, output is a series of asset-specific binary event streams along low-latency buses, split in to asset-specific segments of a scalable cluster of low-end MCUs. Second, get rid of the general purpose operating system assumption on your asset-specific MCU-based execution platform to enable faster turnaround using low-level code you can actually find people to write on hardware you can actually purchase. Third, profit? In such a setup you'd need to monitor the overall state with a general purpose OS based governor which could pause or change strategies by reprogramming the individual elements as required.
  Just how low are the latencies involved? At a certain point you're better off paying to get the hardware closer to the core than bothering with engineering, right? I guess that heavily depends on the rules and available DCs / link infrastructure offered by the exchanges or pools in question. I would guess a number of profitable operations probably don't disclose which pools they connect to and make a business of front-running, regulations or terms of service be damned. In such cases, the relative network geographic latency between two points of execution is more powerful than the absolute latency to one.
  - nickelpro 1 year ago
    The work I do is all in the single-to-low-double-digit microsecond range to give you an idea of timing constraints. I'm peripheral to HFT as a field though.
    > First, split the load in to simple asset-specific data streams with a front-end FPGA for raw speed. Resist the temptation to actually execute here as the friction is too high for iteration, people, supply chain, etc.
    This is largely incorrect, or more generously out-of-date, and it influences everything downstream of your explanation. Think of FPGAs as far more flexible GPUs and you're in the right arena. Input parsing and filtering are the obvious applications, but this is by no means the end state.
    A wide variety of sanity checks and monitoring features are pushed to the FPGAs, fixed calculation tasks, and output generation. It is possible for the entire stack for some (or most, or all) transactions to be implemented at the FPGA layer. For such transactions the time magnitudes are mid-to-high triple digit nanoseconds. The stacks I've seen with my own two eyeballs talked to supervision algorithms over PCIe (which themselves must be not-slow, but not in the same realm as <10us work), but otherwise nothing crazy fancy. This is well covered in the older academic work on the subject [1], which is why I'm fairly certain its long out of date by now.
    HRT has some public information on the pipeline they use for testing and verifying trading components implemented in HDL.[2] With the modern tooling, namely Verilator, development isn't significantly different than modern software development. If anything, SystemVerilog components are much easier to unit test than typical C++ code.
    Beyond that it gets way too firm-specific to really comment on anything, and I'm certainly not the one to comment. There's maybe three dozen HFT firms in the entire United States? It's not a huge field with widely acknowledged industry norms.
    [1]: https://ieeexplore.ieee.org/document/6299067
    [2]: https://www.hudsonrivertrading.com/hrtbeat/verify-custom-har...
    - mgaunard 1 year ago
      Doing anything other than the dumb trigger-release logic on FPGA is counter-productive IMHO.
      You're already heavily constrained by placement in order to achieve the lowest latency; you can't afford to have logic that's too complicated.
- PoignardAzur 11 months ago
  > optimization tricks like fat LTO, PGO, or even the standardized hinting attributes ([[likely]], [[unlikely]]) for optimizing icache layout
  If you do PGO, aren't hinting attributes counter-productive?
  In fact, the common wisdom I mostly see compiler people express is that most of the time they're counter-productive even without PGO, and modern compilers trust their own analysis passes more than they trust these hints and will usually ignore them.
  FWIW, the only times I've seen these hints in the wild were in places where the compiler could easily insert them, eg the null check after a malloc call.
  - nickelpro 11 months ago
    I said "or even", if you're regularly using PGO they're irrelevant, but not everyone regularly uses PGO in a way that covers all their workloads.
    The hinting attributes are exceptional for lone conditionals (not if/else trees) without obvious context to the compiler if it will frequently follow or skip the branch. Compilers are frequently conservative with such things and keep the code in the hot path.
    The [[likely]] attribute then doesn't matter so much, but [[unlikely]] is absolutely respected and gets the code out of the hot path, especially with inlined into a large section. Godbolt is useful to verify this but obviously there's no substitute for benchmarking the performance impact.
- matheusmoreira 11 months ago
  > allocations, copies, and other performance killers
  Please elaborate on those other performance killers.
twic 1 year ago
My emphasis:
> The output of this test is a test statistic (t-statistic) and an associated p-value. The t-statistic, also known as the score, is the result of the unit-root test on the residuals. A more negative t-statistic suggests that the residuals are more likely to be stationary. The p-value provides a measure of the probability that the null hypothesis of the test (no cointegration) is true. The results of your test yielded a p-value of approximately 0.0149 and a t-statistic of -3.7684.
I think they used an LLM to write this bit.
It's also a really weird example. They look at correlation of once-a-day close prices over five years, and then write code to calculate the spread with 65 microsecond latency. That doesn't actually make any sense as something to do. And you wouldn't be calculating statistics on the spread in your inner loop. And 65 microseconds is far too slow for an inner loop. I suppose the point is just to exercise some optimisation techniques - but this is a rather unrepresentative thing to optimise!
sneilan1 1 year ago
I've got an implementation of a stock exchange that uses the LMAX disruptor pattern in C++ https://github.com/sneilan/stock-exchange
And a basic implementation of the LMAX disruptor as a couple C++ files https://github.com/sneilan/lmax-disruptor-tutorial
I've been looking to rebuild this in rust however. I reached the point where I implemented my own websocket protocol, authentication system, SSL etc. Then I realized that memory management and dependencies are a lot easier in rust. Especially for a one man software project.
- JedMartin 1 year ago
  It's not easy to get data structures like this right in C++. There are a couple of problems with your implementation of the queue. Memory accesses can be reordered by both the compiler and the CPU, so you should use std::atomic for your producer and consumer positions to get the barriers described in the original LMAX Disruptor paper. In the get method, you're returning a pointer to the element within the queue after bumping the consumer position (which frees the slot for the producer), so it can get overwritten while the user is accessing it. And then your producer and consumer positions will most likely end up in the same cache line, leading to false sharing.
  - sneilan1 1 year ago
    >> In the get method, you're returning a pointer to the element within the queue after bumping the consumer position (which frees the slot for the producer), so it can get overwritten while the user is accessing it. And then your producer and consumer positions will most likely end up in the same cache line, leading to false sharing.
    I did not realize this. Thank you so much for pointing this out. I'm going to take a look.
    >> use std::atomic for your producer
    Yes, it is hard to get these data structures right. I used Martin Fowler's description of the LMAX algorithm which did not mention atomic. https://martinfowler.com/articles/lmax.html I'll check out the paper.
    - hi_dang_ 1 year ago
      I sincerely doubt the big HFT firms use anything of Fowler’s. Their optimizations are down to making their own hardware. LL is very context dependent and Amdahl’s law applies here.
    - JedMartin 11 months ago
      I have absolutely no idea how this works in Java, but in C++, there are a few reasons you need std::atomic here:
      1. You need to make sure that modifying the producer/consumer position is actually atomic. This may end up being the same instruction that the compiler would use for modifying a non-atomic variable, but that will depend on your target architecture and the size of the data type. Without std::atomic, it may also generate multiple instructions to implement that load/store or use an instruction which is non-atomic at the CPU level. See [1] for more information.
      2. You're using positions for synchronization between the producer and consumer. When incrementing the reader position, you're basically freeing a slot for the producer, which means that you need to make sure all reads happen before you do it. When incrementing the producer position, you're indicating that the slot is ready to be consumed, so you need to make sure that all the stores to that slot happen before that. Things may go wrong here due to reordering by the compiler or by the CPU [2], so you need to instruct both that a certain memory ordering is required here. Reordering by the compiler can be prevented using a compiler-level memory barrier - asm volatile("" ::: "memory"). Depending on your CPU architecture, you may or may not need to add a memory barrier instruction as well to prevent reordering by the CPU at runtime. The good news is that std::atomic does all that for you if you pick the right memory ordering, and by default, it uses the strongest one (sequentially-consistent ordering). I think in this particular case you could relax the constraints a bit and use memory_order_acquire on the consumer side and memory_order_release on the producer side [3].
      [1] https://preshing.com/20130618/atomic-vs-non-atomic-operation...
      [2] https://en.wikipedia.org/wiki/Memory_ordering
      [3] https://en.cppreference.com/w/cpp/atomic/memory_order
    - 6keZbCECT2uB 1 year ago
      Fowler's implementation is written in Java which has a different memory model from C++. To see another example of Java memory model vs a different language, Jon Gjengset ports ConcurrentHashMap to Rust
- jstimpfle 1 year ago
  Instead of this:
```
  T *item = &this->shared_mem_region
                 ->entities[this->shared_mem_region->consumer_position];
  this->shared_mem_region->consumer_position++;
  this->shared_mem_region->consumer_position %= this->slots;
```
  you can do this.
```
  uint64_t mask = slot_count - 1;  // all 1's in binary

  item = &slots[ pos & mask ];

  pos ++;
```
  i.e. you can replace a division / modulo with a bitwise AND, saving a bit of computation. This requires that the size of the ringbuffer is a power of two.
  What's more, you get to use sequence numbers over the full range of e.g. uint64_t. Wraparound is automatic. You can easily subtract two sequence numbers, this will work without a problem even accounting for wraparound. And you won't have to deal with stupid problems like having to leave one empty slot in the buffer because you would otherwise not be able to discern a full buffer from an empty one.
  Naturally, you'll still want to be careful that the window of "live" sequence numbers never exceeds the size of your ringbuffer "window".
- worstspotgain 1 year ago
  I briefly looked over your stock exchange code:
  - For memory management, consider switching to std::shared_ptr. It won't slow anything down and will put that concern to rest entirely.
  - For sockets, there are FOSS libraries that will outperform your code and save you a ton of headaches dealing with caveats and annoyances. For example, your looping through FD_ISSET is slower than e.g. epoll or kqueue.
  - For dependencies, C++ is definitely wilder than other languages. Dependencies are even harder to find than they are to manage. There's a lot of prospective library code, some of it hidden in little forgotten folds of the Internet. Finding it is basically a skill unto itself, one that can pay off handsomely.
  - zxcvbn4038 1 year ago
    When I did low latency everyone was offloading TCP to dedicated hardware.
    They would shut down every single process on the server and bind the trading trading app to the CPUs during trading hours to ensure nothing interrupted.
    Electrons travel slower than light so they would rent server space at the exchange so they had direct access to the exchange network and didn't have to transverse miles of cables to send their orders.
    They would multicast their traffic and there were separate systems to receive the multicast, log packets, and write orders to to databases. There were redundant trading servers that would monitor the multicast traffic so that if they had to take over they would know all of the open positions and orders.
    They did all of their testing against simulators - never against live data or even the exchange test systems. They had a petabyte of exchange data they could play back to verify their code worked and to see if tweaks to the algorithm yielding better or worse trading decisions over time.
    A solid understanding of the underlying hardware was required, you would make sure network interfaces were arranged in a way they wouldn't cause contention on the PCI bus. You usually had separate interfaces for market data and orders.
    All changes were done after exchange hours once trades had been submitted to the back office. The IT department was responsible for reimbursing traders for any losses caused by IT activity - there were shady traders who would look for IT problems and bank them up so they could blame a bad trade on them at some future time.
    - shaklee3 1 year ago
      You don't need to shut down processes on the server. All you have to do is isolate CPU cores and move your workloads onto those cores. That's been a common practice in low latency networking for decades.
    - gohwell 1 year ago
      I’ve worked at a few firms and never heard of an IT budget for f-ups. Sounds like a toxic work environment.
    - rramadass 1 year ago
      Any good books/resources you can recommend to learn about the above architectures/techniques?
    - ra0x3 1 year ago
      A great insightful comment, thank you!
  - sneilan1 1 year ago
    I did not know std::shared_ptr would not slow things down. I've learned something new today! :)
    Yes, I agree, epoll is a lot better than FD_ISSET.
    Maybe I can keep moving with my C++ code but do people still trust C++ projects anymore? My ideal use case is a hobbyist who wants a toy stock exchange to run directly in AWS. I felt that C++ has a lot of bad publicity and if I want anyone to trust/try my code I would have to rebuild it in rust.
    - worstspotgain 1 year ago
      Here's how to maximize shared_ptr performance:
      - In function signatures, use const references: foo(const std::shared_ptr<bar> &p). This will prevent unnecessary bumps of the refcount.
      - If you have an inner loop copying a lot of pointers around, you can dereference the shared_ptr's to raw pointers. This is 100% safe provided that the shared_ptr continues to exist in the meantime. I would consider this an optimization and an edge case, though.
      I would say people trust C++ projects at least as much as any other professional language - more so if you prove that you know what you're doing.
    - fooker 1 year ago
      Reference counting definitely slows down tight loops if you are not careful.
      The way to avoid that in low latency code is to break the abstraction and operate with the raw pointer in the few areas where this could be a bottleneck.
      It is usually not a bottleneck if your code is decently exploiting ipc, an extra addition or subtraction easily gets executed while some other operation is waiting a cycle for some cpu resource.
    - shaklee3 1 year ago
      That's not true. It does slow things down because it has an atomic access. How slow depends on the platform.
      unique_ptr does not slow things down.
    - pjmlp 1 year ago
      C++ might have a bad reputation, but in many fields the only alternative, in terms of ecosystem, tooling and tribal knowledge is C.
      Between those two, I rather pick the "Typescript for C" one.
    - chipdart 1 year ago
      > I felt that C++ has a lot of bad publicity and if I want anyone to trust/try my code I would have to rebuild it in rust.
      C++ gets bad publicity only from evangelists of the flavour of the month of self-described "successor of C++". They don't have a sales pitch beyond "C++ bad" and that's what they try to milk.
      And yet the world runs on C++.
    - rmlong 1 year ago
      std::shared_ptr definitely slows things down. It's non-intrusive therefore requires a memory indirection.
- pclmulqdq 1 year ago
  The LMAX disruptor is a great data structure when you have threads bound to cores and most/all of them are uncontended. It has some terrible pathologies at the tails if you aren't using this pattern. Threads getting descheduled at bad times can really hurt.
  SPSC ring buffers are going to be hard to beat for the system you are thinking of, and you can likely also implement work stealing using good old locks if you need it.
- temporarely 1 year ago
  fun fact: the original LMAX was designed for and written in Java.
  https://martinfowler.com/articles/lmax.html
  - sneilan1 1 year ago
    I think it made sense at the time. From what I understand, you can make Java run as fast as C++ if you're careful with it and use JIT. However, I have never tried such a thing and my source is hearsay from friends who have worked in financial institutions. Then you get added benefit of the Java ecosystem.
    - nine_k 1 year ago
      From my hearsay, you absolutely can, given two things: fewer pointer-chasing data structures, and, most crucially, fewer or no allocations. Pre-allocate arrays of things you need, run ring buffers on them if you have to use a varying number of things.
      A fun but practical approach which I again heard (second-hand) to be used, is just drowning your code in physical RAM, and switch the GC completely off. Have enough RAM to run a trading day, then reboot. The cost is trivial, compared to spending engineering hours on different approaches.
    - bb88 1 year ago
      All the java libs that you use can never do an allocation -- ever!. So you don't really get that much benefit to the java ecosystem (other than IDE's). You have to audit the code you use to make sure allocations never happen during the critical path.
      Fifteen years ago, the USN's DDX software program learned this the hard way when they needed a hard real time requirement in the milliseconds.
- stargrazer 1 year ago
  I'll have to take a look at the code. Maybe I can integrate it into https://github.com/rburkholder/trade-frame
- fooker 1 year ago
  You say it's easier in Rust, but you still have a complete C++ implementation and not a Rust one. :)
  - sneilan1 1 year ago
    It took me about a year to build all of this stuff in C++. So I imagine since I've had to learn rust, it will probably take me the same amount of time if I can save time with dependencies.
    - fooker 1 year ago
      In my experience, once you know the problem really well, yes you're right.
      If you are building a complex prototype from scratch, you'll usually spend more time fighting the Rust compiler than trying out alternate design decisions.
  - deepsun 1 year ago
    Linus said he wouldn't start Linux if Unix was ready at that time.
    - deepsun 11 months ago
      Don't know why so many downvotes, probably because I confused Unix with "GNU kernel".
      Here's the source (Jan 29, 1992):
      If the GNU kernel had been ready last spring, I'd not have bothered to even start my project: the fact is that it wasn't and still isn't. Linux wins heavily on points of being available now.
      https://groups.google.com/g/comp.os.minix/c/wlhw16QWltI/m/P8...
    - pjmlp 1 year ago
      Minix....
jeffreygoesto 1 year ago
Reminds me of https://github.com/CppCon/CppCon2017/blob/master/Presentatio...
- munificent 1 year ago
  This is an excellent slideshow.
  The slide on measuring by having a fake server replaying order data, a second server calculating runtimes, the server under test, and a hardware switch to let you measure packet times is so delightfully hardcore.
  I don't have any interest in working in finance, but it must be fun working on something so performance critical that buying a rack of hardware just for benchmarking is economically feasible.
  - nine_k 1 year ago
    Delightfully hardcore indeed!
    But of course you don't have to buy a rack of servers for testing, you can rent it. Servers are a quickly depreciating asset, why invest in them?
    - CyberDildonics 1 year ago
      Why would replaying data for testing be "Delightfully hardcore indeed!". That's how people program in general, they run the same data through their program.
      Servers are a quickly depreciating asset, why invest in them?
      I don't think they are a quickly depreciating asset compared to the price of renting, but you would want total control over them in this scenario anyway.
    - a_t48 1 year ago
      You'd want it to be the exact same hardware as in production, for one.
  - a_t48 1 year ago
    The self driving space does this :)
winternewt 1 year ago
I made a C++ logging library [1] that has many similarities to the LMAX disruptor. It appears to have found some use among the HFT community.
The original intent was to enable highly detailed logging without performance degradation for "post-mortem" debugging in production environments. I had coworkers who would refuse to include logging of certain important information for troubleshooting, because they were scared that it would impact performance. This put an end to that argument.
[1] https://github.com/mattiasflodin/reckless
munificent 1 year ago
> The noted efficiency in compile-time dispatch is due to decisions about function calls being made during the compilation phase. By bypassing the decision-making overhead present in runtime dispatch, programs can execute more swiftly, thus boosting performance.
The other benefit with compile-time dispatch is that when the compiler can statically determine which function is being called, it may be able to inline the called function's code directly at the callsite. That eliminates all of the function call overhead and may also enable further optimizations (dead code elimination, constant propagation, etc.).
- foobazgt 1 year ago
  > That eliminates all of the function call overhead and may also enable further optimizations (dead code elimination, constant propagation, etc.).
  AFAIK, the speedup is almost never function call overhead. As you mention at the tail end, it's all about the compiler optimizations being able to see past the dynamic branch. Good JITs support polymorphic inlining. My (somewhat dated) experience for C++ is that PGO is the solve for this, but it's not widely used. Instead people tend to avoid dynamic dispatch altogether in performance sensitive code.
  I think the more general moral of the story is to avoid all kinds of unnecessary dynamic branching in hot sections of code in any language unless you have strong/confidence your compiler/JIT is seeing through it.
- binary132 1 year ago
  The real performance depends on the runtime behavior of the machine as well as compiler optimizations. I thought this talk was very interesting on this subject.
  https://youtu.be/i5MAXAxp_Tw
- xxpor 1 year ago
  OTOH, it might be a net negative in latency if you're icache limited. Depends on the access pattern among other things, of course.
  - munificent 1 year ago
    Yup, you always have to measure.
    Though my impression is that compilers tend to be fairly conservative about inlining so that don't risk the inlining being a pessimization.
    - foobazgt 1 year ago
      My experience has been that it's rather heuristic based. It's a clear win when you can immediately see far enough in advance to know that it'll also decrease the amount of generated code. You can spot trivial cases where this is true at the point of inlining. However, if you stopped there, you'd leave a ton of optimizations on the table. Further optimization (e.g. DCE) will often drastically reduce code size from the inlining, but it's hard to predict in relationship to a specific inlining decision.
      So, statistics and heuristics.
    - rasalas 1 year ago
      In my experience it's the "force inline" directives that can make this terrible.
      I had a coworker who loved "force inline". A symptom was stupidly long codegen times on MSVC.
globular-toast 1 year ago
Is there any good reason for high-frequency trading to exist? People often complain about bitcoin wasting energy, but oddly this gets a free pass despite this being a definite net negative to society as far as I can tell.
- jeffreyrogers 1 year ago
  Bid/ask spreads are far narrower than they were previously. If you look at the profits of the HFT industry as a whole they aren't that large (low billions) and their dollar volume is in the trillions. Hard to argue that the industry is wildly prosocial but making spreads narrower does mean less money goes to middlemen.
  - akira2501 1 year ago
    > but making spreads narrower does mean less money goes to middlemen.
    On individual trades. I would think you'd have to also argue that their high overall trading volume is somehow also a benefit to the broader market or at the very least that it does not outcompete the benefits of narrowing.
    - jeffreyrogers 11 months ago
      Someone is taking the other side of the trade. Presumably they have a reason for making that trade, so I don't see how higher volume makes people worse off. Probably some of those trades are wealth destroying (due to transaction costs) but it is destroying traders' and speculators' wealth, not some random person who can't afford it, since if you trade rarely your transaction costs are lower than before HFT became prominent.
  - sesuximo 1 year ago
    Why do high spreads mean more money for middlemen?
    - Arnt 1 year ago
      When you buy stock, you generally by it from a "market maker", which is a middleman. When you sell, you sell to a market maker. Their business is to let you buy and sell when you want instead of waiting for a buyer/seller to show up. The spread is their profit source.
- TheAlchemist 1 year ago
  Because it's not explicitly forbidden ?
  I would argue that HFT is a rather small space, albeit pretty concentrated. It's several orders of magnitude smaller in terms of energy wasting than Bitcoin.
  The only positive from HFT is liquidity and tighter spreads, but it also depends what people put into HFT definition. For example, Robinhood and free trading, probably wouldn't exist without it.
  They are taking a part of the cake that previously went to brokers and banks. HFT is not in a business of screwing 'the little guy'.
  From my perspective there is little to none negative to the society. If somebody is investing long term in the stock market, he couldn't care less about HFT.
  - musicale 11 months ago
    Buying and quickly selling a stock requires a buyer and a seller, so it doesn't seem like an intrinsic property that real liquidity would be increased. Moreover, fast automated trading seems likely to increase volatility.
    I tend to agree that for long term investment it probably doesn't make a huge difference except for possible cumulative effects of decreased liquidity, increased volatility, flash crashes, etc. Also possibly a loss of small investor confidence since the game seems even more rigged in a way that they cannot compete with.
- bravura 1 year ago
  Warren Buffett proposed that the stock market should be open less frequently, like once a quarter or similar. This would encourage long-term investing rather than reacting to speculation.
  Regardless, there are no natural events that necessitate high-frequency trading. The underlying value of things rarely changes very quickly, and if it does it's not volatile, rather it's a firm transiton.
  - astromaniak 1 year ago
    > Warren Buffett proposed that the stock market should be open less frequently
    This will result in another market where deals will be made and then finalized on that 'official' when it opens. It's like with employee stock. You can sell it before you can...
    - FooBarBizBazz 11 months ago
      > It's like with employee stock. You can sell it before you can...
      I thought that this was explicitly forbidden in most SV employment contracts? "Thou shalt not offer your shares as collateral or (I forget the exact language) write or purchase any kind of derivative to hedge downside.' No buying PUTS! No selling CALLs! No stock-backed loans!
      Or do people make secondary deals despite this, because, well, the company doesn't know, does it?
    - munk-a 1 year ago
      Not if we disallow it. We have laws in place to try and prevent a lot of natural actions of markets.
  - Arnt 1 year ago
    AIUI the point of HFT isn't to trade frequently, but rather to change offer/bid prices in the smallest possible steps (small along both axes) while waiting for someone to accept the offer/bid.
  - affyboi 11 months ago
    Why shouldn’t people be allowed to do both? I don’t see much of an advantage to making the markets less agile.
    It would be nice to be able to buy and sell stocks more than once a quarter, especially given plenty of events that do affect the perceived value of a company happen more frequently than that
  - musicale 11 months ago
    The value of a stock usually doesn't change every millisecond. There doesn't seem to be a huge public need to pick winners and losers based on ping time.
  - htrp 1 year ago
    most trading volume already happens at open or before close.
  - kolbe 1 year ago
    Seems like a testable hypothesis. Choose the four times a year that the stock is at a fair price, and buy when it goes below it and sell when it goes above it?
- FredPret 1 year ago
  Non-bitcoin transactions are just a couple of entries in various databases. Mining bitcoin is intense number crunching.
  HFT makes the financial markets a tiny bit more accurate by resolving inconsistencies (for example three pairs of currencies can get out of whack with one another) and obvious mispricings (for various definitions of "obvious")
  - jeffbee 1 year ago
    That's a nice fairy tale that they probably tell their kids when asked, but what the profitable firms are doing at the cutting edge is inducing responses in the other guys' robots, in a phase that the antagonist controls, then trading against what they know is about to happen. It is literally market manipulation. A way to kill off this entire field of endeavor is to charge a tax on cancelled orders.
    - alchemist1e9 1 year ago
      Exchanges put limits on cancellation rates as measured by a multiple of filled orders.
      You have to allow strategies that can induce other strategies as by definition those also increase liquidity. It’s a difficult problem to explain to anyone except the very few people who can understand the extremely complicated feedback loops that result from bots fighting bots, however the regulators actually have access to counterparty tagged exchange event data and what is found when this is analyzed is that the net cost for liquidity that is extracted by market makers and short term traders from longer term participants is continuously decreasing not increasing. The system is becoming more and more efficient and not less. This is good for markets and the economy. There are also less people working in financial markets per capita than ever before, granted those who are might include a higher percentage of highly skilled and specialized and educated individuals than previously, which some might argue might be better used in some other industry, but that is rightfully not what the market wants.
      There is absolutely no logical reason to “kill off this entire field” those sentiments are purely envy based reactions from those who don’t understand what is happening.
    - blakers95 1 year ago
      Yep and what's worse is many hft firms aren't in the market-making business at all but actually REMOVE liquidity.
    - affyboi 11 months ago
      The exchanges already charge fees per order https://www.nyse.com/publicdocs/nyse/markets/nyse-arca/NYSE_...
    - FredPret 1 year ago
      If one outlawed / disincentivized hostile the bot behavior you described, there would still be the opportunity to do the good and profitable things I described.
- idohft 1 year ago
  How far have you tried to tell, and do you buy/sell stocks?
  There's someone on the other side of your trade when you want to trade something. You're more likely than not choosing to interact with an HFT player at your price. If you're getting a better price, that's money that you get to keep.
  *I'm going to disagree on "free pass" also. HFT is pretty often criticized here.
- pi-rat 1 year ago
  Generally gets attributed with:
  - Increased liquidity. Ensures there's actually something to be traded available globally, and swiftly moves it to places where it's lacking.
  - Tighter spreads, the difference between you buying and then selling again is lower. Which often is good for the "actual users" of the market.
  - Global prices / less geographical differences in prices. Generally you can trust you get the right price no matter what venue you trade at, as any arbitrage opportunity has likely already been executed on.
  - etc..
  - munk-a 1 year ago
    > Tighter spreads, the difference between you buying and then selling again is lower. Which often is good for the "actual users" of the market.
    I just wanted to highlight this one in particular - the spread is tighter because HFTs eat the spread and reduce the error that market players can benefit from. The spread is disappearing because of rent-seeking from the HFTs.
    - yxhuvud 1 year ago
      What needs to be pointed out is that the rent and spread is the same thing in this equation. Before the rise of HFT actual people performed these functions, and then the spread/rent-seeking was a lot higher.
- rcxdude 1 year ago
  Yes: it put a much larger, more expensive, and less efficient part of wall street out of business. Before it was done with computers, it was done with lots of people doing the same job. What was that job? Well, if you want to go to a market and sell something, you generally would like for there to be someone there who's buying it. But it's not always the case that there's a buyer there right at the time who actually wants the item for their own use. The inverse is also true for a prospective buyer. Enter middle-men or market makers who just hang around the marketplace, learning roughly how much people will buy or sell a given good for, and buy it for slightly less than they can sell it later for. This is actually generally useful if you just want to buy or sell something.
  Now, does this need to get towards milli-seconds or nano-seconds? No, this is just the equivalent of many of these middle-men racing to give you an offer. But it's (part of) how they compete with each other, and as they do so they squeeze the margins of the industry as a whole: In fact the profits of HFT firms have decreased as a percentage of the overall market and in absolute terms after the initial peak as they displaced the day traders doing the same thing.
  - bostik 1 year ago
    > it's not always the case that there's a buyer there right at the time
    This hits the nail on the head. For a trade to happen, counterparties need to meet in price and in time. A market place is useless if there is nobody around to buy or sell at the same time you do.
    The core service market makers provide is not liquidity. It's immediacy: they offer (put up) liquidity in order to capture trades, but the value proposition for other traders - and the exchanges! - is that there is someone to take the other side of a trade when a non-MM entity wants to buy or sell instruments.
    It took me a long time to understand what the difference is. And in order to make sure that there is sufficient liquidity in place, exchanges set up both contractual requirements and incentive structures for their market makers.
- torlok 1 year ago
  To go a step further, I don't think you should be required to talk to middlemen when buying stocks, yet here we are. The house wants its cut.
  - smabie 1 year ago
    So who would sell it to you then? At any given time there's not very many actual natural buyer and sellers for a single security.
  - mrbald 1 year ago
    Central counterparty concept implemented by most exchanges is a valid service, as otherwise counterparty risk management would be a nightmare - an example of a useful “middleman”.
    - immibis 1 year ago
      Yes, but you and I can't even talk to the exchange. We have to talk to one of many brokers that are allowed to talk to the exchange, and brokers do much more than just passing your orders to exchanges. For example, IIRC they can legally front-run your orders.
  - astromaniak 1 year ago
    Under the carpet deals would make it less transparent. It would be hard to detect that top manager sold all his stock to competitors and now is making decisions in their favor.
- arcimpulse 1 year ago
  It would be trivial (and vastly more equitable) to quantize trade times.
  - FredPret 1 year ago
    You mean like settling trades every 0.25 seconds or something like that? Wouldn't there be a queue of trades piling up every 0.25 seconds, incentivizing maximum speed anyway?
    - foobazgt 1 year ago
      Usually the proposal is to randomize the processing of the queue. So, as long as your trades get in during the window, there's no advantage to getting in any earlier. In theory the window is so small as to not have any impact on liquidity but wide enough to basically shut down all HFT.
- lifeformed 1 year ago
  Just from an energy perspective, I'm pretty sure HFT uses many orders of magnitude less energy than bitcoin mining.
- 1 year ago
- mkoubaa 1 year ago
  Why does it exist?
  Because it's legal and profitable.
  If you don't like it, try to convince regulators that it shouldn't be legal and provide a framework for criminalizing/fining it without unintended consequences, and then find a way to pay regulators more in bribes than the HFT shops do, even though their pockets are deeper than yours, and then things may change.
  If that sounds impossible, that's another answer to your question
- lmm 1 year ago
  > a definite net negative to society as far as I can tell.
  What did you examine to reach that conclusion? If high-frequency trading were positive for society, what would you expect to be different?
  The reason for high-frequency trading to exist is that the sub-penny rule makes it illegal to compete on price so you have to compete on speed instead. Abolishing the sub-penny rule would mean high-frequency trading profits got competed-away to nothing, although frankly they're already pretty close. The whole industry is basically an irrelevant piece of plumbing anyway.
- calibas 1 year ago
  A net negative to society, but a positive for the wealthiest.
  - cheonic730 1 year ago
    > A net negative to society, but a positive for the wealthiest.
    No.
    When your passive index fund manager rebalances every month because “NVDA is now overweighted in VTI, QQQ” the manager does not care about the bid/ask spread.
    When VTI is $1.6 trillion, even a $0.01 difference in price translates to a loss $60 million for the passive 401k, IRA, investors.
    HFT reduces the bid/ask spread, and “gives this $60 million back” to the passive investors for every $0.01 price difference, every month. Note that VTI mid price at time of writing is $272.49.
astromaniak 1 year ago
Just in case you are a pro developer, the whole thing is worth looking at:
https://github.com/CppCon/CppCon2017/tree/master/Presentatio...
and up
ykonstant 11 months ago
I am curious: why does this field use/used C++ instead of C for the logic? What benefits does C++ have over C in the domain? I am proficient in C/assembly but completely ignorant of the practices in HFT so please go easy on the explanations!
- jqmp 11 months ago
  C++ is more expressive and allows much more abstraction than C. For a long time C++ was the only mainstream language that provided C-level performance as well as rich abstractions, which is why it became popular in fields that require complex domain modeling, like HFT, gamedev, and graphics. (Of course one can debate whether this expressivity is worth the enormous complexity of the language, but in practice people have empirically chosen C++.)
ibeff 1 year ago
The structure and tone of this text reeks of LLM.
poulpy123 11 months ago
the irony being that if something should not be high frequency, it is trading
apantel 11 months ago
Anyone know of resources like this for Java?
- Hixon10 11 months ago
  https://www.reddit.com/r/java/comments/1ctpebe/low_latency/
  - apantel 11 months ago
    Thanks! Looks like a great list of resources.
gedanziger 1 year ago
Very cool intro to the subject!