（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=40940241

该用户讨论了一种用于在 C 和 C++ 中遍历数组和链表的方法。当递增指向数组元素的指针时，它指向内存中该类型的下一个对象。通过这样做，用户可以轻松地迭代数组。然而，他们指出，在链表上下文中向指针加 1 的工作方式类似，其中下一个链表节点可能是下一个数组元素，因为它们共享连续内存空间。用户提到现代 CPU 利用管道来加速同时执行多个指令的过程。当指令之间存在依赖性时就会出现瓶颈，从而导致称为管道停顿的错误。内存加载通常会引入这些瓶颈，因为加载内存通常比其他操作慢。为了克服这个问题，处理器可以乱序执行指令，尽管为了防止混乱，它们必须看起来好像正在执行。这允许 CPU 在确认内存中的下一个节点是否对应于实际列表中的下一个节点之前，推测性地开始处理循环的下一个迭代。如果预测结果与实际情况相符，则 CPU 已成功降低管道停顿的可能性。相反，如果预测失败，则不会提交任何结果，并且 CPU 按预期顺序向前移动。最后，用户承认，考虑到连续数组通常可以使用向量有效地执行此任务，该技术看起来很复杂并且似乎没有必要。他们建议，涉及频繁更改链表的特定场景可能需要利用这种方法来最大限度地减少昂贵操作的影响。此外，用户注意到，由于更好的缓存局部性和减少的指针追踪问题，在单个大分配中重新排列链表的内存布局可能会提高性能。或者，可以实现分段数组，其中链表元素单独存储，以实现类似的好处。然而，在实践中，对于典型的应用程序来说，此类技术可能过于复杂，因此需要更简单的解决方案，例如展开链表或通过向量内的索引表示连接。

It's rare to need to work at this level of optimization, but this is a really neat trick!

Modern cores are quite wide - capable of running 6-8 instructions at once, as long as there are no dependencies. Something as simple and common as a summation loop can often be sped up 2-4x by simply having multiple accumulators that you then combine after the loop body; this lets the processor "run ahead" without loop carried dependencies and execute multiple accumulations each cycle.

This technique is similar in concept, but even more general. Put the "guess" in the registers and run with it, relying on a second instruction within a branch to correct the guess if its wrong. Assuming your guess is overwhelmingly accurate... this lets you unlock the width of modern cores in code that otherwise wouldn't present a lot of ILP!

Clever.

>Something as simple and common as a summation loop can often be sped up 2-4x by simply having multiple accumulators that you then combine after the loop body; this lets the processor "run ahead" without loop carried dependencies and execute multiple accumulations each cycle.

Shouldn't the compiler be smart enough to figure that out these days (at least if it truly is a straightforward accumulation loop)?

You want to use SIMD and multiple accumulators. In fact not only you want to use as many accumulators as the number of SIMD ALUs, as SIMD operations are usually longer latency you usually unroll SIMD loops for software pipelining, using more accumulators to break loop carried dependencies.

Very nice! I was struggling to come up with a realistic real world scenario for value speculation, but this is a perfect example. Any kind of multi-thread/contention related code seems like an ideal target for making a guess-based fast-path corrected by a slow-path consistency check

It’s interesting to me that the final assembly-trick-free version almost no longer looks like a hack.

If you commented the inner loop with something like “// Linear scan for adjacent nodes”, the reader gets an OK, if incomplete, intuition for why it’s faster. Even if you don’t know the exact CPU details, if you’re aware that flat arrays usually loop faster than contiguous linked lists, the nested loop immediately reads as a kind of “hybrid mode”.

That is amazing, isn't it?

I want to understand, in:

  uint64_t sum5(Node *node) {
    uint64_t value = 0;
    Node *next = NULL;
    for (; node; node = node->next) {
      for (;;) {
        value += node->value;
        if (node + 1 != node->next) {
          break;
        }
        node++;
      }
    }
    return value;
  }

How is this magic:

        if (node + 1 != node->next) {
          break;
        }

Better than the more intuitive:

        if (node->next == null) {
          break;
        }

Also.. why is `node + 1' even a valid comparison? (Please forgive my rusty C/C++, college was awhile ago)

> Also.. why is `node + 1' even a valid comparison here?

In C and C++, when you add 1 to a pointer, you actually make it point to the next object of that size in memory, e.g. if I have a pointer to a 4-byte integer at address 0x8000, incrementing it will make it point to address 0x8004.

Because arrays are contiguous chunks of memory, you can use this to iterate through arrays.

  int array[50];
  int *p = &array[0]; // Take the address of the first array element
  int sum = 0;
  for (int i = 50; i != 0; i--) {
      sum += *p;
      p++; // Point to the next element
  }

After this loop, p will be equal to &array[50], which is one past the last element of the array, because the loop will have run 50 times, and p is incremented once per loop.

What OP did is allocate an array of linked list nodes, and test to see if the next linked list node was actually just the next array element.

CPUs can execute arithmetic instructions fast, because we've been trying to execute sequential instructions faster for years. One way to do this is to create a pipeline: while the result of one instruction is being calculated, start reading the next; when the result is being written to a register, you're executing the previous one.

Bottlenecks are introduced by dependencies, like if an instruction modifies one register and the next one immediately uses its value. With a pipeline, the result of the previous instruction might not be written back to the register in time for the current instruction to read the correct value from it, causing the CPU to execute the instruction incorrectly. So, you have to stall the pipeline and do nothing until that register is written to.

Memory loads cause bottlenecks for similar reasons. In fact, loading memory is usually the slowest instruction.

One way of getting around this is to execute instructions out of order, but do it in such a way that it looks like it's executing in-order. For dealing with branches, you can speculatively execute both branches, but only commit one result to the CPU state.

By testing whether the next node in memory was actually the next node in the list, the CPU could speculatively start executing the next iteration of the loop body as though it were; if it turns out not to be, then it can just not commit the results. If it is, then you've avoided a pipeline stall.

Note: I'm not a CPU expert; I just took a class in computer architecture recently. I could be mistaken about something, so feel free to correct me.

This jives with my understanding of pipelining architecture and speculative execution, thank you for connecting the dots! Seriously cool to understand why that line works better.

It seems unfortunate the compiler isn't able to realize "no further assignments are ever performed on node->next, it is side-effect dependency free, optimize the spec exec accordingly". Though, how often would this be the case in a real-world program of actual utility.. probably rare.

The other person who replied may have already cleared this up for you, but the two conditions you listed mean very different things. The first means “if the next node is not adjacent in memory”, while the second means “if there is no next node.

This is crucially followed by the node++ instruction. As noted elsewhere, this increases the value of the by the , thereby allowing the CPU to execute at the faster of it's memory prefetch / the speculation of the inner code loop execution speed, for as long as the sequence of list nodes is contiguous and sequential.

Finding 'the hack' to get the compilers to behave is one of the areas where I'd like to see better warnings (which could get promoted to errors).

  while (node) {
    value += node->value;
    next = node->next;
    node++;  // Line 101
    if (node != next) {
      node = next;
    }
  }
  // Compiler warning
  // Warning: Lines 101 102 103: always true evaluation combined

The pivot here is moving the fallback out of the tight inner loop, this also introduces a new statement block. The additional loop exit condition is dependent on values in memory and cannot be optimized out; since it's now directing flow control directly instead of always assuring the value of node becomes node->next (an indirect method of flow control).

  while (node) {
    while (node) {
      value += node->value;
      if (node + 1 != node->next) break;
      node++;
    }
    node = node->next;
  }

This got me wondering - it's said that C is based on the lie that all computers have the same architecture as a PDP-11. (At least, I'm pretty sure I remember people saying that).

So, are there any programming languages that have updated architectural models, something that takes into account branch prediction, CPU caches, etc?

As far as I am aware, this saying is based on the reasoning behind C types rather than serious compiler considerations. In today's world such cpu-specific concerns are left to the compiler to figure out.

I'm sure you could contrive a language where this functionality is exposed, but I'm struggling to come with an example where this would be seriously beneficial across multiple platforms.

I strongly suspect that integrating editors of existing languages with tooling that informs programmers on how a given chunk of code performs with parallel execution units would be far more beneficial than inventing a language dedicated to such concerns at this time.

Well, C++ has the likely/unlikely attribute to somewhat prefer a branch over the other in the eye of the branch predictor, and C++, Rust and some other low-level languages do have native SIMD support (note: C doesn’t have an official one, just compiler-specific ones. So in this vein it is actually higher level than Rust or C++).

The linear node layout is not the point at all.

It's serving two purposes here:

1) Providing us with a correct "guess" of what the next node is. 2) Ensuring that in all cases we're running from the L1 cache.

In real world code, you'd be correct -- getting things to run out of L1/L2 is the most important attribute. This is specifically about a micro-optimization that allows you to beat the obvious code even when running completely from cache!

The example is poorly chosen in terms of practicality for this reason, but otherwise, no, this is a poor summary that misses something interesting.

The memory layout isn't changing in the faster versions, and there are no additional cache misses. It's easy to convince yourself that the only difference between the naive linked list and assuming linear layout is the extra pointer load - but TFA shows this is false! The execution pipeline incurs extra costs, and you can influence it.

I think a more practical example might have been to have a mostly contiguous list with a few discontinuous nodes inserted randomly in the middle. That’s more like a real case, and exercises the advantages of linked lists over simple arrays, but should still perform well, since there would only be a few value speculation misses.

This is great! Mostly when I think about branch prediction, I'm thinking about the end of a loop so this was a great read.

There have been a lot of comments about the example presented being quite artificial and I agree but it is simple enough to help the reader understand what's happening and why it's faster.

In fact, it would be fairly common for the nodes in linked lists to be sequential in ram anyway. For example this code shows that the next node is easy to guess. The nodes do end up exactly in sequence in memory:

  #include 
  #include 

  typedef struct Node {
    int value;
    struct Node *next;
  } Node;

  Node *head = NULL;
  Node *tail = NULL;

  int main(int argc, char **argv) {

    // Allocate some nodes
    for (int i = 0; i < 100; i++) {
      Node *new_node = malloc(sizeof(Node));
      new_node->value = rand();
      new_node->next = NULL;
      if (tail == NULL) {
        head = tail = new_node;
      } else {
        tail->next = new_node;
        tail = new_node;
      }
    }

    // Print their locations in memory
    for (Node *current = head; current->next != NULL; current = current-> next) {
      printf("%p\n", current);
    }
  }

That's a controversial part of it. I think that strictly, if the nodes are allocated as part of one array, it is permissible to use current++ to traverse from one to the other. While it would be UB if they are in separate allocations, even if it logically should work all the same way.

It's a neat trick, but I think a linked list (with the very specific layout where nodes are allocated in order) is the only situation where this trick could possibly be useful?

And I think it only works if Spectre mitigations are disabled anyway?

What the trick does is replace sequential fetches (where each fetch address depends on the result of the previous fetch because, well, linked lists) with parallel fetches. It takes the minimum fetch-to-fetch latency from a L1 cache hit (roughly 3 cycles IIRC) to a cycle or less (most CPUs can do multiple parallel fetches per cycle).

If your data is stored in a vector or a B-tree, accesses are already parallel by default and you'll never need this trick.

Hmm. What do we think about the alternative guess that the address of the next node’s value field is the next address after our current node’s value field memory address? This is, I guess, essentially a guess that we’re pointing at sequential elements of a big array, which sort of begs the question “why not just use an array?” But I’m wonder if a slightly permuted array is not so unusual, or at least might come up occasionally.

I feel like there are a number of data structures that you might initially set up (or load in) in some preferred contiguous order, which will still remain largely contiguous after modification, so that you get a good tradeoff between cheap operations and fast traversal. You’d then have the option to do partial defragmentation at convenient times, without having to have a convoluted hybrid data structure. But it’s definitely something you’d do in specific critical cases after a lot of analysis.

I may have misread the graphs, but I didn't see the article feature the comparison between the throughput when going over a fully-contiguous linked list vs. randomized linked list?

The example is slightly contrived (the entire linked list was preallocated), but seems like the same technique could be a useful performance optimization, such as if successive calls to malloc commonly return pointers with a particular stride.

Ok… all the comments are pretty nitty gritty obscure. So unless you’re a compiler hacker or HFT assembly dev, where can someone like me learn all this stuff from (besides Intel/Arm manuals, even though the i386 manuals were nice)

I enjoyed the read and it taught me new things, I just wish that the reference example would have some minimal practical value.

I don’t think there is any reasonable scenario where you would be using a linked list but the memory is contiguous most of the time.

It's not that unlikely.

- Create the list in some weird order. You know you're going to traverse it a bunch, so you sort it.

- The list is in creation order, and creation is in allocation order. Once in a while you go back and insert or delete a small handful of nodes, hence the linked list.

- There is a natural sort order that you can make contiguous, but you support relatively rare alternative orderings and optimize for the natural order.

Then again, I pretty much agree with you. I think it's a clever trick, but I can't come up with a time when I would use it. Largely that's probably because if the memory is contiguous most of the time, then you should probably be using a vector instead. You can insert/remove by shifting stuff around to handle the rare cases that require the linked list. If performance cliffs are a problem, you can mitigate with a segmented array.

I do think there are examples out there for this value prediction trick that could make sense, which is why I was a bit frustrated by the example they chose.

Someone mentioned sparse vectors for example, where you are guessing 0 rather than pointer++. Anytime where a value is somewhat predictable this could come in handy.

Yeah, this layout of a linked list within one big allocation seems like a niche thing at best.

* I tend to default to a simple vector because it's denser, more CPU cache friendly, less pointer chasy.

* If I really wanted a linked list to avoid occasional expensive ops, I probably would want to be able to grow it without reallocation (much less touching up all the old pointers), so they might all be in separate allocations, and most general-purpose memory allocators won't give me the nice sequential-ness.

* And then I'd probably end up with an unrolled linked list (probably the same thing you're describing as a "segmented array"?), which would reduce the pointer chasing and also make better use of the memory bandwidth through better density, so it'd outperform this trick.

* I guess if I really wanted to be able to reorder stuff within a vector cheaply but was still comfortable with the amortized constant time for growth, I might have represent the links as indices within the vector (or in a parallel vector). It'd be useful then. Maybe something like a linked hash map too. But if those ops are common, the 100% accurate prediction of this benchmark is way too optimistic. Could sort it sometimes as you mentioned, but that seems even more niche and fussy.

There might be other cases of this trick that I'm more likely to use than this data structure, but I'm scratching my head to think of them.

Neat trick. Though it seems unlikely to be very useful in practice. How often are you going to know the probably value of a pointer without knowing the actual value? I would guess it's pretty rare. Interesting anyway!

It's possible to have a "sparse" matrixes where most of the values are 0 and only a few are not null. So you can guess 0 and cross your fingers.

(There are libraries that implement sparse matrixes in a more memory efficient way. I needed them for a program in Python, but I'm not an expert in Python. I found a few ways, but they are only useful for big matrixes with very few coeficients and have other restrictions to get an improved speed. I finaly gave up and used a normal np matrixes.)

In the last case they were like 100x100, but only two rows or columns were not zero, so only 2% full. We were using einsums to multiply them with 4D arrays, and I solved the problem writing some code with @guvectorize.

In al old case they like 1000x1000, like a block checkboard where one of the colors had only 0, so like 50% full, but the blocks had different sizes. It was an old project in Fortran, and we used a "for"(do) for the blocks and another inside each block.

Neat. That’s fairly sparse. I’m 0% surprised to hear that small sparse matrices haven’t got as existing code out there, seems like a good excuse to roll your own :)

Bigger-picture, this method amounts to manually assisted speculative execution. And it's not about knowing the not-yet-loaded value, but about knowing what will (very likely) happen as a consequence of that value.

Well in the articles case, it's the linked list `next` pointers:

https://mazzo.li/posts/value-speculation.html#value-speculat...

In a happy case, those will be laid out sequentially in memory so you can guess the value of the pointer easily.

(That said your comment still stands, since using linked lists in the first place is much more rare). But I suppose there's probably a lot of other domains where you might have a performance critical loop where some hacky guessing might work.

the optimization could be vaguely interesting if you are implementing a lisp (or some other list heavy language) and don't want to perform too aggressive non-local optimizations to optimize the layout of lists.

It might be useful in cases where you pre-allocate a large array which you don't randomly access and whose structure doesn't change much but sometimes it does. Then you could either reallocate the array and pay a (large) one time cost or use this trick.

The article states that the CPU has a limit of 4 instructions per cycle, but the sum2 method issues 5 instructions per cycle. Presumably one of them (maybe the increment) is trivial enough to be executed as a fifth instruction.

depends how you define causality. if you consider the execution of one operation to cause the execution of the next operation in program order, then causality was already broken by simple reordering. if it's a read-write dependency, on the other hand, then it won't be broken (because cpus respect control dependencies); hence, you cannot, for example, replicate oota this way. what's broken is specifically read-read data dependencies. and only on weakly-ordered architectures; it won't do anything on x86

The nodes are adjacent in sum1.

The nodes are adjacent in sum2, and sum2 executes more instructions than sum1, and sum2 is faster than sum1.

The nodes are adjacent in sum3, and sum3 executes more instructions than sum1, and sum3 is faster than sum1.

（评论） (comments)

（评论）
(comments)