![]() |
|
![]() |
| Infiniband uses RDMA, which is different than ordinary DMA. Your IB card sends the data to the client point to point, and the IB card directly writes it to the RAM. IB driver notifies that the data is arrived (generally via IB accelerated MPI), and you directly LOAD your data from the memory location [0].
IOW, your data magically appears in your application's memory, at the correct place. This is what makes Mellanox special, and made NVIDIA to acquire them. From the linked document: Instead of sending the packet for processing to the kernel and copying it into the memory of the user application, the host adapter directly places the packet contents in the application buffer. [0]: https://docs.redhat.com/en/documentation/red_hat_enterprise_... |
![]() |
| This leads, in the extreme, to the idea of a huge array of very simple cores, which I believe is something that has been tried but never really caught on. |
![]() |
| That description reminds me of GreenArrays' (https://www.greenarraychips.com) Forth chips that have 144 cores – although they call them "computers" because they're more independent than regular CPU cores, and eg. each has its own memory and so on. Each "computer" is very simple and small – with a 180nm geometry they can cram 8 of them in 1mm^2, and the chip is fairly energy-efficient.
Programming for these chips apparently a bit of a nightmare though. Because the "computers" are so simple, even eg. calculating MD5 turns into a fairly tricky proposition as you have to spread out the algorithm to multiple computers with very small amounts of memory, so something that would be very simple on a more classic processor turns into a very low level multithreaded ordeal |
![]() |
| Worth noting that the GreenArrays chip is 15 years old. 144 cores was a BIG DEAL back then. I wonder what a similar architecture compiled with a modern process could achieve. 1440 cores? More? |
![]() |
| those weren't "real" cores. you know what current chip has FUs that it falsely calls "cores"? that's right, Nvidia GPUs. I think that's the answer to your question (pushing 20k). |
![]() |
| In what way were they not “real” cores? They had their own operating environment completely independent of other cores. GPU execution units on the other hand are SIMD--a single instruction stream. |
![]() |
| IIRC the first Phi has SMT4 in a round robin fashion similar to the Cell PPUs. To make a core run at full speed, you should schedule 4 threads on it. |
![]() |
| They basically do: it's pretty common to clock gate inactive parts of the ALU, which reduces their power consumption greatly. Modern processor power usage is very workload-dependent for this reason. |
![]() |
| Intel is removing SMT from their next gen mobile processor.
My guess is this will help them improve ST perf. We will see how well it works, and if amd will follow |
![]() |
| Most benchmarks are fine-tuned code that provides pretty much the perfect case for running with SMT off, because the real-world use cases that benefit from SMT are absent in those benchmarks. |
![]() |
| I find that LLMs with web access are a good fit for this kind of search, at least to point me in the right direction. The URLs provided are mostly hallucinations, however. |
![]() |
| The whole point of SMT is to maximize utilization of a superscalar execution engine.
I wonder if that trend means people think superscalar is less important than it used to be. |
![]() |
| Good summary overall, although seemed a little muddled in places.
Would love to know some of the tricks of the trade from insiders (not relating to security at least) |
![]() |
| What I think is worth knowing is that compute units in GPU’s also use SMT, usually at a level of 7 to 10 threads per CU. This helps to hide latency. |
![]() |
| One of the biggest mistakes users have is a mental model of SMT that imagines the existence of one "real core" and one inferior one. The threads are coequal in all observable respects. |
![]() |
| > For example, the BEAM isn't optimized for throughput...
weird take, given that Erlang powers telecom systems with literally millions of connections at a time. |
![]() |
| What’s a dumb way to do computation? Using objects? I’m generally suspicious of divergence from the ideal form of computation (math, applied to a big dumb array) but C++ is quite popular. |
I gather in the early days the LPDDR used on laptops was slower too and since cores were scarce so this was more valuable there. Lately, though, we often have more cores than we can scale with and the value is harder to appreciate. We even avoid scheduling work on a shared with an important thread to avoid cache-contention because we know the single-threaded performance will be the bottleneck.
A while back I was testing Efficient/Performance cores and SMT cores for MT rendering with DirectX 12; on my i7-12700K I found no benefit to either: just using P-cores took about the same time to render a complex scene as P+SMT and P+E+SMT. It's not always a wash, though: on the Xbox Series X we found the same test marginally faster when we scheduled work for SMT too.