Well, just told it to use gdb when necessary, MCP wasn't required at all! Also it helps to tell it to integrate cpptrace and always look at the stacks.
Oops, I assumed maybe too much familiarity with the problem. The premise of the problem is this:
You're designing a CPU, either for your job or for fun, in your favourite hardware description language. Systemverilog, vhdl, hell maybe you went to Berkeley/ work at sifive and you use chisel. if you aren't familiar with HDLs or digital logic design, you should look into them, it is fun.
to test your CPU, you use a simulator, and give your simulated CPU a program to execute. But shit -- you have a bug, and some instruction is producing the wrong value. It would be really nice to step through the code and see what was happening inside your CPU to pinpoint where the bug is.
At this point, you can tell the simulator to generate a waveform (some call it a dump, maybe there are other names). The waveform will contain every signal inside of your design -- every clock, every control signal, you name it. This is typically a MASSIVE amount of data, and people analyze these waveforms. And worse yet, you just have "raw signals". To find out what PC is being executed, you have to pull up the PC signal in the waveform viewer, and then all of the register signals, and then cross correlate that with the disassembly; in short, you get very intimate information about the design, but organizing that information is the responsibility of the person using the waveform viewer, and it can be tedious.
The insight this tool makes is that all of the data you need to have a gdb-like interface when debugging a simulated CPU is already in the waveform, jpdb just organizes and presents the data in a way that is easy for the developer to parse.
It's nice to see a microarchitecture take a risk, and getting perspective on how this design performs with respect to performance, power and area would be interesting.
Very unlikely to me that this design would have comparable "raw" performance to a design that implements something closer to tomasulo's algorithm. The assumption that the latency of a load will be a l1 hit is a load bearing abstraction; I can imagine scenarios where this acts as a "double jeopardy" causing scheduling to lock up because the latency was mispredicted, but one could also speculate that isn't important because the workload is already memory bound.
There's an intuition in computer architecture that designs that lean on "static" instruction scheduling mechanisms are less performant than more dynamic mechanisms for general purpose compute, but we've had decades of compiler development since itanium "proved" this. Efficient computer (or whatever their name is) is doing something cool too, it's exciting to see where this will go
> Very unlikely to me that this design would have comparable "raw" performance to a design that implements something closer to tomasulo's algorithm.
The point appears to be losing maybe a few percent (5%-7%) of performance, in exchange for saving tens of percent of energy consumption.
> The assumption that the latency of a load will be a l1 hit is a load bearing abstraction
That's just the initial assumption, that load results will appear 4 cycles after the load is started. If it gets to that +4 time and the result has not appeared then it looks for a empty execution slot starting another 14 cycles along (+18 in total) for a possible L2 cache return.
The original slot result is marked as "poison" so if any instruction is reached that depends on it, it will also be moved further along the RoB and it's original slot marked as poison, and so on.
If the dependent instruction was originally at least 18 cycles along from the load being issued then I think it will just pick up the result of the L2 hit (if that happens), and not have to be moved.
If L2 also misses and the result still has not been returned when you get to the moved instruction, a spare execution slot will again be searched for starting at another 20 cycles along (+38 in total), in preparation for an L3 hit.
The article says when searching for an empty slot only a maximum of 8 cycles worth of slots are searched. The article doesn't say what happens if there are no empty execution slots within that 8 cycle window. I suspect the instruction just gets moved right to the end, where all slots are empty.
It also doesn't say what happens if the load doesn't hit in L3. As the main memory latency is under control of the SoC vendor and/or the main board or system integrator (for sure not Condor), I suspect that L3 misses are also moved to the very end.
> we've had decades of compiler development since itanium "proved" this
Sure, but until someone doesn't do "The assumption that the latency of a load will be a l1 hit," they're in trouble for most of what we think of as "general purpose" computing.
I think you get it, but there's this overall trope that the issue with Itanium was purely compiler-related: that we didn't have the algorithms or compute resource to parallelize enough of a single program's control flow to correctly fill the dispatch slots in a bundle. I really disagree with this notion: this might have been _a_ problem, but it wasn't _the_ problem.
Even an amazing compiler which can successfully resolve all data dependencies inside of a single program and produce a binary containing ideal instruction bundling has no idea what's in dcache in the case of an interrupt/context switch, and therefore every load and all of its dependencies risks a stall (or in this case, replay) for a statically scheduled architecture, while a modern out-of-order architecture can happily keep going, even speculatively taking both sides of branches.
The modern approach to optimize datacenter computing is to aggressively pack in context switches, with many execution contexts (processes, user groups/containers, whatever) per guest domain and many guest domains per hypervisor.
Basically: I have yet to see someone successfully use the floor plan they took back from not doing out-of-order to effectively fill in for memory latency in a "general purpose" datacenter computing scenario. Most designers just add more cores, which only makes the problem worse (even adding more cache would be better than more cores!).
VLIW and this kind of design have a place: I could see a design like this being useful in place of Cortex-A or even Cortex-X in a lot of edge compute use cases, and of course GPUs and DSPs already rely almost exclusively on some variety of "static" scheduling already. But as a stated competitor to something like Neoverse/Graviton/Veyron in the datacenter space, the "load-bearing load" (I like your description!) seems like it's going to be a huge problem.
> we've had decades of compiler development since itanium "proved" this.
I think an equally large change is the enormous rise of open source and supply chain focus. When Itanium came out, there was tons of code businesses ran which had been compiled years ago, lots of internal reimplementation of what would now be library code, and places commonly didn’t upgrade for years because that was also often a licensing purchase. Between open source and security, it’s a lot more reasonable now to think people will be running optimized binaries from day one and in many cases the common need to support both x86 and ARM will have flushed out a lot of compatibility warts along with encouraging use of libraries rather than writing as many things on their own.
This is still using a Tomasulo like algorithm, it's just been shifted from the backend to the front end. And instructions don't lock up on an L1 miss. Instead the results of that instruction are marked as poisoned, and the front end replays the their microps forward in the execution stream once the L1 miss is resolved. As the article points out, this replay is likely to fill out otherwise unused execution slots on general purpose code, as OoO cpus rarely sustain their full execution width.
It's a smart idea, and has some parallels to the Mill CPU design. The backend is conceptually similar to a statically scheduled VLIW core, and the front end races ahead using it's matrix scorecard trying to queue up as much as it can for it vs the presence of unpredictable latencies.
Last post on their forum a month ago, they claimed that they were live and having progress, but I dunno ...
What I'm afraid of is that perhaps they have been shifting what their goal is a little too often, which of course would delay their time to market.
For example, I think they have shifted from straightforward fixed-SIMD to scalable vectors of some sort, and last I heard they were talking about AI .. which usually means that there's some kind of support for matrix multiplication.
A point of frustration for newer languages, that sus continues, is the lack of thought towards simulation and testbench design, and how it integrates with the language.
While it would be nice to have more elegant support for "modern" codegen in the sv/verilog/vhdl, the real unergonomic experiences are test bench design and integration. The only real options are (for sv, verilog, I have less experience with vhdl): use verilator and write your tb in cpp, use verilator and then write your testbench in cocotb, or you work at a chip design company and use one of the big 3's compilers and maybe you use UVM or cocotb. Verilator and cocotb are okay, but you're crossing a language boundar(ies) and referencing generated code -- it is both mechanical and complex to get any design working with it.
If sus had first class interfaces to create testbenches that could map to UVM or verilator, it would be much more interesting. Spade does some interesting things by having its own package manager, but doesn't (afaik) expose a ton within the language itself
There is a natural tension between developing an API that is nice to use and having a full fledged graph compiler. Most graph compilers, and the hardware that requires them will be complex and difficult to approach. The "original sin" was pytorch vs tensorflow -- tensorflow capturing the entire graph and then compiling it with XLA (or whatever it was before, I'm probably mixing up tf1 and tf2 here) was such an intractable mess to actually hack on (also the runtime had unapproachable complexity, from what I recall). This has probably changed, but pytorch won out because it was both nice to use and develop.
There are clear reasons why a hardware company would use a graph compiler -- they think such an approach is higher performance, and makes tenstorrent look better on price per dollar when compared to competitors (read: nvda).
There is some legitimate criticism of TT here, their hardware is composed or simple blocks that compose into a complex system (5 separate CPUs being programmed per tensix tile, many tiles per chip), and that complexity has to be wrangled in the software stack -- paying that complexity in hardware so there is less of a VLIW model in software might remove a few abstractions.
Smelt is build system agnostic, but it seeks to be the "invoker" so to speak -- we've used it with a many build systems including make, cmake and bazel.
There are a few reasons to separate build and test systems:
build systems struggle to express constrained random testing, and some build systems even struggle to express directed testing sweeps, which are common patterns in design verification.
The other reason is that testing is often treated as a "leaf node" in a build graph, and it's not possible to describe tests depending on other tests.
Overall, testing is a different problem than building, and circuit design often requires complex testing flows that build systems aren't designed for.
Agreed. One of the main properties of a (good) build system, is very aggressive caching. This only works if your build steps are deterministic. From my experience (testing firmware for embedded devices), even if your firmware simulation is totally deterministic, there is plenty of non-determinism created by UB, memory corruption, crazy things interns do with test scripts... So we don't typically cache our test results.
So do you support the APIs of different tools? Is there a list somewhere of told that are supported? Questa? VCS? Our so users need to figure that out?
Reproducible seeding, and more generically having "test list arguments" are definitely something we have in our view, and wouldn't be outside the scope of the project
Tagging is an interesting problem -- do you find that you use any querying schemes emerge when trying to run a specific set of tests, or is running a test group named "A" normally suffice? Test tagging should be on the roadmap, thanks for calling it out.
Re:test weighting, unsure what you're describing exactly -- would you like to have a mechanism to describe how many times a "root test" has been duplicated with some sampling of random input parameters?
I appreciate the code snippets they put in the pub; when HW papers abstract that out, the system doesn't feel grounded in reality. Still, the open problem for this class of architecture IMO is programmability. A composable, well designed API for many core machines would be worth it's weight in gold.