There is a finite number of register windows, usually 8 but only 7 can be used because the 8th serves as "sentinel" to detect over- and underflow.
Once register windows are full (a function call wants to activate the next register window but there is no unused window left) window overflow occurs and a trap handler is activated.
The trap handler "unwinds" the register windows and stores all the contents in memory (stack). Now the next function can continue with an empty set of register windows. Once you return from the function, the contents of the windows have to be restored (window underflow trap).
Problem is that the trap handlers can't know which of the registers in each window were in use. Therefore all have to be saved/restored. This ratio will worsen when you write smaller functions that use less register and nest deeper.
So there are two issues: 1. You can't really know at which point in your program that underflow/overflow occurs because it changes depending on the exact path of execution through the program. 2. Unnecessary memory write/read operations. While ca. 120 x 32-bit words is not that much, with an 8-bit wide SRAM, some waitstates and EDAC this might be noticeable. (Consider that the LEON processors have a data cache for read access but for writing only a "store buffer" that queues few memory writes)
Using -mflat every register is saved by the caller/callee (as ABI demands) on the stack. This means that the memory accesses are predictable and spread out over each function call.
So, my personal conclusion is that register windows are an intriguing idea on the surface but become useless when you aren't writing 80s spaghetti code. There were many similar ideas at that time, e.g., Am29000.
We’d considered using mflat, but we’re not that performance constrained (and prefer the slightly smaller binary size with register windows enabled). I may do some profiling of the under flow/overflow interrupts though since you’ve now got me second guessing myself.
Registers asr22/23 contain a cycle counter that you can use to time stuff. If it's not present, there's a register in the DSU that counts cycles but that requires an access via the AHB bus. You can measure a lot of things with those cycle counters, like context switch and interrupt handling times, memcpy vs naive for-loop, linear vs. binary search on small arrays...
I'd expect a few microseconds per overflow at most but it depends a lot on the characteristics of the system. Of course, if the application is not sensitive to a few microseconds here and a few microseconds there that optimization might not be worth it.
Once register windows are full (a function call wants to activate the next register window but there is no unused window left) window overflow occurs and a trap handler is activated.
The trap handler "unwinds" the register windows and stores all the contents in memory (stack). Now the next function can continue with an empty set of register windows. Once you return from the function, the contents of the windows have to be restored (window underflow trap).
Problem is that the trap handlers can't know which of the registers in each window were in use. Therefore all have to be saved/restored. This ratio will worsen when you write smaller functions that use less register and nest deeper.
So there are two issues: 1. You can't really know at which point in your program that underflow/overflow occurs because it changes depending on the exact path of execution through the program. 2. Unnecessary memory write/read operations. While ca. 120 x 32-bit words is not that much, with an 8-bit wide SRAM, some waitstates and EDAC this might be noticeable. (Consider that the LEON processors have a data cache for read access but for writing only a "store buffer" that queues few memory writes)
Using -mflat every register is saved by the caller/callee (as ABI demands) on the stack. This means that the memory accesses are predictable and spread out over each function call.
So, my personal conclusion is that register windows are an intriguing idea on the surface but become useless when you aren't writing 80s spaghetti code. There were many similar ideas at that time, e.g., Am29000.