No Yes and Yes. Instruction decode wars aren't over, its just that the competition have plateaued :) One of the real big hurdles was decoding enough ops / cycle to compete, and the Mill's 33 ops/cycle is way way way more than everyone else, and due to the split stream and multi-phase decoding cunning.
And of course this is intimately dovetailed to cache, so Yes Yes Yes to everything you said too.
We aggressively speculate the cheap branches (which are the common type) and we aggressively inline, but mostly its to software pipeline loops (which make up 80% of most code).
And of course this is intimately dovetailed to cache, so Yes Yes Yes to everything you said too.