I don't quite understand the MRPO. So during the final they try to ensure the mo...

I don't quite understand the MRPO.

So during the final they try to ensure the model doesn't get the right answer every time, but only 50% of time, so as to avoid killing all variability-- very sensible, and then they compute a measure of this, take the negative exponential of this measure and then they scale the advantage by this.

So a question matters in proportion to the variability of the answers. Isn't this more curriculum learning stuff than actually suppressing things that don't vary enough?

Basically focusing on questions that are still hard instead of trying to push the probability of problem it's often able to solve to 99.99%?

Also very reasonable, but this isn't how they describe it. Instead, from their description I would think they're sort of forcing entropy to be high somehow.

I think the way I'd have titled it would be something like "Dynamic curricula to preserve model entropy".