The concept here goes a bit further and tries to replicate backprop with an external network, arguing that that's probably what the brain actually does.
I'm not seeing the connection. This work is about low-level optimization of matrix multiplication. The repo you linked seems to be about replacing back-propagated gradients with a cheaper estimate. What's the similarity you see between these two?
Correct, I think I mistook it as "use a small neural net to approximate matrix multiplication" instead it seems as "use cheaper replacements of matrix mul without much acc loss".
This feels like a "no free lunch" situation. I would imagine that any time saving in approximating the gradients this way would be lost to needing to train for more iterations due to the loss in gradient accuracy. Is that not the case?
However if this is really the biological analogue of credit assignment, this might scale better than training llms from scratch every time. Even if say it could approx gradients to a certain degree given a new network, normal backprop could further tune for a few epochs or so dramatically reducing overall training costs.
https://github.com/ixaxaar/pytorch-dni
The concept here goes a bit further and tries to replicate backprop with an external network, arguing that that's probably what the brain actually does.