Well, we *do* know *why* gradient descent works (for smooth data), at least for ...

Well, we do know why gradient descent works (for smooth data), at least for finding a local minimum, because finding the minimum is basically what it does by construction. Similarly, we certainly know how back-propagation works, because it's simple calculus, backwards application of the chain rule.

Perhaps what you're trying to say is we don't know why finding local minima of these problems is good at solving the problem?