Thanks for your reply, you raise a very good point, transformer models are a lot...

pmayrgundter · on Jan 4, 2024

So like in my sister reply, I don't see the Backprop, but maybe I'm missing it. This article does use the word, but in a generic way

"For example, when doing the backpropagation (the technique through which the models learn), the gradients can become too large"

But I think this is more of a borrowing and it's not used again in description and may just be a misconception. There's no use of the Backprop term in the original paper nor any stage of learning where output errors are run thru the whole network in a deep regression.

What I do see in Transformers is localized uses of gradient descent, and Backprop in NNs also uses GD...but that seems the extent of it.

Is there a deep regression? Maybe I'm missing it

KoolKat23 · on Jan 6, 2024

Yes, if the below perhaps helps. Over my head but...

https://courses.grainger.illinois.edu/ece448/sp2023/slides/l...

From another source:

Backpropagation Through Time (BPTT) is an adaptation of backpropagation used for training recurrent neural networks (RNNs), which are designed to process sequences of data and have internal memory. Because the output at a given time step might depend on inputs from previous time steps, the forward pass involves unfolding the RNN through time, which essentially converts it into a deep feedforward neural network with shared weights across the time steps. The error for each time step is computed, and then BPTT is used to calculate the gradients across the entire unfolded sequence, propagating the error not just backward through the layers but also backward through the time steps. Updates are then made to the network weights in a way that should minimize errors for all time steps. This is computationally more involved than standard backpropagation and has its own challenges such as exploding or vanishing gradients"