> Scholars believe that the Novgorod Republic had an unusually high level of literacy for the time, with literacy apparently widespread throughout different classes and among both sexes.
To bring things full circle: the cross-entropy loss is the KL divergence. So intuitively, when you're minimizing cross-entropy loss, you're trying to minimize the "divergence" between the true distribution and your model distribution.
This intuition really helped me understand CE loss.
Cross-entropy is not the KL divergence. There is an additional term in cross-entropy which is the entropy of the data distribution (i.e., independent of the model). So, you're right in that minimizing one is equivalent to minimizing the other.
Yes, you are totally correct, but I believe this term is omitted from the cross-entropy loss function that is used in machine learning? Because it is a constant which does not contribute to the optimization.
Except for when it doesn't. E.g. lightbulbs are now 20x more efficient than in my childhoos. But I certainly don't use >20x more light(bulbs). In fact, I'm quite sure it's 1x.
reply