Hey. usually on machine learning we have to consider distances between different probability distributions.This is in fact a hard problem and there is a misconception that the most common way to do this is using thekl divergence. I believe that this is not true.
In fact, the KL divergence is not a metric, because it is not symmetric. It have nice properties that makes it usefulbut there are other metrics that are more useful in specific scenarios. Sadly these metrics are not taught on introductorycourses in machine learning but there is a research trend on using these metrics on common architectures where you usually use the KL divergence on the loss, suck as the VAE’s and DDPMs.
For example, here is a pseudocode for some common divergences
Anyway, for more details here is the colab where I implemented some of these metrics and more Link to code