Audio models are actually quite similar to image models, but there are a few key differences. First, is the autoencoder needs to be designed much more carefully as human hearing is insanely good and music requires orders of magnitude more spatial compression (image AEs do 8X8 downsampling, audio AEs need to do thousands of times downsampling). Second the model itself needs to be really good at placing lyrics/beats (similar to placing text in image diffusion): a sixth finger in an image model is fine, but a missed beat can ruin a song. That's why language model approaches (which have a stronger sequential inductive bias than diffusion models which is good for rhythm and lyric placement) have been really popular in audio.
If you're interested in papers (IMO not good for new people as they make everything seem more complicated than it is):
Audio models are actually quite similar to image models, but there are a few key differences. First, is the autoencoder needs to be designed much more carefully as human hearing is insanely good and music requires orders of magnitude more spatial compression (image AEs do 8X8 downsampling, audio AEs need to do thousands of times downsampling). Second the model itself needs to be really good at placing lyrics/beats (similar to placing text in image diffusion): a sixth finger in an image model is fine, but a missed beat can ruin a song. That's why language model approaches (which have a stronger sequential inductive bias than diffusion models which is good for rhythm and lyric placement) have been really popular in audio.
If you're interested in papers (IMO not good for new people as they make everything seem more complicated than it is):
Stable Audio (similar to our architecture): https://arxiv.org/abs/2402.04825 (code: https://github.com/Stability-AI/stable-audio-tools)
MusicGen (Suno-style architecture): https://arxiv.org/abs/2306.05284 (code: https://github.com/facebookresearch/audiocraft/tree/main)