How does BDH handle long-range dependencies compared to Transformers, given its locally interacting neuron particles? Does the scale-free topology implicitly support efficient global information propagation?
From the authors: great question. If you take an "easy" task for long-range dependencies where a Mamba-like architecture flies (and the transformer doesn't, or gets messy), the hatchling should also be made to fly. For more ambitious benchmarks, give it a try in a place you care about. The paper is really vanilla and focused on explaining what's happening inside the model, but should be good enough as a starting point for architecture tweaks and experiments.