People say this, but when it comes to AI models, the training data is not owned by these companies/groups, so it cannot be "open sourced" in any sense. And the training code is basically accessing that training data that cannot be open sourced, therefore it also cannot be shared. So the full open source model you wish to have can only provide subpar results.
They could easily list the data used though.
These datasets are mostly known and floating around.
When they are constructed, instructions for replication could be provided too