Hey! One of the lead devs here. A cloud computing company called CoreWeave is giving us the compute for free in exchange for us releasing it. We're currently at the ~10B scale and are working on understanding datacenter scale parallelized training better, but we expect to train the model on 300-500 V100s for 4-6 months.
I imagine recreating the model will be computationally cheaper because they will not have to sift through the same huge hyperparameter space as the initial GPT-3 team had to.
This is not true. The OpenAI team only trained one full-sized GPT-3, and conducted their hyperparameter sweep on significantly smaller models (see: https://arxiv.org/abs/2001.08361). The compute savings from not having to do the hyperparameter sweep are negligible and do not significantly change the feasibility of the project.