I'm always curious about the cost of these training runs. Some back of the envel...

p1esk · on May 3, 2023

At these levels of spending the actual cost is heavily negotiated and is usually far below the advertised on-demand pricing.

Considering I could negotiate A100 for under a dollar/hr - 8 months ago, when they were in high demand, I wouldn't be surprised if the cost was close to 100k for this training run.

execveat · on May 3, 2023

Nobody in their right mind is using GCE for training. Take a look at real prices: https://vast.ai/

simonw · on May 3, 2023

I got the impression that kind of thing (buying time on GPUs hosted in people's homes) isn't useful for training large models, because model training requires extremely high bandwidth connections between the GPUs such that you effectively need them in the same rack.

p1esk · on May 3, 2023

I suspect most A100s on vast.ai are actually in a datacenter, and might even be on other public clouds, such as AWS. I don't see why either vast.ai or AWS care if this was the case.

marshray · on May 3, 2023

Is there at good resource that describes the impact of bandwidth and latency between GPUs?

I assume that it's completely impractical to train on distributed systems?

qeternity · on May 3, 2023

Anyone training this size of model is almost certainly using AWS/GCE.

The GPU marketplaces are nice for people who need smaller/single GPU setups, don't have huge reliability or SLA concerns, and where data privacy risks aren't an issue.

mrtranscendence · on May 3, 2023

Well, or Azure.

qeternity · on May 3, 2023

Ha yes of course. But actually has anyone been able to get instances on Azure? Thought OpenAI had them all reserved.

superpope99 · on May 3, 2023

Aren't they explicitly using TPUs in their training? Vast AI are only offering GPUs.

bravura · on May 3, 2023

These nodes typically have slow downstream, and thus are hard to use when training requires pulling a huge dataset.

lostmsu · on May 3, 2023

Only 19 GPUs with 30+G of VRAM in the entire North America.

I might be misreading it. It might be just 12 GPUs.

jeron · on May 3, 2023

also, https://brev.dev/

jerrygenser · on May 3, 2023

They haven't trained a 1 trillion token model yet. They have only done 200bn so far

YetAnotherNick · on May 4, 2023

Google is generous for giving TPU for free for research, so likely it is using this. The more representative number is one from meta which required 87k A100 hours, which is close to $100-200k for 7B model training.