sid-the-kid's comments

sid-the-kid · 2025-04-24T21:23:48 1745529828

Does he? I can't find him.

sid-the-kid · 2025-04-24T21:31:03 1745530263

Looked it up. Cool reference.

aorloff · 2025-04-26T00:24:27 1745627067

Toggle the tab to public

sid-the-kid · 2025-04-24T19:45:33 1745523933

IMO, most videos models will be fully real time within 2 years. You will be able to pick a model, imagine any world and then be fully immersed in it. Walk around any city interacting with people, first person shooter games on any map with crazy monsters, or just let the model auto-pilot an adventure for you.

genewitch · 2025-04-25T09:26:07 1745573167

Probably not, but even if so, how much will that cost? There's AI that will take a pronpt like "what's the best weed whacker in 2025" and build a whole web page to publish the review. It's great, awesome. $10 in tokens to do that.

And that's probably still a subsidized cost!

Bfw "what is the best weed whacker" is John C. Dvorak's "AI test"

sid-the-kid · 2025-04-24T18:35:38 1745519738

The system just crashed. Sorry! Working on getting things live again as fast as we can!

sid-the-kid · 2025-04-24T18:42:20 1745520140

We are live again folks! Sorry about that. We ran out of storage space.

PUSH_AX · 2025-04-24T18:38:05 1745519885

Ah the ole HN soak test.

sid-the-kid · 2025-04-24T18:47:42 1745520462

Ya. You always think you cross your Ts. But, the law always holds.

lcolucci · 2025-04-24T19:45:52 1745523952

haha one of the reasons launching on HN is great!

sid-the-kid · 2025-04-24T18:30:44 1745519444

Nice! Thanks for sharing. I hadn't seen that paper before. Looks like they take in a real-world video and then re-generate the mouth to get to lip synch. In our solution, we take in an image and then generate the entire video.

I am sure they will have open source solutions for fully-generated real-time video within the next year. We also plan to provide an API for our solution at some point.

sid-the-kid · 2025-04-24T18:25:01 1745519101

For the input, we pass the model: 1) embedded audio and 2) a single image (encoded with a causal VAE). The model outputs the final RGB video directly.

The key technical unlock was getting the model to generate a video faster than real-time. This allows us to stream video directly to the user. We do this by recursively generating the video, always using the last few frames of the previous output to condition the next output. We have some tricks to make sure the video stays relatively stable even with recursion.

tough · 2025-04-24T19:06:08 1745521568

I'm not at that level but reminded me of https://news.ycombinator.com/item?id=43736193

sid-the-kid · 2025-04-24T19:17:01 1745522221

Nice find! I hand't seen this before (and will take a deeper look later). It looks like this is an approach to better utilize the GPU memory. And, we would probably benefit from this to get more of a speed-up, which would also help us get better video quality.

I do not think they are running in real time though. From the website: "Personal RTX 4090 generates at speed 2.5 seconds/frame (unoptimized) or 1.5 seconds/frame (teacache)." That means it would take them 37.5s to generate 1 second of video, which is fast for video but way slower than real time.

tough · 2025-04-24T19:41:03 1745523663

Yep, this is way slower but considered SOTA on video-gen open source.

I mostly meant the using the previous frames to generate new frames insight that reminded me but lack knowledge on the specifics of the work

glad if its useful for your work/research to check out the paper

edit: the real-time-ness of it also has to have into equation what HW are you running your model on, obviously easier to make so on a H100 than a 3090, but these memory optimizations really help to make these models usable at all for local stuff, which is a great win i think for overall adoption/further stuff being build upon them a bit like sd-webui from automatic1111 alongside stable diffusion weights models being open sourced was a boom on image gen a couple years back

dheera · 2025-04-24T21:41:45 1745530905

Nice! What infra do you use for inference? I'm wondering what the cost-effective platforms are for projects like this. GPUs on AWS and Azure are incredibly expensive for personal use.

sid-the-kid · 2025-04-24T21:54:56 1745531696

We use modal (https://modal.com/). They give us GPUs on-demand, which is critical for us so we are only paying for what we are using. Pricing is about $2/hr per GPU (as a baseline of the costs). Long story short, things get VERY expensive quickly.

tony_cannistra · 2025-04-24T22:06:58 1745532418

Nice. This is also how recent advances in ML weather forecasting work. Weather forecasting really is just "video generation" but in higher dimensions.

sid-the-kid · 2025-04-24T17:34:49 1745516089

We just removed email signup. You can try it out now without logging in. It was easier than expected to do technically, so we just shipped a quick update.

tetris11 · 2025-04-24T18:30:12 1745519412

Thanks! This is amazing

sid-the-kid · 2025-04-24T18:46:18 1745520378

Glad you like it! IMO, biggest things to improve on are 1) time to video response and 2) being able to generate more complicated videos (2 people talking to each other, a person walking + talking, scene cuts while talking).

sid-the-kid · 2025-04-24T17:33:51 1745516031

That's fair. We just removed the sign-in for HN. Should be live shortly.

Each person gets a dedicated GPU, so we were worried about costs before. But, let' s just go for it.

sgrove · 2025-04-24T18:36:57 1745519817

I think it's not going well? I keep getting to the start a new call page, it fails, and takes me back to the live page. I assume your servers are on fire, but implementing some messaging would help ("come back later") or even better, a queueing system ("you're N in line") would help a lot.

Really looking forward to trying this out!

andrew-w · 2025-04-24T18:46:22 1745520382

We're back online! One of our cache systems ran out of memory. Oops. Agree on improved messaging.

yahoozoo · 2025-04-24T17:43:31 1745516611

Do you use a cloud-based GPU provider?

sid-the-kid · 2025-04-24T17:48:21 1745516901

Yes. We use Modal (https://modal.com/), and are big fans of them. They are very ergonomic for development, and allow us to request GPU instances on demand. Currently, we are running our real-time model on A100s.

lostmsu · 2025-04-24T18:07:33 1745518053

I see you are paying $2/h. Shoot me an email at victor ta borg.games if your model would fit on RTX 3090 24G to get it down to $0.2/h (fellow startup).

tough · 2025-04-24T19:16:03 1745522163

maybe demos could be a downsampled bitrate/size running on commercial GPU's

ivape · 2025-04-24T18:21:20 1745518880

How much would this demo cost you from the HN traffic if you don't mind me asking?

sid-the-kid · 2025-04-24T18:44:54 1745520294

Good question. I guess depends on how many users we get. Each users gets their own dedicated GPU. Most video generations systems (and AI systems in general) can share GPUs during generation. Since we are real time, we don't do that. So, each user minute is a GPU minute. This is the biggest driver of the cost.

tough · 2025-04-24T19:15:04 1745522104

feels like the next logical step for you to bring enconomies of scale is to allow users generating the video to automatically stream it to n platforms, so each gpu can be generating 1 png for many humans to watch simultaneously, with maybe 1 human driving the seat on what to generate, or more ai, idk

sid-the-kid · 2025-04-24T21:37:47 1745530667

that's a good idea! Would be especially cool if the human is charismatic and does a good job driving the convo. Maybe we can try it out with a streamer.

tough · 2025-04-24T21:59:48 1745531988

Vtuber comes to mind

pjc50 · 2025-04-25T14:21:34 1745590894

Neuro/vedal already there, although not with the model as well.

sid-the-kid · 2025-03-04T08:31:08 1741077068

I am with you. I want this too. Maybe somebody can make it wit their API?

sid-the-kid · 2025-03-03T20:41:36 1741034496

Okay. I know these guys IRL. BUT, I genuinely think they have the best music model out there. Hands down. The songs are just more unique, and have a wider range of musical variation. With Suno/Udio, the songs just sounds the same after a while (just with different lyrics).

That could just be me though. I am curious what users of Udio/Suno think?

jacobsimon · 2025-03-04T00:16:26 1741047386

Quality has improved so much too, I tried it a few months ago at Demo Day and I’m blown away by how good it is now.