PranayBatta's comments

PranayBatta · 2025-12-11T13:05:42 1765458342

TL;DR: Diffusion-based TTS models sound amazing but break down for real-time streaming because they require full-sequence attention. StreamFlow introduces a block-wise guided attention scheme that lets diffusion transformers generate speech chunk-by-chunk with near–SOTA quality and predictable low latency.

Why this matters: Current diffusion speech models need to see the entire audio sequence, making them too slow and memory-heavy for assistants, agents, or anything that needs instant voice responses. Causal masks sound robotic; chunking adds weird seams. Streaming TTS has been stuck with a quality–latency tradeoff.

The idea: StreamFlow restricts attention using sliding windows over blocks:

Each block can see W_b past blocks and W_f future blocks

Compute becomes roughly O(B × W × N) instead of full O(N²)

Prosody stays smooth, latency stays constant, and boundaries disappear with small overlaps + cross-fades

How it works: The system is still a Diffusion Transformer, but trained in two phases:

Full-attention pretraining for global quality

Block-wise fine-tuning to adapt to streaming constraints

Generates mel-spectrograms; BigVGAN vocoder runs in parallel.

Performance:

~180ms first-packet latency (80ms model, 60ms vocoder, 40ms overhead)

No latency growth with longer speech

MOS tests show near-indistinguishable quality vs non-streaming diffusion

Speaker similarity within ~2%, prosody continuity preserved

Key ablation takeaways:

Past context helps until ~3 blocks; more adds little

Even a tiny future window greatly boosts naturalness

Best results: 0.4–0.6s block size, ~10–20% overlap

Comparison:

Autoregressive TTS → streaming but meh quality

GAN TTS → fast but inconsistent

Causal diffusion → real-time but degraded

StreamFlow → streaming + near-SOTA quality

Bigger picture: Smart attention shaping lets diffusion models work in real time without throwing away global quality. The same technique could apply to streaming music generation, translation, or interactive media.

PranayBatta · 2025-12-10T07:15:35 1765350935

When you're building AI apps in production, managing multiple LLM providers becomes a pain fast. Each provider has different APIs, auth schemes, rate limits, error handling. Switching models means rewriting code. Provider outages take down your entire app.

At Maxim, we tested multiple gateways for our production use cases and scale became the bottleneck. Talked to other fast-moving AI teams and everyone had the same frustration - existing LLM gateways couldn't handle speed and scalability together. So we built [Bifrost](https://getmaxim.ai/bifrost).

What it handles:

Unified API - Works with OpenAI, Anthropic, Azure, Bedrock, Cohere, and 15+ providers. Drop-in OpenAI-compatible API means changing providers is literally one line of code.

Automatic fallbacks - Provider fails, it reroutes automatically. Cluster mode gives you 99.99% uptime.

Performance - Built in Go. Mean overhead is just 11µs per request at 5K RPS. Benchmarks show 54x faster P99 latency than LiteLLM, 9.4x higher throughput, uses 3x less memory.

Semantic caching - Deduplicates similar requests to cut inference costs.

Governance - SAML/SSO support, RBAC, policy enforcement for teams.

Native observability - OpenTelemetry support out of the box with built-in dashboard.

It's open source and self-hosted.

Anyone dealing with gateway performance issues at scale?

PranayBatta · 2025-12-09T10:20:26 1765275626

Finally, a tool that lets me spend less time debugging mismatched braces and more time debugging my actual math