AboutServicesWork Generative AIClientsLeadership InsightsCareersContact

Our Product · Case Study

ViralForge AI Video Pipeline

Industry

AI Media · Video Processing

Scope

AI Engineering · GPU Infrastructure · Pipeline Architecture

Stack

Python · GPU Workers · Speech-to-Text · LLMs

Background

ViralForge (at sucut.xyz) is an AI-powered video processing pipeline built by TrueLeaf Tech that transforms long-form video content into platform-ready short-form clips. The product takes a single source video — a podcast episode, a webinar recording, an interview — and produces a set of clips optimised for the format requirements of TikTok, Instagram Reels, and YouTube Shorts.

This case study describes our own product. ViralForge is built and operated by TrueLeaf Tech. The architectural decisions and operational lessons here come from running an AI-heavy video pipeline at the throughput and quality bar that real users demand.

The problem we were solving

The shape of short-form video distribution rewards consistency and volume. A creator producing one piece of long-form content per week — a podcast, a long YouTube video, a recorded talk — has, in principle, dozens of clip-sized moments inside that content. Surfacing those moments, cutting them cleanly, and reformatting them for vertical-first platforms is the kind of work that is technically simple per clip and operationally crushing at volume.

Existing tools at the time tended toward one of two extremes. Some were essentially automated cutters that produced low-quality clips with no editorial judgement. Others were fully manual editors that required hours of work per video. The product hypothesis behind ViralForge was that the right answer sat in between: an AI-driven pipeline that did the heavy lifting of clip selection, captioning, reformatting, and metadata, but exposed each stage in a way that a human editor could direct or override.

What we built

The processing pipeline

ViralForge is structured as a multi-stage pipeline, with each stage producing artefacts that the next stage consumes. The stages, simplified, are:

  1. Ingest and transcribe. The source video is processed, the audio extracted, and the transcript generated using a high-accuracy speech-to-text model.
  2. Segment and score. The transcript is broken into candidate clip segments, each scored on a set of criteria: hook strength, narrative completeness, emotional arc, and platform fit.
  3. Select and refine. Top-scoring segments are selected, with overlap and pacing constraints applied. The exact clip boundaries are refined for natural cuts.
  4. Reformat for vertical. Each clip is reformatted from its source aspect ratio to the vertical format required by short-form platforms, with intelligent framing that keeps the subject in shot.
  5. Caption. Word-level captions are generated and styled for readability on mobile, with timing aligned precisely to the speech.
  6. Render. The final clip is rendered, encoded, and prepared for download or direct platform publishing.

Each stage runs as an independent service, with explicit contracts between them. The architecture lets us improve any stage independently, swap underlying models without rewriting the orchestration, and parallelise the work across many videos. The disciplines we describe in boring agent infrastructure apply directly to this pipeline, even though it is not literally an agent: durable state, explicit retry semantics, comprehensive observability.

The GPU fleet

Several stages — particularly speech-to-text and video reformatting — are GPU-bound. We run a managed fleet of GPU workers, scaled based on real-time queue depth, with each worker pulling jobs from the central orchestrator and returning results. The cost discipline here is important: GPU instances are expensive, and a pipeline that runs them at fifteen percent utilisation is bleeding money. We invested in queue management, batching strategies, and worker lifecycle policies that keep our fleet utilisation high without sacrificing job latency.

The editor surface

The user-facing application is built around an editor that exposes each stage of the pipeline. Users can accept the AI's selections, override them, regenerate captions, adjust framing, or recombine clips manually. The design principle is that the AI does ninety percent of the work, and the human does the last ten percent — but the last ten percent is what makes the output feel intentional rather than algorithmic.

The most useful thing an AI pipeline can do is leave room for human judgement at the moments where it actually matters. Everything else should be automated away.

Engineering trade-offs we made

Model quality versus latency versus cost

Almost every stage of the pipeline has a choice between cheaper, faster, lower-quality models and more expensive, slower, higher-quality ones. The right answer is not the same for every stage. For transcription, quality matters disproportionately — a transcription error compounds through the rest of the pipeline. For scoring, speed matters more — we need to evaluate hundreds of candidate segments per video, and the marginal quality improvement of a slower model is not worth the latency cost.

Eager rendering versus lazy rendering

Early versions of the product rendered every selected clip eagerly during processing. This produced fast user experiences but wasted significant compute on clips the user would discard. The current architecture renders metadata and previews eagerly but defers full rendering until the user has confirmed their selection. The cost reduction was substantial; the user experience cost was negligible.

Multi-platform versus single-platform optimisation

Vertical short-form video looks superficially similar across TikTok, Instagram Reels, and YouTube Shorts — but the actual platform conventions differ in small, important ways. Caption styles, ideal length, hook conventions, and pacing all vary. We chose to optimise outputs per platform rather than produce a single generic vertical clip. The complexity cost is real; the quality difference is meaningful enough to justify it.

What we learned

How this informs our client work

The patterns from ViralForge — pipeline architecture for AI-heavy workloads, GPU fleet management, model selection trade-offs, and the discipline of leaving room for human judgement — apply directly to client engagements involving production AI features. If you are building something with similar characteristics, we have run the playbook ourselves and the lessons travel well.

Get in touch if you are working on AI infrastructure that needs to feel reliable to real users.

Work with us

Have a similar challenge in front of you?

If something in this case study resonates with what you're trying to build — or if you'd like to talk through a related problem — we'd be glad to spend a half-hour helping you think it through.

Start a conversation →

Frequently asked questions

What is ViralForge designed to do?

ViralForge is an AI pipeline that takes long-form video content — podcasts, webinars, interviews — and produces platform-ready short-form clips for TikTok, Instagram Reels, and YouTube Shorts. It handles clip selection, captioning, vertical reformatting, and rendering, with human editorial control at each stage.

How does the pipeline architecture handle scale?

Each stage runs as an independent service with explicit contracts between stages. GPU-heavy stages run on a managed worker fleet with queue-depth-driven autoscaling. The architecture allows us to improve any stage independently and parallelise work across many videos without coordination overhead.

What models power the AI stages of the pipeline?

We use a combination of speech-to-text models, language models for scoring and editorial decisions, and computer vision models for framing and reformatting. The specific model choice for each stage balances quality, latency, and cost — and we have deliberately built the architecture to make model swaps cheap as the underlying model landscape evolves.

Can the architecture be applied to other AI-heavy media workflows?

Yes. The pipeline pattern — independent stages with explicit contracts, GPU fleet management, queue-driven autoscaling, and human override at key decision points — applies to any AI-heavy media workflow including audio processing, document generation, image transformation, and content moderation.

Related work

More from the TrueLeaf Tech engineering portfolio.

Let's build

Have an ambitious idea? We'd love to hear it.

Whether you're testing a hypothesis or scaling an established product, we'd be glad to spend a half-hour helping you think through the next step.