Gen2Balance: Generative Balancing for
Long-Tailed Video Action Recognition

Prajwal Gatti1 Simon Jenni2 Fabian Caba Heilbron2 Dima Damen1

1University of Bristol  ·  2Adobe Research

ECCV 2026

TL;DR: We address long-tailed video action recognition by filling the imbalance with synthetic videos from text-to-video generation models, and show that a two-stage training strategy works best, improving over current best methods.

Comparison of real and generated videos for six actions. Despite known issues with synthetic physics, we show that carefully prompted videos are convincing and class-faithful enough to reach state-of-the-art performance in long-tailed video action recognition.
Abstract

What is Gen2Balance?

Gen2Balance is a framework that tackles long-tailed video action recognition by converting imbalanced datasets into a balanced mix of real and synthetic video clips. We augment minority classes using a text-to-video generative model, which is conditioned on diverse prompts grounded in real training exemplars and detailed action profiles. To effectively learn from this augmented data, Gen2Balance employs a two-stage training strategy that mitigates synthetic domain shift. Evaluated on long-tailed versions of standard benchmarks, UCF-101 (UCF-LT) and a temporally challenging Kinetics subset (K100-LT), our approach surpasses the strongest baselines, exhibiting significant accuracy gains for tail and few-shot actions.
Method

How do we generate diverse, class-faithful videos?

Naive, templated prompts (just the class label) lack diversity and are often semantically ambiguous.

To solve this, a multimodal LLM analyses real video exemplars to build an action profile and write diverse, class-faithful text prompts. A text-to-video model then uses these to create clips that fill the long tail.

Action class
🤖 Robot Dancing
+ a few real exemplars
real
real
real
→
Multimodal LLM
Gemini 2.5 Pro
action profile +
diverse, class-faithful prompts
→
Text-to-video
WAN 2.1
generates clips per prompt
→
Generated video
gen
added to the training set
Method

How do we train with the balanced data?

Surprisingly, naively adding the synthetic clips to training does worse than using real data alone, an effect of the synthetic-to-real shift.

To fix this, we train in two stages. We first learn features from the balanced real + generated clips using loss margins based on real data frequencies. A brief rehearsal on just the real data then corrects the domain shift.

The Gen2Balance two-stage training strategy diagram.
Stage 1 trains on the filled dataset (Daug = Dtrain ∪ Dgen) with Balanced Softmax loss; Stage 2 fine-tunes on the real data with the same loss.
Results

State-of-the-art on the long tail

We introduce two long-tailed benchmarks built from popular datasets: K100-LT and UCF-LT.

Class-average accuracy (C/A) against standard and SOTA long-tail baselines, all using the same VideoMAE backbone. Gen2Balance achieves the best overall accuracy, with the largest gains on tail and few-shot classes.

Benchmark
MethodGen.FewTailHeadAvg C/A

Gen. = uses our generated data. The grayed CE (full dataset) row trains on the full balanced data and is an upper-bound reference.

Interactive

Interact and explore the results!

Click on any bar! Each bar represents a K100-LT class, showing our accuracy improvement over the baseline. Notice the gains grow toward the data-starved tail and few-shot classes.

Baseline
Sort

Bars: green = improvement, red = regression. Shaded bands group head / tail / few-shot classes.

Can Gen2Balance extend to rare actions?

RareAct dataset features compositionally rare actions, formed by unlikely verb-noun pairs such as cut keyboard, drill phone and hammer phone, that almost never occur in real footage. We curate 22 of them and append them to K100-LT as few-shot classes with only 5 training real clips each to ask: can Gen2Balance extend to rare actions?

+31.9%
over the strongest
baseline, on rare actions
Cross-Entropy11.3%
Balanced Softmax23.2%
Logit Adjustment27.8%
Gen2Balance59.7%

Class-average accuracy on the 22 RareAct classes.

Real vs. generated rare actions

How much compute do you actually need?

Generation is unbounded, but you don't need to fully balance to see major gains. Drag the filling threshold B to explore the trade-off: the left shows accuracy vs. cost (H100 GPU hours), the right visualizes the distribution.

Accuracy & cost vs. B
Training-set size per class
Takeaway

Partial balancing (B = 330) recovers +6.4% average and +13.5% few-shot accuracy, close to full balancing's +8.1% / +13.5%, at just 27% of the generation compute (2.5K vs 9.2K GPU-hours).

You can capture the bulk of the gains for a fraction of the cost, and this is a one-time offline cost that shrinks as generative models get faster.

Our Contributions

1

Generative balancing

We balance long-tailed video datasets with synthetic clips from a pre-trained text-to-video model.

2

A pipeline & a released dataset

An automatic LLM-driven pipeline (Gemini 2.5 Pro + WAN 2.1) for diverse, class-faithful prompts, and a public release of 140K videos across 223 classes.

3

A study of training strategies

Two-stage training (combined â†’ real) with a class-balanced loss using real-data margins works best.

4

State-of-the-art results

Up to +6.7% (K100-LT) and +5.5% (UCF-LT few-shot) over the strongest baselines, and +31.9% on rare RareAct actions.

Citation

BibTeX

@inproceedings{gatti2026gen2balance,
  title     = {Gen2Balance: Generative Balancing for Long-Tailed Video Action Recognition},
  author    = {Gatti, Prajwal and Jenni, Simon and Caba Heilbron, Fabian and Damen, Dima},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}
Acknowledgements

Acknowledgements

This work was supported by EPSRC Fellowship UMPIRE (EP/T004991/1) and a charitable donation from Adobe to the University of Bristol. We acknowledge the usage of GPU Node hours granted as part of the AIRR Gateway project “HOI Foundational Model from Egocentric Data” (Dec 2025–Mar 2026) and the Sovereign AI Unit call project “Gen Model in Ego-sensed World” (Aug 2025–Nov 2025).