Gen2Balance: Generative Balancing for Long-Tailed Video Action Recognition

Abstract

What is Gen2Balance?

Gen2Balance is a framework that tackles long-tailed video action recognition by converting imbalanced datasets into a balanced mix of real and synthetic video clips. We augment minority classes using a text-to-video generative model, which is conditioned on diverse prompts grounded in real training exemplars and detailed action profiles. To effectively learn from this augmented data, Gen2Balance employs a two-stage training strategy that mitigates synthetic domain shift. Evaluated on long-tailed versions of standard benchmarks, UCF-101 (UCF-LT) and a temporally challenging Kinetics subset (K100-LT), our approach surpasses the strongest baselines, exhibiting significant accuracy gains for tail and few-shot actions.

Method

How do we generate diverse, class-faithful videos?

Naive, templated prompts (just the class label) lack diversity and are often semantically ambiguous.

To solve this, a multimodal LLM analyses real video exemplars to build an action profile and write diverse, class-faithful text prompts. A text-to-video model then uses these to create clips that fill the long tail.

Action class

🤖 Robot Dancing

+ a few real exemplars

real

→

Multimodal LLM

Gemini 2.5 Pro

action profile +
diverse, class-faithful prompts

→

Text-to-video

WAN 2.1

generates clips per prompt

→

Generated video

gen

added to the training set

Method

How do we train with the balanced data?

Surprisingly, naively adding the synthetic clips to training does worse than using real data alone, an effect of the synthetic-to-real shift.

To fix this, we train in two stages. We first learn features from the balanced real + generated clips using loss margins based on real data frequencies. A brief rehearsal on just the real data then corrects the domain shift.

The Gen2Balance two-stage training strategy diagram. — Stage 1 trains on the filled dataset (D_aug = D_train ∪ D_gen) with Balanced Softmax loss; Stage 2 fine-tunes on the real data with the same loss.

Results

State-of-the-art on the long tail

We introduce two long-tailed benchmarks built from popular datasets: K100-LT and UCF-LT.

Class-average accuracy (C/A) against standard and SOTA long-tail baselines, all using the same VideoMAE backbone. Gen2Balance achieves the best overall accuracy, with the largest gains on tail and few-shot classes.

Benchmark

Method	Gen.	Few	Tail	Head	Avg C/A

Gen. = uses our generated data. The grayed CE (full dataset) row trains on the full balanced data and is an upper-bound reference.

Interactive

Interact and explore the results!

Click on any bar! Each bar represents a K100-LT class, showing our accuracy improvement over the baseline. Notice the gains grow toward the data-starved tail and few-shot classes.

📊 Class names, head/tail/few-shot groups and the real/generated counts are real (from annotations/). The accuracy values are placeholders, regenerate with your numbers via webpage/tools/make_perclass_data.py --real-acc <file>, and export clips to fill the video slots.

Baseline

Sort

Bars: green = improvement, red = regression. Shaded bands group head / tail / few-shot classes.

Can Gen2Balance extend to rare actions?

RareAct dataset features compositionally rare actions, formed by unlikely verb-noun pairs such as cut keyboard, drill phone and hammer phone, that almost never occur in real footage. We curate 22 of them and append them to K100-LT as few-shot classes with only 5 training real clips each to ask: can Gen2Balance extend to rare actions?

+31.9%

over the strongest
baseline, on rare actions

Cross-Entropy11.3%

Balanced Softmax23.2%

Logit Adjustment27.8%

Gen2Balance59.7%

Class-average accuracy on the 22 RareAct classes.

Real vs. generated rare actions

Cut keyboard

real

Real

gen

Generated

Drill phone

real

Real

gen

Generated

Hammer phone

real

Real

gen

Generated

Peel corn

real

Real

gen

Generated

Spray shoes

real

Real

gen

Generated

Weigh tomato

real

Real

gen

Generated

How much compute do you actually need?

Generation is unbounded, but you don't need to fully balance to see major gains. Drag the filling threshold B to explore the trade-off: the left shows accuracy vs. cost (H100 GPU hours), the right visualizes the distribution.

Accuracy & cost vs. B

Training-set size per class

Takeaway

Partial balancing (B = 330) recovers +6.4% average and +13.5% few-shot accuracy, close to full balancing's +8.1% / +13.5%, at just 27% of the generation compute (2.5K vs 9.2K GPU-hours).

You can capture the bulk of the gains for a fraction of the cost, and this is a one-time offline cost that shrinks as generative models get faster.

Our Contributions

Generative balancing

We balance long-tailed video datasets with synthetic clips from a pre-trained text-to-video model.

A pipeline & a released dataset

An automatic LLM-driven pipeline (Gemini 2.5 Pro + WAN 2.1) for diverse, class-faithful prompts, and a public release of 140K videos across 223 classes.

A study of training strategies

Two-stage training (combined → real) with a class-balanced loss using real-data margins works best.

State-of-the-art results

Up to +6.7% (K100-LT) and +5.5% (UCF-LT few-shot) over the strongest baselines, and +31.9% on rare RareAct actions.

Citation

BibTeX

@inproceedings{gatti2026gen2balance,
  title     = {Gen2Balance: Generative Balancing for Long-Tailed Video Action Recognition},
  author    = {Gatti, Prajwal and Jenni, Simon and Caba Heilbron, Fabian and Damen, Dima},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}

Acknowledgements

This work was supported by EPSRC Fellowship UMPIRE (EP/T004991/1) and a charitable donation from Adobe to the University of Bristol. We acknowledge the usage of GPU Node hours granted as part of the AIRR Gateway project “HOI Foundational Model from Egocentric Data” (Dec 2025–Mar 2026) and the Sovereign AI Unit call project “Gen Model in Ego-sensed World” (Aug 2025–Nov 2025).