← exp-record / indirect-caption

indirect_caption GRPO Leaderboard

GRPO RL on Qwen2.5-VL with vLLM judge. Caption-metric eval on COCO val_lite + NoCaps val_lite + COCO Karpathy test. (CaptionQA moved to the caption-substitute board below — it's the same caption→judge framework as cap-substitute, just on MC questions.) Column headers show the max_new_tokens we set per task. Sorted by Avg by default — click any header to re-sort. Use the Show metrics multi-select to toggle which metrics appear in the multi-cells (mean/tok lines stay). Sorting and the Avg column recompute from the enabled metrics on every toggle. CLIPScore / RefCLIPScore / PAC-S / RefPAC-S / POLOS appear once their sidecar JSON lands for that (model, task) cell.

Show metrics:

Show benches:

Source: projects/indirect-caption/data/eval_results.csv

Caption-substitute VQA

Each captioner describes the image (caption max_new_tokens=256); qwen3-30b-instruct answers the original lmms-eval question text-only from that description (answer max_tokens=128). 12 VQA-style tasks × 300 random samples each (seed=0) + CaptionQA (4-subdomain mean) as the first column. CaptionQA always uses its own canonical prompt "Describe this image in detail.", so the same value is shown across all 3 variant rows for a given captioner. Original task prompts and metrics preserved verbatim. Standalone page: cap_substitute/.

Source: projects/indirect-caption/cap_substitute/data/cap_substitute_results.csv

Benchmark reference

What each main-board benchmark measures, the caption prompt sent to the model, our run's max_new_tokens, and the ground-truth distribution against which the captions are scored. GT lengths measured with the Qwen2.5-VL tokenizer.

Benchmark	Caption prompt	max_new_tokens (ours)	Refs per image	GT length (tok)	Metric(s)
`captionqa` (4 subdomains)	"Describe this image in detail."	1024	~17 MC questions / image	mean=2.7 p50=2 (single-letter answer; choice-text for judge prompt)	judge-decided correct-letter rate (`captionqa_score`)
`coco2017_cap_val_lite`	"Provide a one-sentence caption for the provided image."	1024	5	mean=12.1 p50=11 p90=16	BLEU-1/4, METEOR, ROUGE-L, CIDEr
`nocaps_val_lite`	"Provide a one-sentence caption for the provided image."	1024	10	mean=13.2 p50=12 p90=17	BLEU-1/4, METEOR, ROUGE-L, CIDEr
`coco_karpathy_test`	"Describe the image briefly."	64	5	mean=11.9 p50=11 p90=15	BLEU-1/4, METEOR, ROUGE-L, CIDEr
`docci_test` (ECCV 2024)	"Generate a detailed image description with around 120 words, but you may adjust the length if you want." (authors' prompt)	256	1	mean≈180 long-form descriptive	BLEU-1/4, METEOR, ROUGE-L, CIDEr (subsample n=300)
`localized_narratives_test` (ECCV 2020)	Text-only adaptation of the annotator instruction (focus on concrete objects; no speculation).	128	1	mean≈50 narrated/spoken style	BLEU-1/4, METEOR, ROUGE-L, CIDEr (subsample n=300)
`detailcaps_4870` (CAPTURE, 2024)	"Describe the image in detail." (canonical CAPTURE prompt)	512	3 (GPT-4O / GPT-4V / Gemini1.5Pro)	mean≈120 dense visual facts	BLEU-1/4, METEOR, ROUGE-L, CIDEr (subsample n=300)
`sharegpt4v_coco_eval` (ShareGPT4V, 2023)	"Analyze the image in a comprehensive and detailed manner." (Share-Captioner training prompt)	512	1 (GPT-4V re-caption of COCO)	mean≈180 GPT-4V dense caption	BLEU-1/4, METEOR, ROUGE-L, CIDEr (n=500, our COCO subset on HF: `Icey444/ShareGPT4V-COCO-eval`)

→ The first three n-gram benches expect ≈ 12-tok references. When a captioner emits 20–60-tok outputs (Qwen 7B base under the brief prompt), BLEU-4 / CIDEr collapse via length-cosine mismatch — that's why 7B base's karpathy_CIDEr ≈ 0.0001. The four lower rows are long-form benches (refs 50–180 tok) we PR'd upstream to lmms-eval (branches add-docci-test, add-localized-narratives-test, add-detailcaps-4870, add-sharegpt4v-coco-eval on Irisicy4/lmms-eval).

How each metric scores against multiple references

COCO val_lite, NoCaps val_lite, and COCO Karpathy test each have multiple reference captions per image (5, 10, 5 respectively). Each metric handles the multi-reference case differently.

BLEU-1 / BLEU-4 — n-gram precision

Take the candidate caption. Count its 1-grams (BLEU-1) or 4-grams (BLEU-4).
For each candidate n-gram, the clip count = min(count_in_candidate, max(count_in_any_reference)). Multi-ref handling: you get credit if any reference contains that n-gram, capped by how often.
Modified precision = sum(clip_counts) / sum(candidate_counts).
Brevity penalty: if candidate is shorter than the closest reference length, multiply by exp(1 − r/c). So short outputs are penalised; long outputs are not — which is why verbose 7B captions don't trigger BP but kill precision.
Aggregate across the dataset by summing the corpus-level numerators/denominators (corpus BLEU), not row-averaging.

BLEU-1 measures word-level agreement (lenient). BLEU-4 requires four consecutive words to match — collapses fast with verbose, paraphrased outputs.

METEOR — alignment with synonyms / stems

Build a unigram alignment between candidate and each reference using exact match → Porter stems → WordNet synonyms → paraphrase tables.
Compute precision P, recall R over the best alignment.
F_mean = 10·P·R / (R + 9·P) (weighted harmonic mean, biased toward recall).
Penalty for fragmentation: chunks of adjacent matched words. More chunks → worse.
Multi-reference: score against each ref individually, take the max.
Corpus-level: mean across samples.

METEOR is much friendlier to paraphrase than BLEU and is the steadiest single number across the leaderboard.

ROUGE-L — longest common subsequence

Find the longest common subsequence (LCS — not substring; words can be non-contiguous) between candidate and reference.
Recall R = LCS / len(ref), precision P = LCS / len(cand), F-score F = (1+β²)·P·R / (R + β²·P) with β=1.2.
Multi-reference: F-score against each ref, take the max.
Corpus-level: mean of per-sample max-F.

ROUGE-L cares about word order without requiring contiguity. Modest sensitivity to length.

CIDEr — TF-IDF n-gram consensus (the one that crushed Qwen 7B)

CIDEr uses all references jointly, not max:

Tokenise + Porter-stem candidate and all references.
For n = 1..4, compute TF-IDF weights, where IDF is computed across the entire corpus's reference captions — n-grams that appear in many refs across the dataset are downweighted.
Per sample, for each n: build a TF-IDF vector for the candidate; build TF-IDF vectors for each of the M references; compute cosine(cand_vec, ref_vec) for each ref, then average over refs.
CIDEr-n = mean across samples.
Final CIDEr = mean of CIDEr-1..CIDEr-4 × 10 (paper convention — some implementations omit ×10).

Why CIDEr is brutal on verbose outputs: the candidate's TF-IDF vector is L2-normalised by its own length. A long, descriptive caption spreads its weight across many low-IDF n-grams, so the per-component overlap with any short reference vector becomes vanishingly small. The 64-tok-cap Karpathy run on Qwen 7B base hit CIDEr ≈ 0.0001 not because the captions were wrong — they were just so much longer and more elaborate than the 10-tok GT references that the cosine similarity vanished.

CaptionQA — judge-decided correct-letter rate

For each (image, question, choices) tuple, build a prompt: You are given a caption of an image and a multiple-choice question… Caption: {caption} Question: {q} Options: A. … B. … C. … D. … Answer:
Send to the judge LLM (qwen3-30b-instruct in our setup; Qwen2.5-72B-Instruct in the paper). Judge returns A/B/C/D/Cannot answer.
Per question: 1 if matches GT letter, 0 otherwise. Cannot answer counts as 0 for captionqa_score and gets tracked separately as captionqa_cannot_answer_rate.
Aggregate: captionqa_score = mean(per_question_scores) as a flat mean over all questions in the split.

CaptionQA has no "multiple references" — each question has one correct letter. It's a utility-of-caption measure, sidestepping the n-gram-overlap pathology.

WinVsBase — GPT-5.4 pairwise arena vs matching-size base

For each sample we feed GPT-5.4 the image plus two unlabeled captions A and B from the two captioners being compared. The trained model occupies slot A after un-swap; A/B order is randomized per sample (seeded RNG) to neutralise position bias.
Judge system prompt: You are an expert evaluator of image captions. You will see an image and two candidate captions A and B describing it. Pick which caption is overall better (more accurate, more complete, more useful to a reader who has not seen the image). Respond with exactly one of the JSON objects: {"winner":"A"} or {"winner":"B"} or {"winner":"TIE"}.
User message: the image as a base-64 data URL plus Caption A: <trained model caption> Caption B: <base model caption> Which caption is better?
Judge model: Azure gpt-5.4-2026-03-05, temperature=0, max_tokens=32, concurrency=16 per arena run. Runs are serialised — stacking 5 in parallel (= 80 concurrent requests) hit ~80 % Azure rate-limit errors.
Score: WinVsBase = 2 · a_win_rate, so 1.0 = tied with base, > 1 = beats base, < 1 = loses. Ties counted in the denominator; errors discarded. Base models display 1.0000 by convention.

Sample size per cell = 300 common doc_ids (intersect of A and B samples_*.jsonl), shuffled with the same seed across all pairs for comparability. Pairings always use the matching size-class base — 3B model vs Qwen2.5-VL-3B-Instruct, 7B vs 7B.