GRPO RL on Qwen2.5-VL with vLLM judge. Caption-metric eval on COCO val_lite + NoCaps val_lite + COCO Karpathy test. (CaptionQA moved to the caption-substitute board below — it's the same caption→judge framework as cap-substitute, just on MC questions.)
Column headers show the max_new_tokens we set per task. Sorted by Avg by default — click any header to re-sort.
Use the Show metrics multi-select to toggle which metrics appear in the multi-cells (mean/tok lines stay). Sorting and the Avg column recompute from the enabled metrics on every toggle. CLIPScore / RefCLIPScore / PAC-S / RefPAC-S / POLOS appear once their sidecar JSON lands for that (model, task) cell.
Each captioner describes the image (caption max_new_tokens=256); qwen3-30b-instruct answers the original lmms-eval question text-only from that description (answer max_tokens=128).
12 VQA-style tasks × 300 random samples each (seed=0) + CaptionQA (4-subdomain mean) as the first column. CaptionQA always uses its own canonical prompt "Describe this image in detail.", so the same value is shown across all 3 variant rows for a given captioner.
Original task prompts and metrics preserved verbatim. Standalone page: cap_substitute/.
What each main-board benchmark measures, the caption prompt sent to the model, our run's max_new_tokens, and the ground-truth distribution against which the captions are scored. GT lengths measured with the Qwen2.5-VL tokenizer.
| Benchmark | Caption prompt | max_new_tokens (ours) | Refs per image | GT length (tok) | Metric(s) |
|---|---|---|---|---|---|
captionqa (4 subdomains) |
"Describe this image in detail." | 1024 | ~17 MC questions / image | mean=2.7 p50=2 (single-letter answer; choice-text for judge prompt) | judge-decided correct-letter rate (captionqa_score) |
coco2017_cap_val_lite |
"Provide a one-sentence caption for the provided image." | 1024 | 5 | mean=12.1 p50=11 p90=16 | BLEU-1/4, METEOR, ROUGE-L, CIDEr |
nocaps_val_lite |
"Provide a one-sentence caption for the provided image." | 1024 | 10 | mean=13.2 p50=12 p90=17 | BLEU-1/4, METEOR, ROUGE-L, CIDEr |
coco_karpathy_test |
"Describe the image briefly." | 64 | 5 | mean=11.9 p50=11 p90=15 | BLEU-1/4, METEOR, ROUGE-L, CIDEr |
docci_test (ECCV 2024) |
"Generate a detailed image description with around 120 words, but you may adjust the length if you want." (authors' prompt) | 256 | 1 | mean≈180 long-form descriptive | BLEU-1/4, METEOR, ROUGE-L, CIDEr (subsample n=300) |
localized_narratives_test (ECCV 2020) |
Text-only adaptation of the annotator instruction (focus on concrete objects; no speculation). | 128 | 1 | mean≈50 narrated/spoken style | BLEU-1/4, METEOR, ROUGE-L, CIDEr (subsample n=300) |
detailcaps_4870 (CAPTURE, 2024) |
"Describe the image in detail." (canonical CAPTURE prompt) | 512 | 3 (GPT-4O / GPT-4V / Gemini1.5Pro) | mean≈120 dense visual facts | BLEU-1/4, METEOR, ROUGE-L, CIDEr (subsample n=300) |
sharegpt4v_coco_eval (ShareGPT4V, 2023) |
"Analyze the image in a comprehensive and detailed manner." (Share-Captioner training prompt) | 512 | 1 (GPT-4V re-caption of COCO) | mean≈180 GPT-4V dense caption | BLEU-1/4, METEOR, ROUGE-L, CIDEr (n=500, our COCO subset on HF: Icey444/ShareGPT4V-COCO-eval) |
→ The first three n-gram benches expect ≈ 12-tok references. When a captioner emits 20–60-tok outputs (Qwen 7B base under the brief prompt), BLEU-4 / CIDEr collapse via length-cosine mismatch — that's why 7B base's karpathy_CIDEr ≈ 0.0001. The four lower rows are long-form benches (refs 50–180 tok) we PR'd upstream to lmms-eval (branches add-docci-test, add-localized-narratives-test, add-detailcaps-4870, add-sharegpt4v-coco-eval on Irisicy4/lmms-eval).
COCO val_lite, NoCaps val_lite, and COCO Karpathy test each have multiple reference captions per image (5, 10, 5 respectively). Each metric handles the multi-reference case differently.
min(count_in_candidate, max(count_in_any_reference)). Multi-ref handling: you get credit if any reference contains that n-gram, capped by how often.sum(clip_counts) / sum(candidate_counts).exp(1 − r/c). So short outputs are penalised; long outputs are not — which is why verbose 7B captions don't trigger BP but kill precision.BLEU-1 measures word-level agreement (lenient). BLEU-4 requires four consecutive words to match — collapses fast with verbose, paraphrased outputs.
P, recall R over the best alignment.F_mean = 10·P·R / (R + 9·P) (weighted harmonic mean, biased toward recall).METEOR is much friendlier to paraphrase than BLEU and is the steadiest single number across the leaderboard.
R = LCS / len(ref), precision P = LCS / len(cand), F-score F = (1+β²)·P·R / (R + β²·P) with β=1.2.ROUGE-L cares about word order without requiring contiguity. Modest sensitivity to length.
CIDEr uses all references jointly, not max:
n = 1..4, compute TF-IDF weights, where IDF is computed across the entire corpus's reference captions — n-grams that appear in many refs across the dataset are downweighted.n: build a TF-IDF vector for the candidate; build TF-IDF vectors for each of the M references; compute cosine(cand_vec, ref_vec) for each ref, then average over refs.CIDEr-n = mean across samples.CIDEr = mean of CIDEr-1..CIDEr-4 × 10 (paper convention — some implementations omit ×10).Why CIDEr is brutal on verbose outputs: the candidate's TF-IDF vector is L2-normalised by its own length. A long, descriptive caption spreads its weight across many low-IDF n-grams, so the per-component overlap with any short reference vector becomes vanishingly small. The 64-tok-cap Karpathy run on Qwen 7B base hit CIDEr ≈ 0.0001 not because the captions were wrong — they were just so much longer and more elaborate than the 10-tok GT references that the cosine similarity vanished.
(image, question, choices) tuple, build a prompt:
You are given a caption of an image and a multiple-choice question…
Caption: {caption}
Question: {q}
Options: A. … B. … C. … D. …
Answer:
qwen3-30b-instruct in our setup; Qwen2.5-72B-Instruct in the paper). Judge returns A/B/C/D/Cannot answer.1 if matches GT letter, 0 otherwise. Cannot answer counts as 0 for captionqa_score and gets tracked separately as captionqa_cannot_answer_rate.captionqa_score = mean(per_question_scores) as a flat mean over all questions in the split.CaptionQA has no "multiple references" — each question has one correct letter. It's a utility-of-caption measure, sidestepping the n-gram-overlap pathology.
A and B from the two captioners being compared. The trained model occupies slot A after un-swap; A/B order is randomized per sample (seeded RNG) to neutralise position bias.You are an expert evaluator of image captions. You will see an image and two candidate captions A and B describing it. Pick which caption is overall better (more accurate, more complete, more useful to a reader who has not seen the image). Respond with exactly one of the JSON objects: {"winner":"A"} or {"winner":"B"} or {"winner":"TIE"}.
Caption A: <trained model caption>
Caption B: <base model caption>
Which caption is better?
gpt-5.4-2026-03-05, temperature=0, max_tokens=32, concurrency=16 per arena run. Runs are serialised — stacking 5 in parallel (= 80 concurrent requests) hit ~80 % Azure rate-limit errors.WinVsBase = 2 · a_win_rate, so 1.0 = tied with base, > 1 = beats base, < 1 = loses. Ties counted in the denominator; errors discarded. Base models display 1.0000 by convention.Sample size per cell = 300 common doc_ids (intersect of A and B samples_*.jsonl), shuffled with the same seed across all pairs for comparability. Pairings always use the matching size-class base — 3B model vs Qwen2.5-VL-3B-Instruct, 7B vs 7B.