← exp-record / indirect-caption

indirect_caption GRPO Leaderboard

GRPO RL on Qwen2.5-VL with vLLM judge. Caption-metric eval on COCO val_lite + NoCaps val_lite + COCO Karpathy test. (CaptionQA moved to the caption-substitute board below — it's the same caption→judge framework as cap-substitute, just on MC questions.) Column headers show the max_new_tokens we set per task. Sorted by Avg by default — click any header to re-sort. Use the Show metrics multi-select to toggle which metrics appear in the multi-cells (mean/tok lines stay). Sorting and the Avg column recompute from the enabled metrics on every toggle. CLIPScore / RefCLIPScore / PAC-S / RefPAC-S / POLOS appear once their sidecar JSON lands for that (model, task) cell.

Show metrics:
Show benches:

Caption-substitute VQA

Each captioner describes the image (caption max_new_tokens=256); qwen3-30b-instruct answers the original lmms-eval question text-only from that description (answer max_tokens=128). 12 VQA-style tasks × 300 random samples each (seed=0) + CaptionQA (4-subdomain mean) as the first column. CaptionQA always uses its own canonical prompt "Describe this image in detail.", so the same value is shown across all 3 variant rows for a given captioner. Original task prompts and metrics preserved verbatim. Standalone page: cap_substitute/.

Benchmark reference

What each main-board benchmark measures, the caption prompt sent to the model, our run's max_new_tokens, and the ground-truth distribution against which the captions are scored. GT lengths measured with the Qwen2.5-VL tokenizer.

Benchmark Caption prompt max_new_tokens (ours) Refs per image GT length (tok) Metric(s)
captionqa (4 subdomains) "Describe this image in detail." 1024 ~17 MC questions / image mean=2.7 p50=2 (single-letter answer; choice-text for judge prompt) judge-decided correct-letter rate (captionqa_score)
coco2017_cap_val_lite "Provide a one-sentence caption for the provided image." 1024 5 mean=12.1 p50=11 p90=16 BLEU-1/4, METEOR, ROUGE-L, CIDEr
nocaps_val_lite "Provide a one-sentence caption for the provided image." 1024 10 mean=13.2 p50=12 p90=17 BLEU-1/4, METEOR, ROUGE-L, CIDEr
coco_karpathy_test "Describe the image briefly." 64 5 mean=11.9 p50=11 p90=15 BLEU-1/4, METEOR, ROUGE-L, CIDEr
docci_test (ECCV 2024) "Generate a detailed image description with around 120 words, but you may adjust the length if you want." (authors' prompt) 256 1 mean≈180 long-form descriptive BLEU-1/4, METEOR, ROUGE-L, CIDEr (subsample n=300)
localized_narratives_test (ECCV 2020) Text-only adaptation of the annotator instruction (focus on concrete objects; no speculation). 128 1 mean≈50 narrated/spoken style BLEU-1/4, METEOR, ROUGE-L, CIDEr (subsample n=300)
detailcaps_4870 (CAPTURE, 2024) "Describe the image in detail." (canonical CAPTURE prompt) 512 3 (GPT-4O / GPT-4V / Gemini1.5Pro) mean≈120 dense visual facts BLEU-1/4, METEOR, ROUGE-L, CIDEr (subsample n=300)
sharegpt4v_coco_eval (ShareGPT4V, 2023) "Analyze the image in a comprehensive and detailed manner." (Share-Captioner training prompt) 512 1 (GPT-4V re-caption of COCO) mean≈180 GPT-4V dense caption BLEU-1/4, METEOR, ROUGE-L, CIDEr (n=500, our COCO subset on HF: Icey444/ShareGPT4V-COCO-eval)

→ The first three n-gram benches expect ≈ 12-tok references. When a captioner emits 20–60-tok outputs (Qwen 7B base under the brief prompt), BLEU-4 / CIDEr collapse via length-cosine mismatch — that's why 7B base's karpathy_CIDEr ≈ 0.0001. The four lower rows are long-form benches (refs 50–180 tok) we PR'd upstream to lmms-eval (branches add-docci-test, add-localized-narratives-test, add-detailcaps-4870, add-sharegpt4v-coco-eval on Irisicy4/lmms-eval).

How each metric scores against multiple references

COCO val_lite, NoCaps val_lite, and COCO Karpathy test each have multiple reference captions per image (5, 10, 5 respectively). Each metric handles the multi-reference case differently.

BLEU-1 / BLEU-4 — n-gram precision

  1. Take the candidate caption. Count its 1-grams (BLEU-1) or 4-grams (BLEU-4).
  2. For each candidate n-gram, the clip count = min(count_in_candidate, max(count_in_any_reference)). Multi-ref handling: you get credit if any reference contains that n-gram, capped by how often.
  3. Modified precision = sum(clip_counts) / sum(candidate_counts).
  4. Brevity penalty: if candidate is shorter than the closest reference length, multiply by exp(1 − r/c). So short outputs are penalised; long outputs are not — which is why verbose 7B captions don't trigger BP but kill precision.
  5. Aggregate across the dataset by summing the corpus-level numerators/denominators (corpus BLEU), not row-averaging.

BLEU-1 measures word-level agreement (lenient). BLEU-4 requires four consecutive words to match — collapses fast with verbose, paraphrased outputs.

METEOR — alignment with synonyms / stems

  1. Build a unigram alignment between candidate and each reference using exact match → Porter stems → WordNet synonyms → paraphrase tables.
  2. Compute precision P, recall R over the best alignment.
  3. F_mean = 10·P·R / (R + 9·P) (weighted harmonic mean, biased toward recall).
  4. Penalty for fragmentation: chunks of adjacent matched words. More chunks → worse.
  5. Multi-reference: score against each ref individually, take the max.
  6. Corpus-level: mean across samples.

METEOR is much friendlier to paraphrase than BLEU and is the steadiest single number across the leaderboard.

ROUGE-L — longest common subsequence

  1. Find the longest common subsequence (LCS — not substring; words can be non-contiguous) between candidate and reference.
  2. Recall R = LCS / len(ref), precision P = LCS / len(cand), F-score F = (1+β²)·P·R / (R + β²·P) with β=1.2.
  3. Multi-reference: F-score against each ref, take the max.
  4. Corpus-level: mean of per-sample max-F.

ROUGE-L cares about word order without requiring contiguity. Modest sensitivity to length.

CIDEr — TF-IDF n-gram consensus (the one that crushed Qwen 7B)

CIDEr uses all references jointly, not max:

  1. Tokenise + Porter-stem candidate and all references.
  2. For n = 1..4, compute TF-IDF weights, where IDF is computed across the entire corpus's reference captions — n-grams that appear in many refs across the dataset are downweighted.
  3. Per sample, for each n: build a TF-IDF vector for the candidate; build TF-IDF vectors for each of the M references; compute cosine(cand_vec, ref_vec) for each ref, then average over refs.
  4. CIDEr-n = mean across samples.
  5. Final CIDEr = mean of CIDEr-1..CIDEr-4 × 10 (paper convention — some implementations omit ×10).

Why CIDEr is brutal on verbose outputs: the candidate's TF-IDF vector is L2-normalised by its own length. A long, descriptive caption spreads its weight across many low-IDF n-grams, so the per-component overlap with any short reference vector becomes vanishingly small. The 64-tok-cap Karpathy run on Qwen 7B base hit CIDEr ≈ 0.0001 not because the captions were wrong — they were just so much longer and more elaborate than the 10-tok GT references that the cosine similarity vanished.

CaptionQA — judge-decided correct-letter rate

  1. For each (image, question, choices) tuple, build a prompt: You are given a caption of an image and a multiple-choice question… Caption: {caption} Question: {q} Options: A. … B. … C. … D. … Answer:
  2. Send to the judge LLM (qwen3-30b-instruct in our setup; Qwen2.5-72B-Instruct in the paper). Judge returns A/B/C/D/Cannot answer.
  3. Per question: 1 if matches GT letter, 0 otherwise. Cannot answer counts as 0 for captionqa_score and gets tracked separately as captionqa_cannot_answer_rate.
  4. Aggregate: captionqa_score = mean(per_question_scores) as a flat mean over all questions in the split.

CaptionQA has no "multiple references" — each question has one correct letter. It's a utility-of-caption measure, sidestepping the n-gram-overlap pathology.

WinVsBase — GPT-5.4 pairwise arena vs matching-size base

  1. For each sample we feed GPT-5.4 the image plus two unlabeled captions A and B from the two captioners being compared. The trained model occupies slot A after un-swap; A/B order is randomized per sample (seeded RNG) to neutralise position bias.
  2. Judge system prompt: You are an expert evaluator of image captions. You will see an image and two candidate captions A and B describing it. Pick which caption is overall better (more accurate, more complete, more useful to a reader who has not seen the image). Respond with exactly one of the JSON objects: {"winner":"A"} or {"winner":"B"} or {"winner":"TIE"}.
  3. User message: the image as a base-64 data URL plus Caption A: <trained model caption> Caption B: <base model caption> Which caption is better?
  4. Judge model: Azure gpt-5.4-2026-03-05, temperature=0, max_tokens=32, concurrency=16 per arena run. Runs are serialised — stacking 5 in parallel (= 80 concurrent requests) hit ~80 % Azure rate-limit errors.
  5. Score: WinVsBase = 2 · a_win_rate, so 1.0 = tied with base, > 1 = beats base, < 1 = loses. Ties counted in the denominator; errors discarded. Base models display 1.0000 by convention.

Sample size per cell = 300 common doc_ids (intersect of A and B samples_*.jsonl), shuffled with the same seed across all pairs for comparability. Pairings always use the matching size-class base — 3B model vs Qwen2.5-VL-3B-Instruct, 7B vs 7B.