CLIP vs SigLIP Experiment

I have a clean, well-documented codebase that already wires the SmallCap captioning architecture to three retrieval encoders—CLIP, SigLIP, and SigLIP2. All hyper-parameters, data splits, and preprocessing steps for MS-COCO 2017 have been locked down. Your task is to finish the implementation where noted, run the comparative training + inference cycle, and return a fully reproducible set of results. Evaluation must cover CIDEr, METEOR, BLEU, and SPICE, with CIDEr called out as the primary figure of merit in the final table. I will need: • the updated training / evaluation scripts • raw and aggregated metric outputs (JSON or CSV) for each encoder checkpoint • a concise report or notebook that plots the head-to-head scores and explains any anomalies • a README detailing exact commands, environment specs, and seeds so I—or any reviewer—can replicate the runs end-to-end on MS-COCO 2017. Everything should slot straight back into the existing repo (PyTorch-based) without refactoring. No new model exploration is required; just faithful execution of the defined plan and neat packaging of the evidence that it worked.

Python

Регистрация