Generalizable Geometric Image Caption Synthesis

Published in NeurIPS Datasets and Benchmarks Track (Under Review), 2025

This work proposes Geo-Image-Textualization, a reinforcement learning-based framework for generating semantically aligned geometry image-caption pairs. We constructed GeoReasoning-10K, the first dataset with full modality equivalence for geometric reasoning, enhancing MLLMs’ cross-modal alignment.

Key Contributions

  • Developed a novel RL-based framework for geometry-text alignment
  • Created GeoReasoning-10K dataset with full modality equivalence
  • Demonstrated significant improvements in Qwen-2.5-vl performance across geometry, arithmetic, algebraic, and numeric domains

Status: Under Review at NeurIPS 2025 Datasets and Benchmarks Track

Recommended citation: Wenyuan Wang*, Yue Xin*, Rui Pan*, BingXu Meng*, Renjie Pi, Tong Zhang. "Generalizable Geometric Image Caption Synthesis." Submitted to NeurIPS Datasets and Benchmarks Track.
Download Paper | Download Slides | Download Bibtex