Generalizable Geometric Image Caption Synthesis
Published in NeurIPS Datasets and Benchmarks Track (Under Review), 2025
This work proposes Geo-Image-Textualization, a reinforcement learning-based framework for generating semantically aligned geometry image-caption pairs. We constructed GeoReasoning-10K, the first dataset with full modality equivalence for geometric reasoning, enhancing MLLMs’ cross-modal alignment.
Key Contributions
- Developed a novel RL-based framework for geometry-text alignment
- Created GeoReasoning-10K dataset with full modality equivalence
- Demonstrated significant improvements in Qwen-2.5-vl performance across geometry, arithmetic, algebraic, and numeric domains
Status: Under Review at NeurIPS 2025 Datasets and Benchmarks Track
Recommended citation: Wenyuan Wang*, Yue Xin*, Rui Pan*, BingXu Meng*, Renjie Pi, Tong Zhang. "Generalizable Geometric Image Caption Synthesis." Submitted to NeurIPS Datasets and Benchmarks Track.
Download Paper | Download Slides | Download Bibtex