I am broadly interested in multi-modal foundation models and spatial intelligence. Currently, I focus on exploring effective multimodal reasoning chains and building models that can reason about dynamics in 2D/3D environments. Reach out if you want to discuss research projects, collaborations or anything else!
We propose Mirage, interleaving latent visual tokens, which represent compact imagery visual features, with explicit text tokens to solve diverse multimodal reasoning tasks, boosting the reasoning performance without the full pixel-level image generation.
We introduce VCA, a curiosity-driven video agent with self-exploration capability, which autonomously navigates video segments and efficiently builds a comprehensive understanding of complex video sequences.
We propose a simple yet effective MLLMs for language-instructed video segmentation. It emphasizes global-local video understanding and achieves SOTA performance on multiple benchmarks.
We show that synthetic videos and natural images can replace real videos for pre-training, achieving comparable or better performance while offering a controllable, transparent alternative.