I obtained my Bachelor's degree from Zhejiang University (ZJU), where I conducted research under the supervision of Professors Kaiwei Wang, Guofeng Zhang, and Zhihai Xu.
During my undergraduate studies, I started at SenseTime, working on computer vision with an emphasis on 2D/3D optical flow prediction.
Then, I joined the Joint Algorithm Department of Rockchip & ZJU, focusing on low-level vision and face-related algorithms, and led the open-sourcing of RKNN (β ~3k).
At Baidu, I worked in the Autonomous Driving Foundation Model Department (ADFM), where I oversaw the world model and concurrently served as head of ADFM research, with work spanning 3D scene understanding, vision-language models, implicit rendering, 4D generation, and world models.
I am currently with X Square Robot, leading the research of the world unified model, where I continue exploring the intersection of generative models, large language models, and embodied intelligence. Please feel free to reach out if you're interested in joining.
TIE: Time Interval Encoding for Video Generation over Events
Zhilei Shu*, Shangwen Zhu*, Zihang Liang, Xiaofan Li, Qianyu Peng, Xinyu Cui, Bo Ye, Yiming Li, Fan Cheng, Jian Zhao, Yang Cao, Zheng-Jun Zhaβ , Ruili Feng
GaussianDream: A Feed-Forward 3D Gaussian World Model for Robotic Manipulation
Zijian Zhang, Yuqing Jiang, Qian Cheng, Xiaofan Li, Si Liu, Ding Zhao, Ping Luo, Weitao Zhou, Haibao Yu
XRZero-G0: Pushing the Frontier of Dexterous Robotic Manipulation with Interfaces, Quality and Ratios
James Wang, Primo Pu, Zephyr Fung, Alex Wang, Sam Wang, Bender Deng, Kevin Wang, Zivid Liu, Chris Pan, Panda Yang, Andy Zhai, Lucy Liang, Shalfun Li, Johnny Sun, Jacky Xu, Will Tian, Kai Yan, Kohler Ye, Scott Li, Qian Wang, Roy Gan, Hao Wang
arXiv preprint arXiv:2604.13001, 2026
[paper]
[code]
Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding
Xianjin Wu, Dingkang Liang, Tianrui Feng, Kui Xia, Yumeng Zhang, Xiaofan Li, Xiao Tan, Xiang Bai
arXiv preprint arXiv:2603.19235, 2026
[paper]
[code]
FaithFusion: Harmonizing Reconstruction and Generation via Pixel-wise Information Gain
YuAn Wang*, Xiaofan Li*β , Chi Huang, Wenhao Zhang, Hao Li, Bosheng Wang, Xun Sun, Jun Wang
(* equal contribution, β Corresponding author)
When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models
Zhengyang Sun, Yu Chen, Xin Zhou, Xiaofan Li, Xiwu Chen, Dingkang Liang, Xiang Bai
Computer Vision and Pattern Recognition Conference (CVPR), 2026
[paper]
[code]
ARGUS: A Trainingβfree Geometryβaware Universal Method for Fine-grained 3D Asset Generation
Mingyang Du, Dingkang Liang, Xin Zhou, Yumeng Zhang, Xiaofan Li, Kui Xia, Xiao Tan, Xiang Bai
Computer Vision and Pattern Recognition Conference (CVPR), 2026
BUGS: Universal 3D Gaussian Splatting with a Bi-directional Gaussian Growing Mechanism
Fan Duan, Yumeng Zhang, Xiaofan Li, Xiao Tan, Li Chen
IEEE Transactions on Multimedia (TMM), 2026
[paper]
Artemis: Structured Visual Reasoning for Perception Policy Learning
Wei Tang, Yanpeng Sun, Shan Zhang, Xiaofan Liβ , Piotr Koniusz, Wei Li, Na Zhao, Zechao Li
(β Corresponding author)
International Conference on Machine Learning (ICML), 2026
[paper]
[code]
[project page]
Benchmarking Dense and Indiscernible Object Counting with Blueberries
Weihao Bo, Jingwen Qin, Yanpeng Sun, Fei Shen, Xiaofan Li, Zechao Li
International Conference on Machine Learning (ICML), 2026
DrivingDiffusion: Layout-Guided Multi-View Driving Scenarios Video Generation with Latent Diffusion Model Xiaofan Li, Yifu Zhang, Xiaoqing Ye
European Conference on Computer Vision (ECCV), 2024
[paper]
[code(β 500+)]
Computer Vision and Pattern Recognition Conference (CVPR), 2025 [Highlight]
[paper]
U-ViLAR: Uncertainty-Aware Visual Localization for Autonomous Driving via Differentiable Association and Registration Xiaofan Li, Zhihao Xu, Chenming Wu, Zhao Yang, Yumeng Zhang, Jiang-Jiang Liu, Haibao Yu, Xiaoqing Ye, YuAn Wang, Shirui Li, Xun Sun, Ji Wan, Jun Wang
International Conference on Computer Vision (ICCV), 2025
[paper]
Drag Your Gaussian: Effective Drag-Based Editing with Score Distillation for 3D Gaussian Splatting
Yansong Qu, Dian Chen, Xinyang Li, Xiaofan Liβ , Shengchuan Zhang, Liujuan Cao, Rongrong Jiβ
(β equal advising)
ACM SIGGRAPH Conference and Exhibition on Computer Graphics and Interactive Techniques (SIGGRAPH), 2025
[paper]
[code]
DriVerse: Navigation World Model for Driving Simulation via Multimodal Trajectory Prompting and Motion Alignment Xiaofan Li, Chenming Wu, Zhao Yang, Zhihao Xu, Dingkang Liang, Yumeng Zhang, Ji Wan, Jun Wang
ACM International Conference on Multimedia (ACM MM), 2025
[paper]
[code(β 200+)]
Video4Edit: Viewing Image Editing as a Degenerate Temporal Process Xiaofan Li, Yanpeng Sun, Chenming Wu, Fan Duan, YuAn Wang, Weihao Bo, Yumeng Zhang, Dingkang Liang
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2026
[paper]
[project page]
WALL-OSS: Igniting VLMs toward the Embodied Space
Andy Zhai, Brae Liu, Bruno Fang, Chalse Cai, Ellie Ma, Ethan Yin, Hao Wang, Hugo Zhou, James Wang, Lights Shi, Lucy Liang, Make Wang, Qian Wang, Roy Gan, Ryan Yu, Shalfun Li, Starrick Liu, Sylas Chen, Vincent Chen, Zach Xu
arXiv preprint arXiv:2509.11766, 2025
[paper]
[code(β 800+)]
Towards Support-Effective Fabrication of 3D Mesh Generation with Preference Alignment
Chenming Wu*, Xiaofan Li*, Chengkai Dai
(* equal contribution)
IEEE Robotics and Automation Letters (RAL), 2025
[paper]
BevWorld: A Multimodal World Model for Autonomous Driving via Unified BEV Latent Space
Yumeng Zhang*, Shi Gong*, Kaixin Xiong*, Xiaoqing Ye, Xiaofan Li, Xiao Tan, Fan Wang, Jizhou Huang, Hua Wu, Haifeng Wang
arXiv preprint arXiv:2407.05679, 2024
[paper]
[code]
DualDiff+: Dual-Branch Diffusion for High-Fidelity Video Generation with Reward Guidance
Zhao Yang, Zezhong Qian, Xiaofan Liβ , Weixiang Xu, Gongpeng Zhao, Ruohong Yu, Lingsi Zhu, Longjun Liuβ
(β equal advising)
IEEE International Conference on Robotics and Automation (ICRA), 2025
[paper]
[code]
The Role of World Models in Shaping Autonomous Driving: A Comprehensive Survey
Sifan Tu, Xin Zhou, Dingkang Liang, Xingyu Jiang, Yumeng Zhang, Xiaofan Liβ , Xiang Baiβ
(β equal advising)
International Joint Conference on Artificial Intelligence (IJCAI), 2024
[paper]
[code(β ~2k)]
CoopTrack: Exploring End-to-End Learning for Efficient Cooperative Sequential Perception
Jiaru Zhong, Jiahao Wang, Jiahui Xu, Xiaofan Li, Zaiqing Nie, Haibao Yu
International Conference on Computer Vision (ICCV), 2025[Highlight]
[paper]
[code]
VRP-SAM: SAM with visual reference prompt
Yanpeng Sun, Jiahui Chen, Shan Zhang, Xinyu Zhang, Xiaofan Li, Qiang Chen, Gang Zhang, Errui Ding, Jingdong Wang, Zechao Li
Computer Vision and Pattern Recognition Conference (CVPR), 2024
[paper]
[code]
Exploring effective factors for improving visual in-context learning
Yanpeng Sun, Qiang Chen, Xiaofan Li, Jian Wang, Jingdong Wang, Zechao Li
IEEE Transactions on Image Processing (TIP), 2025
[paper]
From Prompts to Printable Models: Support-Effective 3D Generation via Offset Direct Preference Optimization
Chenming Wu*, Xiaofan Li*, Chengkai Dai
(* equal contribution)
IEEE Robotics and Automation Letters (RAL), 2026
[paper]
NeRF-DetS: Enhanced Adaptive Spatial-Wise Sampling and View-Wise Fusion Strategies for NeRF-Based Indoor Multi-View 3D Object Detection
Chi Huang, Xinyang Li, Yansong Qu, Changli Wu, Xiaofan Liβ , Shengchuan Zhang, Liujuan Caoβ
(β equal advising)
International Joint Conference on Artificial Intelligence (IJCAI), 2024
[paper]
Revisiting MLLMs: An In-Depth Analysis of Image Classification Abilities
Huan Liu, Lingyu Xiao, Jiangjiang Liu, Xiaofan Li, Ze Feng, Sen Yang, Jingdong Wang
arXiv preprint arXiv:2412.16418, 2024
[paper]
Learning Multiple Probabilistic Decisions from Latent World Model in Autonomous Driving
Lingyu Xiao, Jiang-Jiang Liu, Sen Yang, Xiaofan Li, Xiaoqing Ye, Wankou Yang, Jingdong Wang
IEEE International Conference on Robotics and Automation (ICRA), 2025
[paper]
[code]
UniFuture: A Unified Driving World Model for Future Generation and Perception
Dingkang Liang, Dingyuan Zhang, Xin Zhou, Sifan Tu, Tianrui Feng, Xiaofan Li, Yumeng Zhang, Mingyang Du, Xiao Tan, Xiang Bai
IEEE International Conference on Robotics and Automation (ICRA), 2026
[paper]
[code]
Vision Remember: Alleviating Visual Forgetting in Efficient MLLM with Vision Feature Resample
Ze Feng, Jiang-Jiang Liu, Sen Yang, Lingyu Xiao, Xiaofan Li, Wankou Yang, Jingdong Wang
arXiv preprint arXiv:2506.03928, 2025
[paper]
Serving a Free Lunch for Fine Grained 3D Geometry via Entropy Guided Attention
Mingyang Du, Dingkang Liang, Xin Zhou, Yumeng Zhang, Xiaofan Li, Kui Xia, Xiao Tan, Xiang Bai
Under review
LoopGen: Generative Street Scene Expansion via Diffusion-Aided Outlier Repair Xiaofan Li, Yuan Wang, Ji Wan, Jun Wang
Under review
Rethinking Autonomous Driving Planner Beyond Tweaking the Framework
Lingyu Xiao, Jiang-Jiang Liu, Xiaofan Li, Xiaoqing Ye, Wankou Yang
Under review
EIDOS: Democratizing High-Fidelity 3D Generative Models via Discriminative Prior Transfer
Xiao Luo, Xin Zhou, Mingyang Du, Tianrui Feng, Xiwu Chen, Xiaofan Li, Dingkang Liang
Under review
Talks & Presentations
[02/2025]Self-supervised vision, multimodal foundation models, and the evolving role of visual representation learning @ Baidu
[10/2024]Invited talk: DrivingDiffusion: Layout-Guided Multi-View Driving Scenarios Video Generation with Latent Diffusion Model @ Brown Institute for Media Innovation, Stanford University (with Prof. Maneesh Agrawala)
[06/2023]Controllable multi-view video generation and self-supervised 3D motion representation learning @ Baidu
[03/2022]Diffusion-based models for visual perception and planning in autonomous driving @ Baidu
[07/2020]RKNN: A neural network operator library for inference acceleration on NPU and DSP @ Rockchip & Nvidia
[02/2020]Monocular dynamic-video face reconstruction and image signal processing @ Rockchip & Intel
[05/2019]Multi-resolution high dynamic range imaging in general-purpose computer vision @ ZJU-IPLab & ETH ZΓΌrich
[09/2017]One-shot face recognition: discriminative and generative formulations @ ZJU-3DV Group
Academic Service
Conference Reviewer: NeurIPS, ICLR, ICML, CVPR, ICCV, ECCV, MM, ICRA, etc.