Xiaofan Li (Shalfun Li)

I obtained my Bachelor's degree from Zhejiang University (ZJU), where I conducted research under the supervision of Professors Kaiwei Wang, Guofeng Zhang, and Zhihai Xu.

During my undergraduate studies, I started at SenseTime, working on computer vision with an emphasis on 2D/3D optical flow prediction.

Then, I joined the Joint Algorithm Department of Rockchip & ZJU, focusing on low-level vision and face-related algorithms, and led the open-sourcing of RKNN (⭐ ~3k).

At Baidu, I worked in the Autonomous Driving Foundation Model Department (ADFM), where I oversaw the world model and concurrently served as head of ADFM research, with work spanning 3D scene understanding, vision-language models, implicit rendering, 4D generation, and world models.

I am currently with X Square Robot, leading the research of the world unified model, where I continue exploring the intersection of generative models, large language models, and embodied intelligence. Please feel free to reach out if you're interested in joining.

[Email] [Github] [Google Scholar] [Linkedin]

Xiaofan Li

News (Recent 2 years)

  • [2026/05] We released technical report WALL-WM: Carving World Action Modeling at the Event Joints
  • [2026/05] We released technical report WALL-OSS-0.5
  • [2026/05] We released TIE: Time Interval Encoding for Video Generation over Events and GaussianDream: A Feed-Forward 3D Gaussian World Model for Robotic Manipulation
  • [2026/05] πŸš€ 2 papers accepted by ICML 2026. β€” Two as Project Lead
  • [2026/02–04] We released XRZero-G0 (technical report), When Numbers Speak, Generation Models Know Space, and BUGS.
  • [2026/03] πŸš€ We released VEGA-3D, and it ranked #1 on πŸ€— Hugging Face Trending.
  • [2026/02] πŸš€ 1 paper accepted by ICRA 2026. β€” Co-author
  • [2026/02] πŸš€ 1 paper accepted by TMM. β€” Co-author
  • [2026/02] πŸš€ 7 papers accepted by CVPR 2026. β€” Two as First Author & Two as Project Lead
  • [2026/02] πŸš€ 1 paper accepted by RAL 2025. β€” First Author
  • [2025/11–12] We released Artemis, FVAR, FaithFusion, ViLoMeo, Video4Edit, and Support-Effective 3D Generation.
  • [2025/06–09] We released WALL-OSS (technical report; ⭐ 800+ stars), U-ViLAR, and CoopTrack.
  • [2025/07] πŸš€ 1 paper accepted by ACM MM 2025. β€” First Author
  • [2025/06] πŸš€ 2 papers accepted by ICCV 2025 (Highlight). β€” One as First Author
  • [2025/04–06] We released Vision Remember, MPDrive, and DriVerse.
  • [2025/04] πŸš€ 1 paper accepted by CVPR 2025 (Highlight). β€” First Author
  • [2025/09] πŸš€ DriVerse was officially recommended by WanWan2.1 as a contribution to Wan2.1 Community Works.
  • [2025/04] πŸš€ 2 papers accepted by IJCAI 2025. β€” Project Lead
  • [2025/03] πŸš€ 1 paper accepted by SIGGRAPH 2025. β€” Project Lead
  • [2025/03] We released DualDiff+, UniFuture, and Controllable Panoramic Video Generation.
  • [2025/03] We open-sourced WanControl, a ControlNet version for WanWan2.1, within one week of the Wan2.1 release.
  • [2025/02] We released The Role of World Models in Shaping Autonomous Driving(⭐ 2k+ stars).
  • [2025/01] πŸš€ 2 papers accepted by ICRA 2025. β€” One as Project Lead
  • [2024/09–12] We released Drag Your Gaussian, Descriptive Caption Enhancement with Visual Specialists, Revisiting MLLMs, and Learning Multiple Probabilistic Decisions from Latent World Model.
  • [2024/08] We launched Apollo ADFM, an autonomous driving foundation model.
  • [2024/07] πŸš€ 3 paper accepted by CVPR 2024. β€” Co-author
  • [2024/07] πŸš€ 1 paper accepted by ECCV 2024. β€” First Author
  • [2024/02–07] We released BevWorld, NeRF-DetS, VRP-SAM, and DrivingDiffusion(⭐ 500+ stars).
  • Publications

    Point-wise RoPE vs interval-aware TIE TIE: Time Interval Encoding for Video Generation over Events
    Zhilei Shu*, Shangwen Zhu*, Zihang Liang, Xiaofan Li, Qianyu Peng, Xinyu Cui, Bo Ye, Yiming Li, Fan Cheng, Jian Zhao, Yang Cao, Zheng-Jun Zha†, Ruili Feng

    arXiv preprint arXiv:2605.10543, 2026
    [paper] [project page]
    gdream GaussianDream: A Feed-Forward 3D Gaussian World Model for Robotic Manipulation
    Zijian Zhang, Yuqing Jiang, Qian Cheng, Xiaofan Li, Si Liu, Ding Zhao, Ping Luo, Weitao Zhou, Haibao Yu

    arXiv preprint arXiv:2605.20752, 2026
    [paper] [code]
    thumb XRZero-G0: Pushing the Frontier of Dexterous Robotic Manipulation with Interfaces, Quality and Ratios
    James Wang, Primo Pu, Zephyr Fung, Alex Wang, Sam Wang, Bender Deng, Kevin Wang, Zivid Liu, Chris Pan, Panda Yang, Andy Zhai, Lucy Liang, Shalfun Li, Johnny Sun, Jacky Xu, Will Tian, Kai Yan, Kohler Ye, Scott Li, Qian Wang, Roy Gan, Hao Wang
    arXiv preprint arXiv:2604.13001, 2026
    [paper] [code]
    thumb Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding
    Xianjin Wu, Dingkang Liang, Tianrui Feng, Kui Xia, Yumeng Zhang, Xiaofan Li, Xiao Tan, Xiang Bai
    arXiv preprint arXiv:2603.19235, 2026
    [paper] [code]
    thumb Visual Autoregressive Modeling via Next Focus Prediction
    Xiaofan Li*, Chenming Wu*, Yanpeng Sun, Jiaming Zhou, Delin Qu, Yansong Qu, Weihao Bo, Haibao Yu, Dingkang Liang
    Computer Vision and Pattern Recognition Conference (CVPR), 2026
    [paper] [project page]
    thumb FaithFusion: Harmonizing Reconstruction and Generation via Pixel-wise Information Gain
    YuAn Wang*, Xiaofan Li*†, Chi Huang, Wenhao Zhang, Hao Li, Bosheng Wang, Xun Sun, Jun Wang
    (* equal contribution, † Corresponding author)

    Computer Vision and Pattern Recognition Conference (CVPR), 2026
    [paper] [code] [project page]
    thumb ViLoMeo: Agentic Learner with Grow-and-Refine Multimodal Semantic Memory
    Weihao Bo, Shan Zhang, Yanpeng Sun, Jingjing Wu, Qunyi Xie, Xiao Tan, Kunbin Chen, Wei He, Xiaofan Li, Na Zhao, Jingdong Wang, Zechao Li
    Computer Vision and Pattern Recognition Conference (CVPR), 2026
    [paper] [code] [project page]
    thumb Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception
    Yanpeng Sun, Jing Hao, Ke Zhu, Jiang-Jiang Liu, Yuxiang Zhao, Xiaofan Li, Gang Zhang, Zechao Li, Jingdong Wang
    Computer Vision and Pattern Recognition Conference (CVPR), 2026
    [paper]
    thumb FM-Steer: Enhance Generalist Policies with Value-Guided Cascaded Denosing
    Haoming Song, Delin Qu, Yuanqi Yao, Qizhi Chen, Jiarui Li, Qi Lv, Yiwen Tang, Li Kang, Heng Zhou, Xianqiang Gao, Yuhang Tang, Xiaofan Li, Modi Shi, Guanghui Ren, Maoqing Yao, Bin Zhao, Dong Wang, Xuelong Li
    Computer Vision and Pattern Recognition Conference (CVPR), 2026
    thumb When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models
    Zhengyang Sun, Yu Chen, Xin Zhou, Xiaofan Li, Xiwu Chen, Dingkang Liang, Xiang Bai
    Computer Vision and Pattern Recognition Conference (CVPR), 2026
    [paper] [code]
    thumb ARGUS: A Training‑free Geometry‑aware Universal Method for Fine-grained 3D Asset Generation
    Mingyang Du, Dingkang Liang, Xin Zhou, Yumeng Zhang, Xiaofan Li, Kui Xia, Xiao Tan, Xiang Bai
    Computer Vision and Pattern Recognition Conference (CVPR), 2026
    thumb BUGS: Universal 3D Gaussian Splatting with a Bi-directional Gaussian Growing Mechanism
    Fan Duan, Yumeng Zhang, Xiaofan Li, Xiao Tan, Li Chen
    IEEE Transactions on Multimedia (TMM), 2026
    [paper]
    thumb Artemis: Structured Visual Reasoning for Perception Policy Learning
    Wei Tang, Yanpeng Sun, Shan Zhang, Xiaofan Li†, Piotr Koniusz, Wei Li, Na Zhao, Zechao Li
    († Corresponding author)

    International Conference on Machine Learning (ICML), 2026
    [paper] [code] [project page]
    thumb Benchmarking Dense and Indiscernible Object Counting with Blueberries
    Weihao Bo, Jingwen Qin, Yanpeng Sun, Fei Shen, Xiaofan Li, Zechao Li
    International Conference on Machine Learning (ICML), 2026
    thumb DrivingDiffusion: Layout-Guided Multi-View Driving Scenarios Video Generation with Latent Diffusion Model
    Xiaofan Li, Yifu Zhang, Xiaoqing Ye
    European Conference on Computer Vision (ECCV), 2024
    [paper] [code(⭐ 500+)]
    thumb MPDrive: Improving Spatial Understanding with Marker-Based Prompt Learning for Autonomous Driving
    Zhiyuan Zhang*, Xiaofan Li*, Zhihao Xu, Wenjie Peng, Zijian Zhou, Miaojing Shi, Shuangping Huang
    (* equal contribution)

    Computer Vision and Pattern Recognition Conference (CVPR), 2025 [Highlight]
    [paper]
    thumb U-ViLAR: Uncertainty-Aware Visual Localization for Autonomous Driving via Differentiable Association and Registration
    Xiaofan Li, Zhihao Xu, Chenming Wu, Zhao Yang, Yumeng Zhang, Jiang-Jiang Liu, Haibao Yu, Xiaoqing Ye, YuAn Wang, Shirui Li, Xun Sun, Ji Wan, Jun Wang
    International Conference on Computer Vision (ICCV), 2025
    [paper]
    thumb Drag Your Gaussian: Effective Drag-Based Editing with Score Distillation for 3D Gaussian Splatting
    Yansong Qu, Dian Chen, Xinyang Li, Xiaofan Li†, Shengchuan Zhang, Liujuan Cao, Rongrong Ji†
    († equal advising)

    ACM SIGGRAPH Conference and Exhibition on Computer Graphics and Interactive Techniques (SIGGRAPH), 2025
    [paper] [code]
    thumb DriVerse: Navigation World Model for Driving Simulation via Multimodal Trajectory Prompting and Motion Alignment
    Xiaofan Li, Chenming Wu, Zhao Yang, Zhihao Xu, Dingkang Liang, Yumeng Zhang, Ji Wan, Jun Wang
    ACM International Conference on Multimedia (ACM MM), 2025
    [paper] [code(⭐ 200+)]
    thumb Video4Edit: Viewing Image Editing as a Degenerate Temporal Process
    Xiaofan Li, Yanpeng Sun, Chenming Wu, Fan Duan, YuAn Wang, Weihao Bo, Yumeng Zhang, Dingkang Liang
    IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2026
    [paper] [project page]
    thumb WALL-OSS: Igniting VLMs toward the Embodied Space
    Andy Zhai, Brae Liu, Bruno Fang, Chalse Cai, Ellie Ma, Ethan Yin, Hao Wang, Hugo Zhou, James Wang, Lights Shi, Lucy Liang, Make Wang, Qian Wang, Roy Gan, Ryan Yu, Shalfun Li, Starrick Liu, Sylas Chen, Vincent Chen, Zach Xu
    arXiv preprint arXiv:2509.11766, 2025
    [paper] [code(⭐ 800+)]
    thumb Towards Support-Effective Fabrication of 3D Mesh Generation with Preference Alignment
    Chenming Wu*, Xiaofan Li*, Chengkai Dai
    (* equal contribution)

    IEEE Robotics and Automation Letters (RAL), 2025
    [paper]
    thumb BevWorld: A Multimodal World Model for Autonomous Driving via Unified BEV Latent Space
    Yumeng Zhang*, Shi Gong*, Kaixin Xiong*, Xiaoqing Ye, Xiaofan Li, Xiao Tan, Fan Wang, Jizhou Huang, Hua Wu, Haifeng Wang
    arXiv preprint arXiv:2407.05679, 2024
    [paper] [code]
    thumb DualDiff+: Dual-Branch Diffusion for High-Fidelity Video Generation with Reward Guidance
    Zhao Yang, Zezhong Qian, Xiaofan Li†, Weixiang Xu, Gongpeng Zhao, Ruohong Yu, Lingsi Zhu, Longjun Liu†
    († equal advising)

    IEEE International Conference on Robotics and Automation (ICRA), 2025
    [paper] [code]
    thumb The Role of World Models in Shaping Autonomous Driving: A Comprehensive Survey
    Sifan Tu, Xin Zhou, Dingkang Liang, Xingyu Jiang, Yumeng Zhang, Xiaofan Li†, Xiang Bai†
    († equal advising)

    International Joint Conference on Artificial Intelligence (IJCAI), 2024
    [paper] [code(⭐ ~2k)]
    thumb CoopTrack: Exploring End-to-End Learning for Efficient Cooperative Sequential Perception
    Jiaru Zhong, Jiahao Wang, Jiahui Xu, Xiaofan Li, Zaiqing Nie, Haibao Yu
    International Conference on Computer Vision (ICCV), 2025 [Highlight]
    [paper] [code]
    thumb VRP-SAM: SAM with visual reference prompt
    Yanpeng Sun, Jiahui Chen, Shan Zhang, Xinyu Zhang, Xiaofan Li, Qiang Chen, Gang Zhang, Errui Ding, Jingdong Wang, Zechao Li
    Computer Vision and Pattern Recognition Conference (CVPR), 2024
    [paper] [code]
    thumb Exploring effective factors for improving visual in-context learning
    Yanpeng Sun, Qiang Chen, Xiaofan Li, Jian Wang, Jingdong Wang, Zechao Li
    IEEE Transactions on Image Processing (TIP), 2025
    [paper]
    thumb From Prompts to Printable Models: Support-Effective 3D Generation via Offset Direct Preference Optimization
    Chenming Wu*, Xiaofan Li*, Chengkai Dai
    (* equal contribution)

    IEEE Robotics and Automation Letters (RAL), 2026
    [paper]
    thumb NeRF-DetS: Enhanced Adaptive Spatial-Wise Sampling and View-Wise Fusion Strategies for NeRF-Based Indoor Multi-View 3D Object Detection
    Chi Huang, Xinyang Li, Yansong Qu, Changli Wu, Xiaofan Li†, Shengchuan Zhang, Liujuan Cao†
    († equal advising)

    International Joint Conference on Artificial Intelligence (IJCAI), 2024
    [paper]
    thumb Revisiting MLLMs: An In-Depth Analysis of Image Classification Abilities
    Huan Liu, Lingyu Xiao, Jiangjiang Liu, Xiaofan Li, Ze Feng, Sen Yang, Jingdong Wang
    arXiv preprint arXiv:2412.16418, 2024
    [paper]
    thumb Learning Multiple Probabilistic Decisions from Latent World Model in Autonomous Driving
    Lingyu Xiao, Jiang-Jiang Liu, Sen Yang, Xiaofan Li, Xiaoqing Ye, Wankou Yang, Jingdong Wang
    IEEE International Conference on Robotics and Automation (ICRA), 2025
    [paper] [code]
    thumb UniFuture: A Unified Driving World Model for Future Generation and Perception
    Dingkang Liang, Dingyuan Zhang, Xin Zhou, Sifan Tu, Tianrui Feng, Xiaofan Li, Yumeng Zhang, Mingyang Du, Xiao Tan, Xiang Bai
    IEEE International Conference on Robotics and Automation (ICRA), 2026
    [paper] [code]
    thumb Vision Remember: Alleviating Visual Forgetting in Efficient MLLM with Vision Feature Resample
    Ze Feng, Jiang-Jiang Liu, Sen Yang, Lingyu Xiao, Xiaofan Li, Wankou Yang, Jingdong Wang
    arXiv preprint arXiv:2506.03928, 2025
    [paper]
    thumb Controllable Panoramic Video Generation with 360-Degree Motion Consistency
    Yuzhi Chen, Qi Zeng, Muyang Zhang, Leilei Fan, Xiaofan Li†, Changwei Wang, Rongtao Xu, Yanchao Liu, MingMing Yu, Weiliang Meng
    († Corresponding author)

    SSRN (Elsevier), 2025
    [paper]
    thumb Serving a Free Lunch for Fine Grained 3D Geometry via Entropy Guided Attention
    Mingyang Du, Dingkang Liang, Xin Zhou, Yumeng Zhang, Xiaofan Li, Kui Xia, Xiao Tan, Xiang Bai
    Under review
    thumb LoopGen: Generative Street Scene Expansion via Diffusion-Aided Outlier Repair
    Xiaofan Li, Yuan Wang, Ji Wan, Jun Wang
    Under review
    thumb Rethinking Autonomous Driving Planner Beyond Tweaking the Framework
    Lingyu Xiao, Jiang-Jiang Liu, Xiaofan Li, Xiaoqing Ye, Wankou Yang
    Under review
    thumb EIDOS: Democratizing High-Fidelity 3D Generative Models via Discriminative Prior Transfer
    Xiao Luo, Xin Zhou, Mingyang Du, Tianrui Feng, Xiwu Chen, Xiaofan Li, Dingkang Liang
    Under review

    Talks & Presentations

    • [02/2025] Self-supervised vision, multimodal foundation models, and the evolving role of visual representation learning @ Baidu
    • [10/2024] Invited talk: DrivingDiffusion: Layout-Guided Multi-View Driving Scenarios Video Generation with Latent Diffusion Model @ Brown Institute for Media Innovation, Stanford University (with Prof. Maneesh Agrawala)
    • [06/2023] Controllable multi-view video generation and self-supervised 3D motion representation learning @ Baidu
    • [03/2022] Diffusion-based models for visual perception and planning in autonomous driving @ Baidu
    • [07/2020] RKNN: A neural network operator library for inference acceleration on NPU and DSP @ Rockchip & Nvidia
    • [02/2020] Monocular dynamic-video face reconstruction and image signal processing @ Rockchip & Intel
    • [05/2019] Multi-resolution high dynamic range imaging in general-purpose computer vision @ ZJU-IPLab & ETH ZΓΌrich
    • [09/2017] One-shot face recognition: discriminative and generative formulations @ ZJU-3DV Group

    Academic Service

    • Conference Reviewer: NeurIPS, ICLR, ICML, CVPR, ICCV, ECCV, MM, ICRA, etc.
    • Journal Reviewer: IJCV, Pattern Recognition, Neurocomputing
    © Xiaofan Li | Last update: June 2025