Baifeng Shi

I am a Ph.D. student advised by Prof. Trevor Darrell at UC Berkeley. Previously, I graduated from Peking University with a B.S. degree in computer science.

I build generalist vision and robotic models.

Email  /  Google Scholar  /  Github  /  CV  /  WeChat

profile photo
Selected Publications
NVILA: Efficient Frontier Visual Language Models
Zhijian Liu*, Ligeng Zhu*, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, Xiuyu Li, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Vishwesh Nath, Jinyi Hu, Sifei Liu, Ranjay Krishna, Daguang Xu, Xiaolong Wang, Pavlo Molchanov, Jan Kautz, Hongxu (Danny) Yin^, Song Han^, Yao Lu*^
preprint, 2024
abstract / website / demo / pdf / code / models

NVILA is a family of open VLMs designed to optimize both efficiency and accuracy for efficient video understanding and multi-image understanding . Building on top of VILA, we improve its model architecture by first scaling up the spatial and temporal resolutions, and then compressing visual tokens. This "scale-then-compress" approach enables NVILA to efficiently process high-resolution images and long videos. We also conduct a systematic investigation to enhance the efficiency of NVILA throughout its entire lifecycle, from training and fine-tuning to deployment. NVILA matches or surpasses the accuracy of many leading open and proprietary VLMs across a wide range of image and video benchmarks. At the same time, it reduces training costs by 4.5×, fine-tuning memory usage by 3.4×, pre-filling latency by 1.6-2.2×, and decoding latency by 1.2-2.8×. We make our code and models available to facilitate reproducibility.

When Do We Not Need Larger Vision Models?
Baifeng Shi, Ziyang Wu, Maolin Mao, Xin Wang, Trevor Darrell,
ECCV, 2024
abstract / pdf / code /

We find that smaller vision models (e.g., ViT-B or Vit-L) run on larger image scales are usually better than larger models (e.g., ViT-H, ViT-G) and can also learn similar representations as larger models.

Humanoid Locomotion as Next Token Prediction
Ilija Radosavovic, Bike Zhang, Baifeng Shi, Jathushan Rajasegaran, Sarthak Kamat, Trevor Darrell, Koushil Sreenath, Jitendra Malik
NeurIPS, 2024
Spotlight
abstract / pdf / website /

We formulate humanoid locomotion as a next token prediction problem. This enables learning to walk from in-the-wild data such as Youtube videos.

Robot Learning with Sensorimotor Pre-training
Ilija Radosavovic, Baifeng Shi, Letian Fu, Ken Goldberg, Trevor Darrell*, Jitendra Malik*
CoRL, 2023
Oral Presentation
abstract / pdf / website /

We make imitation learning easier by MAE pre-training on sensorimotor sequences.

Top-Down Visual Attention from Analysis by Synthesis
Baifeng Shi, Trevor Darrell, Xin Wang
CVPR, 2023
Conference highlight
website / abstract / pdf / code / 知乎

We build ViTs with the ability of top-down attention, i.e., steering its attention to specific objects when given a prompt.

Invited Talks

[Jun 2024]   Scaling Up Visual Pre-Training: What’s Next?, AI Tea Talk Singapore

[Apr 2024]   Scaling Up Visual Pre-Training: What’s Next?, VGG group, University of Oxford   [slides]

[Mar 2024]   Scaling Up Visual Pre-Training: What’s Next?, Prof. Yi Ma's group, UC Berkeley

[Oct 2023]   Principles and Applications of Bottom-Up and Top-Down Visual Attention, Peking University   [slides]

[Jun 2023]   Principles and Applications of Bottom-Up and Top-Down Visual Attention, TechBeat