Baifeng Shi
I am a Ph.D. student advised by Prof. Trevor Darrell at UC Berkeley. Previously, I graduated from Peking University with a B.S. degree in computer science.
I build generalist vision and robotic models.
Email  / 
Google Scholar  / 
Github  / 
CV  / 
WeChat
|
|
|
NVILA: Efficient Frontier Visual Language Models
Zhijian Liu*,
Ligeng Zhu*,
Baifeng Shi,
Zhuoyang Zhang,
Yuming Lou,
Shang Yang,
Haocheng Xi,
Shiyi Cao,
Yuxian Gu,
Dacheng Li,
Xiuyu Li,
Yunhao Fang,
Yukang Chen,
Cheng-Yu Hsieh,
De-An Huang,
An-Chieh Cheng,
Vishwesh Nath,
Jinyi Hu,
Sifei Liu,
Ranjay Krishna,
Daguang Xu,
Xiaolong Wang,
Pavlo Molchanov,
Jan Kautz,
Hongxu (Danny) Yin^,
Song Han^,
Yao Lu*^
preprint, 2024
abstract /
website /
demo /
pdf /
code /
models
NVILA is a family of open VLMs designed to optimize both efficiency and accuracy for efficient video understanding and multi-image understanding . Building on top of VILA, we improve its model architecture by first scaling up the spatial and temporal resolutions, and then compressing visual tokens. This "scale-then-compress" approach enables NVILA to efficiently process high-resolution images and long videos. We also conduct a systematic investigation to enhance the efficiency of NVILA throughout its entire lifecycle, from training and fine-tuning to deployment. NVILA matches or surpasses the accuracy of many leading open and proprietary VLMs across a wide range of image and video benchmarks. At the same time, it reduces training costs by 4.5×, fine-tuning memory usage by 3.4×, pre-filling latency by 1.6-2.2×, and decoding latency by 1.2-2.8×. We make our code and models available to facilitate reproducibility.
|
|
When Do We Not Need Larger Vision Models?
Baifeng Shi,
Ziyang Wu,
Maolin Mao,
Xin Wang,
Trevor Darrell,
ECCV, 2024
abstract /
pdf /
code /
We find that smaller vision models (e.g., ViT-B or Vit-L) run on larger image scales are usually better than larger models (e.g., ViT-H, ViT-G) and can also learn similar representations as larger models.
|
|
Humanoid Locomotion as Next Token Prediction
Ilija Radosavovic,
Bike Zhang,
Baifeng Shi,
Jathushan Rajasegaran,
Sarthak Kamat,
Trevor Darrell,
Koushil Sreenath,
Jitendra Malik
NeurIPS, 2024
Spotlight
abstract /
pdf /
website /
We formulate humanoid locomotion as a next token prediction problem. This enables learning to walk from in-the-wild data such as Youtube videos.
|
|
Robot Learning with Sensorimotor Pre-training
Ilija Radosavovic,
Baifeng Shi,
Letian Fu,
Ken Goldberg,
Trevor Darrell*,
Jitendra Malik*
CoRL, 2023
Oral Presentation
abstract /
pdf /
website /
We make imitation learning easier by MAE pre-training on sensorimotor sequences.
|
|
Top-Down Visual Attention from Analysis by Synthesis
Baifeng Shi,
Trevor Darrell,
Xin Wang
CVPR, 2023
Conference highlight
website /
abstract /
pdf /
code /
知乎
We build ViTs with the ability of top-down attention, i.e., steering its attention to specific objects when given a prompt.
|
[Jun 2024] Scaling Up Visual Pre-Training: What’s Next?, AI Tea Talk Singapore
[Apr 2024] Scaling Up Visual Pre-Training: What’s Next?, VGG group, University of Oxford [slides]
[Mar 2024] Scaling Up Visual Pre-Training: What’s Next?, Prof. Yi Ma's group, UC Berkeley
[Oct 2023] Principles and Applications of Bottom-Up and Top-Down Visual Attention, Peking University [slides]
[Jun 2023] Principles and Applications of Bottom-Up and Top-Down Visual Attention, TechBeat
|
|