Chunyuan Li

I am a principal researcher at Microsoft Research, Redmond. My recent research focuses on large-scale pre-training in computer vision and natural language processing. Some recent works include:

  • Building large multimodal models that follow human intents [1]
  • Vision-and-language pre-training [1, 2, 3]
  • Deep generative models at scale [1, 2, 3, 4 ]

I obtained my PhD in machine learning at Duke University, advised by Prof. Lawrence Carin. My PhD research studies probabilistic deep learning. I have served as an Area Chair for NeurIPS, ICML, ICLR, EMNLP & AAAI, and a Guest Editor of IJCV on ``the promises and dangers of large vision models''.


news

Oct/Nov, 2023 LLaVA is upgraded:
  • LLaVA-1.5 achieves SoTA on 11 benchmarks among open-source LMMs. It utilizes all public data, completes training in ~1 day on a single 8-A100 node, and surpasses prior SoTA that use billion-scale data. [Project] [Paper] [Github] [Demo] [Model Zoo]
  • LLaVA-Interactive: Experience the future of human-AI multimodal interaction with an all-in-one demo for image chat, segmentation, generation and editing. [Project] [Paper] [Github] [Demo]
  • LLaVA-Plus expands the capabilities of LLaVA by learning to use external tools for creating multimodal agents. [Project] [Paper] [Github] [Demo]
September 20, 2023 A 110-page paper is released to share our perspective on LMMs: ``Multimodal Foundation Models: From Specialists to General-Purpose Assistants''. This is based our CVPR 2023 Tutorial. [Note on Large Multimodal Models] [Slides] [YouTube] [Bilibili]
June 1, 2023 LLaVA-Med: Training a large language-and-vision assistant for biomedicine in one day. NeurIPS 2023 Datasets and Benchmarks Track (Spotlight)
April 17, 2023 Visual Instruction Tuning with GPT-4! We release LLaVA, a Large Language-and-Vision Assistant towards multimodal GPT-4 level capabilities. NeurIPS 2023 (Oral Presentation) [Project] [Paper] [Github] [Demo] [Data] [Model] [Scaling Note]
April 7, 2023 Instruction Tuning with GPT-4! a "first attempt" to use GPT-4 data for LLM self-instruct tuning. [Paper] [Github] [My Learnings]
March, 2023 CVPR 2023:
  • REACT improves foundation models on various vision tasks by customizing them with retrieval-augmented multimodal knowledge [Code] (Highlights, 2.5%)
  • GLIGEN enables a new capability for frozen text-to-image generation models: open-set grounding. [Demo] [Code] [YouTube]
  • X-Decoder: a generalist decoder for pixel, image and language [Demo] [Code]
Feb, 2023

CVPR2023 Workshop and Challenge on the 2nd Computer Vision in the Wild (CVinW). For those who are new to this topic, please check out the CVinW Reading List . [Workshop] [SGinW Challenge] [RF100 Challenge]

Oct 23, 2022

ECCV 2022 Workshop and Challenge on the 1st Computer Vision in the Wild (CVinW). Please check out the videos of this event at [YouTube] [BiliBili]. [Workshop] [ICinW Challenge] [ODinW Challenge]

Oct 17, 2022 "Vision-Language Pre-Training: Basics, Recent Advances, and Future Trends", A 100-page survey paper in Foundations and Trends® in Computer Graphics and Vision
Sep 16, 2022 NeurIPS 2022: K-LITE (Oral, 1%), ELEVATER and FocalNet. A team effort to push CVinW. :sparkles:; [CVPR Tutorial] :smile:
  • K-LITE demonstrates the effectiveness of external knowledge to improve language-image models in zero-/few-shot task transfer
  • ELEVATER is a platform with 20 image classification and 35 object detection public datasets for evaluating language-image models in task-level visual transfer. [Benchmark Website]
  • FocalNet [paper, code, demo, blog] - SoTA on COCO object detection with a simple attention-free architecture
Mar 25, 2022 Upcoming events as a co-organizer:
Mar 1, 2022 CVPR 2022:
June 17, 2021 EsViT chieves SoTA 81.3% top-1 on the ImageNet linear probe evaluation, outperforming prior arts with an order magnitude of higher throughput. [GitHub]

recent publications

  1. K-LITE
    K-LITE: Learning Transferable Visual Models with External Knowledge
    Shen, Sheng*, Li, Chunyuan*, Hu, Xiaowei*, Xie, Yujia, Yang, Jianwei, Zhang, Pengchuan, Rohrbach, Anna, Gan, Zhe, Wang, Lijuan, Yuan, Lu, Liu, Ce, Keutzer, Kurt, Darrell, Trevor, and Gao, Jianfeng
    NeurIPS 2022
  2. ELEVATER
    ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models
    Li, Chunyuan*, Liu, Haotian*, Li, Liunian Harold, Zhang, Pengchuan, Aneja, Jyoti, Yang, Jianwei, Jin, Ping, Lee, Yong Jae, Hu, Houdong, Liu, Zicheng, and Gao, Jianfeng
    NeurIPS (Datasets and Benchmarks Track) 2022
  3. UniCL
    Unified Contrastive Learning in Image-Text-Label Space
    Yang, Jianwei*, Li, Chunyuan*, Zhang, Pengchuan*, Xiao, Bin*, Liu, Ce, Yuan, Lu, and Gao, Jianfeng
    CVPR 2022