Photo of Saining Xie

Artificial intelligence & robotics

Saining Xie

Developing next-generation infrastructure for visual understanding and generation.

Year Honored
2023

Organization
New York University

Region
China

Hails From
China
Just as the Cambrian Explosion was inseparable from the evolution of visual organs, the future of artificial intelligence also relies on the development of vision and perception abilities. Over the past decade, Saining Xie has focused on the frontier research of Visual Intelligence. He developed the next generation of convolutional neural networks, ConvNeXt and diffusion transformers DiT, with his interns.

As a design purely based on convolutional neural networks, ConvNeXt has the advantages of no-attention mechanism and phased structure, which is comparable to the Transformer architecture in terms of accuracy, scalability and robustness. This achievement also fully proves that even at the moment when Transformers dominate visual recognition tasks, well-designed convolutional neural networks are still highly competitive.

DiT is an innovative image generation framework that applies Transformer to the diffusion model, which can effectively improve the quality and efficiency of image generation.

Recently, Saining co-proposed the Scalable Interpolation Transformer model SiT. This model is built on the DiT backbone and can achieve the connection between two distributions in a more flexible way than the standard diffusion model. In the ImageNet 256x256 benchmark, using exactly the same base network, number of parameters, and floating-point operations per second, SiT's performance surpasses DiT in all aspects. In 2023, Saining transitioned to academia as an assistant professor at New York University, hoping to solve practical problems through innovative research and explore the boundaries of visual intelligence. At the same time, he will also research more efficient training and deployment methods for multimodal models, as well as issues related to the security, ethics, and privacy of AI deployment, ensuring that future AI systems can benefit all of humanity.

He is also researching more efficient training and deployment methods for multimodal models, as well as issues related to the safety, ethics, and privacy of AI deployment to help ensure future AI systems benefit all of humanity. A representative achievement in this area is the recent collaboration with Turing Award winner Yann LeCun and others, using a vision-centric design approach to jointly develop the Cambrian-1 visual multimodal large model. Currently, the model's weights, code, datasets, and other contents have been fully open-sourced.