Tsinghua unveils China's first Sora-level text-to-video large model Vidu to combat OpenAI

At the 2024 ZGC Forum, Tsinghua University, joining hands with Shengshu Technology, unveiled Vidu, China's first video model with extended duration, exceptional consistency, and dynamic capabilities. Using the team's original U-ViT architecture, which combines Diffusion and Transformer, Vidu is capable of generating high-definition videos with a long duration of up to 16 seconds and a resolution of 1080P.

It is reported that Vidu can not only simulate the real physical world but also possesses a rich imagination, featuring multi-view generation and high spatiotemporal consistency. It is a groundbreaking video mega-model that rivals the international benchmark following the release of Sora, and is constantly improving under accelerated iteration.

During the Forum, Zhu Jun, Professor from Tsinghua University and Chief Scientist of Shengshu Technology, explained that Vidu can generate high-quality videos up to 16 seconds long based on provided textual descriptions, just like Sora. In addition to the breakthrough in video duration, significant improvements can also be seen in video effects, particularly in the simulation of the real physical world, multi-camera perspectives, high spatiotemporal consistency, and the understanding of specific Chinese elements.

According to Zhujun, Vidu's swift breakthrough stems from the team's long-term accumulation and multiple original achievements in Bayesian machine learning and multimodal mega-models. The core U-ViT architecture technology, proposed and independently developed by the team in September 2022, predates DiT, the architecture adopted by Sora, as the world's first architecture integrating Diffusion and Transformer.

Within just two months after the release of Sora in February 2024, drawing on an in-depth understanding of the U-ViT architecture and their extensive experience in engineering and data, the team further advanced key technologies in long video representation and processing, developed and launched the Vidu video mega-model with improved video coherence and dynamism.

"The name Vidu not only sounds like 'Video' but also carries the meaning of 'We do,'" said Professor Zhu Jun. The breakthrough of the model is a multidimensional and cross-domain process that requires the deep integration of technologies and industrial applications. He also expressed his hope to strengthen cooperation with upstream and downstream companies in the industry chain, as well as other research institutions, so as to jointly promote the development of video large models.

After Vidu's release, Professor Zhu Jun posted a sentence in WeChat Moments, "Vidu, we do, we did, we do together! Thanks to the relentless dedication of my partners, the laboratory has blossomed with fruitful results."

Behind Vidu is Shengshu Technology, a star startup originating from Tsinghua University.

According to publicly available information, Shengshu Technology was established in March 2023, with core members hailing from Tsinghua University's Institute for Artificial Intelligence. It is dedicated to independently developing controllable universal large multimodal models that lead the world. The CEO, Tang Jiayu, holds both a bachelor's and master's degree from Tsinghua University's Department of Computer Science and Technology, while the Chief Scientist, Zhu Jun, is the Deputy Dean of Tsinghua's Institute for Artificial Intelligence. The CTO, Bao Fan, is a doctoral student from Tsinghua's Department of Computer Science and Technology and a member of Professor Zhu Jun's research group, with a long-term focus on diffusion model research.

Source: Yangcheng Evening News