China is eyeing the position of a leader in generative AI, as evident from the successes of DeepSeek and Kimi k1.5. It has also conquered the vision domain with OmniHuman and Goku and has recently launched Step-Video-T2V, challenging the dominion of top text-to-video models.
As it happens, major competitors in the text-to-video field like Sora, Veo 2, and Movie Gen, are now facing a new contender to the throne in the form of Step-Video-T2V developed by Shanghai-based startup Stepfun AI in partnership with Geely Automobile Group, per a report published on February 19.
Step-Video-T2V technical specifications
Specifically, the Sora-like Stepfun product is an open-source pre-trained video generation model with 30 billion parameters, support for direct generation of high-quality, 204-frame, 540P resolution videos, native bilingual (Chinese and English) input, and cinematographic language.
It deploys a deep compression Variational Autoencoder, Video-VAE, which takes care of video generation tasks, achieving 16×16 spatial and 8x temporal compression ratios while maintaining superior video reconstruction quality, developers said.
Furthermore, the tech encodes user prompts using two bilingual text encoders to handle both English and Chinese. A diffusion transformer (DiT) with 3D full attention, trained using Flow Matching, denoises input noise into latent frames, while Video-DPO supports reduction of artifacts and better visual quality.
The company tests the performance of its product on a novel video generation benchmark, and has made the results public, as well as provided the opportunity for users to test it out themselves for free.
Meanwhile, StepFun has also launched the Step-Audio large model, which can create expressions of emotion, dialects, languages, singing, and personalized styles based on various scene requirements, engage in natural, high-quality dialogue with users, and is the first product-level open-source voice interaction model in the industry.