Introducing Ruyi: Our first image-to-video model

We are thrilled to announce that CreateAI has officially launched "Ruyi" – the Image-to-Video Large Model today. We also have open-sourced the Ruyi-Mini-7B version, and users can now download and use it via Hugging Face. Through open-sourcing, we aim to enable AIGC enthusiasts and community members to freely explore and experience "Ruyi."

Designed specifically to run on consumer-grade GPUs like the RTX 4090, "Ruyi" includes detailed deployment instructions and a ComfyUI workflow, allowing users to get started quickly and easily.

About Ruyi：

Ruyi is CreateAI's first officially released image-to-video model, delivering results at an artistic level. It features exceptional frame-to-frame consistency, generating videos with smooth motion, exquisite details, vibrant colors, and elegant composition. Aiming for master-level visual quality, Ruyi embodies our belief that transitioning from static images to dynamic videos is the best way to tell stories.

Ruyi achieves significant breakthroughs in motion amplitude control and camera control, and its deep learning capabilities for anime and gaming scenarios makes it an ideal creative companion for ACG (Anime, Comics, and Games) enthusiasts.

Outstanding Performance

- Multi-resolution and multi-duration generation:

Ruyi supports resolutions ranging from a minimum of 384×384 to a maximum of 1024×1024, accommodating any aspect ratio. It can generate videos up to 120 frames or 5 seconds in length.

- Keyframe control for generation:

Ruyi allows video creation with up to 5 starting frames and 5 ending frames. Through iterative superposition, videos of any length can be generated.

- Motion amplitude control:

Ruyi offers four levels of motion amplitude adjustment, enabling users to control the degree of overall visual changes with ease.

- Camera Control:

Ruyi provides five types of camera movement controls: tilt-up, tilt-down, pan-left, pan-right, and static.

Technical Overview

- Model Architecture

Ruyi is a image-to-video generation model based on the DiT architecture. It comprises two key components:

Casual VAE Module: Handles video compression and decompression. It reduces spatial resolution to 1/8 and temporal resolution to 1/4, with each pixel is represented in 16-channel BF16 after compression.
Diffusion Transformer Module: Generates compressed video data using 3D full attention, with:2D RoPE for spatial dimensions;Sin-cos position embedding for temporal dimensions;DDPM (Denoising Diffusion Probabilistic Models) for model training. The model includes approximately 7.1 billion parameters and was trained on a dataset of about 200 million video clips.

- Training Data and Methodology

The training process is divided into four phases:

Phase 1: Pre-training from scratch with ~200M video clips and ~30M images at a 256-resolution, using a batch size of 4096 for 350,000 iterations to achieve full convergence.
Phase 2: Fine-tuning with ~60M video clips for multi-scale resolutions (384–512), with a batch size of 1024 for 60,000 iterations.
Phase 3: High-quality fine-tuning with ~20M video clips and ~8M images for 384–1024 resolutions, with dynamic batch sizes according to GPU memory and 10,000 iterations.
Phase 4: Final video training with ~10M curated high-quality video clips, using a batch size of 1024 for ~10,000 iterations.

- Input Format and Output Options

Ruyi requires a single image as input and allows customization of output parameters, including video duration, resolution, motion amplitude and camera movement direction. Based on the input image, Ruyi generates a video with a maximum duration of 5 seconds.

Existing Defects

Ruyi still has issues such as hand deformation, facial detail collapse in multi-person scenarios, and uncontrollable scene transitions. We are working on improving these shortcomings and will fix them in future updates.

Next Step

As competition in the AIGC field intensifies, CreateAI believes that the most promising applications are those driving the development of generative AI tools. The company is committed to leveraging large models to reduce the development time and costs associated with anime and game content. Accordingly, Ruyi will continue focusing on addressing core industry challenges.

The current release of the Ruyi model already supports:

Keyframe-based Generation: Generating up to 5 seconds of video content based on an input keyframe.
Transition Content Creation: Generating intermediate content between two keyframes to streamline workflows and reduce production timelines.

In the future, Ruyi aims to address more advanced scene requirements and achieve breakthroughs in direct CUT generation. The next release will include two versions, offering creators more flexible options tailored to varying needs.

For additional support and resources, please scan the QR code below to join our WeChat community. We look forward to your feedback and experiences to help continuously improve the model.

Create with Ruyi: https://huggingface.co/IamCreateAI/Ruyi-Mini-7B

Introducing Ruyi: Our first image-to-video model - CreateAI

Introducing Ruyi: Our first image-to-video model

About Ruyi：

Heroes of Jin Yong Wins Platinum at the 2025 Pinnacle Awards

CreateAI Debuts First Gameplay Trailer for Heroes of Jin Yong, Positioning Wuxia for a Global Audience

CreateAI to Unveil Innovative AI Products and Timeless Wuxia Classics at ChinaJoy 2025

CreateAI’s “Brotherhood of Blades” Animated Short Wins 3 Golds at Collision Awards, Beating Top Global Studios

CreateAI Launches Asia’s Largest Motion Capture Studio

Pop Singer Kelly Yu Releases World’s First Anime Music Video for Song “Werewolf“ in Collaboration with CreateAI

CreateAI Unveils “Heroes of Jin Yong”, Ushering in New Era of Wuxia Gaming

Introducing Ruyi: Our first image-to-video model

CreateAI Releases White Paper on How Technology Transforms Animation Production

CreateAI Advances Technology with Two Papers Accepted at Prestigious ICCV 2025