中文
English

Alibaba's new open source video model may become a Chinese Adobe

2025-05-28

Alibaba's new open source video model may become a Chinese Adobe


Know the danger

follow with interest

2025-05-16

0 comments

920 views

1 Collection

27 minutes

  B-end products need to rely more on sales teams and channel cooperation to promote their products, while C-end products need to make more use of online marketing and word-of-mouth communication to promote their products .

Alibaba's open-source Wan2.1-VACE video editing model, with its powerful multitasking capabilities and editability, is expected to become the Chinese version of Adobe. This model not only supports basic cultural videos, but also has various functions such as image reference, video redrawing, and local editing, greatly improving the flexibility and controllability of video generation. This paper will discuss the technical highlights, application scenarios and potential impact of this model on the future video production industry.




Last night, Alibaba officially opened sourced the All in One video editing model, Wan2.1-VACE, which may make Alibaba the future Adobe of China in the field of video production.


Why do you say that? Before introducing VACE, let's first lay the foundation for the current status of video generation model products.


The most impressive impression of such products on the public is usually the stunning sensation brought by instant generation. Not only does it generate quality, but its card drawing feature makes the experience of different output results for the same input as a blind box interesting.


However, for professional groups who consider AI as productivity, card drawing is just the first step in their work, and in reality, they often collapse into the second and multiple editing stages.


Imagine a scenario where a startup company wants to post a 30 second promotional video for a new product on social media. The product of this company is a portable coffee machine, targeting urban white-collar workers and travel enthusiasts. Employees hope to have AI help complete the production of short films. In practice, relying solely on AI's "one-time output" of materials or "frequent card draws" will never work.


Because the design requirements themselves will constantly change from the first generation, for example, the highlights proposed by the marketing department ("fast extraction", "USB-C power supply", "lightweight") are often adjusted temporarily in subsequent meetings. In addition, good ideas need to be repeatedly polished, such as the rhythm of the visuals, the tone of the copy, and the switching of shots. Only after reading the initial draft can one know whether they are "right or not".



By taking on these three challenges, product managers will only continue to appreciate their value

Good product managers are scarce. Product managers who understand users, business and data are still in demand when they go out of the Internet. On the contrary, if a product manager only engages in simple communication, inefficient execution, and shallow thinking, they may not be able to survive the flood of the next 3-5 years.

View details>

If AI can only produce results at once, any later modifications will be difficult or even equivalent to starting over. So only with multiple rounds of interaction and editability can the creative cycle be greatly shortened, while maintaining creative flexibility, and AI can become a true productivity tool.


Therefore, close human-machine interaction is currently the most suitable path for the development of AI, but it is very difficult to achieve this.


Compared to text, the controllability of pixel based object generation is obviously more difficult. Without considering semantic or physical constraints, when comparing the number of states, for a sentence of 10 tokens, taking GPT-4 as an example, its vocabulary size is about 10 ^ 5, so the total number of states is (10 ^ 5) ^ 10=1.0 × 10 ^ 50. For a color RGB video (3 RGB channels per pixel, 256 values per RGB channel, for a total of 768 values), such as a 128 × 128 pixel, 3-second 10 frame video, with a total of 491520 pixels and a potential state count of 768 ^ 491520, the magnitude of the state count is much larger than that of text.


It is not difficult to understand why video generation products are currently generally slow and expensive, which actually reflects the efficiency and cost advantages of secondary editing compared to mindless card drawing.


At present, there has been good progress in the controllability of image and 3D generation, but the controllable generation of videos has only recently achieved visible results to the naked eye. Moreover, mainstream related products still have significant limitations, which actually greatly restrict the implementation of creativity.


The Tongyi Wanxiang team expressed to Zhiwei that video generation and editing face significant challenges


Fragmentation problem in video generation and editing: Traditional video generation or editing methods often focus on a single task (such as text video, reference generation, object replacement), lack a unified framework, resulting in independent models for different tasks, low link concatenation efficiency, and high inference costs.

Lack of controllability: Existing methods are difficult to support multi-dimensional or multi task editing simultaneously (such as referencing the subject, content, and structure simultaneously), and users cannot adjust videos as flexibly as editing text.

High quality content generation requirements: The short video and film industries require high fidelity and high consistency video generation, but existing models are prone to problems such as frame flickering and semantic inconsistency.

Taking professional P image software as an example, the key to why a design software can truly come in handy in tight production processes is that it provides a diverse and on-demand tool ecosystem: from repairing brushes and content aware fillers, to channel mixers, bitmap/vector masks, to action scripts and third-party plugins, almost every creative demand can find corresponding "tools".


This allows designers to flexibly switch their ideas and techniques at different stages of the project, without having to jump out of the work interface to complete it.


Last night, Alibaba officially opened sourced the Tongyi Wan2.1-VACE, which has achieved production level multitasking capabilities in the field of AI video.




The open source address is as follows:


GitHub: https://github.com/Wan-Video/Wan2.1

HuggingFace: https://huggingface.co/Wan-AI

Magic Community: https://www.modelscope.cn/organization/Wan-AI?tab=model

Wan2.1-VACE has two versions, 1.3B and 14B. The 1.3B version is suitable for local deployment and gameplay fine-tuning, and can run on consumer grade graphics cards (previously released as a Preview version). It supports 480P resolution, while the 14B version has higher generation quality and supports both 480P and 720P resolutions.


Now, developers can download and experience it on GitHub, Huggingface, and the Magic Community. The model will gradually be launched on the official website of Tongyi Wanxiang and Alibaba Cloud Bailian.




Wan2.1-VACE focuses on "the most comprehensive functions" and "editability". A single model not only supports the most basic liberal arts videos, but also supports multiple functions. There is no need to train a new expert model for a single function, and the cost of deploying multiple models is also eliminated. The Tongyi Wanxiang team stated that Wan2.1-VACE is the first integrated model based on video DiT architecture that supports such a wide range of tasks.


Text conditions have greatly improved the editability of video generation, but they are not sufficient to accurately control all details in the video (such as precise layout, object shape, etc.). Therefore, Wan2.1-VACE has extended multitasking capabilities to achieve more refined editability.


Overall, Wan2.1-VACE's multitasking capabilities include:


Image reference capability, given a reference subject (face or object) and background, generates video content with consistent elements.

Video re drawing capability, including pose transfer, motion control, structure control, re coloring, etc. (based on depth map, optical flow, layout, grayscale, line draft, and pose control);

Local editing ability, including subject reshaping, subject removal, background extension, duration extension, etc.

For example, image reference generation. In the example, Wan2.1-VACE generated a video based on reference images of a small snake and a girl, in which the girl gently touched the small snake. Image reference generation is important for adding new elements and ensuring element consistency in multi shot videos.


Tip: In a joyful and festive scene, a little girl wearing bright red spring clothes is playing with her cute cartoon snake. Her spring clothes were embroidered with golden auspicious patterns, emitting a festive atmosphere, and her face was filled with a bright smile. The snake's body presents a bright green color, with a rounded shape and wide eyes that make it appear both friendly and humorous. The little girl happily caressed the snake's head with her hand, enjoying this warm moment together. The colorful lanterns and ribbons around them decorate the environment, and the sunlight shines on them, creating a New Year atmosphere full of friendship and happiness.




Partial editing is both efficient and indispensable, as it enables the ability to delete, replace existing elements, and add new ones. In the following picture, Wan2.1-VACE uses video editing capabilities to remove the tablet computer from the woman's hand without leaving any trace.


Tip: Documentary photography style, the real estate self media blogger is located in the center of a modern living room. The blogger is wearing simple and fashionable clothes, with a smile on his face. He holds both hands in front of him, with nothing on his hands, and is introducing the situation of the house to the camera. The background is a spacious and bright living room, with simple and modern furniture and green garden outside the floor to ceiling window. The room is well lit and cozy. Mid shot full body portrait, from a head up perspective, with a slight sense of movement, like tapping a finger on the screen.




In addition, by further combining video re drawing, Wan2.1-VACE can control the presentation of new elements based on different motion control capabilities.


For example, sketches/edge maps are suitable for controlling the overall motion trajectory of objects. The following figure shows the fighter motion and lens motion effects generated by Wan2.1-VACE based on sketch motion trajectories and fighter reference maps.


Tip: From the perspective of a fighter jet, rapidly spinning, engaging in a dogfight with enemy planes in the clouds, suddenly rolling, rapidly rolling down, missiles brushing over the fuselage, and the tail flame tracing an arc in the clouds.




Grayscale videos provide information on the brightness and darkness of content, which can be used to guide models in coloring or reconstructing details. In the following image, Wan2.1-VACE also generated a video of a man riding a horse next to a running train based on grayscale images.


Tip: A foreign man is running on a brown horse beside the railway tracks. He was wearing a gray shirt and a black cowboy hat, with a background of a steam train in motion, consisting of multiple carriages and emitting smoke. The sky is an orange sunset scene.




The human pose map (skeleton keypoints) provides intuitive structural information and is very suitable for controlling the actions of characters in videos. In the following figure, Wan2.1-VACE generated a video of a boy practicing karate based on a human pose map.


Tip: Realistic style photography, a 10-year-old white boy wearing a white martial arts uniform and a yellow belt practicing karate in a spacious and bright room. He punches and poses with concentration and discipline, and his movements are smooth and effortless. The background is blurred, and the stacked mats and other gym equipment can be vaguely seen. The camera follows his movements, horizontally shaking left and right to capture mid to close range shots, showcasing his strong and coherent movements.




Optical flow describes the motion of pixels between frames and is an important modality for expressing fine-grained motion structures. In the following figure, Wan2.1-VACE generated a scene of plum blossoms falling into the water and causing splashes of water based on optical flow diagrams.


Tip: Documentary photography style, a deep purple plum blossom slowly falls into a transparent glass, splashing crystal clear water droplets. The scene captures this moment in slow motion, with water splashing in the air, forming a beautiful arc. The water in the glass is clear to the bottom, and the color of the plum blossom contrasts sharply with it. The background is concise and highlights the main subject. Close up close-up, vertical top-down perspective, showcasing the beauty of details.




Wan2.1-VACE also supports video background extension and video duration extension. In the following image, Wan2.1-VACE extends the video background to restore the close-up of a woman playing the violin to the large-scale performance scene already hinted at in the original content.


Tip: An elegant lady is enthusiastically playing the violin, with an entire symphony orchestra behind her.




By extending the video duration, Wan2.1-VACE presents the scene of an off-road rider running to a small slope ahead from behind the camera.


Tip: Off road motorcycle race scene, a fully equipped athlete rides a motorcycle onto a slope, with the wheels splashing high soil.




Overall, the above case demonstrates the organic relationship between Wan2.1-VACE multitasking abilities. Image reference and local editing provide the ability to delete, replace, and add new elements, while video redrawing controls the specific presentation of new elements. Different modalities have their own strengths, and video background extension and video duration extension provide more open imaginative space or restore complete scenes in space and time.


So to fully leverage the advantages of Wan2.1-VACE, we should explore the free combination of various atomic abilities, only in this way can we achieve production level scenario implementation, and Wan2.1-VACE can indeed support this very well.


For example, in the following multi shot promotional video, Wan2.1-VACE freely combines multiple abilities to meet the needs of each shot, while maintaining good character consistency between shots.


For example, this clip combines image extension, posture transfer, and image reference. The window is enlarged with image extension, the girl is encouraged to do stretching exercises with posture extension, and more birds are added with image reference.




This clip combines local editing and image reference, using image reference to instantly shuttle the little elephant doll through the "arbitrary door" to the marked local area of the park scene.




This clip combines motion control and image reference, allowing the elephant to float from the ground and soar into the sky.




This clip combines local editing, pose transfer, and image reference, using pose transfer to control the girl's gait and allowing her to quickly change clothes through local editing and image reference.




Finally, this clip combines posture transfer and image reference, endowing the girl with a professional skateboarding posture, combined with different landscape images, allowing the girl to ride a skateboard through the city, desert, and sea.




How can developers not love such a awesome production tool? This can be seen from the current achievements of Tongyi Wanxiang.


Since February of this year, Tongyi Wanxiang has successively opened sourced the Wensheng Video Model, Tusheng Video Model, and Front and Back Frame Video Model. Currently, the download volume in the open source community has exceeded 3.3 million, and it has won over 1.1 million stars on GitHub, making it the most popular video generation model of the same period. It is expected that Wan2.1-VACE will also bring a new wave of community activity.


How did Wanxiang achieve the integration of so many abilities into one model? To answer this question, Zhiwei communicated with the Tongyi Wanxiang team.


Tongyi Wanxiang expressed that to achieve this, there will actually be many challenges:


Multi task unified modeling: How to integrate multiple tasks such as generation and editing within a single architecture while maintaining high performance.

Fine grained control: How to decouple content (objects), motion (timing), style (appearance), and other attributes in videos to achieve independent editing.

Data and training complexity: Multi task data construction requires processing according to the characteristics of each task and constructing a high-quality training set.

In the modeling section, VCU (Video Condition Unit) is the core module for Wan2.1-VACE to achieve comprehensive and controllable editing. The use of VCU is the source of task unification and also the difference from other proprietary models that only support specific tasks. ”The criticality of VCU is reflected in:


Unified representation: Define the input for video generation and editing as input video, input mask, reference image, etc;

Multi task unification: VCU serves as an intermediate layer to isolate task differences (such as generation or editing), in order to inject representations of different tasks into the generation module.

Fine grained control: Through the decoupling design of VCU, task differentiation and fine-grained control can be achieved.

Briefly explain the composition of VCU. In fact, the multitasking capability of Wan2.1-VACE can be represented as a unified input interface for three modalities of data, namely text prompts, reference images, and masks.


According to the requirements of various video task capabilities for three types of multimodal inputs, they are divided into four categories:


Text to video generation (T2V);

Reference image generation (R2V);

Video to Video Editing (V2V), which refers to video redrawing;

Mask video to video editing (MV2V), which refers to partial video editing.

VCU uses a unified representation method to represent the inputs of the above four types of tasks in the same triplet form (T, F, M), where T is the text prompt, F is the reference image or context frame, and M is the mask:


In T2V, there is no need for context frames or masks. Each frame defaults to a 0 input value, and each mask defaults to a 1 input value, indicating that all pixels in these 0-value frames will be regenerated.

For R2V, additional reference frames (such as faces, objects, etc.) are inserted before the default 0-value frame sequence. In the mask sequence, the mask of the default frame is all 1s, and the mask of the reference frame is all 0s, which means that the default frame should be regenerated while the reference frame should remain unchanged.

In V2V, the context frame is the input video frame (such as depth, grayscale, pose, etc.), and the mask defaults to 1, indicating that all input video frames will be regenerated.

For MV2V, both context frames and masks are required, with mask parts being 0 and 1, and frames with mask 1 will be regenerated.

Thus, different tasks were unified into one model, as shown in the following figure.




VCU: A unified input representation for four types of video processing tasks.


Image source: https://arxiv.org/pdf/2503.07598


The structure of VCU is very simple and beautiful, but it is also based on the team's long-term technical accumulation and evolution. Tongyi Wanxiang said, "ACE and ACE++are our initial attempts at unified generation and editing in the field of images, and have achieved good results. And VACE is also a flexible application of ACE in the video field, where the construction idea of VCU evolved from a unified input module in images. ”


To implement VCU itself, there are also some challenges. Generally speaking, Wanxiang said, "VACE adopts a unified strategy (early fusion) on the input side, which is different from using additional encoding modules to process different input modalities. We follow the design principles of simplicity and uniformity. The core challenge is to use a single model to achieve results compared to proprietary models. ”


Building a multitasking model also requires higher data quality. The Tongyi Wanxiang team needs to mark the first frame of the video, such as which objects are in the image, and perform position box selection and localization to remove videos with target areas that are too small or too large. They also need to calculate whether the target appears in the video for a long time in the time dimension to avoid abnormal scenes caused by the target being too small or disappearing.


In order to adapt the model to flexible ability combinations, the Tongyi Wanxiang team randomly combined all tasks for training. For all operations involving masks, perform arbitrary granularity enhancement to meet the local generation requirements of various granularities.


The training process adopts a phased, easy to difficult approach. The Tongyi Wanxiang team first focuses on tasks such as mask repair and extension based on pre trained text to video models. Next, gradually transition from single input reference frames to multi input reference frames, and from single tasks to composite tasks. Finally, use higher quality data and longer sequences to fine tune the model quality. This allows the input for model training to adapt to any resolution, dynamic duration, and variable frame rate.


In recent years, video generation AI models have undergone rapid evolution, completing a leap from "being able to generate" to "being able to handle generation". The evolution of multimodal input reflects the shift from "one key opening one lock" to "multi clue collaborative command".


Different modalities have their own strengths: text provides abstract semantics, images provide appearance details, pose/sketch limits structure, optical flow constrains motion continuity, while reference frames ensure identity is constant, and so on. This process fully demonstrates the potential of AI video: by continuously introducing new control dimensions, humans will continuously enhance their ability to make AI create videos according to intention.


By integrating different control dimensions, video generation models have begun to possess the ability to comprehensively understand and make decisions, balancing the needs of all parties under complex conditions. This not only greatly improves the editable nature of the generation, but also makes the model more adaptable to the needs of mixing multiple materials in real creative scenarios.


It can be seen that Wan2.1-VACE is a key achievement in completing this transformation.


Looking ahead to the future, how to further improve the generated reality, extend the duration, enhance interactivity (such as real-time adjustment of generated videos), and combine physics and 3D knowledge to avoid distortion will be a continuous research focus. But it can be certain that the editable and multi conditional video generation paradigm has been basically established and will become a new paradigm for digital media production.


And this production paradigm may completely change the workflow of video post production in the future, overturning tools such as PR, AE, and Final Cut in the hands of video production workers.


Author: Liu Dagu Editor: Da Bing


This article is written by everyone who is a product manager [Zhiwei]. WeChat official account: [Zhiwei]. It is original/authorized to be published by everyone who is a product manager. Reproduction without permission is prohibited.


The title image is from Unsplash, based on the CC0 protocol.




阿里开源的新视频模型,没准会成为中国 Adobe

知危关注
2025-05-16
0 评论920 浏览1 收藏27 分钟
B端产品需要更多地依赖销售团队和渠道合作来推广产品,而C端产品需要更多地利用网络营销和口碑传播来推广产品..

阿里巴巴开源的通义万相Wan2.1-VACE视频编辑大模型,凭借其强大的多任务能力和可编辑性,有望成为中国版的Adobe。该模型不仅支持基础的文生视频,还具备图像参考、视频重绘、局部编辑等多种功能,极大地提升了视频生成的灵活性和可控性。本文将深入探讨这一模型的技术亮点、应用场景以及其对未来视频制作行业的潜在影响。

昨晚,阿里巴巴正式开源了 All in one 的视频编辑大模型通义万相 Wan2.1-VACE,而这个模型,没准能让阿里在视频制作领域成为中国未来的 Adobe 。

为什么这么说呢?在介绍 VACE 之前,我们先铺垫一些视频生成类模型产品的现状。

这类产品给大众最深的印象通常是即时生成带来的惊艳感。不仅仅是生成质量,其抽卡特性使得每次相同输入有不同结果输出的体验犹如盲盒般有趣。

不过,对于把 AI 当作生产力的专业群体,抽卡只是工作的第一步,实际上他们经常崩溃于二次、多次编辑阶段。

想象一个场景,一家初创公司想要在社交媒体上发布一条 30 秒的新品宣传短片。这家公司的产品是一台便携式咖啡机,目标受众是城市白领和旅行爱好者,员工希望让 AI 帮忙完成短片的制作。这样的需求在实践中仅仅靠 AI “ 一次性输出 ” 素材 ” 或 “频繁抽卡”,是永远是行不通的。

因为设计需求本身会在第一次生成开始就不断变化,比如市场部提出的亮点( “ 快速萃取 ”、“ USB-C 供电 ”、“ 轻量化 ” )常常在后续会议中被临时调整。此外,好的创意需要反复打磨,比如画面节奏、文案语气、镜头切换等,只有看了初稿后才知道“对不对味”。

做到这三点挑战,产品经理只会不断升值
好的产品经理是很稀缺的,懂用户、懂商业、懂数据的产品经理走出互联网,依然是抢手货。相反,如果只做简单传话、低效执行、浅层思考的产品经理,恐怕走不过未来3-5年的洪流。
查看详情 >

如果 AI 只能一次性产出结果,则任何后期修改都会很困难甚至相当于重来。所以只有具备多轮交互和可编辑性后,创作周期才能大大缩短,同时保持创意灵活性,AI 才能成为真正的生产力工具。

因此,密切的人机交互目前是最契合 AI 发展的路线,但想做到这一点,非常的难。

比起文字,像素类的对象生成的可控性显然要更加难。不考虑语义约束或物理约束,以状态数比较来看,一句 10 个 token 的文本,以 GPT-4 为例,其词表大小约为 10^5,那么总状态数是 (10^5)^10=1.0×10^50。对于彩色 RGB 视频( 每个像素 3 个 RGB 通道,每个 RGB 通道 256 个取值,共 768 个取值 ),比如128×128像素,3秒10帧的视频,共491520个像素,潜在状态数为768^491520,其状态数数量级远远大于文本。

这也就不难理解为什么视频生成产品目前普遍速度慢且贵,而这其实也更体现二次编辑相对于无脑抽卡的效率和成本优势了。

现阶段,图像、3D 的生成可控性已有不错进展,但视频的可控生成仅在近期才有肉眼可见的成果。而且主流的相关产品仍有较大的局限性,这其实对创意落地的限制性很大。

通义万相团队向知危表示,视频生成与编辑面临较大挑战:

  • 视频生成与编辑的碎片化问题:传统的视频生成或编辑方法通常针对单一任务( 如文生视频、参考生成、对象替换 ),缺乏统一框架,导致不同任务需独立的模型,链路串联效率低下、推理成本高。

  • 可控性不足:现有方法难以同时支持多维度或多任务编辑( 如主体、内容、结构同时参考 ),用户无法像编辑文本一样灵活地调整视频。

  • 高质量内容生成需求:短视频、影视行业需要高保真、高一致性的视频生成,而现有模型易出现帧间闪烁、语义不一致等问题。

以专业 P 图软件为例,一款设计软件之所以能在紧张的生产流程中真正派上用场,关键在于它提供了种类繁多、可按需组合的工具生态:从修补画笔、内容感知填充,到通道混合器、位图/矢量蒙版,再到动作脚本和第三方插件,几乎每一种创意诉求都能找到对应 “ 利器 ”。

这让设计师能够在不同项目阶段灵活切换思路与技法,无需跳出工作界面就能完成。

而昨晚,阿里巴巴正式开源的通义万相Wan2.1-VACE,就在AI视频领域实现了生产级别的多任务能力。

开源地址如下:

  • GitHub:https://github.com/Wan-Video/Wan2.1

  • HuggingFace:https://huggingface.co/Wan-AI

  • 魔搭社区:https://www.modelscope.cn/organization/Wan-AI?tab=model

Wan2.1-VACE 拥有 1.3B 和 14B 两个版本,其中 1.3B 版本适合本地部署和玩法微调,可在消费级显卡运行( 此前已发布 Preview 版 ),支持 480P 分辨率,14B 版本生成质量更高,支持 480P 和 720P 分辨率。

现在,开发者可在 GitHub、Huggingface 及魔搭社区下载体验。该模型还将逐步在通义万相官网和阿里云百炼上线。

Wan2.1-VACE 主打 “ 功能最全 ” 与 “ 可编辑性 ”,单一模型不仅支持最基础的文生视频,还同时支持多种功能。不必再为了单一功能训练一个新的专家模型,也省去了部署多个模型的开销。通义万相团队表示:Wan2.1-VACE 是第一个基于视频 DiT 架构的同时支持如此广泛任务的一体化模型。

文本条件大幅提升了视频生成的可编辑性,但却不足以精准控制视频中的所有细节( 例如精确的布局、对象形状等 ),因此 Wan2.1-VACE 扩展了多任务能力以实现更加精细的可编辑性。

总体而言,Wan2.1-VACE 的多任务能力包括:

  • 图像参考能力,给定参考主体( 人脸或物体 )和背景,生成元素一致的视频内容。

  • 视频重绘能力,包括姿态迁移、运动控制、结构控制、重新着色等( 基于深度图、光流、布局、灰度、线稿和姿态等控制 );

  • 局部编辑能力,包括主体重塑、主体移除、背景延展、时长延展等。

比如图像参考生成,在示例中,Wan2.1-VACE 基于小蛇和女孩的参考图生成了一个视频,女孩在视频里轻轻摸了摸小蛇。图像参考生成对于添加新元素很重要,并能保证多镜头视频中的元素一致性。

提示词:在一个欢乐而充满节日气氛的场景中,穿着鲜艳红色春服的小女孩正与她的可爱卡通蛇嬉戏。她的春服上绣着金色吉祥图案,散发着喜庆的气息,脸上洋溢着灿烂的笑容。蛇身呈现出亮眼的绿色,形状圆润,宽大的眼睛让它显得既友善又幽默。小女孩欢快地用手轻轻抚摸着蛇的头部,共同享受着这温馨的时刻。周围五彩斑斓的灯笼和彩带装饰着环境,阳光透过洒在她们身上,营造出一个充满友爱与幸福的新年氛围。

局部编辑是高效可编辑性也不可或缺的,能实现删除、替换原有元素以及加入新元素的能力。在下图中,Wan2.1-VACE 用视频局部编辑能力将女士手里的平板电脑不留痕迹地移除了。

提示词:纪实摄影风格,房产自媒体博主站在一间现代化的客厅中央。博主穿着简洁时尚的衣物,面带微笑,两只手举在身前,手上空无一物正对着镜头介绍房屋情况。背景是一间宽敞明亮的客厅,家具简约现代,落地窗外是绿意盎然的花园。房间内光线充足,温馨舒适。中景全身人像,平视视角,轻微的运动感,如手指轻点屏幕。

此外,通过进一步结合视频重绘,Wan2.1-VACE 能基于不同的运动控制能力来控制新元素的呈现。

比如,草图/边缘图适合控制物体整体的运动轨迹,下图展示了 Wan2.1-VACE 基于草图运动轨迹和战斗机参考图生成的战斗机运动以及镜头运动效果。

提示词:战斗机视角,急速旋转,在云层中与敌机缠斗,突然翻滚,急速下摇,导弹擦过机身,尾焰在云层中划出弧线。

灰度视频提供了内容的明暗信息,可用于指导模型为其上色( 彩色化 )或重建细节。在下图中,Wan2.1-VACE 还基于灰度图生成了在运行的火车旁边骑马的男子的视频。

提示词:一个外国男人骑着一匹棕色的马在铁轨旁奔跑。他穿着一件灰色衬衫和黑色牛仔帽,背景是一列蒸汽火车正在行驶中,它由多个车厢组成,并且冒着烟雾。天空是橙色的日落景象。

人体姿态图( 骨架关键点 )提供了直观的结构信息,非常适合用来控制视频中角色的动作。在下图中,Wan2.1-VACE 基于人体姿态图生成了男孩练习空手道的视频。

提示词:写实风格摄影,10 岁白人男孩身穿白色武术服,系着黄色腰带,在宽敞明亮的房间里练习空手道。他专注而有纪律地出拳、摆架势,动作流畅自如。背景模糊处理,隐约可见堆放的垫子等健身房设备。镜头跟随他的动作,水平左右摇移,捕捉中近景画面,展现他控制力强且连贯的动作。

光流描述了帧间像素的运动,是表达细粒度运动结构的重要模态。在下图中,Wan2.1-VACE基于光流图生成了话梅掉进水里引发水花飞溅的画面。

提示词:纪实摄影风格,一颗深紫色的话梅缓缓落入透明玻璃杯中,溅起晶莹剔透的水花。画面以慢镜头捕捉这一瞬间,水花在空中绽放,形成美丽的弧线。玻璃杯中的水清澈见底,话梅的色彩与之形成鲜明对比。背景简洁,突出主体。近景特写,垂直俯视视角,展现细节之美。

Wan2.1-VACE 还支持视频背景延展和视频时长延展。在下图中,Wan2.1-VACE通过视频背景延展把女士拉小提琴的特写还原为原内容已暗示的大型演奏场景。

提示词:一位优雅的女士正在热情地拉着小提琴,她的身后是一整个交响乐团。

通过视频时长延展,Wan2.1-VACE 呈现了镜头后方的越野骑手跑到前方小坡的画面。

提示词:越野摩托车比赛场景,一个装备齐全的运动员骑着摩托车登上土坡,车轮溅起高高的泥土。

综合来看,上述案例呈现了 Wan2.1-VACE 多任务能力之间的有机关系。图像参考和局部编辑提供基础删除、替换、添加新元素的能力,视频重绘则控制新元素的具体呈现,不同模态各有所长,视频背景延展和视频时长延展在空间、时间上提供了更加开放的想象空间或还原完整场景。

所以要发挥 Wan2.1-VACE 的全部优势,应该探索各种原子能力的自由组合,只有这样才能实现生产级别的场景落地,而 Wan2.1-VACE 确实能够很好地支持这一点。

比如在以下这个多镜头宣传片中,Wan2.1-VACE 自由地组合了多种能力来实现每个镜头的需求,同时很好地保持了镜头间的人物一致性。

比如这个片段组合了画面延展、姿态迁移、图片参考,用画面延展扩大窗户,用姿态延展让女生做伸展运动,用图片参考加入更多小鸟。

这个片段组合了局部编辑和图片参考,用图片参考将小象娃娃通过“任意门”瞬间穿梭到园区场景标记好的局部区域中。

这个片段组合了运动控制和图像参考,让小象从地上浮起然后一飞冲天。

这个片段组合了局部编辑、姿态迁移和图片参考,用姿态迁移控制女孩的步态,通过局部编辑和图片参考给女孩快速更换服装。

最后,这个片段组合了姿态迁移和图片参考,将专业滑板运动姿态赋予给女孩,结合不同的景观图片,使女孩踩着滑板车在城市、沙漠、大海中穿梭。

如此给力的生产级工具,开发者怎能不爱?从通义万相目前的成绩就可见一斑。

自今年 2 月以来,通义万相已先后开源文生视频模型、图生视频模型和首尾帧生视频模型,目前在开源社区的下载量已超 330 万,在 GitHub 上斩获超 1.1w star,是同期最受欢迎的视频生成模型。预计 Wan2.1-VACE 也将带来新一波社区活跃。

将这么多的能力有机融合到一个模型,通义万相是怎么做到的?为解答该问题,知危跟通义万相团队进行了交流。

通义万相向知危表示,要实现这一点,其实会面临不少挑战:

  • 多任务统一建模:如何在单一架构中兼容生成、编辑等多种任务,并保持高性能。

  • 细粒度控制:如何解耦视频中的内容( 物体 )、运动( 时序 )、风格( 外观 )等属性,实现独立编辑。

  • 数据与训练复杂性:多任务的数据构建需要按照任务的特性分别进行处理,并组建出高质量的训练集。

在建模部分,VCU ( Video Condition Unit ) 是 Wan2.1-VACE 实现全面可控编辑的核心模块,“ VCU 的使用是实现任务统一的源头,也是区别于其他专有模型仅支持特定任务的不同之处。” VCU 的关键性体现在:

  • 统一表征:将视频生成和编辑的输入定义为输入视频、输入掩码、参考图像等;

  • 多任务统一:VCU作为中间层,隔离任务差异( 如生成或编辑 ),以实现不同任务的表征注入到生成模块中。

  • 细粒度控制:通过VCU的解耦设计,可实现对任务区分和精细化控制。

简单解释一下 VCU 的构成。实际上,Wan2.1-VACE 的多任务能力可以表示为三种模态数据的统一输入接口,这三种模态即文本提示、参考图以及 mask。

根据多种视频任务能力对三种多模态输入的要求,将其分为四类:

  • 文本转视频生成 ( T2V ) ;

  • 参考图像生成 ( R2V ) ;

  • 视频到视频编辑 ( V2V ) ,即视频重绘;

  • 蒙版视频到视频编辑 ( MV2V ) ,即视频局部编辑。

VCU 用统一的表示方式,将以上四类任务的输入都表示为相同的三元组形式 ( T,F,M ) ,T 为文本提示,F 为参考图像或上下文帧,M 为 mask:

  • 在 T2V 中,不需要上下文帧或 mask,每个帧默认为 0 输入值,每个 mask 默认为 1 输入值,表示所有这些 0 值帧像素都将重新生成。

  • 对于 R2V,在默认 0 值帧序列前插入额外的参考帧( 比如人脸、物体等 ),mask 序列中,默认帧的 mask 为全 1,参考帧的 mask 为全 0,意味着默认帧应重新生成,参考帧应保持不变。

  • 在 V2V 中,上下文帧是输入视频帧( 比如深度、灰度、姿态等 ),mask 默认为 1,表明输入视频帧都将重新生成。

  • 对于 MV2V,上下文帧和 mask 都是必需的,mask 部分为 0、部分为 1,mask 为 1 的帧将重新生成。

由此,便将不同的任务统一到了一个模型中,如下图所示。

VCU:四类视频处理任务的统一输入表示。

图源:https://arxiv.org/pdf/2503.07598

VCU 的结构非常简洁漂亮,但也是基于团队长期的技术积累演化而来,通义万相表示,“ ACE 和 ACE++ 是我们在图像领域进行统一生成和编辑的最初尝试,并取得了不错的效果。而 VACE 也是 ACE 在视频领域中的灵活运用,其中 VCU 的构建思想也是从图像中的统一输入模块演变而来。”

而要实现 VCU 本身,其实也会有一些挑战,通义万相表示,“ VACE 采用了在输入侧进行统一的策略 ( early fusion ),不同于使用额外的编码模块对不同的输入模态进行处理,我们以简单、统一为设计原则。其核心挑战在于要使用单一模型来实现与专有模型相比的效果。”

构建多任务模型对数据质量的要求也更高,通义万相团队需要对视频进行第一帧标记,比如图像内有哪些物体,并进行位置框选和定位,去除目标区域过小或过大的视频,还需要在时间维度上计算目标是否长期出现在视频中,避免目标过小或消失带来的异常场景。

为了让模型适应灵活的能力组合,通义万相团队将所有任务随机组合进行训练。对于所有涉及 mask 的操作,执行任意粒度的增强,以满足各种粒度的局部生成需求。

训练过程则采用分阶段、从易到难的方法。通义万相团队先在预训练文本转视频模型的基础上,专注 mask 修复和扩展等任务。接下来,逐步从单输入参考帧过渡到多输入参考帧,以及从单一任务过渡到复合任务。最后,使用更高质量的数据和更长的序列来微调模型质量。这使得模型训练的输入可以适应任意分辨率、动态时长和可变帧率。

近几年的视频生成 AI 模型经历了飞速演进,完成了从 “ 能生成 ” 到 “ 能驾驭生成 ” 的飞跃。这其中,多模态输入的演进体现了从 “ 一把钥匙开一把锁 ” 到 “ 多线索协同指挥 ” 的转变。

不同模态各有所长:文本给出抽象语义,图像提供外观细节,姿态/草图限定结构,光流约束运动连续性,而参考帧确保身份恒定等等。这一历程充分展现了AI视频的潜力:通过不断引入新的控制维度,人类将不断增强让AI按意图创造视频的能力。

融合不同控制维度,视频生成模型开始具备综合理解与决策的能力,能够在复杂条件下平衡各方需求。这不仅极大提高了生成的可编辑性,也使模型更适应真实创作场景下多种素材混合作用的需求。

可以看出,Wan2.1-VACE 是完成这一转变的关键成果。

展望未来,如何进一步提高生成现实度、扩展时长、增强交互性( 例如实时对生成视频进行调整 ),以及结合物理和 3D 知识避免失真,将是持续的研究重点。但可以肯定的是,可编辑、多条件的视频生成范式已基本确立,并将成为数字媒介生产的新范式。

而这种生产范式,或许可以在未来彻底改变视频后期制作的工作流,颠覆掉视频制作工作人员手中的 PR、AE 以及 Final cut 等工具。

撰文:流大古 编辑:大饼

本文由人人都是产品经理作者【知危】,微信公众号:【知危】,原创/授权 发布于人人都是产品经理,未经许可,禁止转载。

题图来自Unsplash,基于 CC0 协议。


Read2
share
Write a Review...
推荐文章
1  /  184