Advancements in Open-Source Video Models

Explore top LinkedIn content from expert professionals.

Summary

Advancements in open-source video models refer to the rapid improvements in AI systems that can create, analyze, and understand videos using freely accessible code and resources. These developments make it possible for researchers and companies worldwide to experiment with and apply cutting-edge video generation and analysis tools without relying on expensive, proprietary software.

Explore new releases: Check out the latest open-source video models and tools, as many are now rivaling or surpassing commercial options in quality and versatility.
Experiment and collaborate: Take advantage of open-source platforms to test, adapt, and share your creative or research projects, making it easier to innovate and solve diverse challenges together.
Utilize efficient techniques: Look for models and architectures that include smart training and processing methods, which help reduce costs and speed up video generation or analysis without sacrificing performance.

Summarized by AI based on LinkedIn member posts

Henry Ajder Henry Ajder is an Influencer

AI and Deepfake Cartographer

16,735 followers 1y
Report this post
OpenAI's Sora is dominating the news, but Tencent's latest generative video model Hunyuan has been much less discussed. Here's why I think it's significant: Hunyuan is a 13bn parameter model providing text-to-video, avatar animation, and notably video-to-audio capabilities. Tencent claims outputs are "comparable to, if not superior to", other leading generative video models, with independent evaluations finding Hunyuan outperformed Runway Gen-3 alpha and Luma 1.6. I've found the quality impressive but inconsistent. Compared to Sora, the outputs lagged on motion fluidity and human subjects, although others have had better results. So why is it so significant? Hunyuan's is open source and represents the most powerful and dynamic open generative video model currently available. There has been progress in OS generative video (such as Mochi 1), but most advances/multi-functional capabilities are seen in proprietary/closed models. Accessible closed models like Sora may be in the hands of many users right now, but open models like Hunyuan unlock the foundations for the global OS community to experiment and develop novel applications. As we've seen with other open generative models and modalities, these permutations could reshape how we view what's possible with generative video- for better and/or for worse. https://lnkd.in/eKYc6KvW

1 Comment
Like Comment
Arjun Jain

Co-Creating Tomorrow’s AI | Research-as-a-Service | Founder, Fast Code AI | Dad to 8-year-old twins

35,406 followers 9mo
Report this post
#MIT's new "Radial Attention" makes Generative Video 4.4x cheaper to train and 3.7x faster to run. Here's why: The problem with current AI video? It's BRUTALLY expensive. Every frame must "pay attention" to every other frame. With thousands of frames, costs explode exponentially. Training one model? $100K+ Running it? Painfully slow. Massachusetts Institute of Technology, NVIDIA, Princeton, UC Berkeley, Stanford, and First Intelligence just changed the game. Their breakthrough insight: Video attention works like physics. - Sound gets quieter with distance - Light dims as it travels - Heat dissipates over space Turns out, AI video tokens follow the same rules. Why waste compute power on distant, irrelevant connections? Enter Radial Attention: Instead of checking EVERY connection: • Nearby frames → full attention • Distant frames → sparse attention • Computation scales logarithmically, not quadratically Technical result: O(n log n) vs O(n²) Translation: MASSIVE efficiency gains Real-world results on production models: 📊 HunyuanVideo (Tencent): • 2.78x training speedup • 2.35x inference speedup 📊 Mochi 1: • 1.78x training speedup • 1.63x inference speedup Quality? Maintained or IMPROVED. What this unlocks: 4x longer videos, same resources 4.4x cheaper training costs 3.7x faster generation Works with existing models (no retraining!) And, MIT open-sourced everything: https://lnkd.in/gETYw8eT The bigger picture: The internet is transforming. BEFORE: A place to store videos from the real world NOW: A machine that generates synthetic content on demand Think about it: • TikTok filled with AI-generated content • YouTube creators using AI for entire videos • Streaming services producing personalized shows • Educational content generated for each student This changes everything. Remember when only big tech could afford image AI? 2020: GPT-3 → Only OpenAI 2022: Stable Diffusion → Everyone 2024: Midjourney everywhere Video AI is next. Radial Attention probably just accelerated the timeline. The future isn't coming. It's here. And it's more accessible than ever. Want to ride this wave? → Follow me for weekly AI breakthroughs → Share if this opened your eyes → Try the code: https://lnkd.in/gETYw8eT What will YOU create when video AI costs 4x less? #AI #VideoGeneration #MachineLearning #TechInnovation #FutureOfContent
No more previous content

No more next content
2 Comments
Like Comment
Massimiliano Viola

Research @Stanford | ML @Cerrion | Computer Vision • 3D • Generative Models

13,503 followers 1mo
Report this post
This DEFINITELY flew under the radar: just a few days ago, AI at Meta released V-JEPA 2.1, taking a massive step toward closing the gap between image and video domains. For a long time, image backbones were the only option for solving dense vision tasks. This model disagrees, showing that universal spatial understanding also emerges from large-scale video models! 🎥 Quick recap on V-JEPA: it is a joint embedding predictive architecture built on a classic teacher-student setup. The teacher sees the full video, and its weights slowly update as an exponential moving average of the student. The student sees a masked input and predicts the latent features of the missing regions rather than reconstructing them in pixel space. What changed between V1 and V2 was largely a matter of scale. The encoder grew to a 1B-parameter ViT-g, the dataset from 2M to 22M videos, training got longer and progressive, and clips were pushed to higher temporal and spatial resolution. V2 also introduced images into the mix via temporal duplication, training on 1M ImageNet samples. But the difference between V2 and V2.1 is conceptual, on top of just scaling. Sure, they pushed the model to 2B parameters and expanded the image dataset from 1M to 142M, but the real breakthrough lies in the training loss. In V-JEPA 2, supervision was only applied to the masked regions, despite the predictor outputting a token for every input, masked or not. Thus, the visible tokens were free to ignore local structure and aggregate global information if that would minimize the loss, similar to register tokens. V-JEPA 2.1 fixes this by extending supervision to the visible tokens too. Every patch, masked or visible, now has a training signal forcing it to encode where things actually are in space and time. This results in feature maps that look nothing like before: spatially structured, semantically coherent, and temporally consistent. Looking at the features below, you would almost think this is some small variant of DINOv3 (with due respect), except these results came from video pretraining! 🤯 This feature quality obviously translates to downstream tasks. Motion benchmarks got only a small buff, but spatial tasks are where the gains are staggering, with improvements ranging anywhere from 30 to 95%. The idea that we now basically have a SOTA image encoder baked into video features is crazy to me, and as someone working with video models on a daily basis, I could not be happier to put this to the test and distill it down into even smaller and faster variants than the smallest 80M. Resources are down in the comments. Try it out if you were using the previous version, and let me know how it goes! ⏬
No more previous content

No more next content
38 Comments
Like Comment
Sahar Mor

I help researchers and builders make sense of AI | ex-Stripe | aitidbits.ai | Angel Investor

41,842 followers 1y
Report this post
The last few weeks have been huge for open-source video generation and research. After two years of limited usability in open-source video generative models, we’re finally seeing major advancements. These new models outperform commercial ones, including Runway, Pika Labs, and Luma Labs. —> Mochi, released by Genmo a week ago, ranks 2nd among top generative video models, permitting commercial use with an Apache 2.0 license —> CogVideoX-5B from Tsinghua University released last month supports both text2video and image2video, allowing commercial use for companies with <1M users —> Allegro from Rhymes AI is a small model capable of generating a wide range of content, from human close-ups to diverse, dynamic scenes, permitting commercial use with an Apache 2.0 license Also, over the last few weeks, Meta announced MovieGen for generating HD personalized videos with synchronized audio, and Peking University openly released Pyramid Flow. On the proprietary generative video side of things, Runway released a new tool for transforming simple video and voice inputs into expressive character performances, Pika Labs released Pikaffects to transform video subjects with surreal effects, and Luma Labs announced an API access to its generative video models. As for OpenAI’s Sora, who knows, it might launch soon after the US elections are out of the way (this Tuesday). Generative video models leaderboard https://lnkd.in/grWJDVkd Links to the mentioned models are in the comments.
No more previous content

No more next content
3 Comments
Like Comment
Niels Rogge

Machine Learning Engineer at ML6 & Hugging Face

68,391 followers 1y
Report this post
Let's go!! Meta released a new video LLM on Hugging Face, and it sets a new SOTA (state-of-the-art) for open-source video understanding. 🔥 The model is called LongVU, a new multimodal large language model capable of processing long videos (for things like answering questions about it, summarizing it, identifying important passages, etc). LongVU is capable of processing very long videos thanks to various clever compression techniques, which increasingly reduce the amount of tokens used to represent a video (and which a Transformer needs to process in parallel). First, the authors employ DINOv2, a self-supervised image model open-sourced by Meta as well, to remove redundant frames that exhibit high feature similarity across time. Next, features for the remaining frames are combined with features from SigLIP, an important vision encoder open-sourced by Google. The large language model (text decoder part of LongVU) is conditioned on these features. Next, after temporal reduction, the authors employ spatial reduction (reducing the width and height dimensions of certain video features). Based on the embeddings of the text query (e.g. "What did this man put on the pizza?"), less important frames get their features' resolution reduced, whereas the most important frames's features keep their original resolution. Finally, spatial token compression (STC) is performed to further reduce the amount of tokens. This is based on a technique where a non-overlapping window is slided over the tokens, where tokens which exhibit high cosine similarity with the first frame of each window are removed. In terms of performance, the model gets SOTA results on EgoSchema, MVBench and VideoMME and MLVU. Only on VideoMME, the gap with closed-source (GPT-4o and Gemini) is still large, but it's quite impressive to see the results. Resources: * paper: https://lnkd.in/eG-rC8Fg * Gradio demo for you to try: https://lnkd.in/e8Ey9ci7 * checkpoints: https://lnkd.in/eJr8WWQB * project page: https://lnkd.in/ee7dirPR #huggingface #video #largelanguagemodels #generativeai #ai
No more previous content

No more next content
25 Comments
Like Comment
Avani Rajput

Helping businesses scale with AI | Sales Leader

14,100 followers 7mo
Report this post
AI video creation crossed a new milestone ! Alibaba Group has released Wan2.2-S2V-14B, an open-source video generation model that feels like it is finally ready for real-world use. Here’s what you can do with Wan2.2 : 1. Film-style control You can adjust lighting, color, and overall “cinematic feel” just like a director would. 2. Voice-to-video Record a short audio clip, upload an image, and Wan2.2 turns it into a full cinematic sequence. Not just a talking head - but expressive, dynamic scenes. 3. Realistic motion Wan2.2 was trained on one of the largest video datasets yet, which means smoother, more natural movements. 4. Works on consumer hardware This is really amazing: you can run it on a single 4090 GPU and still get 720P @ 24fps video. Before, you’d need data-center-level hardware for that. 5. One tool, many modes Text-to-video, image-to-video, or hybrid, all supported in the same model. Think about what can be using this : - A marketer can create a brand video in hours, not weeks - An e-commerce team can turn product images into lifestyle clips instantly - An educator can narrate a lesson and have it auto-converted into an engaging video - A filmmaker can prototype entire scenes before ever stepping on set - This isn’t just about faster video. It’s about who gets to participate in video creation. If you had access to film-grade video AI on your laptop tomorrow… what’s the very first thing you’d create? Let me know in the comments section Follow Avani Rajput For More Such AI Insights

4 Comments
Like Comment
Smriti Mishra Smriti Mishra is an Influencer

Data & AI | LinkedIn Top Voice Tech & Innovation | Mentor @ Google for Startups | 30 Under 30 STEM

88,454 followers 2y
Report this post
Stability recently released Stable Diffusion Video. Stable Video Diffusion is a latent video diffusion model designed for cutting-edge text-to-video and image-to-video generation. 🔹The approach builds upon existing latent diffusion models used for 2D image synthesis, which have been adapted into generative video models through the integration of temporal layers and fine-tuning on limited, high-quality video datasets. 🔹Despite these advancements, the field lacks a consensus on standardized training methodologies, leading to considerable variations in training approaches documented in the literature. This paper undertakes a comprehensive exploration of three crucial stages pivotal for effective training of video Latent Diffusion Models (LDMs): text-to-image pretraining, video pretraining, and fine-tuning on high-quality video datasets. By identifying and evaluating these stages, the authors aim to establish a structured and unified strategy for optimizing the training process of video LDMs, addressing critical aspects of data curation and model performance enhancement. You can read more here! Code: https://lnkd.in/dQtSTYH9 Paper: https://lnkd.in/dnsD4cCv #artificialintelligence #technology #innovation
No more previous content

No more next content
6 Comments
Like Comment

Advancements in Open-Source Video Models

Summary

More in Advanced Computer Vision Techniques

Explore categories