Smart Video Generation from Text Using Deep Neural Networks

  • Home
  • Blog
  • Smart Video Generation from Text Using Deep Neural Networks
blog image

Smart Video Generation from Text Using Deep Neural Networks

Creating animated videos doesn’t have to be a laborious process anymore. Artificial intelligence and deep neural networks process datasets to create videos in less time. The blog details the different AI models and techniques used for smart video generation from text. 

It’s no surprise that creating animated videos takes time. It’s hard work and involves several man-hours. Even with the use of technology, animated videos are still not easy to produce. However, the entry of artificial intelligence has brought new developments. 

Researchers from the Allen Institute for Artificial Intelligence and the University of Illinois have worked together to create an AI model called CRAFT. It stands for Composition, Retrieval, and Fusion Network. The CRAFT model took text/ description (captions) from users to generate scenes from the famous cartoon series, The Flintstones.

CRAFT is entirely different from the pixel-generation model where the pixel value is determined by the values generated by previous pixels to create a video. It uses the text-to-entity segment retrieval method to collect data from the video database. The model was trained on more than 25,000 videos where each clip was three seconds and 75 frames long. All videos were individually annotated with the details of the characters in the scene and information about what the scene dealt with. That is still labor-intensive as the team has to work on adding the captions to each scene.

How can AI experts help generate video from text using automated video generation models?

First, let’s take a look at the problems in creating videos from different POVs.

Problems in Creating Videos

The major problems in creating animated videos can be categorized into the following:

Problems from the General Point of View

Time Consuming and Effort-Intensive

There’s a high demand for animated videos, leading to a gap between demand and supply. Kids and adults love animated videos, games, etc. But the supply isn’t as much as the viewers would like.  This is because the technology still hasn’t reached the stage where we can generate content in minutes and meet the increasing expectations. Video generation is still a time-consuming and laborious process that requires a lot of resources and input data.

Computers are Not Enough

It might seem that computers are an answer to everything. However, computers and the existing software are not advanced enough to change the video creation process. While researchers and experts are working on creating new applications to create videos in quick time, we still need to wait to experience a higher level of innovation.

Problems from Deep Learning Point of View

Manually Adding Text

Artificial intelligence has helped develop video generation software to speed up the process. However, even AI doesn’t offer a solution to everything as yet. For example, some videos don’t have captions. But you still need to create a video from existing clips. What do you do? Well, you’ve got to manually add the captions so that the software can convert the text to video. Imagine doing that for thousands of video clips! 

Improper Labeling

The problem doesn’t end at manually adding captions. You’ve got to label the videos as well. Now, with so many clips to work on, it’s highly possible that you might mislabel something or give a wrong caption to a couple of videos. What if you notice the error only after the smart video is generated from the given text captions? Wouldn’t that lead to more wastage of resources, and poor-quality videos? 

More than CRAFT Model

While the CRAFT model is indeed a worthy invention, the world needs something better and more advanced than this. Moreover, the CRAFT model is limited to creating cartoons and cannot work with all kinds of video clips.

Flowchart of generalized approach for video generation using text
Flowchart of generalized approach for video generation using text

Introduction to NLP and CV

Well, we’ve seen the challenges faced by the video industries and AI researchers. Wouldn’t it be great to find a solution to overcome these challenges? Oh, yes! That’s exactly what we’ll be doing in this blog. However, we’ll first get a basic idea about the two major concepts that are an inherent part of smart video generation from the text. Yep, we are talking about NLP (Natural Language Processing) and CV (Computer Vision), the two branches of artificial intelligence.

Natural Language Processing (NLP)

NLP can be termed as a medium of communication between a human and a machine. This is, of course, a layman’s explanation. Just like how we use languages to communicate with each other, computers use their own language (the binary code) to exchange information. But things get complex when a human has to communicate with a machine. We are talking about how the machine processes and understands what you say and write. 

NLP models can train a computer to not only read what you write/ speak but also to understand the emotions and intent behind the words. How else will a computer know that you’re being sarcastic? Applications like Sentiment Classification, Named Entity Recognition, Chatbots (or our virtual friends), Question- Answering systems, Story generations, etc., have been developed using NLP models to make the computer smarter than before. 

Computer Vision (CV)

Computer vision is yet another vital aspect of artificial intelligence. Let’s consider a scenario where you spot a familiar face in the crowd. If you know the person very well, you’ll mostly be able to recognize them among a group of strangers. But if you don’t? What if you need to identify someone by watching the CCTV recording? Overwhelming, isn’t it? 

Now, what if the computer can identify a person from a series of videos on your behalf? It would save you so much time, effort, and confusion. But how does the computer do it? That’s where CV enters the picture (pun intended). We (as in the AI developers) provide the model with annotated datasets of images to train it to correctly identify a person based on their features. 

Possible Approaches other than CRAFT model

Researchers have been toiling on finding ways to use artificial intelligence and deep learning to facilitate video generation from text. The solutions involve using pre-trained open-source models and extensive datasets.

Pre-Trained Open Source Models

1. Microsoft GODIVA

Microsoft Research Asia and Duke University collaborated to develop a machine learning system that can generate videos exclusively from text without using GANs (Generative Adversarial Networks). Revealed earlier in 2021, the project was named GODIVA and has been built on similar approaches used in the DALL-E image synthesis system by OpenAI. 

GODIVA stands for Generating Open-DomaIn Videos from Natural Descriptions and uses the VQ-VAE (Vector Quantised-Variational AutoEncoder) model. This model was first introduced in 2018 by researchers from the Google DeepMind project. VQ-VAE was also used in the DALL-E system. 

VQ-VAE is different from VAE in two ways: 

  • The prior is not static. Instead, it is learned. When the representations are paired with autoregressive prior, the result is a high-quality output. In fact, VQ-VAE delivers quality content similar to VAE-GAN.  It is proof that learned representations are used in the process instead of static ones.
  • The encoder network delivers discrete code outputs(latent representations) instead of continuous codes. This helps solve the latent space problem encountered in VAE.

The discrete latent representation comes from Vector Quantisation (VQ), which is mainly used to maneuver around the issues of ‘posterior collapse’, a common problem in VAE. This is when the learned latent space is no more informative. Hyperparameter resemblance is related to posterior collapse resulting from choosing a wrong parameter that causes over smoothness. The latents are ignored when paired with an autoregressive decoder. 

However, by pairing the representations using an autoregressive prior, the ML model can generate high-quality images, videos, and audio (speech). It also allows unsupervised learning of phonemes and quality speaker conversion. This is yet another proof for the use of latent representations. 

Dataset Used for GODIVA 

GODIVA takes text from the labels in the datasets. The model was pre-trained on the Howto100M dataset. It has 136 million videos (with captions) collected over fifteen years from YouTube. It includes 23,000 labeled activities. 

That means almost every possible activity one can think of can be found in the dataset, that too in numerous clips. The generalized forms have higher videos, while the specific ones have a slightly lesser count. Still, considering the range available in the dataset, it is a great choice to start training the GODIVA model. 

Quality of Model
Quality of GODIVA model

2. Variational Autoencoder Model (VAE) and a Generative Adversarial Network (GAN)

In simple terms, VAE provides a probable description of observations in latent spaces. It’s an unsupervised and generative model where each input image with a discrete value is described as a probability description. These descriptions are decoded randomly from the sample. That leads to blurry images as well as unrealistic results. 

GAN is also a generative algorithm belonging to unsupervised machine learning. The GAN has two neural networks- a Generative neural network and a Discriminative neural network. While the first one takes noise as an input to generate samples, the second neural network evaluates the samples to distinguish them from training data.

VAE-GAN together were first introduced by A. Larsen and other authors who created the paper autoencoding beyond pixels using a learned similarity metric. The aim of using VAE and GAN together is to create a model that outperforms traditional VAEs. 

Let’s look at what the neural network video generation model does. 

Data Collection: Videos are downloaded from YouTube for each keyword, along with text such as the descriptions, tags, titles, etc. 

Dataset Preparation: The downloaded data is prepared sidelong with clean data from Kinetics Human Action Video Dataset. The two datasets are used together. 

Training: The process of training the model to extract text to generate news video starts at this stage. The deep neural networks train on ways to use keywords to make news videos by extracting text.  Testing: Totally different datasets of news articles were used in this stage to test the accuracy of the model. 

3. TFGAN Model

The process of training a neural network includes defining a loss function. This function tells the network about how close or far the model is from the objective. TFGAN is a lightweight open-source library that makes it easy to train GANs. It gives the necessary infrastructure for training and evaluating metrics. The library has examples to show how easy it is to use TFGAN and how flexible the model is. 

The image classification networks are provided with a loss function that penalizes them for incorrect classifications. But since it’s not easy to define loss functions in all instances, a machine learning technique is used to improve its ability. 

For example, human perceptions, be it text to speech or image compression, cannot be defined as easily as a dog or cat is defined. Using ML leads to improving a range of such applications and offers a better solution to generate images from text or teach robots to learn. Since GAN itself comes with new issues and challenges that can further complicate the model, TFGAN has been proposed as a better alternative. TFGAN is used by many Google researchers. 

Actionable Advice for Data-Driven Leaders

Struggling to reap the right kind of insights from your business data? Get expert tips, latest trends, insights, case studies, recommendations and more in your inbox.

    4. T2V Model 

    T2V is a gist generation and condition text filter. T2V models are classified as the following-

    DT2V (Direct Text to Video Generation): Random sample noise and concatenated encoded text ψ(t) are fed into the video generator by bypassing the gist generation step. It includes reconstruction loss (LRECONS), as shown below. 

    PT2V (Text to Video Generation with Pair Information): The DT2V is extended with a framework, a discriminator that judges if the video and text pair is real, mismatched, or synthetic. A linear concatenation is used as the discriminator framework for this purpose. Refer to the image below. 

    Summary of Available Architectures

    S. NoName of ApproachArchitecture UsedDataset UsedSource LinkGit Repo Link
    1Microsoft GODIVAVector Quantised-Variational AutoEncoder (VQ-VAE) modelHowto100M found any yet
    2Variational Autoencoder (VAE) and a Generative Adversarial NetworkVariational Autoencoder (VAE) and a Generative Adversarial Network1. Kinetics Human Action Video Dataset  
    2. Youtube
    3TFGAN ModelGAN basedImagenet Dataset
    4(a)T2V ModelDirect text to video generation (DT2V)Create their own dataset from publicly available videos
    4(b) T2V Model Text-to-video generation with pair information (PT2V)Create their own dataset from publicly available videos
    Summary of Available Architectures

    Different Datasets Available for Text to Video Generation 

    Normal Text to Video Datasets

    • Howto100M

    Howto100M has 136 million video clips with captions that have been sourced from 1.2 million videos. The videos have been downloaded from YouTube for over fifteen years. No wonder Howto100M is a large-scale dataset that focuses on narrated videos. Instructional and explanatory videos are majorly found here. The dataset has 23,000 activities ranging from gardening to self-care and much more. All videos are associated with the respective narration that is available as a subtitle on YouTube are downloaded automatically. 

    • AVA

    AVA dataset has a collection of 1.62 million action labels, 80 atomic visual actions in 430 fifteen-minute clips. The actions are localized in time and space. The action labels have various labels per human that occur frequently. AVA (Atomic Visual Actions) is a video dataset that helps in understanding and improving human activity. It is spatiotemporally localized and provides audiovisual annotations of videos. 

    • Youtube-8M Segments Dataset

    YouTube-8M Segments Dataset is nothing more than an extension of the YouTube-8M dataset but with segment annotations verified by humans. The collection consists of labels of about 237,000 segments on 1,000 classes that have been human-verified and validated by the YouTube-8M dataset. Every video in the dataset has its own localized time frame-level features. This enables classifier predictions at the segment level. 

    Text to Cartoon Videos Datasets 


    ToonNet is a cartoon-style image-recognition dataset with 4000 images categorized into twelve classes. The USP of ToonNet is that the images have been collected from the World Wide Web with little manual filtration. The basal dataset has been extended to 10,000 images using methods such as-

    • 2D-3D-2D converting procedure (cartoon-modeling method)
    • Cartoon-shader rendered 3D models snapshots 
    • Stylization filter drawn using hand 

    ToonNet also describes how to build an effective neural network for image semantic classification. 

    Text to Video Available Datasets

    S. NoDataset NameDataset SizeSource
    1Howto100M1.2 Millions Youtube Videos with Text captions
    2AVA Dataset1.62 Million Action labels from 430 movies
    3Youtube-8M Segments Dataset2,37,000 classes

    Text to Cartoon Videos Available Dataset

    S. NoDataset NameDataset SizeSource
    1CRAFT25,000 videos with text captions

    Text to Cartoon Image Available Dataset

    S. NoDataset NameDataset SizeSource
    1Toonnet40,000 images with text classification

    Application Areas for Smart Text to Video Generation 

    Where can deep learning video generation models be used, and how do artificial intelligence experts help with text to video generation? 

    • Short Movie Clip Generation 

    The aim of the project is to build a deep learning pipeline that accepts text descriptions to produce video descriptions that are attractive and unique. The short movie clip generation project uses GAN video generation, a deep learning algorithm that produces unique and realistic video content by pinning two neural networks against each other. 

    The model has generative and discriminative neural networks where the generator produces new content, and the discriminator differentiates between real and fake. 

    • News Stories Generation 

    The term ‘synthetic media’ refers to the news content (image, text, & video) created by a computer program. NUIS.TV is a mobile-first news aggregator that converts news in the text to news in video with an AI narrator playing the role of an anchor. 

    The videos are of 30-40 seconds and share bite-sized news bulletins with the users. The platform is free to use and free of advertisements. 

    • Cartoon Animation Videos 

    Cartoon Animation videos can be designed using various online tools. The AI-based tools have made it easy for designers and non-designers to convert text, images, and video clips into animations. These animations can be used for marketing, educational, and training purposes. It takes just a few minutes to generate cartoon animation videos of shorter durations.

    Companies that are Already Using Text to Video Generation 

    Some AI companies have already developed text to video deep learning applications for personal and commercial use. Let’s look at the top five video-generation-from-text software tools in the market. 

    1. GliaCloud 

    GliaCloud was founded by David Chen in 2015. The Taiwanese startup provides apps and solutions for data analytics and machine learning. The company has created GliaStudio, an AI product that generates videos based on text from articles. The videos are a summary of the article that’s entered as input. GliaStudio comes in three price plans or categories-

    • Pro- $300/ month for professional storytellers 
    • Business- $550/ month for business use with custom themes and 1080p resolution 
    • Enterprise- Custom pricing for large enterprises with mass video production 

    The company gives a 14-day free trial to try the video creator. 

    Overview of GilaStudio

    • GliaStudio also uses an NLP algorithm to scan the content for keywords and highlights to automatically generate videos that summarize the text.
    • The input can be provided using a URL or by uploading a file. A video script is first generated for the AI engine to search for relevant clips and bind them together for the video. 
    • Creating different test versions for the same input is easy. This helps in choosing the most suitable video generated by the tool. 

    2. Vedia 

    Vedia was launched in 2016 to create AI videos for professional use. The application has been designed to assist publishers, advertisers, and media personnel in automatically creating high-quality content for OTT, CTV, and DOOH platforms. 

    Large datasets can be transformed into text using this software. The automated software is scalable and can work with huge amounts of data in less time. Vedia is known to deliver targeted videos that increase user engagement. It helps increase conversions and acquire more customers. 

    Businesses can request a demo of the software through the website. The company has not publicly listed its price plan. Contact the customer support team for details. 

    How Vedia Works: 

    Vedia follows a simple three-step process- 

    • Analyze: The AI engine analyzes the input data, link, blog, text, or feed to pick the keywords and highlights necessary for the video. 
    • Visualize: Next, it collects relevant video clips for the highlighted portions and places them on the video timeline. It then adds a video narration to it. 
    • Customize: Once the video is automatically generated, it can be reviewed, customized for branding, and published on various platforms. There’s a drag and drop feature to customize the videos. 

    3. Lumen5

    Lumen5 is one of the best AI text-to-video generation platforms and requires no training or experience. It was founded in 2017 and is suitable for beginners. It helps in creating unique videos from the text. Lumen5 uses natural language processing algorithms for this purpose. It takes only a few minutes to convert text into high-quality videos. The video creator tool comes in these categories-

    • Community category- free for use for a lifetime but with limited features
    • Creator category- $11/ month for individual creators 
    • Premium category- $59/ month for professional storytellers 
    • Business category- $149/ month for brands and companies (SMEs)
    • Enterprise category- custom pricing for large enterprises 

    The number of features increases as the category requirements increase. From the premium category onwards, the screen resolution is 1080p, and users can custom brand the tool and the videos. 

    How Lumen5 Works: 

    • Enter the link to the article for which the video has to be made. Lumen5 relies on the NLP algorithm to convert the text in the link to create a storyboard.
    • Then the tool uses computer vision technology to identify relevant visuals and audio for the text. 
    • Create and set up the user profile to generate videos, customize them, and share the videos on the internet. 
    • The tool offers an effective dashboard to the admin to monitor the process, watch the videos, and approve them for promotions. 

    4. NIUS.TV 

    NIUS.TV was founded in 2017 to convert blogs and news articles into short videos for easy consumption. It is a next-gen mobile-first news aggregator that uses AI video generation software to share news bulletins with users. 

    Interested people can get an invite to join the subscriber list of NIUS.TV through the company website. 

    5. Wibbitz 

    Wibbitz was founded in 2011 to transform the role of videos in society. The company uses patented technology for smart text analysis. It automatically finds the highlights in the story/ input source and creates a video for the same. The AI technology pairs the highlights with matching footage/ clips from the media library. Wibbitz offers four price plans suitable for individual and commercial use of the software-

    • Starter Plan- $19/ month for individual content creators 
    • Creator Plan- $39/ month for startups and small businesses (no watermark)
    • Business Plan- $119/ month for team collaborations and mid-sized businesses 
    • Enterprise Plan: $489/ month for large enterprises. Customizations are also available 

    The Wibbitz comes with a 7-day free trial to try the software. The company offers a demo to explain how it works. 

    How Wibbitz Works: 

    • Users are provided with a self-guided tour or hands-on training, depending on the chosen price plan. 
    • There are readymade templates to turn an article or an input story into an optimized video to drive user engagement. 
    • Edit the videos by cutting, cropping, zooming the frames using simple features. There’s no need to have any professional training to use the software.
    • Users can add personal voiceover to the videos instead of AI-generated voices. 
    • Customize the videos by changing the color scheme, logos, etc., for business marketing purposes. 

    General Approach to Generate Videos from Text

    The key problem addressed in the blog is categorized into two parts: 

    • NLP Part 
    • CV Part 

    Captions about situations that can be trained on networks based on deep learning like GODIVA, TFGAN, CRAFT, or other methods specified in the blog. 

    Let’s look at the approach to solve both problems: 

    NLP (Natural Language Processing) Part 

    Workflow Steps: 

    1: The user provides an input text in the form of Word, PDF file, or direct text format.

    2: The software divides the input into separate sentences to find an appropriate image/ video clip for each sentence. 

    3: The sentences are passed one by one into the Named Entity Module to send the output to the CV model.

    4: The extracted entities are then collected in a bucket. 

    Workflow of NLP Part
    Workflow of NLP Part


    • Input Story- A young girl is walking alone on a dirty street in the evening. There are cats and dogs on the street. 
    • Sentence1: A young girl is walking alone on a dirty street in the evening.
    • Sentence2: There are cats and dogs on the street.
    • Entities for Sentence1- Person: young girl; Action: walking alone; Location: dirty street; Environment: evening 
    • Entities for Sentence2- Person/ beings: cats and dogs; action: none; location: street 

    CV (Computer Vision) Part 

    The following are the steps in the CV workflow-

    1: Collect the text required to generate video from the existing database

    2: Preprocess data and apply transformation tools and techniques 

    3: Choose a model to complete video generation from text (Model= CRAFT/ TFGAN/ GODIVA/ etc.)

    4: Divide the annotated text to video dataset into training dataset and testing dataset 

    5: Train the model and then test the accuracy 

    6: Feed entities to pass them through the trained and tested model 

    7: Test the model again and optimize it for accuracy 

    Workflow of CV Part
    Workflow of CV Part


    AI solutions can help with adversarial video generation on complex datasets using machine learning algorithms and computer vision models. Enterprises can rely on existing video generation tools or get new applications custom-built for their businesses. 

    Using smart text to video generator software, companies can create unique and attractive videos in less time. The videos can be used for instructional purposes, brand promotions, and sharing news articles with users. Contact AI consultants for guided assistance on the AI automated video generator software using text inputs. 

    Leave a Reply Protection Status