Evaluation Metrics

1. Video Quality and Diversity

Video quality refers to the visual quality of the videos generated by the model, including video clarity and content rationality. Video diversity refers to whether there is sufficient variation among the generated videos. All generated videos should not be almost identical; otherwise, the video generation task loses its meaning.

Video quality and diversity lack objective evaluation methods. Therefore, we list the evaluation metrics based on pretrained video feature extraction networks and their calculation methods.

1.1 Frechet Video Distance (FVD)

FVD (Frechet Video Distance) is a metric used to evaluate the quality of generated videos, similar to FID (Frechet Inception Distance) in the image domain. It measures the realism and coherence of videos by comparing the distribution differences between generated and real videos in a deep feature space. FVD extracts video features using a pretrained 3D convolutional neural network (e.g., I3D) and computes the Frechet distance between the feature distributions of generated and real videos. A smaller distance indicates higher quality of the generated videos.

Formula:

F V D = ∥ μ_{g} - μ_{r} ∥^{2} + Tr (Σ_{g} + Σ_{r} - 2 (Σ_{g} Σ_{r})^{1 / 2})

Where:

$μ_{g}$ and $μ_{r}$ are the mean vectors of the generated and real video features, respectively;
$Σ_{g}$ and $Σ_{r}$ are the covariance matrices of the generated and real video features, respectively;
$Tr$ denotes the trace of a matrix.

FVD is widely used in the evaluation of video generation models (such as GANs and diffusion models) and effectively reflects both visual quality and temporal consistency.

1.2 Frechet Inception Distance (FID)

Inception Score only considers the generated images and does not take into account the distribution of images in the training set, nor the diversity of generated images. Therefore, Frechet Inception Distance evaluates the quality and diversity of generated images by computing the distance between the feature distributions of real and generated images on the same set of text prompts. The closer the distributions, the better the evaluation result, and vice versa.

The formula for distribution distance is as follows:

F I D (P_{r}, P_{g}) = | | μ_{r} - μ_{g} | | + T_{r} (C_{r} + C_{g} - 2 {(C_{r} C_{g})}^{1 / 2}) .

When calculating FID, the text prompt set is 30,000 prompts selected from the MS-COCO dataset. The real image set is the corresponding 30,000 real images, and the generated image set is the results generated by the model being evaluated based on these prompts.

1.3 Code

Code for calculating IS: inception-score-pytorch

Code for calculating FID: fvd-comparison

2. Video–Text Semantic Consistency

In text-to-video generation tasks, it is necessary to evaluate not only the quality of generated videos but also the semantic consistency between the video content and the text prompt. Although the FVD metric can evaluate overall video generation quality, it is not sensitive enough to the semantic alignment between a single prompt and the generated video. We find that current text-to-video models often deviate from the prompt’s semantics when handling prompts involving complex scenes or multi-object interactions. In addition, the temporal nature of videos requires the model to maintain semantic coherence across time, which places higher demands on evaluation.

2.1 CLIP-SIM

CLIP-SIM quantifies semantic consistency by computing the cosine similarity between the text embedding and the embeddings of sampled video frames. Given a text prompt $t$ and a video $V = {f_{1}, f_{2}, \dots, f_{N}}$ (with $N$ keyframes sampled), the formula is:

CLIP-SIM (t, V) = \frac{1}{N} \sum_{i = 1}^{N} \frac{E_{t} (t) \cdot E_{v} (f_{i})}{| | E_{t} (t) | | \cdot | | E_{v} (f_{i}) | |}

Where $E_{t} (\cdot)$ and $E_{v} (\cdot)$ are the text encoder and image encoder of the CLIP model, respectively. This metric effectively captures fine-grained semantic correspondences between text and video content, and it is particularly advantageous for evaluating multi-object interaction scenarios. Compared to metrics based on classification networks, CLIP-SIM benefits from large-scale pretraining and is more robust in semantic understanding for open-domain content.

2.3 Code

Code for calculating CLIP-SIM: clip-score

Evaluation Metrics ​

1. Video Quality and Diversity ​

1.1 Frechet Video Distance (FVD) ​

1.2 Frechet Inception Distance (FID) ​

1.3 Code ​

2. Video–Text Semantic Consistency ​

2.1 CLIP-SIM ​

2.3 Code ​