Let’s see how these analysis metrics work in actual life! We undertake a generic setup with out utilizing textual content prompts or discrete labels within the video era process. We use the TikTok dataset [9] to supply a quantitative comparability of varied video analysis metrics.
Particularly, we generate 50 movies utilizing totally different checkpoints named (a) by way of (e). We don’t use CLIP or IS on this comparability, as they aren’t appropriate for our setup. The fashions (a) to (e) are sorted primarily based on human scores collected by way of a person examine, from worse to raised visible high quality (mannequin (e) has the very best visible high quality and mannequin (a) has the worst). We are able to then examine how properly the analysis metrics align with human judgments.
We put collectively a few movies generated by totally different fashions which clearly differ in visible high quality. Fashions (a-c) end in movies with incomplete human shapes and unnatural motions. Mannequin (d) produces a video with higher visible high quality, however the movement remains to be not easy, leading to plenty of flickering. Compared, mannequin (e) generates a video with higher visible high quality and movement consistency.
Disclaimer: These video samples are nowhere close to good; nonetheless, they’re enough to match totally different analysis metrics.
Quantitative Outcomes.
On this desk, we present the uncooked scores given by the metrics, the place FVD, FID, and FVMD are lower-is-better metrics, whereas VBench is higher-is-better. We additionally report the corresponding rating among the many 5 fashions primarily based on quantitative outcomes. The rating correlation between the metrics analysis and human scores can also be reported, the place the next worth signifies higher alignment with human judgments.
We are able to see the paradox of some analysis metrics. Mannequin (a), which has the poorest high quality, can’t be successfully distinguished from fashions (b-d) primarily based on the FID or VBench metrics. Moreover, mannequin (c) is mistakenly ranked greater than mannequin (d) by all metrics aside from the FVMD metric. Particularly, VBench offers very shut scores to fashions (a-d) with clearly totally different visible high quality, which aren’t in line with human judgments. FVMD, however, ranks the fashions accurately in keeping with human scores. Furthermore, FVMD offers distinct scores for video samples of various high quality, exhibiting a clearer separation between fashions. This means that FVMD is a promising metric for evaluating video generative fashions, particularly when movement consistency is worried.
Frames Comparability.
We additionally current visualizations of video frames for one randomly chosen scene to additional examine the metrics constancy.