The correct option is A multimodal foundation model that integrates text, audio, and images to generate video.

This model type can jointly understand the storyboard text, align it with the narrator audio, and incorporate the curated brand images as visual assets or style references. It can maintain temporal coherence across frames while conditioning on multiple inputs so the narration aligns with scene changes and visuals reflect the supplied brand imagery.

A video classification model trained to label scenes is discriminative rather than generative and is meant to tag or segment existing footage. It cannot synthesize a new video or blend text, audio, and images into a cohesive output.

A unimodal Large Language Model works with text only and cannot directly consume audio or images or produce video frames. At best it could draft a script or prompts for another system but it would not generate the video itself.

An image-only diffusion model produces still images and lacks native modeling of temporal dynamics and audio alignment. Stitching images into a clip would not address cross frame consistency or narration timing and it does not accept audio as an input modality.

Full AWS Practitioner Certification Question