This is a dedicated watch page for a single video.
A Generative AI Engineer is comparing two LLMs (Model A and Model B) to power a chatbot that assists in regulatory compliance Q&A. The team has created a dataset of prompts and expected responses with ground truth labels. The final model must balance accuracy and cost for production deployment. Which evaluation setup is MOST appropriate to select the right model?