A Generative AI Engineer has developed a RAG (Retrieval-Augmented Generation) application that helps employees retrieve answers from an internal knowledge base, such as Confluence pages or Google Drive. After receiving positive feedback from internal testers, the engineer now wants to formally assess the system’s performance and identify areas for improvement. What is the best approach for the engineer to evaluate the system?