A Generative AI Engineer is developing a RAG application to help clinicians answer complex queries based on clinical research PDFs. These documents include footers, disclaimers, and non-content metadata such as watermarks and legal notices. The models response accuracy has been inconsistent, especially when irrelevant content appears in the retrieved chunks. What should the engineer do to improve the relevance and quality of the RAG outputs?