Watch this video on YouTube
What is a key challenge when handling multimodal data (text, images, audio) in a RAG system?