An organization requires a scalable solution for preprocessing and transforming big data for machine learning. The data is spread across multiple sources and needs to be processed using a distributed computing framework. Which of the following solutions would you recommend?