A data science team at your company is planning to utilize Amazon SageMaker to train an XGBoost model to predict customer churn. The dataset comprises millions of rows, necessitating significant pre-processing to ensure model accuracy. To handle this task efficiently, the team has decided to leverage Apache Spark due to its capability for large-scale data processing. As the lead architect, you are tasked with designing a solution that integrates Apache Spark for data pre-processing while optimizing for simplicity and scalability. What is the simplest architecture that allows the team to pre-process the data at scale using Apache Spark before training the model with XGBoost on SageMaker?