You are pre-training a large language model on Google Cloud, which involves custom TensorFlow operations in the training loop, a large batch size, and several weeks of training. You aim to configure a training architecture that minimizes both training time and compute costs. What should you do?