Watch this video on YouTube
You need to use a Kubernetes Engine cluster with specific GPUs to process long running jobs that cannot be restarted. How to configure the Kubernetes Engine cluster in this case?