AWS Exams GCP Exams Azure Exams GitHub Exams Jira Exams ISC2 Exams

Video: You designed a 5-billion-parameter language model gcp video

Question 1
« Back   Next gcp Cloud ml-engineer-pro Question »

Full Certification Question

You designed a 5-billion-parameter language model in TensorFlow Keras that used autotuned tf.data to load the data in memory. You created a distributed training job in Vertex AI with tf.distribute.MirroredStrategy , and set the large_model_v100 machine for the primary instance. The training job fails with the following error: “The replica 0 ran out of memory with a non-zero status of 9.” You want to fix this error without vertically increasing the memory of the replicas. What should you do?