This is a dedicated watch page for a single video.
Scenario: You want to build a managed Hadoop system as your data lake. The data transformation process involves a sequence of Hadoop jobs. You have opted to use the Cloud Storage connector to store input, output, and intermediary data, separating storage from compute. However, you have observed that a specific Hadoop job runs significantly slower on Cloud Dataproc compared to an on-premises bare-metal Hadoop environment with 8-core nodes and 100-GB RAM. Analysis shows that this particular Hadoop job is disk I/O intensive. Question: How can you address the slow performance of the disk I/O intensive Hadoop job on Cloud Dataproc compared to the on-premises environment?