This is a dedicated watch page for a single video.
A data engineering team is processing a large-scale ETL pipeline that involves joining multiple large datasets, each containing hundreds of columns and billions of records. During the join phase, they notice that the Spark executors are repeatedly spilling data to disk, and performance significantly degrades due to excessive shuffling. What type of resource optimization should the team prioritize to improve the performance of this job?