The data science team at an email marketing company has created a data lake with raw and refined zones. The raw zone has the data as it arrives from the source, however, the team wants to de-duplicate the data before it is written into the refined zone. What is the best way to accomplish this with the least amount of development time and infrastructure maintenance effort?