Full AWS Practitioner Certification Question

You're working on an email spam classifier and want to determine which models better separate spam from non-spam across all classification thresholds. Which evaluation approach will best help you understand model performance across the full spectrum of decision boundaries?