A team is formulating guidelines on when to apply various metrics for evaluating classification models. They need to decide under what circumstances the F1 score should be favored over accuracy. The F1 score formula is given as follows: F1 = 2 * (precision * recall) / (precision + recall) What recommendations should the team incorporate into their guidelines?