The project begins by collecting a labeled dataset of spam and legitimate (ham) emails. The raw email text is first passed through a preprocessing pipeline where unnecessary elements such as special characters, numbers, and extra spaces are removed. The text is converted to lowercase and structured for analysis. Feature extraction techniques like Bag-of-Words and TF-IDF are then applied to convert textual data into numerical feature vectors that the machine learning model can process. A Naive Bayes classifier is trained using 70% of the dataset and validated on the remaining 30%. During prediction, the trained model evaluates the probability of an incoming email being spam or ham based on learned word patterns and statistical distributions.
The system achieved strong classification performance with high accuracy while reducing false positives (legitimate emails marked as spam) and false negatives (spam emails marked as legitimate). Overall, the model effectively enhances email filtering, improves security, and reduces inbox clutter, making it suitable for real-world deployment.
