project-highlight-image

Email Spam Classification using Machine Learning Techniques

Designed and implemented a machine learning–based email spam detection system in python. The project includes text preprocessing steps such as cleaning, special character removal, lowercase conversion, and feature extraction using Bag-of-Words and TF-IDF techniques. A Naive Bayes classifier was trained using a 70/30 train-test split to classify emails as spam or ham. The system analyzes email headers and content to improve classification accuracy, reduce false positives and false negatives, and enhance overall email security and communication efficiency.
Home
Questions?
hero-image

Shruti Parab

Project Timeline

Feb 2024 - May-2024

HighlightS

  • Designed and implemented an end-to-end Email Spam Classification system in Python using Machine Learning, incorporating supervised learning techniques for binary classification (spam vs. ham).
  • Built a comprehensive NLP-based text preprocessing pipeline including data cleaning, tokenization, normalization, Bag-of-Words, and TF-IDF feature extraction to improve model performance.
  • Trained and evaluated a Naive Bayes classifier using a 70/30 train-test split, performing model validation, performance evaluation, and optimization to reduce false positives and false negatives.
  • Implemented the solution using Python libraries such as scikit-learn, pandas, and NumPy, ensuring modular, scalable, and real-world deployable architecture.

SKILLS

NumPy
Python
Pandas
Machine Learning
Naive Bayes
Text processing
TF-IDF
Bag of words
Feature extraction
Algorithms

The project begins by collecting a labeled dataset of spam and legitimate (ham) emails. The raw email text is first passed through a preprocessing pipeline where unnecessary elements such as special characters, numbers, and extra spaces are removed. The text is converted to lowercase and structured for analysis. Feature extraction techniques like Bag-of-Words and TF-IDF are then applied to convert textual data into numerical feature vectors that the machine learning model can process. A Naive Bayes classifier is trained using 70% of the dataset and validated on the remaining 30%. During prediction, the trained model evaluates the probability of an incoming email being spam or ham based on learned word patterns and statistical distributions.
The system achieved strong classification performance with high accuracy while reducing false positives (legitimate emails marked as spam) and false negatives (spam emails marked as legitimate). Overall, the model effectively enhances email filtering, improves security, and reduces inbox clutter, making it suitable for real-world deployment.