📘 Chapter 2: Machine Learning Classifiers & Data Preprocessing (MCA Notes)
🎯 Objectives
- Understand major classification algorithms
- Learn basics of scikit-learn
- Handle real-world data issues (missing values, categorical data)
- Build effective training datasets
- Perform feature selection & scaling
🔹 2.1 Choosing a Classification Algorithm
✔ What is Classification?
Classification is a supervised learning task where the model predicts class labels (e.g., spam/not spam).
✔ Factors to Consider:
- Size of dataset
- Linearity of data
- Training time
- Model interpretability
- Accuracy requirements
✔ Common Algorithms:
- Logistic Regression
- Support Vector Machine (SVM)
- Decision Tree
- K-Nearest Neighbors (KNN)
🔹 2.2 Logistic Regression
✔ Concept:
Used for binary classification, predicts probability using sigmoid function.
👉 Formula:
✔ Key Points:
- Outputs probability (0–1)
- Works well for linearly separable data
- Easy to interpret
✔ Advantages:
- Simple & fast
- Probabilistic output
✔ Disadvantages:
- Not suitable for complex/nonlinear data
🔹 2.3 Support Vector Machines (SVM)
✔ Concept:
Finds the best boundary (hyperplane) that maximizes margin between classes.
✔ Key Terms:
- Hyperplane
- Margin
- Support Vectors
✔ Advantages:
- High accuracy
- Works well in high dimensions
✔ Disadvantages:
- Slow for large datasets
- Hard to interpret
🔹 2.4 Kernel SVM (Nonlinear Problems)
✔ Concept:
Transforms data into higher dimensions using kernel trick.
✔ Common Kernels:
- Linear
- Polynomial
- RBF (Radial Basis Function)
✔ Use Case:
- When data is not linearly separable
🔹 2.5 Decision Tree
✔ Concept:
Tree-like structure where:
- Nodes → Features
- Branches → Decisions
- Leaves → Output
✔ Splitting Criteria:
- Gini Index
- Entropy (Information Gain)
✔ Advantages:
- Easy to understand
- No need for scaling
✔ Disadvantages:
- Overfitting problem
🔹 2.6 K-Nearest Neighbors (KNN)
✔ Concept:
Classifies based on nearest neighbors
✔ Steps:
- Choose K
- Calculate distance
- Assign majority class
✔ Advantages:
- Simple
- No training phase (lazy learning)
✔ Disadvantages:
- Slow for large data
- Sensitive to noise
🔹 2.7 Data Preprocessing
✔ Importance:
Real-world data is incomplete, inconsistent, noisy
✔ Steps:
- Handle missing values
- Encode categorical data
- Feature scaling
- Split dataset
🔹 2.8 Handling Missing Data
✔ Methods:
- Remove rows/columns
-
Imputation
- Mean
- Median
- Most frequent
👉 Example (scikit-learn):
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
🔹 2.9 Handling Categorical Data
✔ Types:
- Nominal (no order) → Colors
- Ordinal (ordered) → Rank
✔ Techniques:
- Label Encoding
- One-Hot Encoding
👉 Example:
from sklearn.preprocessing import OneHotEncoder
🔹 2.10 Splitting Dataset
✔ Why?
To evaluate model performance
✔ Types:
- Training Set
- Test Set
👉 Example:
from sklearn.model_selection import train_test_split
🔹 2.11 Feature Scaling
✔ Why?
Algorithms like SVM, KNN depend on distance
✔ Methods:
- Normalization (Min-Max Scaling)
- Standardization
🔹 2.12 Feature Selection
✔ Purpose:
Remove irrelevant features → improve accuracy
✔ Methods:
- Filter methods (correlation)
- Wrapper methods
- Embedded methods
✔ Benefits:
- Reduces overfitting
- Faster training
- Better performance
🔹 2.13 Summary
- Different classifiers have different strengths
- Preprocessing is crucial for model success
- Feature selection improves performance
- scikit-learn simplifies implementation
📘 Self-Assessment Questions (Chapter 2)
🔹 Section A: Very Short Answer Questions (1–2 marks)
Q1. What is classification in machine learning?
Answer:
Classification is a supervised learning technique used to predict categorical class labels based on input data.
Q2. What is logistic regression used for?
Answer:
It is used for binary classification problems and predicts probabilities using a sigmoid function.
Q3. What is a hyperplane in SVM?
Answer:
A hyperplane is a decision boundary that separates different classes in the feature space.
Q4. What is overfitting?
Answer:
Overfitting occurs when a model learns noise instead of patterns, performing well on training data but poorly on test data.
Q5. What is KNN?
Answer:
K-Nearest Neighbors is a lazy learning algorithm that classifies data based on the majority class of nearest neighbors.
🔹 Section B: Short Answer Questions (3–5 marks)
Q6. Explain Logistic Regression briefly.
Answer:
Logistic regression is a classification algorithm that predicts the probability of a class using a sigmoid function. It works best for linearly separable data and outputs values between 0 and 1.
Q7. What are support vectors in SVM?
Answer:
Support vectors are the data points closest to the decision boundary, which influence the position of the hyperplane.
Q8. What is the kernel trick in SVM?
Answer:
The kernel trick transforms data into a higher-dimensional space to make it linearly separable without explicitly computing the transformation.
Q9. What are the advantages of Decision Trees?
Answer:
- Easy to understand and interpret
- No need for feature scaling
- Handles both numerical and categorical data
Q10. What is feature scaling? Why is it needed?
Answer:
Feature scaling standardizes data values to a common range. It is needed because algorithms like SVM and KNN are distance-based and sensitive to feature magnitude.
🔹 Section C: Long Answer Questions (8–10 marks)
Q11. Compare Logistic Regression and SVM.
Answer:
Q12. Explain Decision Tree learning in detail.
Answer:
A Decision Tree is a supervised learning algorithm that splits data into subsets based on feature values. Each internal node represents a feature, branches represent decisions, and leaves represent class labels.
It uses criteria like:
- Gini Index
- Entropy (Information Gain)
However, it may suffer from overfitting, which can be controlled using pruning.
Q13. Describe K-Nearest Neighbors (KNN) algorithm.
Answer:
KNN is a non-parametric algorithm that classifies a data point based on the majority class of its nearest neighbors.
Steps:
- Choose value of K
- Calculate distance (Euclidean)
- Find nearest neighbors
- Assign class
Disadvantages include high computation time and sensitivity to noise.
Q14. Explain data preprocessing steps.
Answer:
Data preprocessing involves:
- Handling missing values (mean, median)
- Encoding categorical data (One-hot encoding)
- Feature scaling (normalization/standardization)
- Splitting dataset into training and testing
It improves model accuracy and performance.
Q15. Explain feature selection and its importance.
Answer:
Feature selection involves choosing the most relevant features for model building.
Importance:
- Reduces overfitting
- Improves accuracy
- Reduces training time
Methods include:
- Filter methods
- Wrapper methods
- Embedded methods
🔹 Section D: Practical/Conceptual Questions
Q16. Why do we split dataset into training and test sets?
Answer:
To evaluate the model’s performance on unseen data and avoid overfitting.
Q17. Difference between normalization and standardization.
Answer:
Q18. What are categorical variables?
Answer:
Variables that represent categories (e.g., color, gender). They must be converted into numerical form for ML models.
Q19. What is imputation?
Answer:
Imputation is the process of filling missing values using statistical methods like mean, median, or mode.
Q20. Which algorithms require feature scaling?
Answer:
- SVM
- KNN
- Logistic Regression

0 Comments