Looking For Anything Specific?

Free Demo

📚 Book a Free Online Demo Class Now!

Learn with clarity, personal mentorship & expert guidance.

🚀 Book Now

Machine Learning Classifiers & Data Preprocessing

📘 Chapter 2: Machine Learning Classifiers & Data Preprocessing (MCA Notes)


🎯 Objectives

  • Understand major classification algorithms
  • Learn basics of scikit-learn
  • Handle real-world data issues (missing values, categorical data)
  • Build effective training datasets
  • Perform feature selection & scaling

🔹 2.1 Choosing a Classification Algorithm

✔ What is Classification?

Classification is a supervised learning task where the model predicts class labels (e.g., spam/not spam).

✔ Factors to Consider:

  • Size of dataset
  • Linearity of data
  • Training time
  • Model interpretability
  • Accuracy requirements

✔ Common Algorithms:

  • Logistic Regression
  • Support Vector Machine (SVM)
  • Decision Tree
  • K-Nearest Neighbors (KNN)

🔹 2.2 Logistic Regression

✔ Concept:

Used for binary classification, predicts probability using sigmoid function.

👉 Formula:

P(y=1)=11+ezP(y=1) = \frac{1}{1 + e^{-z}}

✔ Key Points:

  • Outputs probability (0–1)
  • Works well for linearly separable data
  • Easy to interpret

✔ Advantages:

  • Simple & fast
  • Probabilistic output

✔ Disadvantages:

  • Not suitable for complex/nonlinear data

🔹 2.3 Support Vector Machines (SVM)

✔ Concept:

Finds the best boundary (hyperplane) that maximizes margin between classes.

✔ Key Terms:

  • Hyperplane
  • Margin
  • Support Vectors

✔ Advantages:

  • High accuracy
  • Works well in high dimensions

✔ Disadvantages:

  • Slow for large datasets
  • Hard to interpret

🔹 2.4 Kernel SVM (Nonlinear Problems)

✔ Concept:

Transforms data into higher dimensions using kernel trick.

✔ Common Kernels:

  • Linear
  • Polynomial
  • RBF (Radial Basis Function)

✔ Use Case:

  • When data is not linearly separable

🔹 2.5 Decision Tree

✔ Concept:

Tree-like structure where:

  • Nodes → Features
  • Branches → Decisions
  • Leaves → Output

✔ Splitting Criteria:

  • Gini Index
  • Entropy (Information Gain)

✔ Advantages:

  • Easy to understand
  • No need for scaling

✔ Disadvantages:

  • Overfitting problem

🔹 2.6 K-Nearest Neighbors (KNN)

✔ Concept:

Classifies based on nearest neighbors

✔ Steps:

  1. Choose K
  2. Calculate distance
  3. Assign majority class

✔ Advantages:

  • Simple
  • No training phase (lazy learning)

✔ Disadvantages:

  • Slow for large data
  • Sensitive to noise

🔹 2.7 Data Preprocessing

✔ Importance:

Real-world data is incomplete, inconsistent, noisy

✔ Steps:

  • Handle missing values
  • Encode categorical data
  • Feature scaling
  • Split dataset

🔹 2.8 Handling Missing Data

✔ Methods:

  1. Remove rows/columns
  2. Imputation
    • Mean
    • Median
    • Most frequent

👉 Example (scikit-learn):

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')

🔹 2.9 Handling Categorical Data

✔ Types:

  • Nominal (no order) → Colors
  • Ordinal (ordered) → Rank

✔ Techniques:

  • Label Encoding
  • One-Hot Encoding

👉 Example:

from sklearn.preprocessing import OneHotEncoder

🔹 2.10 Splitting Dataset

✔ Why?

To evaluate model performance

✔ Types:

  • Training Set
  • Test Set

👉 Example:

from sklearn.model_selection import train_test_split

🔹 2.11 Feature Scaling

✔ Why?

Algorithms like SVM, KNN depend on distance

✔ Methods:

  1. Normalization (Min-Max Scaling)
x=xminmaxminx' = \frac{x - min}{max - min}
  1. Standardization
z=xμσz = \frac{x - \mu}{\sigma}

🔹 2.12 Feature Selection

✔ Purpose:

Remove irrelevant features → improve accuracy

✔ Methods:

  • Filter methods (correlation)
  • Wrapper methods
  • Embedded methods

✔ Benefits:

  • Reduces overfitting
  • Faster training
  • Better performance

🔹 2.13 Summary

  • Different classifiers have different strengths
  • Preprocessing is crucial for model success
  • Feature selection improves performance
  • scikit-learn simplifies implementation

📘 Self-Assessment Questions (Chapter 2)

🔹 Section A: Very Short Answer Questions (1–2 marks)

Q1. What is classification in machine learning?

Answer:
Classification is a supervised learning technique used to predict categorical class labels based on input data.


Q2. What is logistic regression used for?

Answer:
It is used for binary classification problems and predicts probabilities using a sigmoid function.


Q3. What is a hyperplane in SVM?

Answer:
A hyperplane is a decision boundary that separates different classes in the feature space.


Q4. What is overfitting?

Answer:
Overfitting occurs when a model learns noise instead of patterns, performing well on training data but poorly on test data.


Q5. What is KNN?

Answer:
K-Nearest Neighbors is a lazy learning algorithm that classifies data based on the majority class of nearest neighbors.



🔹 Section B: Short Answer Questions (3–5 marks)

Q6. Explain Logistic Regression briefly.

Answer:
Logistic regression is a classification algorithm that predicts the probability of a class using a sigmoid function. It works best for linearly separable data and outputs values between 0 and 1.


Q7. What are support vectors in SVM?

Answer:
Support vectors are the data points closest to the decision boundary, which influence the position of the hyperplane.


Q8. What is the kernel trick in SVM?

Answer:
The kernel trick transforms data into a higher-dimensional space to make it linearly separable without explicitly computing the transformation.


Q9. What are the advantages of Decision Trees?

Answer:

  • Easy to understand and interpret
  • No need for feature scaling
  • Handles both numerical and categorical data

Q10. What is feature scaling? Why is it needed?

Answer:
Feature scaling standardizes data values to a common range. It is needed because algorithms like SVM and KNN are distance-based and sensitive to feature magnitude.

🔹 Section C: Long Answer Questions (8–10 marks)

Q11. Compare Logistic Regression and SVM.

Answer: 

Q12. Explain Decision Tree learning in detail.

Answer:
A Decision Tree is a supervised learning algorithm that splits data into subsets based on feature values. Each internal node represents a feature, branches represent decisions, and leaves represent class labels.

It uses criteria like:

  • Gini Index
  • Entropy (Information Gain)

However, it may suffer from overfitting, which can be controlled using pruning.


Q13. Describe K-Nearest Neighbors (KNN) algorithm.

Answer:
KNN is a non-parametric algorithm that classifies a data point based on the majority class of its nearest neighbors.

Steps:

  1. Choose value of K
  2. Calculate distance (Euclidean)
  3. Find nearest neighbors
  4. Assign class

Disadvantages include high computation time and sensitivity to noise.


Q14. Explain data preprocessing steps.

Answer:
Data preprocessing involves:

  • Handling missing values (mean, median)
  • Encoding categorical data (One-hot encoding)
  • Feature scaling (normalization/standardization)
  • Splitting dataset into training and testing

It improves model accuracy and performance.


Q15. Explain feature selection and its importance.

Answer:
Feature selection involves choosing the most relevant features for model building.

Importance:

  • Reduces overfitting
  • Improves accuracy
  • Reduces training time

Methods include:

  • Filter methods
  • Wrapper methods
  • Embedded methods


🔹 Section D: Practical/Conceptual Questions

Q16. Why do we split dataset into training and test sets?

Answer:
To evaluate the model’s performance on unseen data and avoid overfitting.

Q17. Difference between normalization and standardization.

Answer: 

Q18. What are categorical variables?

Answer:
Variables that represent categories (e.g., color, gender). They must be converted into numerical form for ML models.


Q19. What is imputation?

Answer:
Imputation is the process of filling missing values using statistical methods like mean, median, or mode.


Q20. Which algorithms require feature scaling?

Answer:

  • SVM
  • KNN
  • Logistic Regression 


Post a Comment

0 Comments