📘 Chapter 2: Machine Learning Classifiers & Data Preprocessing (MCA Notes)

🎯 Objectives

Understand major classification algorithms
Learn basics of scikit-learn
Handle real-world data issues (missing values, categorical data)
Build effective training datasets
Perform feature selection & scaling

🔹 2.1 Choosing a Classification Algorithm

✔ What is Classification?

Classification is a supervised learning task where the model predicts class labels (e.g., spam/not spam).

✔ Factors to Consider:

Size of dataset
Linearity of data
Training time
Model interpretability
Accuracy requirements

✔ Common Algorithms:

Logistic Regression
Support Vector Machine (SVM)
Decision Tree
K-Nearest Neighbors (KNN)

🔹 2.2 Logistic Regression

✔ Concept:

Used for binary classification, predicts probability using sigmoid function.

👉 Formula:

P(y=1) = \frac{1}{1 + e^{-z}}

✔ Key Points:

Outputs probability (0–1)
Works well for linearly separable data
Easy to interpret

✔ Advantages:

Simple & fast
Probabilistic output

✔ Disadvantages:

Not suitable for complex/nonlinear data

🔹 2.3 Support Vector Machines (SVM)

✔ Concept:

Finds the best boundary (hyperplane) that maximizes margin between classes.

✔ Key Terms:

Hyperplane
Margin
Support Vectors

✔ Advantages:

High accuracy
Works well in high dimensions

✔ Disadvantages:

Slow for large datasets
Hard to interpret

🔹 2.4 Kernel SVM (Nonlinear Problems)

✔ Concept:

Transforms data into higher dimensions using kernel trick.

✔ Common Kernels:

Linear
Polynomial
RBF (Radial Basis Function)

✔ Use Case:

When data is not linearly separable

🔹 2.5 Decision Tree

✔ Concept:

Tree-like structure where:

Nodes → Features
Branches → Decisions
Leaves → Output

✔ Splitting Criteria:

Gini Index
Entropy (Information Gain)

✔ Advantages:

Easy to understand
No need for scaling

✔ Disadvantages:

Overfitting problem

🔹 2.6 K-Nearest Neighbors (KNN)

✔ Concept:

Classifies based on nearest neighbors

✔ Steps:

Choose K
Calculate distance
Assign majority class

✔ Advantages:

Simple
No training phase (lazy learning)

✔ Disadvantages:

Slow for large data
Sensitive to noise

🔹 2.7 Data Preprocessing

✔ Importance:

Real-world data is incomplete, inconsistent, noisy

✔ Steps:

Handle missing values
Encode categorical data
Feature scaling
Split dataset

🔹 2.8 Handling Missing Data

✔ Methods:

Remove rows/columns
Imputation
- Mean
- Median
- Most frequent

👉 Example (scikit-learn):


from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')

🔹 2.9 Handling Categorical Data

✔ Types:

Nominal (no order) → Colors
Ordinal (ordered) → Rank

✔ Techniques:

Label Encoding
One-Hot Encoding

👉 Example:


from sklearn.preprocessing import OneHotEncoder

🔹 2.10 Splitting Dataset

✔ Why?

To evaluate model performance

✔ Types:

Training Set
Test Set

👉 Example:


from sklearn.model_selection import train_test_split

🔹 2.11 Feature Scaling

✔ Why?

Algorithms like SVM, KNN depend on distance

✔ Methods:

Normalization (Min-Max Scaling)

x' = \frac{x - min}{max - min}

Standardization

z = \frac{x - \mu}{\sigma}

🔹 2.12 Feature Selection

✔ Purpose:

Remove irrelevant features → improve accuracy

✔ Methods:

Filter methods (correlation)
Wrapper methods
Embedded methods

✔ Benefits:

Reduces overfitting
Faster training
Better performance

🔹 2.13 Summary

Different classifiers have different strengths
Preprocessing is crucial for model success
Feature selection improves performance
scikit-learn simplifies implementation

📘 Self-Assessment Questions (Chapter 2)

🔹 Section A: Very Short Answer Questions (1–2 marks)

Q1. What is classification in machine learning?

Answer:
Classification is a supervised learning technique used to predict categorical class labels based on input data.

Q2. What is logistic regression used for?

Answer:
It is used for binary classification problems and predicts probabilities using a sigmoid function.

Q3. What is a hyperplane in SVM?

Answer:
A hyperplane is a decision boundary that separates different classes in the feature space.

Q4. What is overfitting?

Answer:
Overfitting occurs when a model learns noise instead of patterns, performing well on training data but poorly on test data.

Q5. What is KNN?

Answer:
K-Nearest Neighbors is a lazy learning algorithm that classifies data based on the majority class of nearest neighbors.

🔹 Section B: Short Answer Questions (3–5 marks)

Q6. Explain Logistic Regression briefly.

Answer:
Logistic regression is a classification algorithm that predicts the probability of a class using a sigmoid function. It works best for linearly separable data and outputs values between 0 and 1.

Q7. What are support vectors in SVM?

Answer:
Support vectors are the data points closest to the decision boundary, which influence the position of the hyperplane.

Q8. What is the kernel trick in SVM?

Answer:
The kernel trick transforms data into a higher-dimensional space to make it linearly separable without explicitly computing the transformation.

Q9. What are the advantages of Decision Trees?

Answer:

Easy to understand and interpret
No need for feature scaling
Handles both numerical and categorical data

Q10. What is feature scaling? Why is it needed?

Answer:
Feature scaling standardizes data values to a common range. It is needed because algorithms like SVM and KNN are distance-based and sensitive to feature magnitude.

🔹 Section C: Long Answer Questions (8–10 marks)

Q11. Compare Logistic Regression and SVM.

Answer:

Q12. Explain Decision Tree learning in detail.

Answer:
A Decision Tree is a supervised learning algorithm that splits data into subsets based on feature values. Each internal node represents a feature, branches represent decisions, and leaves represent class labels.

It uses criteria like:

Gini Index
Entropy (Information Gain)

However, it may suffer from overfitting, which can be controlled using pruning.

Q13. Describe K-Nearest Neighbors (KNN) algorithm.

Answer:
KNN is a non-parametric algorithm that classifies a data point based on the majority class of its nearest neighbors.

Steps:

Choose value of K
Calculate distance (Euclidean)
Find nearest neighbors
Assign class

Disadvantages include high computation time and sensitivity to noise.

Q14. Explain data preprocessing steps.

Answer:
Data preprocessing involves:

Handling missing values (mean, median)
Encoding categorical data (One-hot encoding)
Feature scaling (normalization/standardization)
Splitting dataset into training and testing

It improves model accuracy and performance.

Q15. Explain feature selection and its importance.

Answer:
Feature selection involves choosing the most relevant features for model building.

Importance:

Reduces overfitting
Improves accuracy
Reduces training time

Methods include:

Filter methods
Wrapper methods
Embedded methods

🔹 Section D: Practical/Conceptual Questions

Q16. Why do we split dataset into training and test sets?

Answer:
To evaluate the model’s performance on unseen data and avoid overfitting.

Q17. Difference between normalization and standardization.

Answer:

Q18. What are categorical variables?

Answer:
Variables that represent categories (e.g., color, gender). They must be converted into numerical form for ML models.

Q19. What is imputation?

Answer:
Imputation is the process of filling missing values using statistical methods like mean, median, or mode.

Q20. Which algorithms require feature scaling?

Answer:

SVM
KNN
Logistic Regression

Looking For Anything Specific?

📚 Book a Free Online Demo Class Now!

Machine Learning Classifiers & Data Preprocessing

📘 Chapter 2: Machine Learning Classifiers & Data Preprocessing (MCA Notes)

🎯 Objectives

🔹 2.1 Choosing a Classification Algorithm

✔ What is Classification?

✔ Factors to Consider:

✔ Common Algorithms:

🔹 2.2 Logistic Regression

✔ Concept:

✔ Key Points:

✔ Advantages:

✔ Disadvantages:

🔹 2.3 Support Vector Machines (SVM)

✔ Concept:

✔ Key Terms:

✔ Advantages:

✔ Disadvantages:

🔹 2.4 Kernel SVM (Nonlinear Problems)

✔ Concept:

✔ Common Kernels:

✔ Use Case:

🔹 2.5 Decision Tree

✔ Concept:

✔ Splitting Criteria:

✔ Advantages:

✔ Disadvantages:

🔹 2.6 K-Nearest Neighbors (KNN)

✔ Concept:

✔ Steps:

✔ Advantages:

✔ Disadvantages:

🔹 2.7 Data Preprocessing

✔ Importance:

✔ Steps:

🔹 2.8 Handling Missing Data

✔ Methods:

🔹 2.9 Handling Categorical Data

✔ Types:

✔ Techniques:

🔹 2.10 Splitting Dataset

✔ Why?

✔ Types:

🔹 2.11 Feature Scaling

✔ Why?

✔ Methods:

🔹 2.12 Feature Selection

✔ Purpose:

✔ Methods:

✔ Benefits:

🔹 2.13 Summary

📘 Self-Assessment Questions (Chapter 2)

🔹 Section A: Very Short Answer Questions (1–2 marks)

Q1. What is classification in machine learning?

Q2. What is logistic regression used for?

Q3. What is a hyperplane in SVM?

Q4. What is overfitting?

Q5. What is KNN?

🔹 Section B: Short Answer Questions (3–5 marks)

Q6. Explain Logistic Regression briefly.

Q7. What are support vectors in SVM?

Q8. What is the kernel trick in SVM?

Q9. What are the advantages of Decision Trees?

Q10. What is feature scaling? Why is it needed?

🔹 Section C: Long Answer Questions (8–10 marks)

Q11. Compare Logistic Regression and SVM.

Q12. Explain Decision Tree learning in detail.

Q13. Describe K-Nearest Neighbors (KNN) algorithm.

Q14. Explain data preprocessing steps.

Q15. Explain feature selection and its importance.

🔹 Section D: Practical/Conceptual Questions

Q16. Why do we split dataset into training and test sets?

Q17. Difference between normalization and standardization.

Q18. What are categorical variables?

Q19. What is imputation?

Q20. Which algorithms require feature scaling?

Posted by Aditya Suman.

Post a Comment

0 Comments