Machine Learning Basics

Machine Learning (ML) is the core technology driving most modern AI applications. This lesson covers the fundamental concepts, types of learning, and key algorithms that form the foundation of ML systems.

What is Machine Learning?

Machine Learning is a method of data analysis that automates analytical model building. It's based on the idea that systems can learn from data, identify patterns, and make decisions with minimal human intervention.

Key Concepts

Data

The fuel of machine learning:

Features: Input variables used to make predictions
Labels: Output variables you're trying to predict
Training Data: Data used to train the model
Test Data: Data used to evaluate model performance

Model

A mathematical representation of a real-world process:

Algorithm: The method used to find patterns
Parameters: Values learned during training
Hyperparameters: Configuration settings for the algorithm

Training

The process of teaching the algorithm:

Fitting: Finding the best parameters
Optimization: Minimizing prediction errors
Validation: Checking performance on unseen data

Types of Machine Learning

1. Supervised Learning

Learning with labeled examples.

Characteristics

Uses input-output pairs for training
Goal is to predict labels for new inputs
Performance can be measured against known correct answers

Types of Supervised Learning

Classification: Predicting categories or classes

# Example: Email spam detection
# Input: Email text
# Output: Spam or Not Spam

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample data
emails = [
    "Win money now! Click here!",
    "Meeting scheduled for tomorrow",
    "Free lottery winner! Claim prize!",
    "Project deadline reminder"
]
labels = ["spam", "not_spam", "spam", "not_spam"]

# Vectorize text
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(emails)

# Train classifier
classifier = MultinomialNB()
classifier.fit(X, labels)

# Make prediction
new_email = ["Congratulations! You've won $1000!"]
X_new = vectorizer.transform(new_email)
prediction = classifier.predict(X_new)
print(f"Prediction: {prediction[0]}")

Regression: Predicting continuous numerical values

# Example: House price prediction
# Input: House features (size, location, etc.)
# Output: Price

from sklearn.linear_model import LinearRegression
import numpy as np

# Sample data: house size (sq ft) and price
house_sizes = np.array([[1000], [1500], [2000], [2500], [3000]])
prices = np.array([200000, 300000, 400000, 500000, 600000])

# Train model
model = LinearRegression()
model.fit(house_sizes, prices)

# Predict price for 1800 sq ft house
new_house = np.array([[1800]])
predicted_price = model.predict(new_house)
print(f"Predicted price: ${predicted_price[0]:,.2f}")

Common Supervised Learning Algorithms

Linear Regression: For continuous predictions
Logistic Regression: For binary classification
Decision Trees: Interpretable models using if-then rules
Random Forest: Ensemble of decision trees
Support Vector Machines (SVM): Effective for high-dimensional data
Neural Networks: Flexible models for complex patterns

2. Unsupervised Learning

Learning patterns from data without labels.

Characteristics

No target variable to predict
Goal is to discover hidden patterns
More exploratory in nature

Types of Unsupervised Learning

Clustering: Grouping similar data points

# Example: Customer segmentation
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np

# Sample customer data: age and income
customers = np.array([
    [25, 30000], [30, 40000], [35, 50000],
    [45, 80000], [50, 90000], [55, 100000],
    [22, 25000], [28, 35000], [32, 45000]
])

# Perform clustering
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(customers)

# Visualize results
plt.scatter(customers[:, 0], customers[:, 1], c=clusters, cmap='viridis')
plt.xlabel('Age')
plt.ylabel('Income')
plt.title('Customer Segmentation')
plt.show()

Dimensionality Reduction: Reducing the number of features

# Example: Data visualization
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

# Load iris dataset
iris = load_iris()
X = iris.data  # 4 features

# Reduce to 2 dimensions
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

# Visualize
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=iris.target, cmap='viridis')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('Iris Dataset - PCA Visualization')
plt.show()

Association Rules: Finding relationships between items

# Example: Market basket analysis
# "People who buy bread and milk also buy eggs"

from mlxtend.frequent_patterns import apriori, association_rules
import pandas as pd

# Sample transaction data
transactions = [
    ['bread', 'milk', 'eggs'],
    ['bread', 'butter'],
    ['milk', 'eggs', 'cheese'],
    ['bread', 'milk', 'butter', 'eggs'],
    ['milk', 'cheese']
]

# Convert to binary matrix
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df = pd.DataFrame(te_ary, columns=te.columns_)

# Find frequent itemsets
frequent_itemsets = apriori(df, min_support=0.4, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.6)
print(rules[['antecedents', 'consequents', 'confidence']])

3. Reinforcement Learning

Learning through interaction and feedback.

Characteristics

Agent learns by taking actions in an environment
Receives rewards or penalties for actions
Goal is to maximize cumulative reward

Key Components

Agent: The learner/decision maker
Environment: The world the agent interacts with
State: Current situation of the agent
Action: What the agent can do
Reward: Feedback from the environment

# Simple Q-Learning example
import numpy as np
import random

class SimpleQLearning:
    def __init__(self, states, actions, learning_rate=0.1, discount_factor=0.9):
        self.q_table = np.zeros((states, actions))
        self.lr = learning_rate
        self.gamma = discount_factor
        self.epsilon = 0.1  # Exploration rate

    def choose_action(self, state):
        if random.random() < self.epsilon:
            return random.randint(0, self.q_table.shape[1] - 1)  # Explore
        else:
            return np.argmax(self.q_table[state])  # Exploit

    def update_q_table(self, state, action, reward, next_state):
        current_q = self.q_table[state, action]
        max_next_q = np.max(self.q_table[next_state])
        new_q = current_q + self.lr * (reward + self.gamma * max_next_q - current_q)
        self.q_table[state, action] = new_q

# Example usage in a simple grid world
agent = SimpleQLearning(states=16, actions=4)  # 4x4 grid, 4 actions (up, down, left, right)

The Machine Learning Workflow

1. Problem Definition

Define the business problem
Determine if it's classification, regression, or clustering
Identify success metrics

2. Data Collection and Preparation

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load data
data = pd.read_csv('dataset.csv')

# Explore data
print(data.head())
print(data.info())
print(data.describe())

# Handle missing values
data = data.dropna()  # or data.fillna(method='mean')

# Feature engineering
data['new_feature'] = data['feature1'] * data['feature2']

# Split features and target
X = data.drop('target', axis=1)
y = data['target']

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

3. Model Selection and Training

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

# Try different models
models = {
    'Random Forest': RandomForestClassifier(),
    'Logistic Regression': LogisticRegression(),
    'SVM': SVC()
}

# Compare models using cross-validation
for name, model in models.items():
    scores = cross_val_score(model, X_train_scaled, y_train, cv=5)
    print(f"{name}: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

4. Model Evaluation

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report

# Train best model
best_model = RandomForestClassifier()
best_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = best_model.predict(X_test_scaled)

# Evaluate performance
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f"Precision: {precision_score(y_test, y_pred, average='weighted'):.3f}")
print(f"Recall: {recall_score(y_test, y_pred, average='weighted'):.3f}")
print(f"F1-score: {f1_score(y_test, y_pred, average='weighted'):.3f}")

# Confusion matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# Detailed report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

5. Model Deployment and Monitoring

import joblib

# Save the trained model
joblib.dump(best_model, 'trained_model.pkl')
joblib.dump(scaler, 'scaler.pkl')

# Load and use the model
loaded_model = joblib.load('trained_model.pkl')
loaded_scaler = joblib.load('scaler.pkl')

# Make predictions on new data
new_data = [[1.2, 3.4, 5.6]]  # Example new data point
new_data_scaled = loaded_scaler.transform(new_data)
prediction = loaded_model.predict(new_data_scaled)
print(f"Prediction: {prediction[0]}")

Common Challenges in Machine Learning

1. Overfitting

When a model learns the training data too well and fails to generalize.

Solutions:

Use more training data
Simplify the model
Apply regularization
Use cross-validation

2. Underfitting

When a model is too simple to capture underlying patterns.

Solutions:

Use more complex models
Add more features
Reduce regularization

3. Data Quality Issues

Missing values
Outliers
Inconsistent data
Biased samples

4. Feature Selection

Choosing the right features is crucial:

from sklearn.feature_selection import SelectKBest, f_classif

# Select top k features
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X_train, y_train)

# Get selected feature names
selected_features = X.columns[selector.get_support()]
print("Selected features:", selected_features.tolist())

Evaluation Metrics

Classification Metrics

Accuracy: Correct predictions / Total predictions
Precision: True Positives / (True Positives + False Positives)
Recall: True Positives / (True Positives + False Negatives)
F1-Score: Harmonic mean of precision and recall

Regression Metrics

Mean Absolute Error (MAE): Average absolute difference
Mean Squared Error (MSE): Average squared difference
Root Mean Squared Error (RMSE): Square root of MSE
R-squared: Proportion of variance explained

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# For regression problems
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]

mae = mean_absolute_error(y_true, y_pred)
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_true, y_pred)

print(f"MAE: {mae:.3f}")
print(f"MSE: {mse:.3f}")
print(f"RMSE: {rmse:.3f}")
print(f"R²: {r2:.3f}")

Best Practices

1. Data Preparation

Always explore your data first
Handle missing values appropriately
Scale features when necessary
Create meaningful features

2. Model Development

Start with simple models
Use cross-validation for model selection
Don't forget about the business context
Document your experiments

3. Validation

Always use a separate test set
Be aware of data leakage
Consider the cost of different types of errors
Validate on real-world scenarios

Key Takeaways

Machine learning automates pattern recognition from data
Supervised learning uses labeled data for prediction
Unsupervised learning discovers hidden patterns
Reinforcement learning learns through trial and error
The ML workflow involves data preparation, model training, and evaluation
Proper evaluation and validation are crucial for success

Practice Exercise

Choose a simple dataset (like the Iris dataset or Boston Housing dataset) and:

Load and explore the data
Prepare the data (handle missing values, scale features)
Split into training and testing sets
Train at least two different models
Evaluate and compare their performance
Write a brief summary of your findings

Next Steps

Now that you understand machine learning basics, we'll dive into neural networks, which form the foundation of deep learning and generative AI.

Continue to: Neural Networks Fundamentals

Machine Learning Basics

What is Machine Learning?

Key Concepts

Data

Model

Training

Types of Machine Learning

1. Supervised Learning

Characteristics

Types of Supervised Learning

Common Supervised Learning Algorithms

2. Unsupervised Learning

Characteristics

Types of Unsupervised Learning

3. Reinforcement Learning

Characteristics

Key Components

The Machine Learning Workflow

1. Problem Definition

2. Data Collection and Preparation

3. Model Selection and Training

4. Model Evaluation

5. Model Deployment and Monitoring

Common Challenges in Machine Learning

1. Overfitting

2. Underfitting

3. Data Quality Issues

4. Feature Selection

Evaluation Metrics

Classification Metrics

Regression Metrics

Best Practices

1. Data Preparation

2. Model Development

3. Validation

Key Takeaways

Practice Exercise

Next Steps

Additional Resources