Machine Learning Basics

Machine Learning (ML) is the core technology driving most modern AI applications. This lesson covers the fundamental concepts, types of learning, and key algorithms that form the foundation of ML systems.

What is Machine Learning?

Machine Learning is a method of data analysis that automates analytical model building. It's based on the idea that systems can learn from data, identify patterns, and make decisions with minimal human intervention.

Key Concepts

Data

The fuel of machine learning:

  • Features: Input variables used to make predictions
  • Labels: Output variables you're trying to predict
  • Training Data: Data used to train the model
  • Test Data: Data used to evaluate model performance

Model

A mathematical representation of a real-world process:

  • Algorithm: The method used to find patterns
  • Parameters: Values learned during training
  • Hyperparameters: Configuration settings for the algorithm

Training

The process of teaching the algorithm:

  • Fitting: Finding the best parameters
  • Optimization: Minimizing prediction errors
  • Validation: Checking performance on unseen data

Types of Machine Learning

1. Supervised Learning

Learning with labeled examples.

Characteristics

  • Uses input-output pairs for training
  • Goal is to predict labels for new inputs
  • Performance can be measured against known correct answers

Types of Supervised Learning

Classification: Predicting categories or classes

# Example: Email spam detection
# Input: Email text
# Output: Spam or Not Spam

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample data
emails = [
    "Win money now! Click here!",
    "Meeting scheduled for tomorrow",
    "Free lottery winner! Claim prize!",
    "Project deadline reminder"
]
labels = ["spam", "not_spam", "spam", "not_spam"]

# Vectorize text
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(emails)

# Train classifier
classifier = MultinomialNB()
classifier.fit(X, labels)

# Make prediction
new_email = ["Congratulations! You've won $1000!"]
X_new = vectorizer.transform(new_email)
prediction = classifier.predict(X_new)
print(f"Prediction: {prediction[0]}")

Regression: Predicting continuous numerical values

# Example: House price prediction
# Input: House features (size, location, etc.)
# Output: Price

from sklearn.linear_model import LinearRegression
import numpy as np

# Sample data: house size (sq ft) and price
house_sizes = np.array([[1000], [1500], [2000], [2500], [3000]])
prices = np.array([200000, 300000, 400000, 500000, 600000])

# Train model
model = LinearRegression()
model.fit(house_sizes, prices)

# Predict price for 1800 sq ft house
new_house = np.array([[1800]])
predicted_price = model.predict(new_house)
print(f"Predicted price: ${predicted_price[0]:,.2f}")

Common Supervised Learning Algorithms

  1. Linear Regression: For continuous predictions
  2. Logistic Regression: For binary classification
  3. Decision Trees: Interpretable models using if-then rules
  4. Random Forest: Ensemble of decision trees
  5. Support Vector Machines (SVM): Effective for high-dimensional data
  6. Neural Networks: Flexible models for complex patterns

2. Unsupervised Learning

Learning patterns from data without labels.

Characteristics

  • No target variable to predict
  • Goal is to discover hidden patterns
  • More exploratory in nature

Types of Unsupervised Learning

Clustering: Grouping similar data points

# Example: Customer segmentation
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np

# Sample customer data: age and income
customers = np.array([
    [25, 30000], [30, 40000], [35, 50000],
    [45, 80000], [50, 90000], [55, 100000],
    [22, 25000], [28, 35000], [32, 45000]
])

# Perform clustering
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(customers)

# Visualize results
plt.scatter(customers[:, 0], customers[:, 1], c=clusters, cmap='viridis')
plt.xlabel('Age')
plt.ylabel('Income')
plt.title('Customer Segmentation')
plt.show()

Dimensionality Reduction: Reducing the number of features

# Example: Data visualization
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

# Load iris dataset
iris = load_iris()
X = iris.data  # 4 features

# Reduce to 2 dimensions
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

# Visualize
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=iris.target, cmap='viridis')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('Iris Dataset - PCA Visualization')
plt.show()

Association Rules: Finding relationships between items

# Example: Market basket analysis
# "People who buy bread and milk also buy eggs"

from mlxtend.frequent_patterns import apriori, association_rules
import pandas as pd

# Sample transaction data
transactions = [
    ['bread', 'milk', 'eggs'],
    ['bread', 'butter'],
    ['milk', 'eggs', 'cheese'],
    ['bread', 'milk', 'butter', 'eggs'],
    ['milk', 'cheese']
]

# Convert to binary matrix
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df = pd.DataFrame(te_ary, columns=te.columns_)

# Find frequent itemsets
frequent_itemsets = apriori(df, min_support=0.4, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.6)
print(rules[['antecedents', 'consequents', 'confidence']])

3. Reinforcement Learning

Learning through interaction and feedback.

Characteristics

  • Agent learns by taking actions in an environment
  • Receives rewards or penalties for actions
  • Goal is to maximize cumulative reward

Key Components

  • Agent: The learner/decision maker
  • Environment: The world the agent interacts with
  • State: Current situation of the agent
  • Action: What the agent can do
  • Reward: Feedback from the environment
# Simple Q-Learning example
import numpy as np
import random

class SimpleQLearning:
    def __init__(self, states, actions, learning_rate=0.1, discount_factor=0.9):
        self.q_table = np.zeros((states, actions))
        self.lr = learning_rate
        self.gamma = discount_factor
        self.epsilon = 0.1  # Exploration rate

    def choose_action(self, state):
        if random.random() < self.epsilon:
            return random.randint(0, self.q_table.shape[1] - 1)  # Explore
        else:
            return np.argmax(self.q_table[state])  # Exploit

    def update_q_table(self, state, action, reward, next_state):
        current_q = self.q_table[state, action]
        max_next_q = np.max(self.q_table[next_state])
        new_q = current_q + self.lr * (reward + self.gamma * max_next_q - current_q)
        self.q_table[state, action] = new_q

# Example usage in a simple grid world
agent = SimpleQLearning(states=16, actions=4)  # 4x4 grid, 4 actions (up, down, left, right)

The Machine Learning Workflow

1. Problem Definition

  • Define the business problem
  • Determine if it's classification, regression, or clustering
  • Identify success metrics

2. Data Collection and Preparation

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load data
data = pd.read_csv('dataset.csv')

# Explore data
print(data.head())
print(data.info())
print(data.describe())

# Handle missing values
data = data.dropna()  # or data.fillna(method='mean')

# Feature engineering
data['new_feature'] = data['feature1'] * data['feature2']

# Split features and target
X = data.drop('target', axis=1)
y = data['target']

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

3. Model Selection and Training

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

# Try different models
models = {
    'Random Forest': RandomForestClassifier(),
    'Logistic Regression': LogisticRegression(),
    'SVM': SVC()
}

# Compare models using cross-validation
for name, model in models.items():
    scores = cross_val_score(model, X_train_scaled, y_train, cv=5)
    print(f"{name}: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

4. Model Evaluation

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report

# Train best model
best_model = RandomForestClassifier()
best_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = best_model.predict(X_test_scaled)

# Evaluate performance
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f"Precision: {precision_score(y_test, y_pred, average='weighted'):.3f}")
print(f"Recall: {recall_score(y_test, y_pred, average='weighted'):.3f}")
print(f"F1-score: {f1_score(y_test, y_pred, average='weighted'):.3f}")

# Confusion matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# Detailed report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

5. Model Deployment and Monitoring

import joblib

# Save the trained model
joblib.dump(best_model, 'trained_model.pkl')
joblib.dump(scaler, 'scaler.pkl')

# Load and use the model
loaded_model = joblib.load('trained_model.pkl')
loaded_scaler = joblib.load('scaler.pkl')

# Make predictions on new data
new_data = [[1.2, 3.4, 5.6]]  # Example new data point
new_data_scaled = loaded_scaler.transform(new_data)
prediction = loaded_model.predict(new_data_scaled)
print(f"Prediction: {prediction[0]}")

Common Challenges in Machine Learning

1. Overfitting

When a model learns the training data too well and fails to generalize.

Solutions:

  • Use more training data
  • Simplify the model
  • Apply regularization
  • Use cross-validation

2. Underfitting

When a model is too simple to capture underlying patterns.

Solutions:

  • Use more complex models
  • Add more features
  • Reduce regularization

3. Data Quality Issues

  • Missing values
  • Outliers
  • Inconsistent data
  • Biased samples

4. Feature Selection

Choosing the right features is crucial:

from sklearn.feature_selection import SelectKBest, f_classif

# Select top k features
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X_train, y_train)

# Get selected feature names
selected_features = X.columns[selector.get_support()]
print("Selected features:", selected_features.tolist())

Evaluation Metrics

Classification Metrics

  • Accuracy: Correct predictions / Total predictions
  • Precision: True Positives / (True Positives + False Positives)
  • Recall: True Positives / (True Positives + False Negatives)
  • F1-Score: Harmonic mean of precision and recall

Regression Metrics

  • Mean Absolute Error (MAE): Average absolute difference
  • Mean Squared Error (MSE): Average squared difference
  • Root Mean Squared Error (RMSE): Square root of MSE
  • R-squared: Proportion of variance explained
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# For regression problems
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]

mae = mean_absolute_error(y_true, y_pred)
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_true, y_pred)

print(f"MAE: {mae:.3f}")
print(f"MSE: {mse:.3f}")
print(f"RMSE: {rmse:.3f}")
print(f"R²: {r2:.3f}")

Best Practices

1. Data Preparation

  • Always explore your data first
  • Handle missing values appropriately
  • Scale features when necessary
  • Create meaningful features

2. Model Development

  • Start with simple models
  • Use cross-validation for model selection
  • Don't forget about the business context
  • Document your experiments

3. Validation

  • Always use a separate test set
  • Be aware of data leakage
  • Consider the cost of different types of errors
  • Validate on real-world scenarios

Key Takeaways

  • Machine learning automates pattern recognition from data
  • Supervised learning uses labeled data for prediction
  • Unsupervised learning discovers hidden patterns
  • Reinforcement learning learns through trial and error
  • The ML workflow involves data preparation, model training, and evaluation
  • Proper evaluation and validation are crucial for success

Practice Exercise

Choose a simple dataset (like the Iris dataset or Boston Housing dataset) and:

  1. Load and explore the data
  2. Prepare the data (handle missing values, scale features)
  3. Split into training and testing sets
  4. Train at least two different models
  5. Evaluate and compare their performance
  6. Write a brief summary of your findings

Next Steps

Now that you understand machine learning basics, we'll dive into neural networks, which form the foundation of deep learning and generative AI.

Continue to: Neural Networks Fundamentals

Additional Resources

Share this article: