1.2 Activation Functions

Activation functions are essential in neural networks — they bring the “neural” to neural networks. Without them, your model would be limited to learning only linear relationships, which isn’t useful for solving real-world problems like image recognition, natural language understanding, or nonlinear decision boundaries.

Activation functions play a crucial role in neural networks by introducing non-linearities into the model, enabling it to learn complex patterns and relationships in the data. Without activation functions, a neural network, regardless of how many layers it has, would essentially behave like a linear model, limiting its ability to capture and model complex structures in data. (several weight matrices multiplication is still a weight matrix). Thus, these activation enables deep learning.

Another importance of activation function is to help control the data (gradient) flow. Activation function definition and its gradient heavily influence back-propagation, as it appears every hidden layer. Certain functions like the Rectified Linear Unit (ReLU) can mitigate issues like the vanishing gradient problem, making it easier to train deep networks.

1. Controlling the Flow of Data

Yes, activation functions control the flow of data via gradients in neural networks. They do this in two key ways:

1.1 Controlling Gradient Magnitude (Vanishing & Exploding Gradients)

Activation functions affect the gradient during backpropagation, which determines how much the weights are updated.

Vanishing Gradient Problem (common in sigmoid/tanh):
- When gradients become too small, learning slows down because weight updates are negligible.
- This happens because sigmoid and tanh squish inputs into a small range (e.g., [0,1] for sigmoid), making their derivatives tiny.
Exploding Gradient Problem (can happen in deep networks with unbounded activations):
- When gradients become too large, training becomes unstable.
- This happens when activations amplify inputs too much (e.g., using very large initial weights in ReLU).

👉 ReLU mitigates vanishing gradients by keeping gradients at 1 for positive values.

1.2 Acting as a Gate for Information Flow

Activation functions decide which neurons fire (i.e., send information forward) and which don’t.

ReLU: Sets negative values to zero, meaning some neurons “turn off.” This creates sparse activations, making learning more efficient.
Leaky ReLU / PReLU: Allow a small gradient for negative inputs to avoid “dying neurons.”
Sigmoid: Outputs values between 0 and 1, often interpreted as a probability (useful for classification).
Softmax: Ensures output neurons sum to 1, distributing importance among classes.

👉 Activation functions regulate the gradient flow, ensuring the model learns effectively without instability.

1.3 🔍 Summary: Activation Functions and Gradient Flow

Activation Function	Prevents Vanishing Gradient?	Prevents Exploding Gradient?	Controls Data Flow?
ReLU (✅ Default choice)	✅ Yes (for positive inputs)	❌ No (can explode with large weights)	✅ Turns off neurons (sparsity)
Leaky ReLU	✅ Yes (keeps small gradient for negatives)	❌ No	✅ Allows negative values to pass
Sigmoid	❌ No (vanishing gradient for extreme values)	✅ Yes (bounded output)	✅ Squashes values into (0,1)
Tanh	❌ No (still vanishes, but better than sigmoid)	✅ Yes	✅ Maps values to (-1,1)
Softmax	✅ Yes (for classification tasks)	✅ Yes	✅ Controls probability distribution

2. with vs. without an activation function

Lets see the difference in performance with vs. without an activation function. We’ll build and train two small neural networks on synthetic data:

Model A: With ReLU activation.
Model B: No activation (i.e., linear model).

using Python with TensorFlow/Keras.

2.1 🔧 Experiment Setup

🔹 Problem:

A simple nonlinear function:

Train a neural network to learn this function.

✅ Code Demo

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense

# Generate synthetic data
x = np.linspace(-2 * np.pi, 2 * np.pi, 1000)
y = np.sin(x)

x_train = x.reshape(-1, 1)
y_train = y.reshape(-1, 1)

# Model A: With ReLU activation
model_relu = Sequential([
    Dense(64, activation='relu', input_shape=(1,)),
    Dense(64, activation='relu'),
    Dense(1)  # Output layer
])
model_relu.compile(optimizer='adam', loss='mse')

# Model B: Without activation (just linear layers)
model_linear = Sequential([
    Dense(64, activation=None, input_shape=(1,)),
    Dense(64, activation=None),
    Dense(1)
])
model_linear.compile(optimizer='adam', loss='mse')

# Train both models
model_relu.fit(x_train, y_train, epochs=100, verbose=0)
model_linear.fit(x_train, y_train, epochs=100, verbose=0)

# Predict
y_pred_relu = model_relu.predict(x_train)
y_pred_linear = model_linear.predict(x_train)

# Plot
plt.figure(figsize=(12, 6))
plt.plot(x, y, label='True Function (sin(x))', color='black', linewidth=2)
plt.plot(x, y_pred_relu, label='Model A (with ReLU)', color='blue')
plt.plot(x, y_pred_linear, label='Model B (no activation)', color='red', linestyle='dashed')
plt.legend()
plt.title("Neural Network with vs. without Activation Function")
plt.grid(True)
plt.show()

2.2 📊 What You’ll See

The ReLU model learns the sine wave quite well — it captures the non-linear shape.
The linear model struggles — it can only form a straight or very shallow curve, totally missing the complexity of the sine wave.

3. ReLu Activation

The ReLU (Rectified Linear Unit) activation function has become the default choice in most deep learning architectures — and for good reason. Here’s why ReLU is so popular and widely used:

🔥 What is ReLu

ReLU stands for Rectified Linear Unit and is defined as:

It outputs:

0 if the input is negative,
The input itself if it’s positive.

Simple, but super effective.

✅ Why ReLU is Popular

1. Computational Simplicity

Very fast to compute — just a comparison with zero.
No complex math like exponentials or divisions (as in sigmoid/tanh).
Ideal for large-scale models and real-time applications.

2. Breaks the Linearity

Introduces non-linearity, which allows the network to learn complex patterns.
Without this, stacking multiple layers wouldn’t be useful (you’d just end up with another linear function).

3. Helps with the Vanishing Gradient Problem

Sigmoid and tanh squish input values into small ranges, which can make gradients tiny during backpropagation (leading to slow or stalled learning).
ReLU doesn’t squash input — its gradient is 1 for positive values, so gradients stay healthy and learning is faster.

4. Sparse Activation (Efficient)

ReLU outputs zero for negative inputs, meaning many neurons are inactive at a given time.
This sparsity makes the network more efficient and less prone to overfitting.

::caution[⚠️ Caveat: The “Dying ReLU” Problem]

If too many neurons output zero (never activate), they might stop learning — we say they’re “dead”.
Solutions:
- Use Leaky ReLU: Allows a small, non-zero gradient for negative inputs.
- Parametric ReLU (PReLU) or ELU are other alternatives. :::

📊 ReLU vs. Others

Function	Speed	Handles Vanishing Gradient	Output Range	Non-zero mean	Common Use
ReLU	✅ Fast	✅ Yes	[0, ∞)	✅ Yes	Hidden layers
Sigmoid	❌ Slow	❌ No	(0, 1)	❌ No	Output (binary)
Tanh	❌ Slow	❌ No	(-1, 1)	❌ No	Rarely used
Leaky ReLU	✅ Fast	✅ Yes	(-∞, ∞)	✅ Yes	Hidden layers (alternative)