1. PyTorch
GradScaler
in PyTorch is a utility that helps with mixed-precision training. It scales gradients during backpropagation to avoid underflow issues that can occur with low-precision (float16) calculations. It enables models to use less memory and run faster by using half-precision while maintaining stability.
Here’s an example using GradScaler
:
import torchfrom torch.cuda.amp import GradScaler, autocast
# Model, optimizer, and loss functionmodel = MyModel().cuda()optimizer = torch.optim.Adam(model.parameters(), lr=0.001)criterion = torch.nn.CrossEntropyLoss()
# Initialize GradScalerscaler = GradScaler()
for data, target in train_loader: data, target = data.cuda(), target.cuda()
# Mixed precision forward pass with autocast(): output = model(data) loss = criterion(output, target)
# Scales loss, then does backprop scaler.scale(loss).backward()
# Optimizer step and scaler update scaler.step(optimizer) scaler.update()
# Clear gradients optimizer.zero_grad()
In this code:
autocast()
enables mixed precision for forward pass and loss calculation.GradScaler
scales up gradients before backprop to prevent underflow.scaler.update()
adjusts the scaling factor dynamically.
This helps you speed up training without loss of numerical stability.
=====================================
Gradient underflow occurs when the values of gradients become so small that they are rounded down to zero due to the limited precision of the floating-point representation. This is especially a concern in low-precision formats like float16
, which have a smaller range of representable numbers compared to float32
or float64
.
In deep learning, gradients are used to update the model’s parameters. If gradients underflow (become too close to zero), they can’t effectively contribute to the model updates, leading to stalled or inefficient learning. This is why tools like GradScaler
are used in mixed-precision training to prevent underflow by scaling up gradients temporarily, ensuring they stay within a representable range.