Polynomial regression with PyTorch: LBFGS vs Adam


Tags: ml pytorch

This is my second post comparing the LBFGS optimizer with the Adam optimizer for small datasets, and shallow models. In the last post I had discussed linear regression with PyTorch. Polynomial regression is a generalization of that where instead of fitting a line to the data, we fit a polynomial curve. In some sense a polynomial regression can be thought of as a special case of a multidimensional linear regression. A multidimensional linear regression is given by \(y = β_0 + β_1 x_1 + ⋯ + β_n x_n\), where the output variable \(y\) is linear in the predictor variables \(x_1, …, x_n\). For a polynomial regression of degree \(n\): \(x_1 = x, x_{i \in \{2,…,n\}} = x^i\). To be clear this does not mean that a polynomial regression, of degree \(n > 1\) is linear, because it is not. The difference between an \(n\)-dimensional linear regression and an \(n\)-th degree polynomial regression is that for the linear regression we expect that the predictor variables are uncorrelated, whereas for the polynomial regression it is clearly the opposite. 1 The computer is not aware of this difference between the interpretations of the two models. Hence the code for a multidimensional linear regression model, and a polynomial regression model are practically the same.

I will do the benchmark with two functions: \(y = x^3 + 2x^2 - 3x + 5\), and \(y = \sin{(x)}\). 2 The first function is a cubic polynomial, and so the expectation is that after adequate training a cubic polynomial regression model will be able to exactly fit this function. The Taylor series expansion of \(\sin{(x)}\) is \(\sin{(x)} = x - \frac{1}{3!}x^3 + \frac{1}{5!}x^5 - \frac{1}{7!}x^7 + ⋯\). At \(x = \frac{\pi}{6}\), with just the first two terms of the series the accuracy is up to the third decimal point, and with just the first three terms the accuracy of up to the fifth decimal point. A cubic or a quintic (degree = 5) polynomial regression model should be able to adequately approximate \(\sin{(x)}\), up to the desired accuracy. The caveat is that the higher the degree the more computations there will be, particularly for the LBFGS optimizer since it computes the function multiple times during each training step. And with more computations there are more chances of error propagation. Also, and more important from the point of view of predictions, higher degree models are more likely to overfit, particularly with small datasets.

PolynomialModel class

The PolynomialModel class is very similar to the LinearModel class, except that the linear layer is now multidimensional. The original LinearModel class is equivalent to PolynomialModel(degree=1).

import math

import torch
import torch.nn as nn
from torch.autograd import Variable
from torch.optim import Adam, LBFGS, SGD
from torch.utils.data import Dataset, DataLoader

class PolynomialModel(nn.Module):
    def __init__(self, degree):
        self._degree = degree
        self.linear = nn.Linear(self._degree, 1)

    def forward(self, x):
        return self.linear(self._polynomial_features(x))

    def _polynomial_features(self, x):
        x = x.unsqueeze(1)
        return torch.cat([x ** i for i in range(1, self._degree + 1)], 1)

Dummy data

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
dtype = torch.float
x = torch.linspace(-math.pi, math.pi, steps=20, dtype=dtype, device=device)
cubic_y = x ** 3 + 2 * x ** 2 - 3 * x + 5
sine_y = torch.sin(x)

The DummyData class to provide the data during training.

class DummyData(Dataset):
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def __len__(self):
        return self.x.shape[0]

    def __getitem__(self, idx):
        return self.x[idx], self.y[idx]


It is best to write a function to train the models, to avoid code duplication. I also toyed with the idea to add a fit function to the PolynomialModel class following the scikit-learn API, pm.fit(data) makes it very clear what it is that you are doing. But finally decided against that to keep things simple. 3

def train_step(model, data, optimizer, criterion):
    running_loss = 0.0

    for i, (x, y) in enumerate(data):

        x_ = Variable(x, requires_grad=True)
        y_ = Variable(y.unsqueeze(1)) # unsqueeze to match dimensions with y_pred later

        def closure():
            # Zero gradients

            # Forward pass
            y_pred = model(x_)

            # Compute loss
            loss = criterion(y_pred, y_)

            # Backward pass

            return loss

        # Update weights

        # Update the running loss
        loss = closure()
        running_loss += loss.item()
    return running_loss

First I will train cubic polynomial regression models for the cubic polynomial function - with both Adam and LBFGS. Then I will do the same for the sine function.

cubic_data = DummyData(x.reshape(-1, 1), cubic_y.reshape(-1, 1))
sine_data = DummyData(x.reshape(-1, 1), sine_y.reshape(-1, 1))
criterion = nn.MSELoss()

Cubic polynomial

With Adam

With the linear model it was clear that Adam requires a significantly large number of epochs than LBFGS to converge to a similar accuracy. To do the same comparison, I will train two models with Adam - the first for 20 epochs and the second for 2000 epochs.

pm_cubic_adam_20 = PolynomialModel(degree=3)
optimizer = Adam(pm_cubic_adam_20.parameters(), weight_decay=0.0001)

for epoch in range(20):
    running_loss = train_step(model=pm_cubic_adam_20,
    print(f"Epoch: {epoch + 1:02}/20 Loss: {running_loss:.5e}")
Epoch: 01/20 Loss: 4.85275e+03
Epoch: 02/20 Loss: 4.77171e+03
Epoch: 03/20 Loss: 4.71236e+03
Epoch: 04/20 Loss: 4.65514e+03
Epoch: 05/20 Loss: 4.59920e+03
Epoch: 06/20 Loss: 4.54425e+03
Epoch: 07/20 Loss: 4.49016e+03
Epoch: 08/20 Loss: 4.43686e+03
Epoch: 09/20 Loss: 4.38428e+03
Epoch: 10/20 Loss: 4.33239e+03
Epoch: 11/20 Loss: 4.28116e+03
Epoch: 12/20 Loss: 4.23057e+03
Epoch: 13/20 Loss: 4.18059e+03
Epoch: 14/20 Loss: 4.13121e+03
Epoch: 15/20 Loss: 4.08242e+03
Epoch: 16/20 Loss: 4.03421e+03
Epoch: 17/20 Loss: 3.98655e+03
Epoch: 18/20 Loss: 3.93945e+03
Epoch: 19/20 Loss: 3.89289e+03
Epoch: 20/20 Loss: 3.84687e+03
pm_cubic_adam_2000 = PolynomialModel(degree=3)
optimizer = Adam(pm_cubic_adam_2000.parameters(), weight_decay=0.0001)

for epoch in range(2000):
    running_loss = train_step(model=pm_cubic_adam_2000,
    if epoch % 100 == 99:
        print(f"Epoch: {epoch + 1:04}/2000 Loss: {running_loss:.5e}")
Epoch: 0100/2000 Loss: 1.38251e+03
Epoch: 0200/2000 Loss: 3.20796e+02
Epoch: 0300/2000 Loss: 1.32317e+02
Epoch: 0400/2000 Loss: 8.82815e+01
Epoch: 0500/2000 Loss: 5.17755e+01
Epoch: 0600/2000 Loss: 2.55819e+01
Epoch: 0700/2000 Loss: 9.56498e+00
Epoch: 0800/2000 Loss: 2.03360e+00
Epoch: 0900/2000 Loss: 9.99046e-02
Epoch: 1000/2000 Loss: 5.93547e-05
Epoch: 1100/2000 Loss: 2.03925e-01
Epoch: 1200/2000 Loss: 1.83275e-01
Epoch: 1300/2000 Loss: 1.47286e-03
Epoch: 1400/2000 Loss: 2.99433e-02
Epoch: 1500/2000 Loss: 5.84062e-02
Epoch: 1600/2000 Loss: 4.87064e-02
Epoch: 1700/2000 Loss: 3.86544e-02
Epoch: 1800/2000 Loss: 3.36865e-02
Epoch: 1900/2000 Loss: 3.26679e-02
Epoch: 2000/2000 Loss: 3.38537e-02


pm_cubic_lbfgs_20 = PolynomialModel(degree=3)
optimizer = LBFGS(pm_cubic_lbfgs_20.parameters(), history_size=10, max_iter=4)

for epoch in range(20):
    running_loss = train_step(model=pm_cubic_lbfgs_20,
    print(f"Epoch: {epoch + 1:02}/20 Loss: {running_loss:.5e}")
Epoch: 01/20 Loss: 1.44156e+01
Epoch: 02/20 Loss: 8.98075e+03
Epoch: 03/20 Loss: 1.16285e+03
Epoch: 04/20 Loss: 2.40442e+06
Epoch: 05/20 Loss: 6.01252e+03
Epoch: 06/20 Loss: 5.28824e+02
Epoch: 07/20 Loss: 4.71262e+03
Epoch: 08/20 Loss: 1.65953e+03
Epoch: 09/20 Loss: 8.75217e-01
Epoch: 10/20 Loss: 6.33097e-03
Epoch: 11/20 Loss: 1.11079e-01
Epoch: 12/20 Loss: 3.15055e-02
Epoch: 13/20 Loss: 1.18033e-03
Epoch: 14/20 Loss: 1.33972e-03
Epoch: 15/20 Loss: 9.55270e-04
Epoch: 16/20 Loss: 1.48802e-02
Epoch: 17/20 Loss: 2.16993e-02
Epoch: 18/20 Loss: 4.92419e-04
Epoch: 19/20 Loss: 2.54595e-04
Epoch: 20/20 Loss: 5.30092e-04

Sine function

With Adam

pm_sine_adam_20 = PolynomialModel(degree=3)
optimizer = Adam(pm_sine_adam_20.parameters(), weight_decay=0.0001)

for epoch in range(20):
    running_loss = train_step(model=pm_sine_adam_20,
    print(f"Epoch: {epoch + 1:02}/20 Loss: {running_loss:.5e}")
Epoch: 01/20 Loss: 2.63689e+02
Epoch: 02/20 Loss: 2.43146e+02
Epoch: 03/20 Loss: 2.24795e+02
Epoch: 04/20 Loss: 2.07180e+02
Epoch: 05/20 Loss: 1.90500e+02
Epoch: 06/20 Loss: 1.74870e+02
Epoch: 07/20 Loss: 1.60323e+02
Epoch: 08/20 Loss: 1.46851e+02
Epoch: 09/20 Loss: 1.34422e+02
Epoch: 10/20 Loss: 1.22993e+02
Epoch: 11/20 Loss: 1.12515e+02
Epoch: 12/20 Loss: 1.02935e+02
Epoch: 13/20 Loss: 9.41970e+01
Epoch: 14/20 Loss: 8.62471e+01
Epoch: 15/20 Loss: 7.90301e+01
Epoch: 16/20 Loss: 7.24925e+01
Epoch: 17/20 Loss: 6.65820e+01
Epoch: 18/20 Loss: 6.12486e+01
Epoch: 19/20 Loss: 5.64440e+01
Epoch: 20/20 Loss: 5.21224e+01
pm_sine_adam_2000 = PolynomialModel(degree=3)
optimizer = Adam(pm_sine_adam_2000.parameters(), weight_decay=0.0001)

for epoch in range(2000):
    running_loss = train_step(model=pm_sine_adam_2000,
    if epoch % 100 == 99:
        print(f"Epoch: {epoch + 1:04}/2000 Loss: {running_loss:.5e}")
Epoch: 0100/2000 Loss: 1.51636e+01
Epoch: 0200/2000 Loss: 6.18430e+00
Epoch: 0300/2000 Loss: 1.15440e+00
Epoch: 0400/2000 Loss: 1.41171e-01
Epoch: 0500/2000 Loss: 1.24388e-01
Epoch: 0600/2000 Loss: 1.24824e-01
Epoch: 0700/2000 Loss: 1.24891e-01
Epoch: 0800/2000 Loss: 1.24899e-01
Epoch: 0900/2000 Loss: 1.24901e-01
Epoch: 1000/2000 Loss: 1.24901e-01
Epoch: 1100/2000 Loss: 1.24901e-01
Epoch: 1200/2000 Loss: 1.24901e-01
Epoch: 1300/2000 Loss: 1.24901e-01
Epoch: 1400/2000 Loss: 1.24901e-01
Epoch: 1500/2000 Loss: 1.24901e-01
Epoch: 1600/2000 Loss: 1.24901e-01
Epoch: 1700/2000 Loss: 1.24901e-01
Epoch: 1800/2000 Loss: 1.24901e-01
Epoch: 1900/2000 Loss: 1.24901e-01
Epoch: 2000/2000 Loss: 1.24901e-01


pm_sine_lbfgs_20 = PolynomialModel(degree=3)
optimizer = LBFGS(pm_sine_lbfgs_20.parameters(), history_size=10, max_iter=4)

for epoch in range(20):
    running_loss = train_step(model=pm_sine_lbfgs_20,
    print(f"Epoch: {epoch + 1:02}/20 Loss: {running_loss:.5e}")
Epoch: 01/20 Loss: 6.60091e+00
Epoch: 02/20 Loss: 2.71961e-02
Epoch: 03/20 Loss: 2.06060e-02
Epoch: 04/20 Loss: 1.06985e+02
Epoch: 05/20 Loss: 1.59497e+00
Epoch: 06/20 Loss: 2.08394e+01
Epoch: 07/20 Loss: 3.93797e+00
Epoch: 08/20 Loss: 1.34833e-01
Epoch: 09/20 Loss: 3.58053e-01
Epoch: 10/20 Loss: 1.70789e+00
Epoch: 11/20 Loss: 2.49742e+02
Epoch: 12/20 Loss: 1.63955e+00
Epoch: 13/20 Loss: 1.34608e+00
Epoch: 14/20 Loss: 4.64407e+01
Epoch: 15/20 Loss: 2.57760e-02
Epoch: 16/20 Loss: 1.78422e-01
Epoch: 17/20 Loss: 2.25476e-01
Epoch: 18/20 Loss: 1.61440e-01
Epoch: 19/20 Loss: 1.07994e-01
Epoch: 20/20 Loss: 3.09775e-01


For the cubic function, the performance of the two optimizers followed the same trend as in the case of the linear function - Adam took 900 to 1000 epochs to give the same accuracy that LBFGS gave in 20 epochs. This is not surprising, since the computer practically treats the cubic function as a three dimensional linear function (i.e. it treats \(x^3\), \(x^2\), and \(x\) as independent variables). On the other hand, in the case of the sine function, the verdict is not clear. Both optimizers give similar performance. Surprisingly (at least to me) LBFGS jumps around quite a bit, while Adam is quite stable.

import matplotlib.pyplot as plt

x_for_pred = torch.linspace(-math.pi, math.pi, steps=60,
                            device=device, dtype=dtype)
with torch.no_grad():
    cubic_y_pred_adam_20 = pm_cubic_adam_20(Variable(x_for_pred)).cpu().data.numpy()
    cubic_y_pred_adam_2000 = pm_cubic_adam_2000(Variable(x_for_pred)).cpu().data.numpy()
    cubic_y_pred_lbfgs_20 = pm_cubic_lbfgs_20(Variable(x_for_pred)).cpu().data.numpy()
    sine_y_pred_adam_20 = pm_sine_adam_20(Variable(x_for_pred)).cpu().data.numpy()
    sine_y_pred_adam_2000 = pm_sine_adam_2000(Variable(x_for_pred)).cpu().data.numpy()
    sine_y_pred_lbfgs_20 = pm_sine_lbfgs_20(Variable(x_for_pred)).cpu().data.numpy()

# Training data
num_training_points = x.shape[0]
x_plot = x.reshape(num_training_points,).cpu().data.numpy()
sine_y_plot = sine_y.reshape(num_training_points,).cpu().data.numpy()
cubic_y_plot = cubic_y.reshape(num_training_points,).cpu().data.numpy()

# Prediction data
num_pred_points = x_for_pred.shape[0]
x_for_pred = x_for_pred.reshape(num_pred_points,).cpu().data.numpy()
cubic_y_pred_adam_20 = cubic_y_pred_adam_20.reshape(num_pred_points,)
cubic_y_pred_adam_2000 = cubic_y_pred_adam_2000.reshape(num_pred_points,)
cubic_y_pred_lbfgs_20 = cubic_y_pred_lbfgs_20.reshape(num_pred_points,)
sine_y_pred_adam_20 = sine_y_pred_adam_20.reshape(num_pred_points,)
sine_y_pred_adam_2000 = sine_y_pred_adam_2000.reshape(num_pred_points,)
sine_y_pred_lbfgs_20 = sine_y_pred_lbfgs_20.reshape(num_pred_points,)

fig, (ax1, ax2) = plt.subplots(nrows=2, sharex=True, figsize=(12, 10), squeeze=True)

ax1.plot(x_for_pred, cubic_y_pred_adam_20, "r", label="Predictions with Adam (20 training epochs)", alpha=0.4, lw=2)
ax1.plot(x_for_pred, cubic_y_pred_adam_2000, "r:", label="Predictions with Adam (2000 training epochs)", alpha=0.8, lw=2)
ax1.plot(x_for_pred, cubic_y_pred_lbfgs_20, "b", label="Predictions with LBFGS (20 training epochs)", alpha=0.4, lw=2)
ax1.plot(x_plot, cubic_y_plot, "ko", label="True data")
ax1.set_ylabel(r"$x^3 + 2x^2 - 3x + 5$", fontsize="xx-large")

ax2.plot(x_for_pred, sine_y_pred_adam_20, "r", label="Predictions with Adam (20 training epochs)", alpha=0.4, lw=2)
ax2.plot(x_for_pred, sine_y_pred_adam_2000, "r:", label="Predictions with Adam (2000 training epochs)", alpha=0.8, lw=2)
ax2.plot(x_for_pred, sine_y_pred_lbfgs_20, "b", label="Predictions with LBFGS (20 training epochs)", alpha=0.4, lw=2)
ax2.plot(x_plot, sine_y_plot, "ko", label="True data")
ax2.set_xlabel(r"$x$", fontsize="xx-large")
ax2.set_ylabel(r"$\sin{(x)}$", fontsize="xx-large")

plt.savefig("static/images/polynomial_regression_with_Adam_LBFGS_2000.png", dpi=90)

The polynomial regression models - one trained with Adam (2000 iterations) and the other with LBFGS underestimates and overestimates the sine function at different places. The one trained with LBFGS, significantly underestimates near \(x = 3\). This suggests that if enough data is available then a better way to fit such highly non-linear data with such simple models would be to do it piecewise - training different models on different segments of the data.

  1. For more details about linear regression and correlation check out An Introduction to Statistical Learning. ↩︎

  2. Some similar examples can be found in the official PyTorch tutorials at https://pytorch.org/tutorials/beginner/pytorch%5Fwith%5Fexamples.html. ↩︎

  3. The skorch library combines the scikit-learn API with PyTorch. ↩︎