How to Apply Ordinal Logistic Regression for prediction of Materials Properties

April, 2024


Photo by Robert Katzki on Unsplash

This tutorial serves as a follow-up to the previous one on binary logistic regression, wherein I provided a comprehensive explanation of the coding steps involved in predicting binary outcomes for corrosion properties of materials.

Now, the question is: How do we apply the logistic regression method if there are more than two categories?

Here comes the role of ordinal logistic regression. The word ‘ordinal’ means sequence or order. Hence, ordinal regression is a type of logistic regression used for modeling and predicting ordinal variables, where the categories have a natural order in a meaningful way. For instance, the Likert scale, or customer satisfaction surveys - the response variable in all of these falls into ordered categories.

In this tutorial, we'll explore how to apply ordinal regression in coding to predict material properties with more than two categories. By the end, you'll gain the skills to model and interpret such data for your own projects effectively. Let's get started!

The Dataset

Data is the precursor to building efficient models in machine learning. Hence, amassing sufficient and necessary data is the very first step. I have used a hypothetically generated dataset on the corrosion properties of 150 metals and alloys.

There are four independent variables (Xs) namely corrosion rate, corrosion current density, polarization resistance, and media.

The target variable (y) is corrosion property categorized as 'poor', 'fair', and excellent.

The Coding

Step1: Importing relevant libaries and loading the dataset

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split

# Performance metrics 
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import mean_absolute_error, mean_squared_error
 #Loading data 
data = pd.read_csv('SyntheticCorrData2.csv', encoding='latin') 
# Create a DataFrame
df = pd.DataFrame(data)

Here is a glimpse of the data loaded for machine learning.

Step 2: Defining Xs and y

X = df[['Corrosion Rate (mpy)', 'Corrosion Density Current (µA/cm²)', 'Polarization Resistance (ohms)']]
y = df['Corrosion Resistance']

Step 3: Ordinal Encoding of the target variable

ordinal_encoder = OrdinalEncoder(categories=[["Poor", "Fair", "Excellent"]])
y_encoded = ordinal_encoder.fit_transform(y.values.reshape(-1, 1)).ravel()

The y_encoded variable is the result of applying fit_transform() to the target variable y after reshaping it to a 2-dimensional array with one column using y.values.reshape(-1, 1). This transformation changes the orientation of the array from horizontal to vertical.

ravel( ): Flattens 2D array into a 1D array. This flattened array is then used as the encoded target variable for training the machine learning model.

To recheck the categories after encoding, simply type:

categories = ordinal_encoder.categories_

To view the encoded values of 'y', type:

y_encoded

The corresponding output:

array([1., 1., 0., 0., 2., 1., 1., 1., 0., 2., 2., 1., 2., 1., 0., 0., 1.,
       1., 1., 0., 2., 1., 0., 1., 0., 1., 1., 0., 0., 0., 2., 1., 1., 0.,
       2., 0., 1., 2., 0., 0., 0., 1., 1., 1., 0., 2., 1., 2., 2., 1., 1.,
       1., 0., 0., 2., 2., 0., 1., 2., 1., 1., 0., 2., 0., 1., 1., 1., 2.,
       0., 2., 2., 0., 2., 2., 1., 2., 0., 2., 1., 1., 0., 1., 2., 0., 0.,
       0., 2., 2., 1., 1., 1., 0., 2., 0., 1., 0., 2., 1., 0., 2., 2., 0.,
       0., 2., 1., 1., 0., 2., 2., 2., 1., 0., 2., 0., 1., 2., 2., 2., 0.,
       0., 1., 0., 1., 0., 2., 1., 2., 2., 1., 2., 2., 1., 0., 2., 2., 0.,
       0., 1., 1., 0., 2., 0., 0., 2., 2., 0., 1., 1., 0., 0.])

To view the labels mapped to their corresponding encoded numbers:

  • Convert 1D array of y_encoded to 2D Dataframe.
  • df_y_encoded= pd.DataFrame(y_encoded)
  • Apply inverse_transform() function to retrieve the names of the labels followed by conversion of the same to 2D DataFrame structure.
  •   y_names=ordinal_encoder.inverse_transform(y_encoded.reshape(-1,1))
      df_y_names =pd.DataFrame(y_names)
  • Concatenate the two datadrames using concat() function.
  •  map_data = [df_y_names, df_y_encoded]
    result = pd.concat(map_data, axis=1)
    result

    Step 4: Split data ino training and test set
    X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42)
    

    Step 5: Install MORD for Ordinal Regression

     pip install mord
     from mord import OrdinalRidge

    MORD is a Python package that implements ordinal regression methods. The package includes different models suitable for ordinal regression tasks, and it categorizes them into threshold-based, regression-based, and classification-based models. We will implement the OrdinalRidge regression-based model for predicting an ordinal variable, meaning that the variable has a clear ordering or ranking.

    The mathematics behind ordinal ridge regression involves a regularization technique applied to the standard linear regression model to handle multicollinearity, prevent overfitting, and enhance prediction accuracy, especially when dealing with ordinal data.

    Step 6: Data fitting with OrdinalRidge method
    ord_reg = OrdinalRidge()
    ord_reg.fit(X_train, y_train)
    Step 7: Prediction
    y_pred =ord_reg.predict(X_test)

    Step 8: Evaluation of the model
    #Computing accuracy, mean absolute error
    accuracy = accuracy_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)

    Step 9: Visualization of the performance of the model with Confusion Matrix
    cm = confusion_matrix(y_test, y_pred)
    print(cm) 
    [[ 6  3  0]
     [ 0 10  0]
     [ 0  5  6]]

    import seaborn as sns
    sns.heatmap(cm/np.sum(cm), annot=True,fmt='.2%', cmap='GnBu')

    Confusion matrix in percentage for multi-class classification** and table showing the percentage of true positive (TP) for each class/ category namely 'poor', 'fair', and 'excellent'. ** Please check out for the next tutorial on reading confusion matrix for multi-class classification.

    Finally, here is the link to the entire code and dataset.