Support Vector Machine (SVM)

Support Vector Machine (SVM)#

This section is a continuation of the Logistic Regression section and serves the purpose of computing SVM pipelines for macro-structural and micro-structural data.

1. What are Support Vector Machines (SVMs) ?#

SVMs are a set of supervised learning methods used that can be used for classifcation, regression and outliers detection (for further information click here). The basic idea is to find an optimal separating line (or hyperplane) as output that separates the data into two classes. The SVM algorithm looks for the data points that are the clostest to the line from both classes. These points are called support vectors. Then, the distance between the support vectors and the hyperplane which is called the margin is computed. To find the best and optimal hyperplane, the margin should be maximized.

Support vector classification (SVC) or Linear Support vector classification (LinearSVC) are methods of SVMs making it feasible to perfom a binary or mulit-class classification on a dataset. For the purpose of this project, LinearSVC is going to be performed.

2. Macro-structural data: Cortical Thickness (CT)#

In this section, a SVM pipeline will be computed for the macro-structual data: CT. Firstly, the data is loaded and prepared for the analysis part. Secondly, the SVM model is built containing the analysis. Finally, we evaluate our model and have a look at how it performs.

2.1 Data preparation#

Relevant modules are imported that are neccessary to perfrom the analyses.

#import relevant modules

import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import svm
from sklearn import metrics
from sklearn.svm import SVC

#load data

CT_Dublin_path = os.path.join(os.pardir, 'data', 'PARC_500.aparc_thickness_Dublin.csv')
CT_Dublin = pd.read_csv(CT_Dublin_path)

The dataframe is adjusted accordingly, removing irrelevant variables for the classification.

#adjust dataframe

CT_Dublin_adj = CT_Dublin.drop(['Subject ID','Age', 'Sex'], axis=1)

#label group 1 as 0 and 2 as 1

CT_Dublin_adj['Group'] = CT_Dublin_adj['Group'].replace([1,2],[0, 1])

Here, 0 stands for “Control” and 1 for “Patient”.

2.2 Build the model#

As in the Logistic regression, as a first step the input and output variables should be defined. It is analogous to the Logistic Regression input and output since it is the identical CT data.

#define data and target

X_CT = CT_Dublin_adj.iloc[:,1:309].values
y_CT = CT_Dublin_adj.iloc[:,[0]].values

X_CT

array([[2.18 , 2.382, 2.346, ..., 2.776, 3.282, 3.347],
       [2.394, 1.973, 2.534, ..., 2.654, 3.124, 3.214],
       [2.551, 2.567, 1.954, ..., 2.495, 2.669, 2.886],
       ...,
       [2.273, 2.559, 2.578, ..., 2.294, 2.571, 2.875],
       [1.94 , 2.438, 2.272, ..., 2.51 , 2.759, 2.838],
       [2.108, 2.269, 2.145, ..., 2.551, 2.855, 2.985]])

y_CT

array([[0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1]])

As in the Logistic Regression we have to use the .ravel function fur further analyses.

y_CT = y_CT.ravel()

In the following, we build the SVM model using GridSearch to identify the best value of C for the best possible accuracy. The C hyperparameter “adds a penalty for each misclassified data point. If C is small, the penalty for misclassified points is low so a decision boundary with a large margin is chosen at the expense of a greater number of misclassifications. If C is large, SVM tries to minimize the number of misclassified examples due to high penalty which results in a decision boundary with a smaller margin” (see here). Also, we are using a 5-fold cross validation as indicated by cv.

# Split dataset into training and testing sets
X_train_CT, X_test_CT, y_train_CT, y_test_CT = train_test_split(X_CT, y_CT, test_size=0.3, random_state=42)

# Create linear SVM classifier
svm = SVC(kernel = 'linear')

# Define  range of values for the C hyperparameter
param_grid = {'C': [0.1, 1, 10, 100]}

# Perform a grid search cross-validation to find the best value of C
grid_search = GridSearchCV(svm, param_grid=param_grid, cv=5)
grid_search.fit(X_train_CT, y_train_CT)

# Print the best value of C found during the grid search
print("Best value of C: ", grid_search.best_params_['C'])

# Evaluate the performance of the SVM on the testing set with the best value of C
svm_best = SVC(C=grid_search.best_params_['C'])
svm_best.fit(X_train_CT, y_train_CT)
accuracy = svm_best.score(X_test_CT, y_test_CT)
print("Accuracy: ", accuracy)

Best value of C:  1
Accuracy:  0.9090909090909091

As we can see, according to GridSearch the best Value for the C parameter is 1. The accuracy of our model is around 91%. However, let us check other performance measures!

2.3 Model evaluation#

As in the Logistic Regression, we can have a look at different measurements to gain insight of how the model performs.

y_pred_CT = svm_best.predict(X_test_CT)

print("Accuracy:",metrics.accuracy_score(y_test_CT, y_pred_CT))
print("Precision:",metrics.precision_score(y_test_CT, y_pred_CT))
print("Recall:",metrics.recall_score(y_test_CT, y_pred_CT))
print("F1-Score:", metrics.f1_score(y_test_CT, y_pred_CT))

Accuracy: 0.9090909090909091
Precision: 0.8333333333333334
Recall: 0.7142857142857143
F1-Score: 0.7692307692307692

As the results reveal, our SVM model for cortical thickness makes 90.91% of the time correct predictions. 83.33% represent the proportion of the model’s prediction of psychosis where psychosis is actually present and 71.43% relate to the proportion of all cases of psychosis that the model accurately predicted.

3. Micro-structural data#

In the following the same routine is applied to the microstructural data. First, we begin with MD.

3.1 Mean Diffusivity#

3.1.1 Data preparation#

First, the data is prepared for the analysis.

#read data

MD_Dublin_path = os.path.join(os.pardir, 'data', 'PARC_500.aparc_MD_cortexAv_mean_Dublin.csv')
MD_Dublin = pd.read_csv(MD_Dublin_path)

#adjust dataframe

MD_Dublin_adj = MD_Dublin.drop(['Subject ID','Age', 'Sex'], axis=1)

#label group 1 as 0 and 2 as 1

MD_Dublin_adj['Group'] = MD_Dublin_adj['Group'].replace([1,2],[0, 1])

MD_Dublin_adj

	Group	lh_bankssts_part1_thickness	lh_bankssts_part2_thickness	lh_caudalanteriorcingulate_part1_thickness	lh_caudalmiddlefrontal_part1_thickness	lh_caudalmiddlefrontal_part2_thickness	lh_caudalmiddlefrontal_part3_thickness	lh_caudalmiddlefrontal_part4_thickness	lh_cuneus_part1_thickness	lh_cuneus_part2_thickness	...	rh_supramarginal_part5_thickness	rh_supramarginal_part6_thickness	rh_supramarginal_part7_thickness	rh_frontalpole_part1_thickness	rh_temporalpole_part1_thickness	rh_transversetemporal_part1_thickness	rh_insula_part1_thickness	rh_insula_part2_thickness	rh_insula_part3_thickness	rh_insula_part4_thickness
0	0	0.911	0.931	0.891	1.048	0.881	0.939	1.124	0.986	1.045	...	0.928	1.067	1.096	0.892	1.238	1.021	1.166	0.900	0.907	0.937
1	0	0.861	0.913	0.846	0.927	0.888	0.894	0.924	1.040	1.093	...	0.878	0.985	1.045	1.001	1.196	1.083	1.143	0.917	0.923	0.960
2	0	0.817	0.827	0.828	0.828	0.780	0.843	0.825	0.848	0.838	...	0.847	0.849	0.819	0.952	0.933	0.942	1.059	0.794	0.834	0.860
3	0	0.887	0.905	0.878	0.932	0.820	0.888	0.970	0.918	0.900	...	0.957	0.985	0.989	1.075	1.150	1.017	0.986	0.888	0.916	0.928
4	0	0.887	0.854	0.905	1.011	0.946	0.922	1.034	1.126	1.114	...	0.871	0.952	0.987	1.325	0.996	1.094	1.064	0.966	0.989	0.977
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
110	1	0.843	0.855	0.940	1.017	0.954	0.840	1.128	1.012	0.997	...	0.938	1.062	1.143	0.903	1.364	1.284	1.218	1.017	0.972	1.028
111	1	0.911	0.914	0.926	1.001	0.918	1.115	1.036	1.026	1.001	...	0.957	1.085	1.098	1.059	1.268	1.089	1.173	0.990	1.065	1.021
112	1	0.890	0.899	0.886	0.930	0.883	0.882	0.883	1.190	1.101	...	0.916	1.010	0.974	0.968	1.305	1.168	1.265	0.981	0.975	0.972
113	1	0.920	0.986	0.883	0.879	0.794	0.983	1.029	1.076	1.053	...	0.942	0.985	0.990	1.199	1.353	1.187	1.444	0.947	1.047	1.085
114	1	0.970	0.868	0.940	0.967	0.930	1.022	0.957	1.112	1.004	...	1.074	1.122	1.215	0.891	1.333	1.295	1.284	1.125	1.110	1.139

115 rows × 309 columns

3.1.2 Build the model#

Again, we define the input and output variables. Here, we now have MD for each brain region as our input.

#define input and output

X_MD = MD_Dublin_adj.iloc[:,1:309].values
y_MD = MD_Dublin_adj.iloc[:,[0]].values

y_MD = y_MD.ravel()

We run the same SVM pipeline for the MD data.

# Split the dataset into training and testing sets
X_train_MD, X_test_MD, y_train_MD, y_test_MD = train_test_split(X_MD, y_MD, test_size=0.3,random_state=109)

# Create a SVM classifier
svm_MD = SVC(kernel = 'linear')

# Define the range of values for the C hyperparameter
param_grid_MD = {'C': [0.1, 1, 10, 100]}

# Perform a grid search cross-validation to find the best value of C
grid_search_MD = GridSearchCV(svm_MD, param_grid=param_grid, cv=5)
grid_search_MD.fit(X_train_MD, y_train_MD)

# Print the best value of C found during the grid search
print("Best value of C: ", grid_search_MD.best_params_['C'])

# Evaluate the performance of the SVM on the testing set with the best value of C
svm_best_MD = SVC(C=grid_search_MD.best_params_['C'])
svm_best_MD.fit(X_train_MD, y_train_MD)
accuracy_MD = svm_best_MD.score(X_test_MD, y_test_MD)
print("Accuracy: ", accuracy_MD)

Best value of C:  1
Accuracy:  0.8285714285714286

The model seems to perform less well than the model for CT, but still the accuracy is good. Let’s have a look at the other measures.

y_pred_MD = svm_best_MD.predict(X_test_MD)

3.1.3 Model evaluation#

print("Accuracy:",metrics.accuracy_score(y_test_MD, y_pred_MD))
print("Precision:",metrics.precision_score(y_test_MD, y_pred_MD))
print("Recall:",metrics.recall_score(y_test_MD, y_pred_MD))
print("F1-Score:", metrics.f1_score(y_test_MD, y_pred_MD))

Accuracy: 0.8285714285714286
Precision: 0.8333333333333334
Recall: 0.5
F1-Score: 0.625

As the results reveal, our SVM model for mean diffusivity makes 82.86% of the time correct predictions. 83.33% represent the proportion of the model’s prediction of psychosis where psychosis is actually present and 50% relate to the proportion of all cases of psychosis that the model accurately predicted.

3.2 Fractional Anisotropy#

Let’s see how FA values might perfrom in terms of informativeness for accurate classification.

3.2.1 Data preparation#

First, the data is prepared for the analyis.

#read data

FA_Dublin_path = os.path.join(os.pardir, 'data', 'PARC_500.aparc_FA_cortexAv_mean_Dublin.csv')
FA_Dublin = pd.read_csv(FA_Dublin_path)

#adjust dataframe

FA_Dublin_adj = FA_Dublin.drop(['Subject ID','Age', 'Sex'], axis=1)

#label group 1 as 0 and 2 as 1

FA_Dublin_adj['Group'] = FA_Dublin_adj['Group'].replace([1,2],[0, 1])

3.2.2 Build the model#

This time, as input we choose the FA values for the 308 brain regions.

#define input and output

X_FA = FA_Dublin_adj.iloc[:,1:309].values
y_FA = FA_Dublin_adj.iloc[:,[0]].values

y_FA = y_FA.ravel()

We again run the same pipeline for the FA values!

# Split the dataset into training and testing sets
X_train_FA, X_test_FA, y_train_FA, y_test_FA = train_test_split(X_FA, y_FA, test_size=0.3,random_state=109)

# Create a SVM classifier
svm_FA = SVC(kernel = 'linear')

# Define the range of values for the C hyperparameter
param_grid_FA = {'C': [0.1, 1, 10, 100]}

# Perform a grid search cross-validation to find the best value of C
grid_search_FA = GridSearchCV(svm_FA, param_grid=param_grid, cv=5)
grid_search_FA.fit(X_train_FA, y_train_FA)

# Print the best value of C found during the grid search
print("Best value of C: ", grid_search_FA.best_params_['C'])

# Evaluate the performance of the SVM on the testing set with the best value of C
svm_best_FA = SVC(C=grid_search_FA.best_params_['C'])
svm_best_FA.fit(X_train_FA, y_train_FA)
accuracy_FA = svm_best_FA.score(X_test_FA, y_test_FA)
print("Accuracy: ", accuracy_FA)

Best value of C:  10
Accuracy:  0.7714285714285715

y_pred_FA = svm_best_FA.predict(X_test_FA)

3.2.3 Model evaluation#

print("Accuracy:",metrics.accuracy_score(y_test_FA, y_pred_FA))
print("Precision:",metrics.precision_score(y_test_FA, y_pred_FA))
print("Recall:",metrics.recall_score(y_test_FA, y_pred_FA))
print("F1-Score:", metrics.f1_score(y_test_FA, y_pred_FA))

Accuracy: 0.7714285714285715
Precision: 0.75
Recall: 0.3
F1-Score: 0.4285714285714285

As the results reveal, our SVM model for fractional anisotropy makes 77.14% of the time correct predictions. 75% represent the proportion of the model’s prediction of psychosis where psychosis is actually present and 30% relate to the proportion of all cases of psychosis that the model accurately predicted.

4. Summary#

We can again summarize the finding in a table for a better overview. First import tabulate.

from tabulate import tabulate

#create data
data_SVM = [["Accuracy", metrics.accuracy_score(y_test_CT, y_pred_CT), metrics.accuracy_score(y_test_MD, y_pred_MD), metrics.accuracy_score(y_test_FA, y_pred_FA)], 
        ["Precision", metrics.precision_score(y_test_CT, y_pred_CT), metrics.precision_score(y_test_MD, y_pred_MD),metrics.precision_score(y_test_FA, y_pred_FA)], 
        ["Recall", metrics.recall_score(y_test_CT, y_pred_CT), metrics.recall_score(y_test_MD, y_pred_MD),metrics.recall_score(y_test_FA, y_pred_FA)], 
        ["F1-Score", metrics.f1_score(y_test_CT, y_pred_CT), metrics.f1_score(y_test_MD, y_pred_MD), metrics.f1_score(y_test_FA, y_pred_FA)]]
  
#define header names
col_names = ["Measure", "Cortical Thickness", "Mean Diffusivity", "Fractional Anisotropy"]
  
#display table
print(tabulate(data_SVM, headers=col_names))

Measure      Cortical Thickness    Mean Diffusivity    Fractional Anisotropy
---------  --------------------  ------------------  -----------------------
Accuracy               0.909091            0.828571                 0.771429
Precision              0.833333            0.833333                 0.75
Recall                 0.714286            0.5                      0.3
F1-Score               0.769231            0.625                    0.428571

What can be seen for the SVM model is that CT has the highest accuracy and reval score. MD has the highest precision score. Precision and Recall values are inversely related. Again, when we relate to the F1-Score, this time the classifier seems to perform better for CT.