{ "cells": [ { "cell_type": "markdown", "id": "4fb0d430", "metadata": {}, "source": [ "# Logistic Regression" ] }, { "cell_type": "markdown", "id": "7e251389", "metadata": {}, "source": [ "In this chapter the logistic regression algorithm is applied to the macro- and microstrucutral data. The procedure is analogous to both kind of data. First, the data is prepared in terms of making it suitable to use it in the code efficiently. Second, the logistic regression model is built, defining the input and output variables, scaling the relevant variables and splitting the data set into training and testing samples and subsequently training the model. Followed by evulating the model, we will refer to different metrics that provide information on how the model performs. \n", "\n", "We first start with **cortical thickness (CT)**." ] }, { "cell_type": "markdown", "id": "623b534c", "metadata": {}, "source": [ "## 1. Macro-structural data: Cortical Thickness (CT)" ] }, { "cell_type": "markdown", "id": "150941c9", "metadata": {}, "source": [ "### 1.1 Data preperation" ] }, { "cell_type": "markdown", "id": "ed303c0c", "metadata": {}, "source": [ "In the beginning, all relevant modules needed for the analysis are imported. " ] }, { "cell_type": "code", "execution_count": 1, "id": "4c372d6b", "metadata": {}, "outputs": [], "source": [ "#import relevant modules\n", "\n", "import os\n", "import pandas as pd\n", "import numpy as np\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "\n", "from sklearn.model_selection import train_test_split\n", "from sklearn import metrics \n", "from sklearn.metrics import confusion_matrix\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.preprocessing import StandardScaler" ] }, { "cell_type": "markdown", "id": "dc556abe", "metadata": {}, "source": [ "Again, to read the data, the os.pardir() function is used to make the code reproducible independent of different operating systems. " ] }, { "cell_type": "code", "execution_count": 2, "id": "c724d3cf", "metadata": {}, "outputs": [], "source": [ "#read data\n", "\n", "CT_Dublin_path = os.path.join(os.pardir, 'data', 'PARC_500.aparc_thickness_Dublin.csv')\n", "CT_Dublin = pd.read_csv(CT_Dublin_path)" ] }, { "cell_type": "code", "execution_count": 3, "id": "dab3630c", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Subject IDAgeSexGrouplh_bankssts_part1_thicknesslh_bankssts_part2_thicknesslh_caudalanteriorcingulate_part1_thicknesslh_caudalmiddlefrontal_part1_thicknesslh_caudalmiddlefrontal_part2_thicknesslh_caudalmiddlefrontal_part3_thickness...rh_supramarginal_part5_thicknessrh_supramarginal_part6_thicknessrh_supramarginal_part7_thicknessrh_frontalpole_part1_thicknessrh_temporalpole_part1_thicknessrh_transversetemporal_part1_thicknessrh_insula_part1_thicknessrh_insula_part2_thicknessrh_insula_part3_thicknessrh_insula_part4_thickness
0CON922521212.1802.3822.3462.5262.7472.544...2.8172.3252.4303.0043.9792.3293.6202.7763.2823.347
1CON922928212.3941.9732.5342.4392.4852.435...2.6112.4182.3172.7943.8512.0343.5882.6543.1243.214
2CON923129212.5512.5671.9542.4392.4282.190...2.7772.3092.3902.3654.0392.3373.6572.4952.6692.886
3GASP303761122.1871.9232.1602.4102.3812.277...2.2652.3062.1292.2813.5052.2753.1212.3332.6042.731
4GASP304047121.8621.7502.1292.5162.2442.169...2.5822.3142.0472.3893.2722.4453.1712.2162.6592.657
..................................................................
103RPG901931122.2402.1501.9952.2542.1642.008...2.2732.2882.3952.1053.2672.2573.2312.5742.9202.899
104RPG910242222.2692.1242.5312.5022.2502.183...2.3022.1822.1822.3272.8812.1243.1592.4502.7532.791
105RPG911941122.2732.5592.5782.4632.4632.053...2.5342.6042.4492.3703.1112.1903.4802.2942.5712.875
106RPG912151121.9402.4382.2722.2722.6102.099...2.6382.2252.0132.1153.8532.2313.1872.5102.7592.838
107RPG912656122.1082.2692.1452.1922.4431.977...2.0132.2512.0212.4193.6791.9703.1922.5512.8552.985
\n", "

108 rows × 312 columns

\n", "
" ], "text/plain": [ " Subject ID Age Sex Group lh_bankssts_part1_thickness \\\n", "0 CON9225 21 2 1 2.180 \n", "1 CON9229 28 2 1 2.394 \n", "2 CON9231 29 2 1 2.551 \n", "3 GASP3037 61 1 2 2.187 \n", "4 GASP3040 47 1 2 1.862 \n", ".. ... ... ... ... ... \n", "103 RPG9019 31 1 2 2.240 \n", "104 RPG9102 42 2 2 2.269 \n", "105 RPG9119 41 1 2 2.273 \n", "106 RPG9121 51 1 2 1.940 \n", "107 RPG9126 56 1 2 2.108 \n", "\n", " lh_bankssts_part2_thickness lh_caudalanteriorcingulate_part1_thickness \\\n", "0 2.382 2.346 \n", "1 1.973 2.534 \n", "2 2.567 1.954 \n", "3 1.923 2.160 \n", "4 1.750 2.129 \n", ".. ... ... \n", "103 2.150 1.995 \n", "104 2.124 2.531 \n", "105 2.559 2.578 \n", "106 2.438 2.272 \n", "107 2.269 2.145 \n", "\n", " lh_caudalmiddlefrontal_part1_thickness \\\n", "0 2.526 \n", "1 2.439 \n", "2 2.439 \n", "3 2.410 \n", "4 2.516 \n", ".. ... \n", "103 2.254 \n", "104 2.502 \n", "105 2.463 \n", "106 2.272 \n", "107 2.192 \n", "\n", " lh_caudalmiddlefrontal_part2_thickness \\\n", "0 2.747 \n", "1 2.485 \n", "2 2.428 \n", "3 2.381 \n", "4 2.244 \n", ".. ... \n", "103 2.164 \n", "104 2.250 \n", "105 2.463 \n", "106 2.610 \n", "107 2.443 \n", "\n", " lh_caudalmiddlefrontal_part3_thickness ... \\\n", "0 2.544 ... \n", "1 2.435 ... \n", "2 2.190 ... \n", "3 2.277 ... \n", "4 2.169 ... \n", ".. ... ... \n", "103 2.008 ... \n", "104 2.183 ... \n", "105 2.053 ... \n", "106 2.099 ... \n", "107 1.977 ... \n", "\n", " rh_supramarginal_part5_thickness rh_supramarginal_part6_thickness \\\n", "0 2.817 2.325 \n", "1 2.611 2.418 \n", "2 2.777 2.309 \n", "3 2.265 2.306 \n", "4 2.582 2.314 \n", ".. ... ... \n", "103 2.273 2.288 \n", "104 2.302 2.182 \n", "105 2.534 2.604 \n", "106 2.638 2.225 \n", "107 2.013 2.251 \n", "\n", " rh_supramarginal_part7_thickness rh_frontalpole_part1_thickness \\\n", "0 2.430 3.004 \n", "1 2.317 2.794 \n", "2 2.390 2.365 \n", "3 2.129 2.281 \n", "4 2.047 2.389 \n", ".. ... ... \n", "103 2.395 2.105 \n", "104 2.182 2.327 \n", "105 2.449 2.370 \n", "106 2.013 2.115 \n", "107 2.021 2.419 \n", "\n", " rh_temporalpole_part1_thickness rh_transversetemporal_part1_thickness \\\n", "0 3.979 2.329 \n", "1 3.851 2.034 \n", "2 4.039 2.337 \n", "3 3.505 2.275 \n", "4 3.272 2.445 \n", ".. ... ... \n", "103 3.267 2.257 \n", "104 2.881 2.124 \n", "105 3.111 2.190 \n", "106 3.853 2.231 \n", "107 3.679 1.970 \n", "\n", " rh_insula_part1_thickness rh_insula_part2_thickness \\\n", "0 3.620 2.776 \n", "1 3.588 2.654 \n", "2 3.657 2.495 \n", "3 3.121 2.333 \n", "4 3.171 2.216 \n", ".. ... ... \n", "103 3.231 2.574 \n", "104 3.159 2.450 \n", "105 3.480 2.294 \n", "106 3.187 2.510 \n", "107 3.192 2.551 \n", "\n", " rh_insula_part3_thickness rh_insula_part4_thickness \n", "0 3.282 3.347 \n", "1 3.124 3.214 \n", "2 2.669 2.886 \n", "3 2.604 2.731 \n", "4 2.659 2.657 \n", ".. ... ... \n", "103 2.920 2.899 \n", "104 2.753 2.791 \n", "105 2.571 2.875 \n", "106 2.759 2.838 \n", "107 2.855 2.985 \n", "\n", "[108 rows x 312 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "CT_Dublin" ] }, { "cell_type": "markdown", "id": "fc679ec0", "metadata": {}, "source": [ "The data contains variables such as SubjectID, Age and Sex which are not relevant for the classification. Hence, we adjust the dataframe accordingly." ] }, { "cell_type": "code", "execution_count": 4, "id": "e7eaad40", "metadata": {}, "outputs": [], "source": [ "#adjust dataframe\n", "\n", "CT_Dublin_adj = CT_Dublin.drop(['Subject ID','Age', 'Sex'], axis=1)" ] }, { "cell_type": "code", "execution_count": 5, "id": "1277a861", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Grouplh_bankssts_part1_thicknesslh_bankssts_part2_thicknesslh_caudalanteriorcingulate_part1_thicknesslh_caudalmiddlefrontal_part1_thicknesslh_caudalmiddlefrontal_part2_thicknesslh_caudalmiddlefrontal_part3_thicknesslh_caudalmiddlefrontal_part4_thicknesslh_cuneus_part1_thicknesslh_cuneus_part2_thickness...rh_supramarginal_part5_thicknessrh_supramarginal_part6_thicknessrh_supramarginal_part7_thicknessrh_frontalpole_part1_thicknessrh_temporalpole_part1_thicknessrh_transversetemporal_part1_thicknessrh_insula_part1_thicknessrh_insula_part2_thicknessrh_insula_part3_thicknessrh_insula_part4_thickness
012.1802.3822.3462.5262.7472.5442.5821.8162.228...2.8172.3252.4303.0043.9792.3293.6202.7763.2823.347
112.3941.9732.5342.4392.4852.4352.4581.7231.821...2.6112.4182.3172.7943.8512.0343.5882.6543.1243.214
212.5512.5671.9542.4392.4282.1902.3772.0261.800...2.7772.3092.3902.3654.0392.3373.6572.4952.6692.886
322.1871.9232.1602.4102.3812.2772.3611.5851.750...2.2652.3062.1292.2813.5052.2753.1212.3332.6042.731
421.8621.7502.1292.5162.2442.1692.2201.6461.717...2.5822.3142.0472.3893.2722.4453.1712.2162.6592.657
..................................................................
10322.2402.1501.9952.2542.1642.0082.2981.9181.717...2.2732.2882.3952.1053.2672.2573.2312.5742.9202.899
10422.2692.1242.5312.5022.2502.1832.4081.5391.611...2.3022.1822.1822.3272.8812.1243.1592.4502.7532.791
10522.2732.5592.5782.4632.4632.0532.5261.7331.859...2.5342.6042.4492.3703.1112.1903.4802.2942.5712.875
10621.9402.4382.2722.2722.6102.0992.5381.9311.792...2.6382.2252.0132.1153.8532.2313.1872.5102.7592.838
10722.1082.2692.1452.1922.4431.9772.4531.5901.715...2.0132.2512.0212.4193.6791.9703.1922.5512.8552.985
\n", "

108 rows × 309 columns

\n", "
" ], "text/plain": [ " Group lh_bankssts_part1_thickness lh_bankssts_part2_thickness \\\n", "0 1 2.180 2.382 \n", "1 1 2.394 1.973 \n", "2 1 2.551 2.567 \n", "3 2 2.187 1.923 \n", "4 2 1.862 1.750 \n", ".. ... ... ... \n", "103 2 2.240 2.150 \n", "104 2 2.269 2.124 \n", "105 2 2.273 2.559 \n", "106 2 1.940 2.438 \n", "107 2 2.108 2.269 \n", "\n", " lh_caudalanteriorcingulate_part1_thickness \\\n", "0 2.346 \n", "1 2.534 \n", "2 1.954 \n", "3 2.160 \n", "4 2.129 \n", ".. ... \n", "103 1.995 \n", "104 2.531 \n", "105 2.578 \n", "106 2.272 \n", "107 2.145 \n", "\n", " lh_caudalmiddlefrontal_part1_thickness \\\n", "0 2.526 \n", "1 2.439 \n", "2 2.439 \n", "3 2.410 \n", "4 2.516 \n", ".. ... \n", "103 2.254 \n", "104 2.502 \n", "105 2.463 \n", "106 2.272 \n", "107 2.192 \n", "\n", " lh_caudalmiddlefrontal_part2_thickness \\\n", "0 2.747 \n", "1 2.485 \n", "2 2.428 \n", "3 2.381 \n", "4 2.244 \n", ".. ... \n", "103 2.164 \n", "104 2.250 \n", "105 2.463 \n", "106 2.610 \n", "107 2.443 \n", "\n", " lh_caudalmiddlefrontal_part3_thickness \\\n", "0 2.544 \n", "1 2.435 \n", "2 2.190 \n", "3 2.277 \n", "4 2.169 \n", ".. ... \n", "103 2.008 \n", "104 2.183 \n", "105 2.053 \n", "106 2.099 \n", "107 1.977 \n", "\n", " lh_caudalmiddlefrontal_part4_thickness lh_cuneus_part1_thickness \\\n", "0 2.582 1.816 \n", "1 2.458 1.723 \n", "2 2.377 2.026 \n", "3 2.361 1.585 \n", "4 2.220 1.646 \n", ".. ... ... \n", "103 2.298 1.918 \n", "104 2.408 1.539 \n", "105 2.526 1.733 \n", "106 2.538 1.931 \n", "107 2.453 1.590 \n", "\n", " lh_cuneus_part2_thickness ... rh_supramarginal_part5_thickness \\\n", "0 2.228 ... 2.817 \n", "1 1.821 ... 2.611 \n", "2 1.800 ... 2.777 \n", "3 1.750 ... 2.265 \n", "4 1.717 ... 2.582 \n", ".. ... ... ... \n", "103 1.717 ... 2.273 \n", "104 1.611 ... 2.302 \n", "105 1.859 ... 2.534 \n", "106 1.792 ... 2.638 \n", "107 1.715 ... 2.013 \n", "\n", " rh_supramarginal_part6_thickness rh_supramarginal_part7_thickness \\\n", "0 2.325 2.430 \n", "1 2.418 2.317 \n", "2 2.309 2.390 \n", "3 2.306 2.129 \n", "4 2.314 2.047 \n", ".. ... ... \n", "103 2.288 2.395 \n", "104 2.182 2.182 \n", "105 2.604 2.449 \n", "106 2.225 2.013 \n", "107 2.251 2.021 \n", "\n", " rh_frontalpole_part1_thickness rh_temporalpole_part1_thickness \\\n", "0 3.004 3.979 \n", "1 2.794 3.851 \n", "2 2.365 4.039 \n", "3 2.281 3.505 \n", "4 2.389 3.272 \n", ".. ... ... \n", "103 2.105 3.267 \n", "104 2.327 2.881 \n", "105 2.370 3.111 \n", "106 2.115 3.853 \n", "107 2.419 3.679 \n", "\n", " rh_transversetemporal_part1_thickness rh_insula_part1_thickness \\\n", "0 2.329 3.620 \n", "1 2.034 3.588 \n", "2 2.337 3.657 \n", "3 2.275 3.121 \n", "4 2.445 3.171 \n", ".. ... ... \n", "103 2.257 3.231 \n", "104 2.124 3.159 \n", "105 2.190 3.480 \n", "106 2.231 3.187 \n", "107 1.970 3.192 \n", "\n", " rh_insula_part2_thickness rh_insula_part3_thickness \\\n", "0 2.776 3.282 \n", "1 2.654 3.124 \n", "2 2.495 2.669 \n", "3 2.333 2.604 \n", "4 2.216 2.659 \n", ".. ... ... \n", "103 2.574 2.920 \n", "104 2.450 2.753 \n", "105 2.294 2.571 \n", "106 2.510 2.759 \n", "107 2.551 2.855 \n", "\n", " rh_insula_part4_thickness \n", "0 3.347 \n", "1 3.214 \n", "2 2.886 \n", "3 2.731 \n", "4 2.657 \n", ".. ... \n", "103 2.899 \n", "104 2.791 \n", "105 2.875 \n", "106 2.838 \n", "107 2.985 \n", "\n", "[108 rows x 309 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "CT_Dublin_adj" ] }, { "cell_type": "markdown", "id": "074688be", "metadata": {}, "source": [ "As the dataframe shows, the Group variable contains information of whether the samples belong to control or patient. In this case, 1 indicates control and 2 patient. In order to perform a **Logistic Regression**, the labels of the outputs require to be 0 and 1 since the probability of an instance belonging to a default class is computed." ] }, { "cell_type": "code", "execution_count": 6, "id": "aca4d140", "metadata": {}, "outputs": [], "source": [ "#label group 1 as 0 and 2 as 1\n", "\n", "CT_Dublin_adj['Group'] = CT_Dublin_adj['Group'].replace([1,2],[0, 1])" ] }, { "cell_type": "code", "execution_count": 7, "id": "408610ce", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Grouplh_bankssts_part1_thicknesslh_bankssts_part2_thicknesslh_caudalanteriorcingulate_part1_thicknesslh_caudalmiddlefrontal_part1_thicknesslh_caudalmiddlefrontal_part2_thicknesslh_caudalmiddlefrontal_part3_thicknesslh_caudalmiddlefrontal_part4_thicknesslh_cuneus_part1_thicknesslh_cuneus_part2_thickness...rh_supramarginal_part5_thicknessrh_supramarginal_part6_thicknessrh_supramarginal_part7_thicknessrh_frontalpole_part1_thicknessrh_temporalpole_part1_thicknessrh_transversetemporal_part1_thicknessrh_insula_part1_thicknessrh_insula_part2_thicknessrh_insula_part3_thicknessrh_insula_part4_thickness
002.1802.3822.3462.5262.7472.5442.5821.8162.228...2.8172.3252.4303.0043.9792.3293.6202.7763.2823.347
102.3941.9732.5342.4392.4852.4352.4581.7231.821...2.6112.4182.3172.7943.8512.0343.5882.6543.1243.214
202.5512.5671.9542.4392.4282.1902.3772.0261.800...2.7772.3092.3902.3654.0392.3373.6572.4952.6692.886
312.1871.9232.1602.4102.3812.2772.3611.5851.750...2.2652.3062.1292.2813.5052.2753.1212.3332.6042.731
411.8621.7502.1292.5162.2442.1692.2201.6461.717...2.5822.3142.0472.3893.2722.4453.1712.2162.6592.657
..................................................................
10312.2402.1501.9952.2542.1642.0082.2981.9181.717...2.2732.2882.3952.1053.2672.2573.2312.5742.9202.899
10412.2692.1242.5312.5022.2502.1832.4081.5391.611...2.3022.1822.1822.3272.8812.1243.1592.4502.7532.791
10512.2732.5592.5782.4632.4632.0532.5261.7331.859...2.5342.6042.4492.3703.1112.1903.4802.2942.5712.875
10611.9402.4382.2722.2722.6102.0992.5381.9311.792...2.6382.2252.0132.1153.8532.2313.1872.5102.7592.838
10712.1082.2692.1452.1922.4431.9772.4531.5901.715...2.0132.2512.0212.4193.6791.9703.1922.5512.8552.985
\n", "

108 rows × 309 columns

\n", "
" ], "text/plain": [ " Group lh_bankssts_part1_thickness lh_bankssts_part2_thickness \\\n", "0 0 2.180 2.382 \n", "1 0 2.394 1.973 \n", "2 0 2.551 2.567 \n", "3 1 2.187 1.923 \n", "4 1 1.862 1.750 \n", ".. ... ... ... \n", "103 1 2.240 2.150 \n", "104 1 2.269 2.124 \n", "105 1 2.273 2.559 \n", "106 1 1.940 2.438 \n", "107 1 2.108 2.269 \n", "\n", " lh_caudalanteriorcingulate_part1_thickness \\\n", "0 2.346 \n", "1 2.534 \n", "2 1.954 \n", "3 2.160 \n", "4 2.129 \n", ".. ... \n", "103 1.995 \n", "104 2.531 \n", "105 2.578 \n", "106 2.272 \n", "107 2.145 \n", "\n", " lh_caudalmiddlefrontal_part1_thickness \\\n", "0 2.526 \n", "1 2.439 \n", "2 2.439 \n", "3 2.410 \n", "4 2.516 \n", ".. ... \n", "103 2.254 \n", "104 2.502 \n", "105 2.463 \n", "106 2.272 \n", "107 2.192 \n", "\n", " lh_caudalmiddlefrontal_part2_thickness \\\n", "0 2.747 \n", "1 2.485 \n", "2 2.428 \n", "3 2.381 \n", "4 2.244 \n", ".. ... \n", "103 2.164 \n", "104 2.250 \n", "105 2.463 \n", "106 2.610 \n", "107 2.443 \n", "\n", " lh_caudalmiddlefrontal_part3_thickness \\\n", "0 2.544 \n", "1 2.435 \n", "2 2.190 \n", "3 2.277 \n", "4 2.169 \n", ".. ... \n", "103 2.008 \n", "104 2.183 \n", "105 2.053 \n", "106 2.099 \n", "107 1.977 \n", "\n", " lh_caudalmiddlefrontal_part4_thickness lh_cuneus_part1_thickness \\\n", "0 2.582 1.816 \n", "1 2.458 1.723 \n", "2 2.377 2.026 \n", "3 2.361 1.585 \n", "4 2.220 1.646 \n", ".. ... ... \n", "103 2.298 1.918 \n", "104 2.408 1.539 \n", "105 2.526 1.733 \n", "106 2.538 1.931 \n", "107 2.453 1.590 \n", "\n", " lh_cuneus_part2_thickness ... rh_supramarginal_part5_thickness \\\n", "0 2.228 ... 2.817 \n", "1 1.821 ... 2.611 \n", "2 1.800 ... 2.777 \n", "3 1.750 ... 2.265 \n", "4 1.717 ... 2.582 \n", ".. ... ... ... \n", "103 1.717 ... 2.273 \n", "104 1.611 ... 2.302 \n", "105 1.859 ... 2.534 \n", "106 1.792 ... 2.638 \n", "107 1.715 ... 2.013 \n", "\n", " rh_supramarginal_part6_thickness rh_supramarginal_part7_thickness \\\n", "0 2.325 2.430 \n", "1 2.418 2.317 \n", "2 2.309 2.390 \n", "3 2.306 2.129 \n", "4 2.314 2.047 \n", ".. ... ... \n", "103 2.288 2.395 \n", "104 2.182 2.182 \n", "105 2.604 2.449 \n", "106 2.225 2.013 \n", "107 2.251 2.021 \n", "\n", " rh_frontalpole_part1_thickness rh_temporalpole_part1_thickness \\\n", "0 3.004 3.979 \n", "1 2.794 3.851 \n", "2 2.365 4.039 \n", "3 2.281 3.505 \n", "4 2.389 3.272 \n", ".. ... ... \n", "103 2.105 3.267 \n", "104 2.327 2.881 \n", "105 2.370 3.111 \n", "106 2.115 3.853 \n", "107 2.419 3.679 \n", "\n", " rh_transversetemporal_part1_thickness rh_insula_part1_thickness \\\n", "0 2.329 3.620 \n", "1 2.034 3.588 \n", "2 2.337 3.657 \n", "3 2.275 3.121 \n", "4 2.445 3.171 \n", ".. ... ... \n", "103 2.257 3.231 \n", "104 2.124 3.159 \n", "105 2.190 3.480 \n", "106 2.231 3.187 \n", "107 1.970 3.192 \n", "\n", " rh_insula_part2_thickness rh_insula_part3_thickness \\\n", "0 2.776 3.282 \n", "1 2.654 3.124 \n", "2 2.495 2.669 \n", "3 2.333 2.604 \n", "4 2.216 2.659 \n", ".. ... ... \n", "103 2.574 2.920 \n", "104 2.450 2.753 \n", "105 2.294 2.571 \n", "106 2.510 2.759 \n", "107 2.551 2.855 \n", "\n", " rh_insula_part4_thickness \n", "0 3.347 \n", "1 3.214 \n", "2 2.886 \n", "3 2.731 \n", "4 2.657 \n", ".. ... \n", "103 2.899 \n", "104 2.791 \n", "105 2.875 \n", "106 2.838 \n", "107 2.985 \n", "\n", "[108 rows x 309 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "CT_Dublin_adj" ] }, { "cell_type": "code", "execution_count": 8, "id": "402f2a08", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(108, 309)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#get shape of df_adj\n", "\n", "CT_Dublin_adj.shape" ] }, { "cell_type": "markdown", "id": "0c3996d6", "metadata": {}, "source": [ "Because the LogisticRegression function from sklearn requires the inputs to be numpy arrays, in the following step the dataframe is converted to a numpy array." ] }, { "cell_type": "code", "execution_count": 9, "id": "8c20c482", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0. , 2.18 , 2.382, ..., 2.776, 3.282, 3.347],\n", " [0. , 2.394, 1.973, ..., 2.654, 3.124, 3.214],\n", " [0. , 2.551, 2.567, ..., 2.495, 2.669, 2.886],\n", " ...,\n", " [1. , 2.273, 2.559, ..., 2.294, 2.571, 2.875],\n", " [1. , 1.94 , 2.438, ..., 2.51 , 2.759, 2.838],\n", " [1. , 2.108, 2.269, ..., 2.551, 2.855, 2.985]])" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#dataframe as numpy array \n", "\n", "CT_Dublin_adj.to_numpy()" ] }, { "cell_type": "markdown", "id": "5998c14a", "metadata": {}, "source": [ "### 1.2 Building the model" ] }, { "cell_type": "markdown", "id": "a3e7df3f", "metadata": {}, "source": [ "In the next steps, the **logistic regression** model is built. Firstly, the input and output should be defined. Our input contains the **CT** for all of the 308 brain regions, meaning that there are n=308 features in total. The output is within the Group variable containing label information." ] }, { "cell_type": "code", "execution_count": 10, "id": "2cd70850", "metadata": {}, "outputs": [], "source": [ "#define input\n", "\n", "X = CT_Dublin_adj.iloc[:,1:309].values" ] }, { "cell_type": "code", "execution_count": 11, "id": "4a858fdd", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[2.18 , 2.382, 2.346, ..., 2.776, 3.282, 3.347],\n", " [2.394, 1.973, 2.534, ..., 2.654, 3.124, 3.214],\n", " [2.551, 2.567, 1.954, ..., 2.495, 2.669, 2.886],\n", " ...,\n", " [2.273, 2.559, 2.578, ..., 2.294, 2.571, 2.875],\n", " [1.94 , 2.438, 2.272, ..., 2.51 , 2.759, 2.838],\n", " [2.108, 2.269, 2.145, ..., 2.551, 2.855, 2.985]])" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X" ] }, { "cell_type": "code", "execution_count": 12, "id": "d44d0e78", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(108, 308)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.shape" ] }, { "cell_type": "code", "execution_count": 13, "id": "aaa56b67", "metadata": {}, "outputs": [], "source": [ "#output\n", "\n", "y = CT_Dublin_adj.iloc[:,[0]].values" ] }, { "cell_type": "code", "execution_count": 14, "id": "73f341d3", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "array([[0],\n", " [0],\n", " [0],\n", " [1],\n", " [1],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [0],\n", " [1],\n", " [1],\n", " [1],\n", " [1],\n", " [1],\n", " [1],\n", " [1],\n", " [1],\n", " [1],\n", " [1],\n", " [1],\n", " [1],\n", " [1],\n", " [1],\n", " [1],\n", " [1],\n", " [1],\n", " [1],\n", " [1],\n", " [1],\n", " [1],\n", " [1],\n", " [1],\n", " [1],\n", " [1],\n", " [1]])" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y" ] }, { "cell_type": "code", "execution_count": 15, "id": "c0d23e00", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(108, 1)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y.shape" ] }, { "cell_type": "markdown", "id": "4728103a", "metadata": {}, "source": [ "The numpy.ravel() functions returns contiguous flattened array (1D array with all the input-array elements and with the same type as it). This step is required for the upcoming analyses. " ] }, { "cell_type": "code", "execution_count": 16, "id": "5377250a", "metadata": {}, "outputs": [], "source": [ "y = y.ravel()" ] }, { "cell_type": "code", "execution_count": 17, "id": "6401e42c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(108,)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y.shape" ] }, { "cell_type": "markdown", "id": "2cce2797", "metadata": {}, "source": [ "Now having defined our input and ouput data, to build the logistic regression model we need to split our data into train and test sets. For this, we use the train_test_split splitter function from Sklearn. The training set is the dataset on which the model is trained. This data is seen and learned by the model. The test set is a a subset of the training set and utilized for an accurate evaluation of a final model fit.\n", "With that function, the data gets divided into X_train, X_test, y_train and y_test. X_train and y_train are used for training and fitting the model. The X_test and y_test sets, however, are used for training the model if the correct labels were predicted. " ] }, { "cell_type": "markdown", "id": "2e75fed4", "metadata": {}, "source": [ "But before splitting the data into training and testing set, we use the StandardScaler() function to standardize our data. The function standardizes every feature (each column) indivudally by substracting the mean and then scaling to unit variance (dividing all the values by the standard deviation). As a result, we get a distribution with a mean equal to 0 and with a standard deviation equal to 1. " ] }, { "cell_type": "markdown", "id": "9c7abf57", "metadata": {}, "source": [ "Also, with such a small sample, the N=27 participants (108 * 25%) in the testing sample could differ considerably from the training sample by chance. To tackle this problem, we can run the cross validation for 5000 iterations." ] }, { "cell_type": "code", "execution_count": 29, "id": "090318e9", "metadata": {}, "outputs": [], "source": [ "n_iter = 5000\n", "y_preds = []\n", "y_tests = []\n", "\n", "\n", "# scale before splitting into test and train samples\n", "X_sc = StandardScaler().fit_transform(X)\n", "\n", "for i in range(n_iter):\n", " \n", " # take a new testing and training sample\n", " X_train, X_test, y_train, y_test = train_test_split(X_sc, y, test_size = 0.25, random_state = i)\n", " y_tests.append(y_test) # store the y_test sample\n", " \n", " # fit the logistic regression\n", " classifier = LogisticRegression(random_state = i, solver ='liblinear')\n", " classifier.fit(X_train, y_train)\n", " \n", " # get the y predictions and store\n", " y_pred = classifier.predict(X_test)\n", " y_preds.append(y_pred)" ] }, { "cell_type": "markdown", "id": "8f821651", "metadata": {}, "source": [ "The test size indicates the size of the test subset, a random sampling without replacement about 75% of the rows , the remaining 25% is put into the test set. The random_state parameter allows you to reproduce the same train test split each time when the code is run. With a different value for random_state, different information would flow into the train and test sets. " ] }, { "cell_type": "markdown", "id": "c555676e", "metadata": {}, "source": [ "The following outputs show the first five lines of the iterations for our predicted y values and y testing values." ] }, { "cell_type": "code", "execution_count": 19, "id": "1ce39e57", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[array([1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1,\n", " 1, 1, 0, 1, 0]),\n", " array([1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1,\n", " 0, 1, 0, 1, 0]),\n", " array([1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1,\n", " 1, 1, 0, 1, 1]),\n", " array([1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1,\n", " 1, 1, 0, 0, 1]),\n", " array([1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0,\n", " 1, 0, 0, 0, 1])]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_preds[:5]" ] }, { "cell_type": "code", "execution_count": 20, "id": "c901050c", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[array([1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,\n", " 0, 1, 0, 1, 0]),\n", " array([1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1,\n", " 0, 1, 0, 0, 0]),\n", " array([0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,\n", " 0, 1, 0, 1, 0]),\n", " array([1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0,\n", " 1, 0, 0, 0, 0]),\n", " array([0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,\n", " 1, 0, 0, 0, 1])]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_tests[:5]" ] }, { "cell_type": "markdown", "id": "390aa641", "metadata": {}, "source": [ "In the following, we can concatenate the the y_pred and y_test results from each iteration and use this to compute the confusion matrix. " ] }, { "cell_type": "code", "execution_count": 40, "id": "1481ef01", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1, 0, 1, ..., 1, 1, 1])" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_preds = np.concatenate([y_preds])\n", "y_preds" ] }, { "cell_type": "code", "execution_count": 41, "id": "eb27b859", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1, 0, 0, ..., 1, 0, 1])" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_tests = np.concatenate([y_tests])\n", "y_tests" ] }, { "cell_type": "markdown", "id": "d487eed3", "metadata": {}, "source": [ "### 1.3 Model evaluation" ] }, { "cell_type": "markdown", "id": "610c1784", "metadata": {}, "source": [ "In the next section, we will have a look at how the logistic regression model performs and evaluate it. To evaluate the model, a look at different measurements such as the **confusion matrix**, **accuracy, precision and recall** is helpful." ] }, { "cell_type": "markdown", "id": "3534e34a", "metadata": {}, "source": [ "#### 1.3.1 Confusion Matrix" ] }, { "cell_type": "markdown", "id": "9f048bc1", "metadata": {}, "source": [ "The **confusion matrix** provides information on the quality of the logistic regression model since it shows the predicted values from the model compared to the actual values from the test dataset. We scale this by the sum of the array to get probabilities for **hits, misses, false positives and true negatives**." ] }, { "cell_type": "code", "execution_count": 45, "id": "58f882a8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Confusion Matrix : \n", " [[0.43165926 0.31014074]\n", " [0.00807407 0.25012593]]\n" ] } ], "source": [ "cm_CT = confusion_matrix(y_tests, y_preds)\n", "\n", "cm_CT_f = cm_CT/np.sum(cm_CT)\n", "\n", "print(\"Confusion Matrix : \\n\", cm_CT_f)" ] }, { "cell_type": "markdown", "id": "bc49f766", "metadata": {}, "source": [ "To make the confusion matrix visually more appealing and more informative, we run the following code." ] }, { "cell_type": "code", "execution_count": 46, "id": "1eeac9d6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0.5, 257.44, 'Predicted label')" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "class_names=[0,1]\n", "\n", "fig, ax = plt.subplots() \n", "tick_marks = np.arange(len(class_names)) \n", "plt.xticks(tick_marks, class_names) \n", "plt.yticks(tick_marks, class_names) \n", "sns.heatmap(pd.DataFrame(cm_CT_f), annot=True, cmap=\"YlGnBu\" ,fmt='g') \n", "ax.xaxis.set_label_position(\"top\") \n", "plt.tight_layout() \n", "plt.title('Confusion Matrix for Cortical Thickness', y=1.1) \n", "plt.ylabel('Actual label') \n", "plt.xlabel('Predicted label')" ] }, { "cell_type": "markdown", "id": "02311028", "metadata": {}, "source": [ "The confusion matrix can be interpreted as follows:\n", "\n", "- Bottom righ square: **True positives** or **hits** -> predicted a subject to be patient and he is\n", "- Upper left square: **True negatives** -> predicted a subject to be control and he is\n", "- Upper right square: **False positive (Type 1 Error)** -> predicted a subject as patient but he is control\n", "- Bottom left square: **False negative (Type 2 Error)** or **misses** -> predicted a subject as control but he is patient" ] }, { "cell_type": "markdown", "id": "d5d2c2dc", "metadata": {}, "source": [ "The results show that the probability for **hits** is around 25%, for **true negatives** around 43%. The probability for **misses** is around 0.8%. The probability for **false positive** cases is around 31%.\n", "As we can see, the model is better at classifying controls than patients. \n", "However, based on these values we can compute other measures such as **accuracy, precision, recall and F1-Score** indicating the qualtiy of the model." ] }, { "cell_type": "markdown", "id": "9103b861", "metadata": {}, "source": [ "#### 1.3.2 Model accuracy, precision, recall F1-Score" ] }, { "cell_type": "markdown", "id": "c72e03e4", "metadata": {}, "source": [ "The measures indicate the following: \n", "\n", "- **Accuracy**: percentage of correct predictions \n", "- **Precision**: correct positive predictions relative to total positive predictions. In other words: From all the cases predicted as positive, how many are actually positive. \n", "- **Recall**: the correct positive predictions relative to total actual positives. In other words: From all the positive cases, how many were predicted correctly. \n", "- **F1-Score**: combines the precision and recall of a classifier into a single metric by taking their harmonic mean. It is primarily used to compare the performance of two classifiers." ] }, { "cell_type": "markdown", "id": "ef8b53e9", "metadata": {}, "source": [ "To compute these measures, we can simply use ```sklearn.metrics module```." ] }, { "cell_type": "code", "execution_count": 48, "id": "2eeec99b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy: 0.6817851851851852\n", "Precision: 0.446440848273309\n", "Recall: 0.9687293800384428\n", "F1-Score: 0.6112061397554596\n" ] } ], "source": [ "print(\"Accuracy:\",metrics.accuracy_score(y_tests, y_preds)) \n", "\n", "print(\"Precision:\",metrics.precision_score(y_tests, y_preds)) \n", "\n", "print(\"Recall:\",metrics.recall_score(y_tests, y_preds)) \n", "\n", "print(\"F1-Score:\", metrics.f1_score(y_tests, y_preds))" ] }, { "cell_type": "markdown", "id": "ec04efc1", "metadata": {}, "source": [ "So as the values show, our logistic regression model for **cortical thickness** makes 68.18% of the time correct predictions. 44.64% represent the proportion of the model's prediction of psychosis where psychosis is actually present and 96.87% relate to the proportion of all cases of psychosis that the model accurately predicted. " ] }, { "cell_type": "markdown", "id": "4c3ab202", "metadata": {}, "source": [ "## 2. Micro-structural data: mean diffusivity (MD) and fractional anisotropy (FA)" ] }, { "cell_type": "markdown", "id": "d4cc3ede", "metadata": {}, "source": [ "In the following section, the logistic regression model is computed for **micro-structural** data in an analogous way as for the macro-structural data. We first start with **MD**." ] }, { "cell_type": "markdown", "id": "a2ba209f", "metadata": {}, "source": [ "### 2.1 Mean diffusivity" ] }, { "cell_type": "markdown", "id": "c39656f1", "metadata": {}, "source": [ "First, we prepare our data to use it in the code efficiently and adjust it for the logistic regression accordingly." ] }, { "cell_type": "markdown", "id": "56a83991", "metadata": {}, "source": [ "#### 2.1.1 Data preparation" ] }, { "cell_type": "code", "execution_count": 50, "id": "d7217d4a", "metadata": {}, "outputs": [], "source": [ "#read data\n", "\n", "MD_Dublin_path = os.path.join(os.pardir, 'data', 'PARC_500.aparc_MD_cortexAv_mean_Dublin.csv')\n", "MD_Dublin = pd.read_csv(MD_Dublin_path)" ] }, { "cell_type": "code", "execution_count": 51, "id": "d8f1126d", "metadata": {}, "outputs": [], "source": [ "#adjust dataframe\n", "\n", "MD_Dublin_adj = MD_Dublin.drop(['Subject ID','Age', 'Sex'], axis=1)" ] }, { "cell_type": "code", "execution_count": 52, "id": "6a0c154b", "metadata": {}, "outputs": [], "source": [ "#label group 1 as 0 and 2 as 1\n", "\n", "MD_Dublin_adj['Group'] = MD_Dublin_adj['Group'].replace([1,2],[0, 1])" ] }, { "cell_type": "code", "execution_count": 53, "id": "e12a8836", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Grouplh_bankssts_part1_thicknesslh_bankssts_part2_thicknesslh_caudalanteriorcingulate_part1_thicknesslh_caudalmiddlefrontal_part1_thicknesslh_caudalmiddlefrontal_part2_thicknesslh_caudalmiddlefrontal_part3_thicknesslh_caudalmiddlefrontal_part4_thicknesslh_cuneus_part1_thicknesslh_cuneus_part2_thickness...rh_supramarginal_part5_thicknessrh_supramarginal_part6_thicknessrh_supramarginal_part7_thicknessrh_frontalpole_part1_thicknessrh_temporalpole_part1_thicknessrh_transversetemporal_part1_thicknessrh_insula_part1_thicknessrh_insula_part2_thicknessrh_insula_part3_thicknessrh_insula_part4_thickness
000.9110.9310.8911.0480.8810.9391.1240.9861.045...0.9281.0671.0960.8921.2381.0211.1660.9000.9070.937
100.8610.9130.8460.9270.8880.8940.9241.0401.093...0.8780.9851.0451.0011.1961.0831.1430.9170.9230.960
200.8170.8270.8280.8280.7800.8430.8250.8480.838...0.8470.8490.8190.9520.9330.9421.0590.7940.8340.860
300.8870.9050.8780.9320.8200.8880.9700.9180.900...0.9570.9850.9891.0751.1501.0170.9860.8880.9160.928
400.8870.8540.9051.0110.9460.9221.0341.1261.114...0.8710.9520.9871.3250.9961.0941.0640.9660.9890.977
..................................................................
11010.8430.8550.9401.0170.9540.8401.1281.0120.997...0.9381.0621.1430.9031.3641.2841.2181.0170.9721.028
11110.9110.9140.9261.0010.9181.1151.0361.0261.001...0.9571.0851.0981.0591.2681.0891.1730.9901.0651.021
11210.8900.8990.8860.9300.8830.8820.8831.1901.101...0.9161.0100.9740.9681.3051.1681.2650.9810.9750.972
11310.9200.9860.8830.8790.7940.9831.0291.0761.053...0.9420.9850.9901.1991.3531.1871.4440.9471.0471.085
11410.9700.8680.9400.9670.9301.0220.9571.1121.004...1.0741.1221.2150.8911.3331.2951.2841.1251.1101.139
\n", "

115 rows × 309 columns

\n", "
" ], "text/plain": [ " Group lh_bankssts_part1_thickness lh_bankssts_part2_thickness \\\n", "0 0 0.911 0.931 \n", "1 0 0.861 0.913 \n", "2 0 0.817 0.827 \n", "3 0 0.887 0.905 \n", "4 0 0.887 0.854 \n", ".. ... ... ... \n", "110 1 0.843 0.855 \n", "111 1 0.911 0.914 \n", "112 1 0.890 0.899 \n", "113 1 0.920 0.986 \n", "114 1 0.970 0.868 \n", "\n", " lh_caudalanteriorcingulate_part1_thickness \\\n", "0 0.891 \n", "1 0.846 \n", "2 0.828 \n", "3 0.878 \n", "4 0.905 \n", ".. ... \n", "110 0.940 \n", "111 0.926 \n", "112 0.886 \n", "113 0.883 \n", "114 0.940 \n", "\n", " lh_caudalmiddlefrontal_part1_thickness \\\n", "0 1.048 \n", "1 0.927 \n", "2 0.828 \n", "3 0.932 \n", "4 1.011 \n", ".. ... \n", "110 1.017 \n", "111 1.001 \n", "112 0.930 \n", "113 0.879 \n", "114 0.967 \n", "\n", " lh_caudalmiddlefrontal_part2_thickness \\\n", "0 0.881 \n", "1 0.888 \n", "2 0.780 \n", "3 0.820 \n", "4 0.946 \n", ".. ... \n", "110 0.954 \n", "111 0.918 \n", "112 0.883 \n", "113 0.794 \n", "114 0.930 \n", "\n", " lh_caudalmiddlefrontal_part3_thickness \\\n", "0 0.939 \n", "1 0.894 \n", "2 0.843 \n", "3 0.888 \n", "4 0.922 \n", ".. ... \n", "110 0.840 \n", "111 1.115 \n", "112 0.882 \n", "113 0.983 \n", "114 1.022 \n", "\n", " lh_caudalmiddlefrontal_part4_thickness lh_cuneus_part1_thickness \\\n", "0 1.124 0.986 \n", "1 0.924 1.040 \n", "2 0.825 0.848 \n", "3 0.970 0.918 \n", "4 1.034 1.126 \n", ".. ... ... \n", "110 1.128 1.012 \n", "111 1.036 1.026 \n", "112 0.883 1.190 \n", "113 1.029 1.076 \n", "114 0.957 1.112 \n", "\n", " lh_cuneus_part2_thickness ... rh_supramarginal_part5_thickness \\\n", "0 1.045 ... 0.928 \n", "1 1.093 ... 0.878 \n", "2 0.838 ... 0.847 \n", "3 0.900 ... 0.957 \n", "4 1.114 ... 0.871 \n", ".. ... ... ... \n", "110 0.997 ... 0.938 \n", "111 1.001 ... 0.957 \n", "112 1.101 ... 0.916 \n", "113 1.053 ... 0.942 \n", "114 1.004 ... 1.074 \n", "\n", " rh_supramarginal_part6_thickness rh_supramarginal_part7_thickness \\\n", "0 1.067 1.096 \n", "1 0.985 1.045 \n", "2 0.849 0.819 \n", "3 0.985 0.989 \n", "4 0.952 0.987 \n", ".. ... ... \n", "110 1.062 1.143 \n", "111 1.085 1.098 \n", "112 1.010 0.974 \n", "113 0.985 0.990 \n", "114 1.122 1.215 \n", "\n", " rh_frontalpole_part1_thickness rh_temporalpole_part1_thickness \\\n", "0 0.892 1.238 \n", "1 1.001 1.196 \n", "2 0.952 0.933 \n", "3 1.075 1.150 \n", "4 1.325 0.996 \n", ".. ... ... \n", "110 0.903 1.364 \n", "111 1.059 1.268 \n", "112 0.968 1.305 \n", "113 1.199 1.353 \n", "114 0.891 1.333 \n", "\n", " rh_transversetemporal_part1_thickness rh_insula_part1_thickness \\\n", "0 1.021 1.166 \n", "1 1.083 1.143 \n", "2 0.942 1.059 \n", "3 1.017 0.986 \n", "4 1.094 1.064 \n", ".. ... ... \n", "110 1.284 1.218 \n", "111 1.089 1.173 \n", "112 1.168 1.265 \n", "113 1.187 1.444 \n", "114 1.295 1.284 \n", "\n", " rh_insula_part2_thickness rh_insula_part3_thickness \\\n", "0 0.900 0.907 \n", "1 0.917 0.923 \n", "2 0.794 0.834 \n", "3 0.888 0.916 \n", "4 0.966 0.989 \n", ".. ... ... \n", "110 1.017 0.972 \n", "111 0.990 1.065 \n", "112 0.981 0.975 \n", "113 0.947 1.047 \n", "114 1.125 1.110 \n", "\n", " rh_insula_part4_thickness \n", "0 0.937 \n", "1 0.960 \n", "2 0.860 \n", "3 0.928 \n", "4 0.977 \n", ".. ... \n", "110 1.028 \n", "111 1.021 \n", "112 0.972 \n", "113 1.085 \n", "114 1.139 \n", "\n", "[115 rows x 309 columns]" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "MD_Dublin_adj" ] }, { "cell_type": "markdown", "id": "a2dba246", "metadata": {}, "source": [ "Very important is, not to forget to convert the dataframe into an numpy array!" ] }, { "cell_type": "code", "execution_count": 54, "id": "aa974bc3", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0. , 0.911, 0.931, ..., 0.9 , 0.907, 0.937],\n", " [0. , 0.861, 0.913, ..., 0.917, 0.923, 0.96 ],\n", " [0. , 0.817, 0.827, ..., 0.794, 0.834, 0.86 ],\n", " ...,\n", " [1. , 0.89 , 0.899, ..., 0.981, 0.975, 0.972],\n", " [1. , 0.92 , 0.986, ..., 0.947, 1.047, 1.085],\n", " [1. , 0.97 , 0.868, ..., 1.125, 1.11 , 1.139]])" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#dataframe as numpy array \n", "\n", "MD_Dublin_adj.to_numpy()" ] }, { "cell_type": "markdown", "id": "9317379a", "metadata": {}, "source": [ "#### 2.1.2 Build the model" ] }, { "cell_type": "markdown", "id": "e57ed474", "metadata": {}, "source": [ "In the next step, the input and output for our model is defined. For that, we use the **MD** for each 308 cortical region as input and the label whether a participant belongs to the control or patient group as output." ] }, { "cell_type": "code", "execution_count": 55, "id": "0decd525", "metadata": {}, "outputs": [], "source": [ "#define input\n", "\n", "X_MD = MD_Dublin_adj.iloc[:,1:309].values" ] }, { "cell_type": "code", "execution_count": 56, "id": "e4fe91ae", "metadata": {}, "outputs": [], "source": [ "#output\n", "\n", "y_MD = MD_Dublin_adj.iloc[:,[0]].values" ] }, { "cell_type": "code", "execution_count": 57, "id": "e7089e2f", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "(115, 1)" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_MD.shape" ] }, { "cell_type": "markdown", "id": "079848d7", "metadata": {}, "source": [ "To return a 1D flattened array since its required for further analyses, the .ravel() function is used!" ] }, { "cell_type": "code", "execution_count": 58, "id": "d8c37219", "metadata": {}, "outputs": [], "source": [ "y_MD = y_MD.ravel()" ] }, { "cell_type": "code", "execution_count": 59, "id": "0ca72716", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(115,)" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_MD.shape" ] }, { "cell_type": "code", "execution_count": 60, "id": "3a491660", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0.911, 0.931, 0.891, ..., 0.9 , 0.907, 0.937],\n", " [0.861, 0.913, 0.846, ..., 0.917, 0.923, 0.96 ],\n", " [0.817, 0.827, 0.828, ..., 0.794, 0.834, 0.86 ],\n", " ...,\n", " [0.89 , 0.899, 0.886, ..., 0.981, 0.975, 0.972],\n", " [0.92 , 0.986, 0.883, ..., 0.947, 1.047, 1.085],\n", " [0.97 , 0.868, 0.94 , ..., 1.125, 1.11 , 1.139]])" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_MD" ] }, { "cell_type": "markdown", "id": "741bf0a4", "metadata": {}, "source": [ "Again, we build our model with 5000 iterations." ] }, { "cell_type": "code", "execution_count": 61, "id": "f2c9353a", "metadata": {}, "outputs": [], "source": [ "n_iter_MD = 5000\n", "y_preds_MD = []\n", "y_tests_MD = []\n", "\n", "# scale before splitting into test and train samples\n", "X_sc_MD = StandardScaler().fit_transform(X_MD)\n", "\n", "for i in range(n_iter):\n", " # take a new testing and training sample\n", " X_train_MD, X_test_MD, y_train_MD, y_test_MD = train_test_split(X_sc_MD, y_MD, test_size = 0.25, random_state = i)\n", " y_tests_MD.append(y_test_MD) # store the y_test sample\n", " \n", " # fit the logistic regression\n", " classifier_MD = LogisticRegression(random_state = i, solver ='liblinear')\n", " classifier_MD.fit(X_train_MD, y_train_MD)\n", " \n", " # get the y predictions and store\n", " y_pred_MD = classifier_MD.predict(X_test_MD)\n", " y_preds_MD.append(y_pred_MD)" ] }, { "cell_type": "markdown", "id": "3272350b", "metadata": {}, "source": [ "In the following steps, we again concatenate the values to compute the confusion matrix." ] }, { "cell_type": "code", "execution_count": 72, "id": "1d9afdd9", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1, 0, 0, ..., 1, 0, 1])" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_preds_MD = np.concatenate([y_preds_MD])\n", "y_preds_MD" ] }, { "cell_type": "code", "execution_count": 93, "id": "71323df1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 0, 0, ..., 0, 0, 1])" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_tests_MD = np.concatenate([y_tests_MD])\n", "y_tests_MD" ] }, { "cell_type": "markdown", "id": "6ad84688", "metadata": {}, "source": [ "#### 2.1.3 Model evaluation" ] }, { "cell_type": "markdown", "id": "317b6dff", "metadata": {}, "source": [ "Next, we will again have a look at the confusion matrix and then compute the evaluation measures as for the **CT**. To get probabilities in the confusion matrix, we again scale the confusion matrix by the sum of the array." ] }, { "cell_type": "markdown", "id": "6f022c85", "metadata": {}, "source": [ "##### 2.1.3.1 Confusion matrix" ] }, { "cell_type": "code", "execution_count": 94, "id": "b92c0047", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Confusion Matrix : \n", " [[0.53733103 0.17482759]\n", " [0.07247586 0.21536552]]\n" ] } ], "source": [ "#confusion matrix\n", "\n", "cm_MD = confusion_matrix(y_tests_MD, y_preds_MD)\n", "\n", "cm_MD_f = cm_MD/np.sum(cm_MD)\n", " \n", "print (\"Confusion Matrix : \\n\", cm_MD_f)" ] }, { "cell_type": "markdown", "id": "1ba0fc55", "metadata": {}, "source": [ "To plot the confusion matrix visually more appealing again, the following code can be run." ] }, { "cell_type": "code", "execution_count": 95, "id": "6772de47", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0.5, 257.44, 'Predicted label')" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "class_names=[0,1]\n", "\n", "fig, ax = plt.subplots() \n", "tick_marks = np.arange(len(class_names)) \n", "plt.xticks(tick_marks, class_names) \n", "plt.yticks(tick_marks, class_names) \n", "sns.heatmap(pd.DataFrame(cm_MD_f), annot=True, cmap=\"YlGnBu\" ,fmt='g') \n", "ax.xaxis.set_label_position(\"top\") \n", "plt.tight_layout() \n", "plt.title('Confusion matrix for Mean Diffusivity', y=1.1) \n", "plt.ylabel('Actual label') \n", "plt.xlabel('Predicted label')" ] }, { "cell_type": "markdown", "id": "73f431f5", "metadata": {}, "source": [ "The confusion matrix reveals that the probability for **hits** is around 22%, for **true negatives** around 54%. The probability for **misses** is around 7% and for **false positive** cases around 17%. " ] }, { "cell_type": "markdown", "id": "147a9322", "metadata": {}, "source": [ "##### 2.1.3.2 Model accuracy, precision, recall and F1-Score" ] }, { "cell_type": "code", "execution_count": 96, "id": "dc865b19", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy: 0.752696551724138\n", "Precision: 0.5519459860723249\n", "Recall: 0.7482090231688909\n", "F1-Score: 0.6352642018003356\n" ] } ], "source": [ "#compute accuracy, precision, recall\n", "\n", "print(\"Accuracy:\",metrics.accuracy_score(y_tests_MD, y_preds_MD)) \n", "\n", "print(\"Precision:\",metrics.precision_score(y_tests_MD, y_preds_MD)) \n", "\n", "print(\"Recall:\",metrics.recall_score(y_tests_MD, y_preds_MD)) \n", "\n", "print(\"F1-Score:\", metrics.f1_score(y_tests_MD, y_preds_MD))" ] }, { "cell_type": "markdown", "id": "ea7e839a", "metadata": {}, "source": [ "As the values reveal, our logistic regression model for **mean diffusivity** makes 75.26% of the time correct predictions. 55.20% represent the proportion of the model's prediction of psychosis where psychosis is actually present and 74.82% relate to the proportion of all cases of psychosis that the model accurately predicted. " ] }, { "cell_type": "markdown", "id": "76d6bc86", "metadata": {}, "source": [ "### 2.2 Fractional anisotropy" ] }, { "cell_type": "markdown", "id": "a84e891f", "metadata": {}, "source": [ "Now, the same procedure is applied for **FA**. " ] }, { "cell_type": "markdown", "id": "cd038dc5", "metadata": {}, "source": [ "#### 2.2.1 Data preparation" ] }, { "cell_type": "markdown", "id": "8d9b53c7", "metadata": {}, "source": [ "First, the data is adjusted accordingly. " ] }, { "cell_type": "code", "execution_count": 97, "id": "4ed0d86d", "metadata": {}, "outputs": [], "source": [ "#read data\n", "\n", "FA_Dublin_path = os.path.join(os.pardir, 'data', 'PARC_500.aparc_FA_cortexAv_mean_Dublin.csv')\n", "FA_Dublin = pd.read_csv(FA_Dublin_path)" ] }, { "cell_type": "code", "execution_count": 98, "id": "2245eded", "metadata": {}, "outputs": [], "source": [ "#adjust dataframe\n", "\n", "FA_Dublin_adj = FA_Dublin.drop(['Subject ID','Age', 'Sex'], axis=1)" ] }, { "cell_type": "code", "execution_count": 99, "id": "8e2354f2", "metadata": {}, "outputs": [], "source": [ "#label group 1 as 0 and 2 as 1\n", "\n", "FA_Dublin_adj['Group'] = FA_Dublin_adj['Group'].replace([1,2],[0, 1])" ] }, { "cell_type": "code", "execution_count": 100, "id": "206f4afb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0. , 0.322, 0.147, ..., 0.157, 0.147, 0.137],\n", " [0. , 0.302, 0.155, ..., 0.152, 0.148, 0.152],\n", " [0. , 0.324, 0.18 , ..., 0.171, 0.174, 0.143],\n", " ...,\n", " [1. , 0.323, 0.173, ..., 0.181, 0.143, 0.151],\n", " [1. , 0.311, 0.174, ..., 0.162, 0.145, 0.123],\n", " [1. , 0.294, 0.164, ..., 0.145, 0.147, 0.127]])" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#dataframe as numpy array \n", "\n", "FA_Dublin_adj.to_numpy()" ] }, { "cell_type": "markdown", "id": "714a804d", "metadata": {}, "source": [ "#### 2.2.2 Build the model" ] }, { "cell_type": "markdown", "id": "1bc4e3c0", "metadata": {}, "source": [ "Again, the input and output variables are defined. Here, we take the **FA** values for every brain region as input." ] }, { "cell_type": "code", "execution_count": 101, "id": "ffa88998", "metadata": {}, "outputs": [], "source": [ "#define input\n", "\n", "X_FA = FA_Dublin_adj.iloc[:,1:309].values" ] }, { "cell_type": "code", "execution_count": 102, "id": "4a6fdc05", "metadata": {}, "outputs": [], "source": [ "#output\n", "\n", "y_FA = FA_Dublin_adj.iloc[:,[0]].values" ] }, { "cell_type": "code", "execution_count": 103, "id": "f7a678ed", "metadata": {}, "outputs": [], "source": [ "y_FA = y_FA.ravel()" ] }, { "cell_type": "markdown", "id": "09e07afb", "metadata": {}, "source": [ "And finally run the model." ] }, { "cell_type": "code", "execution_count": 104, "id": "8417be69", "metadata": {}, "outputs": [], "source": [ "n_iter_FA = 5000\n", "y_preds_FA = []\n", "y_tests_FA = []\n", "\n", "# scale before splitting into test and train samples\n", "X_sc_FA = StandardScaler().fit_transform(X_FA)\n", "\n", "for i in range(n_iter):\n", " # take a new testing and training sample\n", " X_train_FA, X_test_FA, y_train_FA, y_test_FA = train_test_split(X_sc_FA, y_FA, test_size = 0.25, random_state = i)\n", " y_tests_FA.append(y_test_FA) # store the y_test sample\n", " \n", " # fit the logistic regression\n", " classifier_FA = LogisticRegression(random_state = i, solver ='liblinear')\n", " classifier_FA.fit(X_train_FA, y_train_FA)\n", " \n", " # get the y predictions and store\n", " y_pred_FA = classifier_FA.predict(X_test_FA)\n", " y_preds_FA.append(y_pred_FA)" ] }, { "cell_type": "markdown", "id": "7f35473b", "metadata": {}, "source": [ "Concatenate the values from each iteration to compute our confusion matrix!" ] }, { "cell_type": "code", "execution_count": 112, "id": "4db6e2b9", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1, 0, 0, ..., 1, 0, 1])" ] }, "execution_count": 112, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_preds_FA = np.concatenate([y_preds_FA])\n", "y_preds_FA" ] }, { "cell_type": "code", "execution_count": 114, "id": "04ffbc41", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 0, 0, ..., 0, 0, 1])" ] }, "execution_count": 114, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_tests_FA = np.concatenate([y_tests_FA])\n", "y_tests_FA" ] }, { "cell_type": "markdown", "id": "341d0dfd", "metadata": {}, "source": [ "#### 2.2.3 Model evaluation" ] }, { "cell_type": "markdown", "id": "5afe1404", "metadata": {}, "source": [ "Again, to get the probabilites we scale the confusion matrix with sum of its array. " ] }, { "cell_type": "markdown", "id": "dced73e0", "metadata": {}, "source": [ "##### 2.2.3.1 Confusion matrix" ] }, { "cell_type": "code", "execution_count": 117, "id": "9fc70ea9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Confusion Matrix : \n", " [[0.44015862 0.272 ]\n", " [0.05826897 0.22957241]]\n" ] } ], "source": [ "#confusion matrix\n", "\n", "cm_FA = confusion_matrix(y_tests_FA, y_preds_FA)\n", "\n", "cm_FA_f = cm_FA / np.sum(cm_FA)\n", " \n", " \n", "print (\"Confusion Matrix : \\n\", cm_FA_f)" ] }, { "cell_type": "markdown", "id": "57452435", "metadata": {}, "source": [ "To have a nicer plot, again we run the following code!" ] }, { "cell_type": "code", "execution_count": 118, "id": "6a037656", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0.5, 257.44, 'Predicted label')" ] }, "execution_count": 118, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "class_names=[0,1]\n", "\n", "fig, ax = plt.subplots() \n", "tick_marks = np.arange(len(class_names)) \n", "plt.xticks(tick_marks, class_names) \n", "plt.yticks(tick_marks, class_names) \n", "sns.heatmap(pd.DataFrame(cm_FA_f), annot=True, cmap=\"YlGnBu\" ,fmt='g') \n", "ax.xaxis.set_label_position(\"top\") \n", "plt.tight_layout() \n", "plt.title('Confusion matrix for Fractional Anisotropy', y=1.1) \n", "plt.ylabel('Actual label') \n", "plt.xlabel('Predicted label')" ] }, { "cell_type": "markdown", "id": "d3b1ac17", "metadata": {}, "source": [ "The confusion matrix shows that the probability for **hits** is around 23%, for **true negatives** around 44%. The probability for **misses** is around 6%. The probability for **false positive** cases is around 27%. " ] }, { "cell_type": "markdown", "id": "7a5f68b1", "metadata": {}, "source": [ "##### 2.2.3.2 Model accuracy, precision, recall and F1-Score" ] }, { "cell_type": "code", "execution_count": 120, "id": "b400b986", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy: 0.6697310344827586\n", "Precision: 0.45770542294577055\n", "Recall: 0.7975657090830678\n", "F1-Score: 0.5816275717468222\n" ] } ], "source": [ "#compute accuracy, precision, recall\n", "\n", "print(\"Accuracy:\",metrics.accuracy_score(y_tests_FA, y_preds_FA)) \n", "\n", "print(\"Precision:\",metrics.precision_score(y_tests_FA, y_preds_FA)) \n", "\n", "print(\"Recall:\",metrics.recall_score(y_tests_FA, y_preds_FA)) \n", "\n", "print(\"F1-Score:\", metrics.f1_score(y_tests_FA, y_preds_FA))" ] }, { "cell_type": "markdown", "id": "113adf3d", "metadata": {}, "source": [ "As the values reveal, our logistic regression model for **fractional anisotropy** makes 66.97% of the time correct predictions. 45.77% represent the proportion of the model's prediction of psychosis where psychosis is actually present and 79.75% relate to the proportion of all cases of psychosis that the model accurately predicted. " ] }, { "cell_type": "markdown", "id": "5914663f", "metadata": {}, "source": [ "## 3. Summary" ] }, { "cell_type": "markdown", "id": "542475a8", "metadata": {}, "source": [ "To summarize the performance measures of our logistic regression models for **CT** and **FA and MD**, we can simply create a table with the relevant data." ] }, { "cell_type": "markdown", "id": "4edf4326", "metadata": {}, "source": [ "For that, we need to import the tabulate function from the tabulate module. However, before this u have to run the ```pip install tabulate```command. Remove the hashtag in the next line, to install it." ] }, { "cell_type": "code", "execution_count": 165, "id": "e03942ec", "metadata": {}, "outputs": [], "source": [ "#!pip install tabulate\n", "\n", "from tabulate import tabulate" ] }, { "cell_type": "markdown", "id": "4b63d003", "metadata": {}, "source": [ "Next, we define our data and the column names and print the table." ] }, { "cell_type": "code", "execution_count": 184, "id": "a8014e5b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Measure Cortical Thickness Mean Diffusivity Fractional Anisotropy\n", "--------- -------------------- ------------------ -----------------------\n", "Accuracy 0.681785 0.752697 0.669731\n", "Precision 0.446441 0.551946 0.457705\n", "Recall 0.968729 0.748209 0.797566\n", "F1-Score 0.611206 0.635264 0.581628\n" ] } ], "source": [ "#create data\n", "data_LR = [[\"Accuracy\", metrics.accuracy_score(y_tests, y_preds), metrics.accuracy_score(y_tests_MD, y_preds_MD), metrics.accuracy_score(y_tests_FA, y_preds_FA)], \n", " [\"Precision\", metrics.precision_score(y_tests, y_preds), metrics.precision_score(y_tests_MD, y_preds_MD),metrics.precision_score(y_tests_FA, y_preds_FA)], \n", " [\"Recall\", metrics.recall_score(y_tests, y_preds), metrics.recall_score(y_tests_MD, y_preds_MD),metrics.recall_score(y_tests_FA, y_preds_FA)], \n", " [\"F1-Score\", metrics.f1_score(y_tests, y_preds), metrics.f1_score(y_tests_MD, y_preds_MD), metrics.f1_score(y_tests_FA, y_preds_FA)]]\n", " \n", "#define header names\n", "col_names = [\"Measure\", \"Cortical Thickness\", \"Mean Diffusivity\", \"Fractional Anisotropy\"]\n", " \n", "#display table\n", "print(tabulate(data_LR, headers=col_names))" ] }, { "attachments": {}, "cell_type": "markdown", "id": "7e26df31", "metadata": {}, "source": [ "What can be seen is, that **MD** has the highest **accuracy**, the highest **precision** and the highest **F1-Score**. **CT** the highest **recall** score. **Precision and Recall** values are inversely related. As one increases, the other decreases. So when we relate to the **F1-Score**, **MD** performs better.\n" ] }, { "cell_type": "markdown", "id": "f6fa1c9c", "metadata": {}, "source": [ "In the next pages, a different algorithm is applied to the same data." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" }, "vscode": { "interpreter": { "hash": "578504114d49301275c44c87035f08411733f9928d9347745d7de100c09f7611" } } }, "nbformat": 4, "nbformat_minor": 5 }