<!DOCTYPE html>

CTS risk of death prediction research project

  1. Introduction

This project is an external research study for the City of Hope conducted by the USC Public Health Data Science department. The California Teachers Study (CTS) data, with a combination of OSHPD hospitalization records and the survey records are the dataset to analyze in this project. CTS was founded in 1995 with 133,477 teachers, administrators, school nurses, and other members who agreed to provide their health and behaviors to CTS researchers. In this study, hospitalizations of CTS participants from 2000 through 2015 are used. The main objective of the project is to predict the short-term risk of death based on prior in-patient hospitalization and data which includes subject-specific factors such as baseline characteristics from the CTS questionnaires and hospitalization information.

1.1 Project objectives

  1. Develop the best fitting model through machine learning that predicts the probability of death within a certain time window.
  2. Assess whether the time window after hospitalization plays a role in predicting the risk of death.
  3. Assess whether certain co-morbidity play a role in predicting the risk of death
  4. Deermine whether there are any spatial trends in hospitalization-related deaths.

  1. Methods

Initial dataset is a combination of the hospitalization data and survey data of CTS records, provided by the City of Hope. To answer the specific aims, different analyzing tools and prediction methods are used. Skimming total of 172 variables of the raw dataset, data exploration includes summary statistics, graphical display and plotting, cleaning process, and re-creating features. Data cleaning is the most important part in the data exploration. Eliminating NAs and appropriate amount of invalid observations is conducted to build an accurate machine learning training data. Data cleaning process also includes average imputation, clustering into moderate size of groups, and re-creating more analyzable variables. A couple of machine learning methods such as logistic regression and Random Forest are used for feature evaluation and selection in the exploration step. These modeling methods check if each individual variable is an appropriate feature to be in the final model.

Random Forest method is selected as a main Machine learning method in this project. Other regression modeling methods such as logistic regression, Lasso and ridge regressions are all good modeling methods for building a prediction model. However, due to the difference in characteristics of each variable and the greater number of observations compared to the number of features, Random Forest is believed to be better method for this study as it is flexible enough while keeping moderate variance/bias balance. It is also easier to evaluate performance and eligible to rank the importance of variables used in the model. The variable importance is measured by the Mean Decrease Gini coefficient, which is a measure of how each variable contributes to the homogeneity of the nodes and leaves in the resulting Random Forest. The higher the value of mean decrease gini score, the higher importance of the variable in the model.

The performance evaluation of the prediction model is based on AUC (area uder the curve) of ROC curve, which is a graph showing the performance of a classification model at all classifications thresholds. Confusion matrix will show the actual true/false positive/negative number of predictions. Prediction accuracy will be another important checkpoint to see the model performance. The model performance will be determined based on test set of the data while training and validation set will be used for model training and confirmation.

2.1 Exploratory Data Analysis (EDA)

The initial dataset has total of 132538 observations with 172 variables. Breaking down into unique values by participant key, it is composed of only 44494 participants hospitalization and survey data. Although there is a risk of duplicated data, all 132538 observations are needed to be in the analyses to avoid omitting important data regarding certain variables. Errors and invalid data is filtered out as data cleaning process conducted. Among total number of observations, 64099 were alive and 68439 were deceased in the end of eligible dataset period from 2000 through 2015. Through out the exploration, the value of the target variable “deceased” is kept as “0” as alive and “1” as deceased for seamless operation of the analyzing and modeling tools.

length(unique(combined_data$participant_key))
## [1] 44494
combined_data%>%
  count(deceased)
##   deceased     n
## 1        0 64099
## 2        1 68439

2.1.1 Key variable analyses

Variable analysis 1: Days till death since admission

One of the primary objective of the study is to assess if the time window after hospitalization affect the risk of death. For this analysis, the days were counted since admission to include all patients who have died during and after hospitalization. New variable (days_after_admission) counted days by subtracting admission date from date of death of each observation.

The results shows that the time window after hospitalization indicate certain trend as death count of the first month since admission is very high considering the length of time compare to the observations with longer term till death.

To predict the probability of death within a certain time window, new binary variable (deceased or alive) based on the days_after admission variable was created and separately grouped by certain time windows (1mo, 6mo, 1yr, 3yr, and 5yr). This was used as target variable in prediction model within each time window.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0     186     960    1418    2235    6741   64099

Below is the table and plots showing the number of death from total observations by each time window.

##   death_timewindow_after_admission     n
## 1                           in 1mo  7568
## 2                    in 1mo to 6mo  9332
## 3                    in 6mo to 1yr  5412
## 4                    in 1yr to 3yr 14134
## 5                    in 3yr to 5yr 10384
## 6                         over 5yr 21525
## 7                             <NA> 64099

Variable analysis 2: Length of stay

Length of stay is another interesting variable. Mean value for the hospitalization days is 4.7 days and maximum value is 1792 days. Although difference in mean and maximum length of stay shows a large number, values for 1st quarter and 3rd quarter are only 2 days and 5 days. To see the survival rate intuitively, the variable was grouped by certain length of stay days tier (0, 1-2, 3-7, 8-30, over 30). The bar plot shows that survival rate decreases as length of stay increases. Interestingly, survival rate of 0 group has lower rate than 1-2 group. Logistic regression of the length of stay variable showed significant association with the target variable (deceased). This variable was included in the final model as a predictor.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    0.000    2.000    3.000    4.713    5.000 1792.000
##    
##         0   1-2   3-7  8-30 over 30
##   0 53.66 59.40 45.51 27.46   29.89
##   1 46.34 40.60 54.49 72.54   70.11
##    
##         0   1-2   3-7  8-30 over 30
##   0  1.17 22.08 21.19  2.92    1.03
##   1  1.01 15.09 25.37  7.72    2.42

## 
## Call:
## glm(formula = deceased ~ length_of_stay_tier, family = binomial, 
##     data = combined_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6078  -1.2547   0.8013   1.1020   1.3427  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -0.14664    0.03735  -3.926 8.63e-05 ***
## length_of_stay_tier1-2     -0.23387    0.03846  -6.081 1.19e-09 ***
## length_of_stay_tier3-7      0.32662    0.03821   8.547  < 2e-16 ***
## length_of_stay_tier8-30     1.11815    0.04185  26.719  < 2e-16 ***
## length_of_stay_tierover 30  0.99895    0.04940  20.223  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 183483  on 132453  degrees of freedom
## Residual deviance: 177628  on 132449  degrees of freedom
## AIC: 177638
## 
## Number of Fisher Scoring iterations: 4

Variable analysis 3: Age groups

Age is a major confounding feature in most of the healthcare data predictive analysis. For further convenience, observations were grouped into six age groups (22-39, 40-49, 50-59, 60-69, 70-79, and over 80). As expected, survival rate diminished as age increases except that 70-79 group’s survival was lower than over 80 group. Age group was included in the final model as a predictor. Also, separate machine learning and prediction by each age group was conducted for deeper analysis.

##    
##     22-39 40-49 50-59 60-69 70-79 over 80
##   0  9253 11546 17677 17072  4627    3924
##   1   588  2455  7038 21561 27344    9369
##    
##     22-39 40-49 50-59 60-69 70-79 over 80
##   0 94.02 82.47 71.52 44.19 14.47   29.52
##   1  5.98 17.53 28.48 55.81 85.53   70.48

##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
## "2000-05-01" "2000-05-10" "2000-05-15" "2000-07-10" "2000-08-15" "2001-11-14"
##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
## "2000-05-01" "2000-05-10" "2000-05-16" "2000-07-12" "2000-08-23" "2002-05-20"

Variable analysis 4: Cause of death code

This variable is not suitable to be one of the features for the prediction model as it only has results for deceased patients. If a model is built on this variable, it will predict true positives very well, but at the same time, false positives will be very high. This results derives because machine learning will be trained on the data that has all of the observations eventually have deceased. Therefore the model will be biased. However, it is interesting to see top 15 cause of death each of which has at least 1000 observation under specific code. Observations with “0000”, “7777”, and NA code were all replaced to “Alive” after eliminating errors such as alive observation with cause of death code or deceased observation with NA values. As a result, total of 64082 alive and 67865 deceased observations remain in the dataset.

This variable would have been the most important variable to create the prediction model but there are a critical flaw. Because this code is not completely comparable with the icd-9 codes in other variables of the dataset, it can’t be solely used or modified into a complete predictor. This is because 1, this variable only has the icd-9 code for deceased observations, 2, primary diagnosis icd-9 code cannot be compared with the icd-9 code in this variable because primary diagnosis icd-9 codes are more subdivided. For instance, Alzheimer’s disease, the second largest cause of death in the dataset, is refereed as “G309” with 4522 cases in cause of death code. However, in primary diagnosis icd-9 code, the same disease is subdivided into “3310”, “G309”, “G301”, “G300”, and “G308”, and only 32 cases are under G309 code. Instead, “3310” code has 1034 cases. Due to these reasons, synchronizing level of specificity of icd-9 codes is required to utilize important icd-9 code related variables for predictive analytics with machine learning.

##   deceased     n
## 1        0 64082
## 2        1 68355

##    cause_of_death_cde                                   cause_of_death_dsc    n
## 1                I251                        Atherosclerotic heart disease 5797
## 2                G309                     Alzheimer's disease, unspecified 4522
## 3                J449   Chronic obstructive pulmonary disease, unspecified 3084
## 4                I500                             Congestive heart failure 2546
## 5                C509            Malignant neoplasm of breast, unspecified 2517
## 6                I219             Acute myocardial infarction, unspecified 2446
## 7                 I64    Stroke, not specified as hemorrhage or infarction 2446
## 8                C349  Malignant neoplasm of bronchus or lung, unspecified 2253
## 9                 F03                                 Unspecified dementia 1813
## 10               J189                               Pneumonia, unspecified 1767
## 11                C56                          Malignant neoplasm of ovary 1513
## 12                I48                      Atrial fibrillation and flutter 1151
## 13               C259          Malignant neoplasm of pancreas, unspecified 1133
## 14               I250 Atherosclerotic cardiovascular disease, so described 1110
## 15               C189             Malignant neoplasm of colon, unspecified 1058
##     Predicted
## True     0     1
##    0     0 19189
##    1     2 20541

Variable analysis 5: Cause of death code (deceased) + Primary diagnosis ICD-9 code (alive)

To confirm above assumption, new variable was created to combine and compare the cause of death code for deceased patients and primary diagnosis ICD-9 code for non-deceased patients. Random forest model was used to make a prediction solely based on the ICD-9 codes. The result shows that the accuracy of the model is below 50%.

##   is.na(diag_icd1)      n
## 1            FALSE 132437
##     Predicted
## True     0     1
##    0 19075   114
##    1 20543     0

Variable analysis 6: Major diagnosis category code

This is one of the most important diagnosis variables for the prediction model. Death rate trend by different diagnosis category is shown on the bar plot. Logistic regression shows significant assosication with the target variable and the random forest prediction model gives well above 50% accuracy.

00 = Ungroupable
01 = Nervous System, Diseases & Disorders
02 = Eye, Diseases & Disorders
03 = Ear, Nose, Mouth, & Throat, Diseases & Disorders
04 = Respiratory System, Diseases & Disorders
05 = Circulatory System, Diseases & Disorders
06 = Digestive System, Diseases & Disorders
07 = Hepatobiliary System & Pancreas, Diseases & Disorders
08 = Musculoskeletal System & Connective Tissue, Diseases & Disorders
09 = Skin, Subcutaneous Tissue & Breast, Diseases & Disorders
10 = Endocrine, Nutritional, and Metabolic, Diseases & Disorders
11 = Kidney and Urinary Tract, Diseases & Disorders
12 = Male Reproductive System, Diseases & Disorders
13 = Female Reproductive System, Diseases & Disorders
14 = Pregnancy, Childbirth, & The Puerperium
15 = Newborns and Neonate Conditions Began in Perinatal Period
16 = Blood, Blood Forming Organs,Immunological, Diseases & Disorders
17 = Myeloproliferative Diseases & Poorly Differentiated Neoplasms
18 = Infectious & Parasitic Diseases
19 = Mental Diseases & Disorders
20 = Alcohol-Drug Use and Alcohol-Drug Induced Organic Mental Diseases
21 = Injuries, Poisonings, and Toxic Effects of Drugs
22 = Burns
23 = Factors on Health Status & Other Contacts with Health Services
24 = Multiple Signficant Trauma
25 = Human Immunodeficiency Virus Infections

##    major_diag_cat_cde     n
## 1                   0  3907
## 2                   1  9104
## 3                   2   147
## 4                   3  1049
## 5                   4 10892
## 6                   5 20532
## 7                   6 14235
## 8                   7  3331
## 9                   8 28128
## 10                  9  4546
## 11                 10  4500
## 12                 11  5150
## 13                 13  6409
## 14                 14  3371
## 15                 15     1
## 16                 16  1413
## 17                 17  1593
## 18                 18  5379
## 19                 19  1690
## 20                 20   313
## 21                 21  1296
## 22                 22    23
## 23                 23  5207
## 24                 24   214
## 25                 25     7
##    
##         0     1     2     3     4     5     6     7     8     9    10    11
##   0  1863  3221    79   564  2734  7790  7133  1719 17892  2435  2248  1553
##   1  2044  5883    68   485  8158 12742  7102  1612 10236  2111  2252  3597
##    
##        13    14    15    16    17    18    19    20    21    22    23    24
##   0  5155  3338     0   363   378  1599   955   185   667    12  2097   102
##   1  1254    33     1  1050  1215  3780   735   128   629    11  3110   112
##    
##        25
##   0     0
##   1     7

## 
## Call:
## glm(formula = deceased ~ major_diag_cat_cde, family = binomial, 
##     data = modified_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6962  -1.1174   0.7603   0.9768   3.0419  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            0.60237    0.02192  27.482  < 2e-16 ***
## major_diag_cat_cde2   -0.75232    0.16687  -4.508 6.53e-06 ***
## major_diag_cat_cde3   -0.75328    0.06569 -11.467  < 2e-16 ***
## major_diag_cat_cde4    0.49086    0.03113  15.770  < 2e-16 ***
## major_diag_cat_cde5   -0.11031    0.02622  -4.208 2.58e-05 ***
## major_diag_cat_cde6   -0.60673    0.02759 -21.988  < 2e-16 ***
## major_diag_cat_cde7   -0.66664    0.04102 -16.252  < 2e-16 ***
## major_diag_cat_cde8   -1.16082    0.02518 -46.101  < 2e-16 ***
## major_diag_cat_cde9   -0.74516    0.03694 -20.170  < 2e-16 ***
## major_diag_cat_cde10  -0.60060    0.03700 -16.230  < 2e-16 ***
## major_diag_cat_cde11   0.23754    0.03745   6.343 2.25e-10 ***
## major_diag_cat_cde13  -2.01600    0.03837 -52.548  < 2e-16 ***
## major_diag_cat_cde14  -5.21899    0.17630 -29.602  < 2e-16 ***
## major_diag_cat_cde15   9.96365  119.46804   0.083 0.933533    
## major_diag_cat_cde16   0.45977    0.06471   7.105 1.20e-12 ***
## major_diag_cat_cde17   0.56523    0.06284   8.995  < 2e-16 ***
## major_diag_cat_cde18   0.25797    0.03702   6.969 3.20e-12 ***
## major_diag_cat_cde19  -0.86422    0.05374 -16.081  < 2e-16 ***
## major_diag_cat_cde20  -0.97070    0.11704  -8.294  < 2e-16 ***
## major_diag_cat_cde21  -0.66103    0.05975 -11.064  < 2e-16 ***
## major_diag_cat_cde22  -0.68939    0.41800  -1.649 0.099095 .  
## major_diag_cat_cde23  -0.20826    0.03576  -5.824 5.76e-09 ***
## major_diag_cat_cde24  -0.50885    0.13861  -3.671 0.000242 ***
## major_diag_cat_cde25   9.96365   45.15468   0.221 0.825360    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 178050  on 128529  degrees of freedom
## Residual deviance: 161559  on 128506  degrees of freedom
## AIC: 161607
## 
## Number of Fisher Scoring iterations: 9
##     Predicted
## True     0     1
##    0  9919  8727
##    1  5205 14708

Variable analysis 7: Major diagnosis category code + ccs diagnosis code + cause of death code

To confirm that cause of death code is not a good feature to be included in the prediction model, two different model were compared. One model is trained with all major diagnosis variables + cause of death code and the other model is trained without cause of death code.

Although the result shows that prediction performance of the model with cause of death code seems better, it is flawed because it has higher false positive rates while showing very small false negative rates. This is again because the variable only has the values for deceased patients. Also in the variable importance plot, the importance score of single variable (cause of death code) is greatly outperforming other variables and this doesn’t make sense considering all three variables used in model training are similar diagnosis codes. This result confirms that cause of death code should not be included in the model.

With cause of death code
##     Predicted
## True     0     1
##    0  8048 10536
##    1    40 19935

without cause of death code
##     Predicted
## True     0     1
##    0 10689  7895
##    1  4036 15939

Variable analysis 8: Co-morbidity frequency of deceased patients by time window

One of the initial objective of the project is a certain co-morbidity plays a role in predicting the risk of death. To see the trend, top 5 secondary to fifth ccs code were counted in each deceased by time window group.

The comparison of co-morbidities of deceased observations shows that it has a little difference in trend by each time window but it is not dramatically different. This means a certain co-morbidities in each time window group may play a role in predicting the risk of death if the model is built on these predictors. However in this project, co-morbidities were taken out from the final model feature selection as they have large amount missing values which could end up with eliminating too many observations and ruining predicting capability.

Comorbidity frequency of deceased and alive patients in total study period

Except the missing values, essential hypertension and cardiac dysrhythmias are the most frequent co-morbidities in total dataset.

CCS_code_2 frequency
##   diag_ccs_code2                       diag_ccs2    n
## 1             NA                            <NA> 7440
## 2             98          Essential hypertension 6829
## 3            106            Cardiac dysrhythmias 5892
## 4            159        Urinary tract infections 4730
## 5             55 Fluid and electrolyte disorders 4494
CCS_code_3 frequency
##   diag_ccs_code3                       diag_ccs3     n
## 1             NA                            <NA> 13953
## 2             98          Essential hypertension  9733
## 3            106            Cardiac dysrhythmias  5351
## 4             55 Fluid and electrolyte disorders  4304
## 5             53   Disorders of lipid metabolism  3364
CCS_code_4 frequency
##   diag_ccs_code4                       diag_ccs4     n
## 1             NA                            <NA> 22440
## 2             98          Essential hypertension  9278
## 3             53   Disorders of lipid metabolism  4574
## 4            106            Cardiac dysrhythmias  4018
## 5             55 Fluid and electrolyte disorders  3731
CCS_code_5 frequency
##   diag_ccs_code5                     diag_ccs5     n
## 1             NA                          <NA> 32123
## 2             98        Essential hypertension  7661
## 3             53 Disorders of lipid metabolism  4763
## 4             48             Thyroid disorders  3717
## 5            106          Cardiac dysrhythmias  3206

Comorbidity frequency of deceased patients in 1 month after admission

For the observations deceased in 1 month after admission, respiratory failure, secondary malignancies, pneumonia, and congestive heart failure were the most frequent co-morbidities.

CCS_code_2 frequency
##   diag_ccs_code2
## 1            131
## 2            122
## 3             42
## 4            108
## 5            157
##                                                                        diag_ccs2
## 1                             Respiratory failure; insufficiency; arrest (adult)
## 2 Pneumonia (except that caused by tuberculosis or sexually transmitted disease)
## 3                                                         Secondary malignancies
## 4                                      Congestive heart failure; nonhypertensive
## 5                                            Acute and unspecified renal failure
##     n
## 1 725
## 2 484
## 3 439
## 4 391
## 5 338
CCS_code_3 frequency
##   diag_ccs_code3                                          diag_ccs3   n
## 1             42                             Secondary malignancies 453
## 2            131 Respiratory failure; insufficiency; arrest (adult) 453
## 3             55                    Fluid and electrolyte disorders 389
## 4            106                               Cardiac dysrhythmias 387
## 5            108          Congestive heart failure; nonhypertensive 360
CCS_code_4 frequency
##   diag_ccs_code4                                 diag_ccs4   n
## 1             55           Fluid and electrolyte disorders 417
## 2             NA                                      <NA> 416
## 3             42                    Secondary malignancies 365
## 4            108 Congestive heart failure; nonhypertensive 358
## 5            106                      Cardiac dysrhythmias 336
CCS_code_5 frequency
##   diag_ccs_code5                                 diag_ccs5   n
## 1             NA                                      <NA> 564
## 2             55           Fluid and electrolyte disorders 426
## 3            106                      Cardiac dysrhythmias 320
## 4            108 Congestive heart failure; nonhypertensive 284
## 5             98                    Essential hypertension 252

Comorbidity frequency of deceased patients in 1 year after admission

For the observations deceased in 1 year after admission, respiratory failure, secondary malignancies, pneumonia, and congestive heart failure were the most frequent co-morbidities.

CCS_code_2 frequency
##   diag_ccs_code2
## 1             42
## 2            108
## 3            131
## 4            122
## 5            106
##                                                                        diag_ccs2
## 1                                                         Secondary malignancies
## 2                                      Congestive heart failure; nonhypertensive
## 3                             Respiratory failure; insufficiency; arrest (adult)
## 4 Pneumonia (except that caused by tuberculosis or sexually transmitted disease)
## 5                                                           Cardiac dysrhythmias
##      n
## 1 1421
## 2 1254
## 3 1177
## 4 1135
## 5  988
CCS_code_3 frequency
##   diag_ccs_code3                                 diag_ccs3    n
## 1             42                    Secondary malignancies 1364
## 2            106                      Cardiac dysrhythmias 1153
## 3             55           Fluid and electrolyte disorders 1134
## 4            108 Congestive heart failure; nonhypertensive 1055
## 5             NA                                      <NA>  973
CCS_code_4 frequency
##   diag_ccs_code4                       diag_ccs4    n
## 1             NA                            <NA> 1329
## 2             55 Fluid and electrolyte disorders 1108
## 3             42          Secondary malignancies 1032
## 4            106            Cardiac dysrhythmias 1026
## 5             98          Essential hypertension  955
CCS_code_5 frequency
##   diag_ccs_code5                                 diag_ccs5    n
## 1             NA                                      <NA> 1876
## 2             98                    Essential hypertension 1062
## 3             55           Fluid and electrolyte disorders 1000
## 4            106                      Cardiac dysrhythmias  912
## 5            108 Congestive heart failure; nonhypertensive  753

Comorbidity frequency of deceased patients in 5 years after admission

For the observations deceased in 5 years after admission, congestive heart failure, cardiac dyshythmias, and essential hypertension were the most frequent co-morbidities.

CCS_code_2 frequency
##   diag_ccs_code2                                 diag_ccs2    n
## 1            108 Congestive heart failure; nonhypertensive 2619
## 2            106                      Cardiac dysrhythmias 2542
## 3            159                  Urinary tract infections 2303
## 4             42                    Secondary malignancies 2157
## 5             55           Fluid and electrolyte disorders 1946
CCS_code_3 frequency
##   diag_ccs_code3                                 diag_ccs3    n
## 1            106                      Cardiac dysrhythmias 2678
## 2             NA                                      <NA> 2361
## 3             55           Fluid and electrolyte disorders 2233
## 4            108 Congestive heart failure; nonhypertensive 2183
## 5             98                    Essential hypertension 2063
CCS_code_4 frequency
##   diag_ccs_code4                                 diag_ccs4    n
## 1             NA                                      <NA> 3362
## 2             98                    Essential hypertension 2613
## 3            106                      Cardiac dysrhythmias 2203
## 4             55           Fluid and electrolyte disorders 2044
## 5            108 Congestive heart failure; nonhypertensive 1850
CCS_code_5 frequency
##   diag_ccs_code5                                 diag_ccs5    n
## 1             NA                                      <NA> 4875
## 2             98                    Essential hypertension 2748
## 3            106                      Cardiac dysrhythmias 1867
## 4             55           Fluid and electrolyte disorders 1742
## 5            108 Congestive heart failure; nonhypertensive 1421

2.1.2 Hospitalization data exploration and clean up

  1. CCS_code - primary diagnosis, primary procedure

Primary diagnosis ccs code is a good feature to train a model but the secondary and below ccs code variables includes too many missing values that could result in shrinking the size of eligible data, if included in the model. Same condition applies to all procedure ccs code as they also has too many missing values.

##                                                                         diag_ccs1
## 1                                                                  Osteoarthritis
## 2                                                    Septicemia (except in labor)
## 3                                                            Cardiac dysrhythmias
## 4           Rehabilitation care; fitting of prostheses; and adjustment of devices
## 5  Pneumonia (except that caused by tuberculosis or sexually transmitted disease)
## 6                                       Congestive heart failure; nonhypertensive
## 7                 Spondylosis; intervertebral disc disorders; other back problems
## 8                                                 Fracture of neck of femur (hip)
## 9                                                   Acute cerebrovascular disease
## 10                                                                      Undefined
## 11                                       Complication of device; implant or graft
## 12                                                         Nonspecific chest pain
## 13                                                       Urinary tract infections
## 14                                          Intestinal obstruction without hernia
## 15                               Coronary atherosclerosis and other heart disease
##        n
## 1  12131
## 2   4359
## 3   4317
## 4   4229
## 5   3842
## 6   3453
## 7   3315
## 8   3293
## 9   3247
## 10  3159
## 11  2695
## 12  2674
## 13  2531
## 14  2322
## 15  2279
##   is.na(diag_ccs_code1)      n
## 1                 FALSE 128530
##   is.na(diag_ccs_code2)      n
## 1                 FALSE 121090
## 2                  TRUE   7440

  1. Admission type

Not included in the final model - admission_typ shows no significant association

##   admission_typ     n
## 1             0    10
## 2             1 45832
## 3             2 82615
## 4             3     4
## 5             4    69
## 
## Call:
## glm(formula = deceased ~ admission_typ, family = binomial, data = modified_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7941  -0.9014   0.9831   0.9831   1.4812  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)   
## (Intercept)      1.3863     0.7906   1.754  0.07951 . 
## admission_typ1  -2.0770     0.7906  -2.627  0.00861 **
## admission_typ2  -0.9104     0.7906  -1.152  0.24951   
## admission_typ3  -1.3863     1.2748  -1.087  0.27682   
## admission_typ4  -0.8210     0.8293  -0.990  0.32219   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 178050  on 128529  degrees of freedom
## Residual deviance: 168466  on 128525  degrees of freedom
## AIC: 168476
## 
## Number of Fisher Scoring iterations: 4

  1. source admission code

It does not show significant association as only a handful of source admission codes shows good association whereas some codes whith large portion of patients does not show significant association

## 
## Call:
## glm(formula = deceased ~ src_admission_cde, family = binomial, 
##     data = modified_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8405  -1.2531   0.4546   1.0041   1.7941  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           2.680e-10  1.000e+00   0.000 1.000000    
## src_admission_cde010  1.157e+01  1.970e+02   0.059 0.953175    
## src_admission_cde011  1.335e-01  1.126e+00   0.119 0.905600    
## src_admission_cde012  1.157e+01  1.393e+02   0.083 0.933819    
## src_admission_cde013  1.763e-01  1.000e+00   0.176 0.860046    
## src_admission_cde023  4.016e+00  1.120e+00   3.586 0.000336 ***
## src_admission_cde031 -1.811e-01  1.006e+00  -0.180 0.857175    
## src_admission_cde032  1.157e+01  1.393e+02   0.083 0.933819    
## src_admission_cde033  2.469e-01  1.035e+00   0.239 0.811486    
## src_admission_cde041  1.977e+00  1.061e+00   1.863 0.062484 .  
## src_admission_cde042  2.351e+00  1.129e+00   2.083 0.037216 *  
## src_admission_cde043  3.169e+00  1.056e+00   3.000 0.002701 ** 
## src_admission_cde051  1.301e+00  1.002e+00   1.298 0.194451    
## src_admission_cde052  8.849e-01  1.003e+00   0.883 0.377502    
## src_admission_cde053  1.157e+01  9.849e+01   0.117 0.906515    
## src_admission_cde061  8.183e-01  1.047e+00   0.782 0.434436    
## src_admission_cde062  8.473e-01  1.058e+00   0.801 0.423154    
## src_admission_cde063  1.157e+01  1.970e+02   0.059 0.953175    
## src_admission_cde083 -1.157e+01  1.970e+02  -0.059 0.953175    
## src_admission_cde091  1.157e+01  1.970e+02   0.059 0.953175    
## src_admission_cde092  1.157e+01  8.042e+01   0.144 0.885639    
## src_admission_cde093  6.242e-01  1.050e+00   0.594 0.552201    
## src_admission_cde100 -1.157e+01  1.137e+02  -0.102 0.918992    
## src_admission_cde111  1.099e+00  1.528e+00   0.719 0.472011    
## src_admission_cde112 -6.931e-01  1.323e+00  -0.524 0.600299    
## src_admission_cde121  1.946e+00  1.464e+00   1.329 0.183746    
## src_admission_cde122 -2.683e-10  1.155e+00   0.000 1.000000    
## src_admission_cde130  1.157e+01  1.970e+02   0.059 0.953175    
## src_admission_cde131  4.224e-01  1.000e+00   0.422 0.672777    
## src_admission_cde132 -7.991e-01  1.000e+00  -0.799 0.424258    
## src_admission_cde1X3  1.157e+01  1.970e+02   0.059 0.953175    
## src_admission_cde211  1.157e+01  1.137e+02   0.102 0.918992    
## src_admission_cde221  1.157e+01  1.970e+02   0.059 0.953175    
## src_admission_cde222  1.157e+01  1.970e+02   0.059 0.953175    
## src_admission_cde231  2.218e+00  1.003e+00   2.212 0.026970 *  
## src_admission_cde232  1.792e+00  1.016e+00   1.764 0.077705 .  
## src_admission_cde311  2.683e-01  1.033e+00   0.260 0.795173    
## src_admission_cde312 -5.199e-01  1.003e+00  -0.518 0.604157    
## src_admission_cde321 -1.157e+01  1.393e+02  -0.083 0.933819    
## src_admission_cde322 -4.700e-01  1.151e+00  -0.408 0.683044    
## src_admission_cde331 -9.808e-01  1.208e+00  -0.812 0.416675    
## src_admission_cde332 -8.899e-01  1.015e+00  -0.877 0.380748    
## src_admission_cde411  1.897e+00  1.092e+00   1.738 0.082234 .  
## src_admission_cde412  1.145e+00  1.016e+00   1.128 0.259500    
## src_admission_cde421  2.090e+00  1.023e+00   2.043 0.041099 *  
## src_admission_cde422  1.369e+00  1.041e+00   1.316 0.188312    
## src_admission_cde431  2.058e+00  1.003e+00   2.053 0.040099 *  
## src_admission_cde432  1.676e+00  1.011e+00   1.658 0.097405 .  
## src_admission_cde4X3  1.157e+01  1.970e+02   0.059 0.953175    
## src_admission_cde500 -1.157e+01  1.970e+02  -0.059 0.953175    
## src_admission_cde511  5.108e-01  1.125e+00   0.454 0.649915    
## src_admission_cde512  5.240e-01  1.001e+00   0.524 0.600476    
## src_admission_cde521  1.823e-01  1.014e+00   0.180 0.857307    
## src_admission_cde522  3.589e-01  1.001e+00   0.359 0.719828    
## src_admission_cde531  1.157e+01  1.137e+02   0.102 0.918992    
## src_admission_cde532 -5.108e-01  1.125e+00  -0.454 0.649915    
## src_admission_cde611  1.386e+00  1.275e+00   1.087 0.276816    
## src_admission_cde612  8.220e-01  1.018e+00   0.808 0.419334    
## src_admission_cde621  9.886e-01  1.042e+00   0.949 0.342739    
## src_admission_cde622  1.289e-01  1.009e+00   0.128 0.898278    
## src_admission_cde631  1.157e+01  1.393e+02   0.083 0.933819    
## src_admission_cde632 -1.386e+00  1.500e+00  -0.924 0.355384    
## src_admission_cde731  6.931e-01  1.581e+00   0.438 0.661107    
## src_admission_cde831  1.099e+00  1.528e+00   0.719 0.472011    
## src_admission_cde832 -1.157e+01  1.393e+02  -0.083 0.933819    
## src_admission_cde902  1.157e+01  1.970e+02   0.059 0.953175    
## src_admission_cde911  1.157e+01  1.970e+02   0.059 0.953175    
## src_admission_cde912  9.163e-01  1.304e+00   0.703 0.482204    
## src_admission_cde921  5.557e-02  1.009e+00   0.055 0.956089    
## src_admission_cde922  8.552e-02  1.003e+00   0.085 0.932039    
## src_admission_cde931  9.607e-01  1.016e+00   0.946 0.344151    
## src_admission_cde932 -7.020e-02  1.017e+00  -0.069 0.944988    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 178050  on 128529  degrees of freedom
## Residual deviance: 164816  on 128458  degrees of freedom
## AIC: 164960
## 
## Number of Fisher Scoring iterations: 10
## 
## Call:
## glm(formula = deceased ~ src_site_cde, family = binomial, data = combined_data)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.929  -1.026  -1.026   1.337   1.665  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept)    -0.6931     1.2247  -0.566   0.5714  
## src_site_cde1   0.3257     1.2248   0.266   0.7903  
## src_site_cde2   2.3844     1.2271   1.943   0.0520 .
## src_site_cde3  -0.2829     1.2319  -0.230   0.8184  
## src_site_cde4   2.2285     1.2266   1.817   0.0693 .
## src_site_cde5   0.7074     1.2253   0.577   0.5637  
## src_site_cde6   0.8169     1.2307   0.664   0.5068  
## src_site_cde8  -0.4055     1.6832  -0.241   0.8096  
## src_site_cde9   0.7526     1.2265   0.614   0.5395  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 66401  on 48377  degrees of freedom
## Residual deviance: 64275  on 48369  degrees of freedom
##   (84076 observations deleted due to missingness)
## AIC: 64293
## 
## Number of Fisher Scoring iterations: 4
## 
## Call:
## glm(formula = deceased ~ src_licensure_cde, family = binomial, 
##     data = combined_data)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.205  -1.068  -1.068   1.291   1.893  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)
## (Intercept)          -1.609      1.095  -1.469    0.142
## src_licensure_cde1    1.577      1.097   1.438    0.150
## src_licensure_cde2    1.673      1.096   1.527    0.127
## src_licensure_cde3    1.347      1.095   1.229    0.219
## src_licensure_cdeX   11.175     51.251   0.218    0.827
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 66401  on 48377  degrees of freedom
## Residual deviance: 66307  on 48373  degrees of freedom
##   (84076 observations deleted due to missingness)
## AIC: 66317
## 
## Number of Fisher Scoring iterations: 8
## 
## Call:
## glm(formula = deceased ~ src_route_cde, family = binomial, data = combined_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2910  -0.8057  -0.8057   1.0679   1.6019  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)
## (Intercept)    -0.91629    0.83666  -1.095    0.273
## src_route_cde1  1.17941    0.83675   1.410    0.159
## src_route_cde2 -0.04211    0.83680  -0.050    0.960
## src_route_cde3 10.48225   51.24581   0.205    0.838
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 66401  on 48377  degrees of freedom
## Residual deviance: 62310  on 48374  degrees of freedom
##   (84076 observations deleted due to missingness)
## AIC: 62318
## 
## Number of Fisher Scoring iterations: 8

  1. Payer category code

Only group 1,2,3 and 4 has a meaningful size of patients. Drop 0 ( invalide patients) from the data to see more clear association - shows significant association

##    
##         0     1     2     3     4     5     6     7     8     9
##   0     6 33897   148 26787   589    17   156    14   308   297
##   1     7 57335   135  8240    74    15    69     9   199   228
##    
##         0     1     2     3     4     5     6     7     8     9
##   0 46.15 37.15 52.30 76.48 88.84 53.12 69.33 60.87 60.75 56.57
##   1 53.85 62.85 47.70 23.52 11.16 46.88 30.67 39.13 39.25 43.43

## 
## Call:
## glm(formula = deceased ~ payer_cat_cde, family = binomial, data = modified_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4072  -1.4072   0.9638   0.9638   2.0941  
## 
## Coefficients:
##                 Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)     0.525585   0.006851   76.711  < 2e-16 ***
## payer_cat_cde2 -0.617522   0.119210   -5.180 2.22e-07 ***
## payer_cat_cde3 -1.704501   0.014340 -118.864  < 2e-16 ***
## payer_cat_cde4 -2.599946   0.123521  -21.049  < 2e-16 ***
## payer_cat_cde5 -0.650748   0.354312   -1.837   0.0663 .  
## payer_cat_cde6 -1.341334   0.144741   -9.267  < 2e-16 ***
## payer_cat_cde7 -0.967417   0.427302   -2.264   0.0236 *  
## payer_cat_cde8 -0.962380   0.091208  -10.552  < 2e-16 ***
## payer_cat_cde9 -0.789971   0.088317   -8.945  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 178032  on 128516  degrees of freedom
## Residual deviance: 161208  on 128508  degrees of freedom
## AIC: 161226
## 
## Number of Fisher Scoring iterations: 4

  1. Payer coverage type

It shows significant association

##   payer_coverage_typ     n
## 1                  0  1077
## 2                  1 51331
## 3                  2 14401
## 4                  3 61708
## 
## Call:
## glm(formula = deceased ~ payer_coverage_typ, family = binomial, 
##     data = modified_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3833  -1.1393   0.9846   0.9846   1.7089  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -0.34316    0.06184  -5.549 2.87e-08 ***
## payer_coverage_typ1  0.25290    0.06247   4.048 5.16e-05 ***
## payer_coverage_typ2 -0.85273    0.06491 -13.136  < 2e-16 ***
## payer_coverage_typ3  0.81528    0.06239  13.067  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 178032  on 128516  degrees of freedom
## Residual deviance: 170326  on 128513  degrees of freedom
## AIC: 170334
## 
## Number of Fisher Scoring iterations: 4

  1. Total charges amount

It shows significant association. Cleaned data shows less significant association - presumably because eliminating patients with 0 total charges takes away too many numbers of data of patients who are healthy enought to not to receive serious treatments

##  [1]     0 10988 61866     0 18062  7265 19475 20565 35509 18927 25161 21571
## [13]  8098 17592 37395 73502 35285 18356  8812     0
## [1] 3270166
## 
## Call:
## glm(formula = deceased ~ total_charges_amt, family = binomial, 
##     data = modified_data)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.685  -1.201   1.124   1.154   1.158  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       4.539e-02  6.567e-03   6.912 4.78e-12 ***
## total_charges_amt 4.191e-07  7.945e-08   5.276 1.32e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 178032  on 128516  degrees of freedom
## Residual deviance: 178004  on 128515  degrees of freedom
## AIC: 178008
## 
## Number of Fisher Scoring iterations: 3
## 
## Call:
## glm(formula = deceased ~ total_charges_amt, family = binomial, 
##     data = charges_cleaned_data)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.755  -1.200   1.117   1.156   1.162  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       3.539e-02  7.776e-03   4.551 5.33e-06 ***
## total_charges_amt 4.831e-07  8.421e-08   5.737 9.64e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 141491  on 102134  degrees of freedom
## Residual deviance: 141457  on 102133  degrees of freedom
## AIC: 141461
## 
## Number of Fisher Scoring iterations: 3

  1. Patient care type

It shows significant association

##   patient_care_typ      n
## 1                0      2
## 2                1 120435
## 3                3   4027
## 4                4   1492
## 5                5     93
## 6                6   2468
## 
## Call:
## glm(formula = deceased ~ patient_care_typ, family = binomial, 
##     data = modified_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5593  -1.1955   0.8387   1.1594   1.8762  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        0.042602   0.005764   7.391 1.46e-13 ***
## patient_care_typ3  0.821425   0.034982  23.481  < 2e-16 ***
## patient_care_typ4 -0.486760   0.053372  -9.120  < 2e-16 ***
## patient_care_typ5 -1.613819   0.274809  -5.873 4.29e-09 ***
## patient_care_typ6  0.182000   0.040921   4.448 8.68e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 178029  on 128514  degrees of freedom
## Residual deviance: 177272  on 128510  degrees of freedom
## AIC: 177282
## 
## Number of Fisher Scoring iterations: 4

  1. Patient disposition code

It shows significant association. Although one of the code category is “died” which directly indicates the deceased status, neither the variable nor the “died” category is eliminated because rest of the data provides significantly important information for the prediction and deceased population cannot be largely eliminated. Fortunately “died” category in this variable only accounts around 4% of total deceased population.

01 = Routine (home)
02 = Acute Care within the admitting hospital
03 = Other Care within the admitting hospital
04 = Skilled Nursing / Intermediate Care (SN/IC) within the admitting hospital
05 = Acute Care at another hospital
06 = Other Care (not SN/IC) at another hospital
07 = Skilled Nursing / Intermediate Care (SN/IC) at another facility
08 = Residential Care Facility
09 = Prison/Jail
10 = Left Against Medical Advice
11 = Died
12 = Home Health Service
13 = Other
00 = Invalid/Blank

##    patient_disposition_cde     n
## 1                        0    12
## 2                        1 72120
## 3                        2   538
## 4                        3  2909
## 5                        4  3334
## 6                        5  2266
## 7                        6  3377
## 8                        7 17678
## 9                        8  1705
## 10                       9     3
## 11                      10   268
## 12                      11  3270
## 13                      12 19916
## 14                      13   382
## 15                      20   228
## 16                      21     1
## 17                      50   194
## 18                      51    45
## 19                      61    10
## 20                      62   148
## 21                      63    48
## 22                      64    12
## 23                      65    11
## 24                      70    13
## 25                      81     1
## 26                      83     3
## 27                      86     1
## 28                      89     1
## 29                      93     1
## 30                      99    20
## 
## Call:
## glm(formula = deceased ~ patient_disposition_cde, family = binomial, 
##     data = modified_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.6623  -1.0129   0.0495   1.1084   1.7682  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -0.400095   0.007597 -52.666  < 2e-16 ***
## patient_disposition_cde2    1.341240   0.096251  13.935  < 2e-16 ***
## patient_disposition_cde3    0.562015   0.037971  14.801  < 2e-16 ***
## patient_disposition_cde4    1.315756   0.039077  33.671  < 2e-16 ***
## patient_disposition_cde5    0.984588   0.044475  22.138  < 2e-16 ***
## patient_disposition_cde6    0.179483   0.035449   5.063 4.13e-07 ***
## patient_disposition_cde7    1.419269   0.018655  76.080  < 2e-16 ***
## patient_disposition_cde8    2.357449   0.073935  31.885  < 2e-16 ***
## patient_disposition_cde9   -0.293052   1.224768  -0.239   0.8109    
## patient_disposition_cde10   0.564644   0.122818   4.597 4.28e-06 ***
## patient_disposition_cde11   7.105122   0.500353  14.200  < 2e-16 ***
## patient_disposition_cde12   0.668209   0.016192  41.267  < 2e-16 ***
## patient_disposition_cde13   2.120395   0.142778  14.851  < 2e-16 ***
## patient_disposition_cde20   5.825045   1.002229   5.812 6.17e-09 ***
## patient_disposition_cde21 -10.165933 119.468043  -0.085   0.9322    
## patient_disposition_cde50   2.620442   0.241668  10.843  < 2e-16 ***
## patient_disposition_cde51   4.184285   1.011328   4.137 3.51e-05 ***
## patient_disposition_cde61  -0.447203   0.690107  -0.648   0.5170    
## patient_disposition_cde62  -0.928092   0.202145  -4.591 4.41e-06 ***
## patient_disposition_cde63   1.613118   0.343502   4.696 2.65e-06 ***
## patient_disposition_cde64   0.736567   0.585589   1.258   0.2085    
## patient_disposition_cde65   0.582417   0.605578   0.962   0.3362    
## patient_disposition_cde70  -0.069909   0.570138  -0.123   0.9024    
## patient_disposition_cde81 -10.165933 119.468043  -0.085   0.9322    
## patient_disposition_cde83 -10.165933  68.974907  -0.147   0.8828    
## patient_disposition_cde86  10.966123 119.468043   0.092   0.9269    
## patient_disposition_cde89  10.966123 119.468043   0.092   0.9269    
## patient_disposition_cde93  10.966123 119.468043   0.092   0.9269    
## patient_disposition_cde99   1.019134   0.468869   2.174   0.0297 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 178013  on 128502  degrees of freedom
## Residual deviance: 163569  on 128474  degrees of freedom
## AIC: 163627
## 
## Number of Fisher Scoring iterations: 9

2.1.3 Survey data exploration and clean up

  1. variable importance of baseline characteristics from the CTS questionnaires

Based on the variable importance, variables with Mean Decrease Gini score of 1000 and above in each feature categories are selected and cleaned for the final prediction model feature selection.

1.1 Background characteristics
##     Predicted
## True     0     1
##    0 29985 17317
##    1 13959 34457

1.2. Reproductive history
##     Predicted
## True     0     1
##    0 26664 13030
##    1  8223 38882

1.3. Health History
##     Predicted
## True     0     1
##    0 47475 12376
##    1 27594 32028

1.4. Physical activty
##     Predicted
## True     0     1
##    0 54212  5545
##    1  8698 47201

1.5. Diet_data
##     Predicted
## True     0     1
##    0 58389     2
##    1    18 57867

1.6. Alcohol & tobacco data
##     Predicted
## True     0     1
##    0 45249 12334
##    1 14288 45846

1.7. medication data
##     Predicted
## True     0     1
##    0 40497 15262
##    1 26462 23226

  1. Other survey data exploration and clean up

2.1 SES quartile
##   ses_quartile_ind     n
## 1                1  5817
## 2                2 22466
## 3                3 40484
## 4                4 57854
## 5               NA  1882
##    
##         1     2     3     4
##   0 46.97 45.13 48.28 49.97
##   1 53.03 54.87 51.72 50.03

## 
## Call:
## glm(formula = deceased ~ ses_quartile_ind, family = binomial, 
##     data = modified_data)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.273  -1.210   1.085   1.145   1.176  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       0.295186   0.021082   14.00   <2e-16 ***
## ses_quartile_ind -0.072935   0.006369  -11.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 175410  on 126620  degrees of freedom
## Residual deviance: 175279  on 126619  degrees of freedom
##   (1882 observations deleted due to missingness)
## AIC: 175283
## 
## Number of Fisher Scoring iterations: 3
2.2 Urbanization blockgroup code
##   blockgroup90_urban_cat     n
## 1                     1R 16568
## 2                     2T  4458
## 3                     3C 22337
## 4                     4S 70390
## 5                     5M 12899
## 6                   <NA>  1851
##    
##        1R    2T    3C    4S    5M
##   0  8085  1897 10963 34192  6213
##   1  8483  2561 11374 36198  6686
##    
##        1R    2T    3C    4S    5M
##   0 48.80 42.55 49.08 48.58 48.17
##   1 51.20 57.45 50.92 51.42 51.83

## 
## Call:
## glm(formula = deceased ~ blockgroup90_urban_cat, family = binomial, 
##     data = modified_data)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.307  -1.202   1.053   1.153   1.162  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               0.048054   0.015542   3.092  0.00199 ** 
## blockgroup90_urban_cat2T  0.252070   0.034046   7.404 1.32e-13 ***
## blockgroup90_urban_cat3C -0.011250   0.020511  -0.548  0.58337    
## blockgroup90_urban_cat4S  0.008959   0.017275   0.519  0.60406    
## blockgroup90_urban_cat5M  0.025318   0.023497   1.078  0.28124    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 175454  on 126651  degrees of freedom
## Residual deviance: 175386  on 126647  degrees of freedom
##   (1851 observations deleted due to missingness)
## AIC: 175396
## 
## Number of Fisher Scoring iterations: 3
2.3 Surgery indicators

## 
## Call:
## glm(formula = deceased ~ hysterectomy_ind, family = binomial, 
##     data = modified_data)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.275  -1.152   1.083   1.203   1.203  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -0.059377   0.007408  -8.015  1.1e-15 ***
## hysterectomy_ind  0.285397   0.011305  25.245  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 178013  on 128502  degrees of freedom
## Residual deviance: 177374  on 128501  degrees of freedom
## AIC: 177378
## 
## Number of Fisher Scoring iterations: 3

## 
## Call:
## glm(formula = deceased ~ bilateral_mastectomy_ind, family = binomial, 
##     data = modified_data)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.210  -1.210   1.146   1.146   1.524  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               0.075258   0.005623   13.38   <2e-16 ***
## bilateral_mastectomy_ind -0.861789   0.050975  -16.91   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 178013  on 128502  degrees of freedom
## Residual deviance: 177704  on 128501  degrees of freedom
## AIC: 177708
## 
## Number of Fisher Scoring iterations: 4

## 
## Call:
## glm(formula = deceased ~ bilateral_oophorectomy_ind, family = binomial, 
##     data = modified_data)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.226  -1.197   1.130   1.158   1.158  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                0.045094   0.006542   6.893 5.46e-12 ***
## bilateral_oophorectomy_ind 0.068135   0.012549   5.430 5.65e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 178013  on 128502  degrees of freedom
## Residual deviance: 177983  on 128501  degrees of freedom
## AIC: 177987
## 
## Number of Fisher Scoring iterations: 3

  1. Result

3.1 Random Forest prediction model by different time window

The method used in final prediction modeling is Random Forest. The final models are trained on training set of data and validated on test set to see performance. Performance of the model is evaluated with AUC to see if it maximized both sensitivity and specificity. Error rate and accuracy rate is provided with Confusion matrix showing actual number of true/false positive/negative numbers of predictions.

The prediction models was built on target variable set different by various time windows: Total study period, 1 month after admission, 6 months after admission, 1 year after admission, 3 year after admission, and 5 year after admission.

Variable importance score was measured with mean decrease gini to see which variable play the most critical role in each prediction.

  1. Deceased in total study periods

RandomForest model with target variable: deceased

##     Predicted
## True    0    1
##    0 8659 1063
##    1  839 9753
## Error rate
## [1] 0.094
## Accuracy rate
## [1] 0.90637

## Area under the curve: 0.9683
## 95% CI: 0.9673-0.9694 (DeLong)

  1. Deceased in 1 month

RandomForest model with target variable: deceased_in30after_admission

##     Predicted
## True     0     1
##    0 19053    36
##    1   646   579
## Error rate
## [1] 0.034
## Accuracy rate
## [1] 0.9664271

## Area under the curve: 0.904
## 95% CI: 0.8992-0.9088 (DeLong)

  1. Deceased in 6 months

RandomForest model with target variable: deceased_in180_after_admission

##     Predicted
## True     0     1
##    0 17472   170
##    1  1805   867
## Error rate
## [1] 0.097
## Accuracy rate
## [1] 0.9027764

## Area under the curve: 0.8857
## 95% CI: 0.8824-0.8889 (DeLong)

  1. Deceased in 1 year

RandomForest model with target variable: deceased_in365_after_admission

##     Predicted
## True     0     1
##    0 16516   292
##    1  2214  1292
## Error rate
## [1] 0.123
## Accuracy rate
## [1] 0.8766368

## Area under the curve: 0.8865
## 95% CI: 0.8836-0.8894 (DeLong)

  1. Deceased in 3 years

RandomForest model with target variable: deceased_in1095_after_admission

##     Predicted
## True     0     1
##    0 13551  1051
##    1  2176  3536
## Error rate
## [1] 0.159
## Accuracy rate
## [1] 0.841144

## Area under the curve: 0.8977
## 95% CI: 0.8954-0.8999 (DeLong)

  1. Deceased in 5 years

RandomForest model with target variable: deceased_in1825_after_admission

##     Predicted
## True     0     1
##    0 13734   481
##    1   720  5378
## Error rate
## [1] 0.059
## Accuracy rate
## [1] 0.9408753

## Area under the curve: 0.9029
## 95% CI: 0.9005-0.9054 (DeLong)

3.2 Short-term (6 months) risk of death prediction with out patient disposition code variable

  1. Deceased in 1 month

RandomForest model with target variable: deceased_in30after_admission

##     Predicted
## True     0     1
##    0 19011    78
##    1  1182    43
## Error rate
## [1] 0.062
## Accuracy rate
## [1] 0.9379738

## Area under the curve: 0.8201
## 95% CI: 0.8144-0.8258 (DeLong)

Deceased in 6 months

RandomForest model with target variable: deceased_in180_after_admission

##     Predicted
## True     0     1
##    0 17381   261
##    1  2239   433
## Error rate
## [1] 0.123
## Accuracy rate
## [1] 0.8769322

## Area under the curve: 0.8472
## 95% CI: 0.8435-0.8508 (DeLong)

  1. Conclusion

The final prediction model targeting risk of death in different time window after hospitalization provided clear findings. First, each model by time window yielded prediction accuracy around 90%. Because all accuracy rates and error rates were computed based on the prediction result of the test data set, these figures are objective indicators of performance of the model. With the same types of new patients’ data, these models can produce moderately accurate prediction.

Secondly, different trend of variable importance plot in predictions by each time window indicate that short-term risk of death is highly dependent on a specific variable. In the prediction of risk of death in 1 month, 6 months and 1 year, patient disposition code played a massively important role with matchless importance score than other variables. The reason behind this result is most likely because of the “died” code of the variable. Despite the fact that “died” category in this variable only accounts around 4% of total deceased population, large portion of the population in short-term death is recorded as died in this variable and it affected the result.

As it is analyzed in the hospitalization data exploration, although one of the patient disposition code category is “died” which directly indicates the deceased status, it cannot be deleted because rest of the data provides significantly important information for the prediction and deceased population also cannot be largely eliminated. However, the prediction model for short-term risk of death (6 months) created with the same variable selection but the patient disposition code also provided decent accuracy rate. This result shows that more sophisticated modeling may require different settings of feature selection for the short-term risk of death prediction compared to the long-term risk of death prediction.

On the other hands, in the longer term risk of death prediction, major diagnosis category code, and primary diagnosis ccs code catch up the position of patient disposition code. Specifically in 5 year model, major diagnosis category code takes the top position. It is clear that variables providing diagnosis code information play an important role in the risk prediction.

Thirdly, age group variable does not stand out throughout the predictions, especially in the short-term models. It gets more score in the longer term predictions and takes top position in risk of death prediction in total period. This result indicates that in short-term risk of death after hospitalization prediction, other hospitalization data and individual baseline information is more important than age. Further research on the models by sub-dividing the dataset by age blocks is expected to provide more accurate insight on this issue.

Lastly and interestingly, in extremely longer term risk prediction, menopausal status and high carbohydrate diet plays as much important role as diagnosis-related predictors. This interesting outcome displays that they certainly are an very influential factor related to the female death. Menopausal status might be related to the age issue, however, deeper research could reveal the association in detail.

One additional point that should be addressed in the study is relationship between cause of death code and diagnosis icd-9 codes. This project experienced difficulty utilizing such an important variables. However, it is believed that combining cause of death code and diagnosis icd-9 code with accurately matching level of icd code use will open possibility of creating more sophisticated machine learning prediction model.