<!DOCTYPE html>
This project is an external research study for the City of Hope conducted by the USC Public Health Data Science department. The California Teachers Study (CTS) data, with a combination of OSHPD hospitalization records and the survey records are the dataset to analyze in this project. CTS was founded in 1995 with 133,477 teachers, administrators, school nurses, and other members who agreed to provide their health and behaviors to CTS researchers. In this study, hospitalizations of CTS participants from 2000 through 2015 are used. The main objective of the project is to predict the short-term risk of death based on prior in-patient hospitalization and data which includes subject-specific factors such as baseline characteristics from the CTS questionnaires and hospitalization information.
Initial dataset is a combination of the hospitalization data and survey data of CTS records, provided by the City of Hope. To answer the specific aims, different analyzing tools and prediction methods are used. Skimming total of 172 variables of the raw dataset, data exploration includes summary statistics, graphical display and plotting, cleaning process, and re-creating features. Data cleaning is the most important part in the data exploration. Eliminating NAs and appropriate amount of invalid observations is conducted to build an accurate machine learning training data. Data cleaning process also includes average imputation, clustering into moderate size of groups, and re-creating more analyzable variables. A couple of machine learning methods such as logistic regression and Random Forest are used for feature evaluation and selection in the exploration step. These modeling methods check if each individual variable is an appropriate feature to be in the final model.
Random Forest method is selected as a main Machine learning method in this project. Other regression modeling methods such as logistic regression, Lasso and ridge regressions are all good modeling methods for building a prediction model. However, due to the difference in characteristics of each variable and the greater number of observations compared to the number of features, Random Forest is believed to be better method for this study as it is flexible enough while keeping moderate variance/bias balance. It is also easier to evaluate performance and eligible to rank the importance of variables used in the model. The variable importance is measured by the Mean Decrease Gini coefficient, which is a measure of how each variable contributes to the homogeneity of the nodes and leaves in the resulting Random Forest. The higher the value of mean decrease gini score, the higher importance of the variable in the model.
The performance evaluation of the prediction model is based on AUC (area uder the curve) of ROC curve, which is a graph showing the performance of a classification model at all classifications thresholds. Confusion matrix will show the actual true/false positive/negative number of predictions. Prediction accuracy will be another important checkpoint to see the model performance. The model performance will be determined based on test set of the data while training and validation set will be used for model training and confirmation.
The initial dataset has total of 132538 observations with 172 variables. Breaking down into unique values by participant key, it is composed of only 44494 participants hospitalization and survey data. Although there is a risk of duplicated data, all 132538 observations are needed to be in the analyses to avoid omitting important data regarding certain variables. Errors and invalid data is filtered out as data cleaning process conducted. Among total number of observations, 64099 were alive and 68439 were deceased in the end of eligible dataset period from 2000 through 2015. Through out the exploration, the value of the target variable “deceased” is kept as “0” as alive and “1” as deceased for seamless operation of the analyzing and modeling tools.
length(unique(combined_data$participant_key))
## [1] 44494
combined_data%>%
count(deceased)
## deceased n
## 1 0 64099
## 2 1 68439
One of the primary objective of the study is to assess if the time window after hospitalization affect the risk of death. For this analysis, the days were counted since admission to include all patients who have died during and after hospitalization. New variable (days_after_admission) counted days by subtracting admission date from date of death of each observation.
The results shows that the time window after hospitalization indicate certain trend as death count of the first month since admission is very high considering the length of time compare to the observations with longer term till death.
To predict the probability of death within a certain time window, new binary variable (deceased or alive) based on the days_after admission variable was created and separately grouped by certain time windows (1mo, 6mo, 1yr, 3yr, and 5yr). This was used as target variable in prediction model within each time window.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 186 960 1418 2235 6741 64099
Below is the table and plots showing the number of death from total observations by each time window.
## death_timewindow_after_admission n
## 1 in 1mo 7568
## 2 in 1mo to 6mo 9332
## 3 in 6mo to 1yr 5412
## 4 in 1yr to 3yr 14134
## 5 in 3yr to 5yr 10384
## 6 over 5yr 21525
## 7 <NA> 64099
Length of stay is another interesting variable. Mean value for the hospitalization days is 4.7 days and maximum value is 1792 days. Although difference in mean and maximum length of stay shows a large number, values for 1st quarter and 3rd quarter are only 2 days and 5 days. To see the survival rate intuitively, the variable was grouped by certain length of stay days tier (0, 1-2, 3-7, 8-30, over 30). The bar plot shows that survival rate decreases as length of stay increases. Interestingly, survival rate of 0 group has lower rate than 1-2 group. Logistic regression of the length of stay variable showed significant association with the target variable (deceased). This variable was included in the final model as a predictor.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 3.000 4.713 5.000 1792.000
##
## 0 1-2 3-7 8-30 over 30
## 0 53.66 59.40 45.51 27.46 29.89
## 1 46.34 40.60 54.49 72.54 70.11
##
## 0 1-2 3-7 8-30 over 30
## 0 1.17 22.08 21.19 2.92 1.03
## 1 1.01 15.09 25.37 7.72 2.42
##
## Call:
## glm(formula = deceased ~ length_of_stay_tier, family = binomial,
## data = combined_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6078 -1.2547 0.8013 1.1020 1.3427
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.14664 0.03735 -3.926 8.63e-05 ***
## length_of_stay_tier1-2 -0.23387 0.03846 -6.081 1.19e-09 ***
## length_of_stay_tier3-7 0.32662 0.03821 8.547 < 2e-16 ***
## length_of_stay_tier8-30 1.11815 0.04185 26.719 < 2e-16 ***
## length_of_stay_tierover 30 0.99895 0.04940 20.223 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 183483 on 132453 degrees of freedom
## Residual deviance: 177628 on 132449 degrees of freedom
## AIC: 177638
##
## Number of Fisher Scoring iterations: 4
Age is a major confounding feature in most of the healthcare data predictive analysis. For further convenience, observations were grouped into six age groups (22-39, 40-49, 50-59, 60-69, 70-79, and over 80). As expected, survival rate diminished as age increases except that 70-79 group’s survival was lower than over 80 group. Age group was included in the final model as a predictor. Also, separate machine learning and prediction by each age group was conducted for deeper analysis.
##
## 22-39 40-49 50-59 60-69 70-79 over 80
## 0 9253 11546 17677 17072 4627 3924
## 1 588 2455 7038 21561 27344 9369
##
## 22-39 40-49 50-59 60-69 70-79 over 80
## 0 94.02 82.47 71.52 44.19 14.47 29.52
## 1 5.98 17.53 28.48 55.81 85.53 70.48
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "2000-05-01" "2000-05-10" "2000-05-15" "2000-07-10" "2000-08-15" "2001-11-14"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "2000-05-01" "2000-05-10" "2000-05-16" "2000-07-12" "2000-08-23" "2002-05-20"
This variable is not suitable to be one of the features for the prediction model as it only has results for deceased patients. If a model is built on this variable, it will predict true positives very well, but at the same time, false positives will be very high. This results derives because machine learning will be trained on the data that has all of the observations eventually have deceased. Therefore the model will be biased. However, it is interesting to see top 15 cause of death each of which has at least 1000 observation under specific code. Observations with “0000”, “7777”, and NA code were all replaced to “Alive” after eliminating errors such as alive observation with cause of death code or deceased observation with NA values. As a result, total of 64082 alive and 67865 deceased observations remain in the dataset.
This variable would have been the most important variable to create the prediction model but there are a critical flaw. Because this code is not completely comparable with the icd-9 codes in other variables of the dataset, it can’t be solely used or modified into a complete predictor. This is because 1, this variable only has the icd-9 code for deceased observations, 2, primary diagnosis icd-9 code cannot be compared with the icd-9 code in this variable because primary diagnosis icd-9 codes are more subdivided. For instance, Alzheimer’s disease, the second largest cause of death in the dataset, is refereed as “G309” with 4522 cases in cause of death code. However, in primary diagnosis icd-9 code, the same disease is subdivided into “3310”, “G309”, “G301”, “G300”, and “G308”, and only 32 cases are under G309 code. Instead, “3310” code has 1034 cases. Due to these reasons, synchronizing level of specificity of icd-9 codes is required to utilize important icd-9 code related variables for predictive analytics with machine learning.
## deceased n
## 1 0 64082
## 2 1 68355
## cause_of_death_cde cause_of_death_dsc n
## 1 I251 Atherosclerotic heart disease 5797
## 2 G309 Alzheimer's disease, unspecified 4522
## 3 J449 Chronic obstructive pulmonary disease, unspecified 3084
## 4 I500 Congestive heart failure 2546
## 5 C509 Malignant neoplasm of breast, unspecified 2517
## 6 I219 Acute myocardial infarction, unspecified 2446
## 7 I64 Stroke, not specified as hemorrhage or infarction 2446
## 8 C349 Malignant neoplasm of bronchus or lung, unspecified 2253
## 9 F03 Unspecified dementia 1813
## 10 J189 Pneumonia, unspecified 1767
## 11 C56 Malignant neoplasm of ovary 1513
## 12 I48 Atrial fibrillation and flutter 1151
## 13 C259 Malignant neoplasm of pancreas, unspecified 1133
## 14 I250 Atherosclerotic cardiovascular disease, so described 1110
## 15 C189 Malignant neoplasm of colon, unspecified 1058
## Predicted
## True 0 1
## 0 0 19189
## 1 2 20541
To confirm above assumption, new variable was created to combine and compare the cause of death code for deceased patients and primary diagnosis ICD-9 code for non-deceased patients. Random forest model was used to make a prediction solely based on the ICD-9 codes. The result shows that the accuracy of the model is below 50%.
## is.na(diag_icd1) n
## 1 FALSE 132437
## Predicted
## True 0 1
## 0 19075 114
## 1 20543 0
This is one of the most important diagnosis variables for the prediction model. Death rate trend by different diagnosis category is shown on the bar plot. Logistic regression shows significant assosication with the target variable and the random forest prediction model gives well above 50% accuracy.
00 = Ungroupable
01 = Nervous System, Diseases & Disorders
02 = Eye, Diseases & Disorders
03 = Ear, Nose, Mouth, & Throat, Diseases & Disorders
04 = Respiratory System, Diseases & Disorders
05 = Circulatory System, Diseases & Disorders
06 = Digestive System, Diseases & Disorders
07 = Hepatobiliary System & Pancreas, Diseases & Disorders
08 = Musculoskeletal System & Connective Tissue, Diseases & Disorders
09 = Skin, Subcutaneous Tissue & Breast, Diseases & Disorders
10 = Endocrine, Nutritional, and Metabolic, Diseases & Disorders
11 = Kidney and Urinary Tract, Diseases & Disorders
12 = Male Reproductive System, Diseases & Disorders
13 = Female Reproductive System, Diseases & Disorders
14 = Pregnancy, Childbirth, & The Puerperium
15 = Newborns and Neonate Conditions Began in Perinatal Period
16 = Blood, Blood Forming Organs,Immunological, Diseases & Disorders
17 = Myeloproliferative Diseases & Poorly Differentiated Neoplasms
18 = Infectious & Parasitic Diseases
19 = Mental Diseases & Disorders
20 = Alcohol-Drug Use and Alcohol-Drug Induced Organic Mental Diseases
21 = Injuries, Poisonings, and Toxic Effects of Drugs
22 = Burns
23 = Factors on Health Status & Other Contacts with Health Services
24 = Multiple Signficant Trauma
25 = Human Immunodeficiency Virus Infections
## major_diag_cat_cde n
## 1 0 3907
## 2 1 9104
## 3 2 147
## 4 3 1049
## 5 4 10892
## 6 5 20532
## 7 6 14235
## 8 7 3331
## 9 8 28128
## 10 9 4546
## 11 10 4500
## 12 11 5150
## 13 13 6409
## 14 14 3371
## 15 15 1
## 16 16 1413
## 17 17 1593
## 18 18 5379
## 19 19 1690
## 20 20 313
## 21 21 1296
## 22 22 23
## 23 23 5207
## 24 24 214
## 25 25 7
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 0 1863 3221 79 564 2734 7790 7133 1719 17892 2435 2248 1553
## 1 2044 5883 68 485 8158 12742 7102 1612 10236 2111 2252 3597
##
## 13 14 15 16 17 18 19 20 21 22 23 24
## 0 5155 3338 0 363 378 1599 955 185 667 12 2097 102
## 1 1254 33 1 1050 1215 3780 735 128 629 11 3110 112
##
## 25
## 0 0
## 1 7
##
## Call:
## glm(formula = deceased ~ major_diag_cat_cde, family = binomial,
## data = modified_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6962 -1.1174 0.7603 0.9768 3.0419
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.60237 0.02192 27.482 < 2e-16 ***
## major_diag_cat_cde2 -0.75232 0.16687 -4.508 6.53e-06 ***
## major_diag_cat_cde3 -0.75328 0.06569 -11.467 < 2e-16 ***
## major_diag_cat_cde4 0.49086 0.03113 15.770 < 2e-16 ***
## major_diag_cat_cde5 -0.11031 0.02622 -4.208 2.58e-05 ***
## major_diag_cat_cde6 -0.60673 0.02759 -21.988 < 2e-16 ***
## major_diag_cat_cde7 -0.66664 0.04102 -16.252 < 2e-16 ***
## major_diag_cat_cde8 -1.16082 0.02518 -46.101 < 2e-16 ***
## major_diag_cat_cde9 -0.74516 0.03694 -20.170 < 2e-16 ***
## major_diag_cat_cde10 -0.60060 0.03700 -16.230 < 2e-16 ***
## major_diag_cat_cde11 0.23754 0.03745 6.343 2.25e-10 ***
## major_diag_cat_cde13 -2.01600 0.03837 -52.548 < 2e-16 ***
## major_diag_cat_cde14 -5.21899 0.17630 -29.602 < 2e-16 ***
## major_diag_cat_cde15 9.96365 119.46804 0.083 0.933533
## major_diag_cat_cde16 0.45977 0.06471 7.105 1.20e-12 ***
## major_diag_cat_cde17 0.56523 0.06284 8.995 < 2e-16 ***
## major_diag_cat_cde18 0.25797 0.03702 6.969 3.20e-12 ***
## major_diag_cat_cde19 -0.86422 0.05374 -16.081 < 2e-16 ***
## major_diag_cat_cde20 -0.97070 0.11704 -8.294 < 2e-16 ***
## major_diag_cat_cde21 -0.66103 0.05975 -11.064 < 2e-16 ***
## major_diag_cat_cde22 -0.68939 0.41800 -1.649 0.099095 .
## major_diag_cat_cde23 -0.20826 0.03576 -5.824 5.76e-09 ***
## major_diag_cat_cde24 -0.50885 0.13861 -3.671 0.000242 ***
## major_diag_cat_cde25 9.96365 45.15468 0.221 0.825360
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 178050 on 128529 degrees of freedom
## Residual deviance: 161559 on 128506 degrees of freedom
## AIC: 161607
##
## Number of Fisher Scoring iterations: 9
## Predicted
## True 0 1
## 0 9919 8727
## 1 5205 14708
To confirm that cause of death code is not a good feature to be included in the prediction model, two different model were compared. One model is trained with all major diagnosis variables + cause of death code and the other model is trained without cause of death code.
Although the result shows that prediction performance of the model with cause of death code seems better, it is flawed because it has higher false positive rates while showing very small false negative rates. This is again because the variable only has the values for deceased patients. Also in the variable importance plot, the importance score of single variable (cause of death code) is greatly outperforming other variables and this doesn’t make sense considering all three variables used in model training are similar diagnosis codes. This result confirms that cause of death code should not be included in the model.
## Predicted
## True 0 1
## 0 8048 10536
## 1 40 19935
## Predicted
## True 0 1
## 0 10689 7895
## 1 4036 15939
One of the initial objective of the project is a certain co-morbidity plays a role in predicting the risk of death. To see the trend, top 5 secondary to fifth ccs code were counted in each deceased by time window group.
The comparison of co-morbidities of deceased observations shows that it has a little difference in trend by each time window but it is not dramatically different. This means a certain co-morbidities in each time window group may play a role in predicting the risk of death if the model is built on these predictors. However in this project, co-morbidities were taken out from the final model feature selection as they have large amount missing values which could end up with eliminating too many observations and ruining predicting capability.
Except the missing values, essential hypertension and cardiac dysrhythmias are the most frequent co-morbidities in total dataset.
## diag_ccs_code2 diag_ccs2 n
## 1 NA <NA> 7440
## 2 98 Essential hypertension 6829
## 3 106 Cardiac dysrhythmias 5892
## 4 159 Urinary tract infections 4730
## 5 55 Fluid and electrolyte disorders 4494
## diag_ccs_code3 diag_ccs3 n
## 1 NA <NA> 13953
## 2 98 Essential hypertension 9733
## 3 106 Cardiac dysrhythmias 5351
## 4 55 Fluid and electrolyte disorders 4304
## 5 53 Disorders of lipid metabolism 3364
## diag_ccs_code4 diag_ccs4 n
## 1 NA <NA> 22440
## 2 98 Essential hypertension 9278
## 3 53 Disorders of lipid metabolism 4574
## 4 106 Cardiac dysrhythmias 4018
## 5 55 Fluid and electrolyte disorders 3731
## diag_ccs_code5 diag_ccs5 n
## 1 NA <NA> 32123
## 2 98 Essential hypertension 7661
## 3 53 Disorders of lipid metabolism 4763
## 4 48 Thyroid disorders 3717
## 5 106 Cardiac dysrhythmias 3206
For the observations deceased in 1 month after admission, respiratory failure, secondary malignancies, pneumonia, and congestive heart failure were the most frequent co-morbidities.
## diag_ccs_code2
## 1 131
## 2 122
## 3 42
## 4 108
## 5 157
## diag_ccs2
## 1 Respiratory failure; insufficiency; arrest (adult)
## 2 Pneumonia (except that caused by tuberculosis or sexually transmitted disease)
## 3 Secondary malignancies
## 4 Congestive heart failure; nonhypertensive
## 5 Acute and unspecified renal failure
## n
## 1 725
## 2 484
## 3 439
## 4 391
## 5 338
## diag_ccs_code3 diag_ccs3 n
## 1 42 Secondary malignancies 453
## 2 131 Respiratory failure; insufficiency; arrest (adult) 453
## 3 55 Fluid and electrolyte disorders 389
## 4 106 Cardiac dysrhythmias 387
## 5 108 Congestive heart failure; nonhypertensive 360
## diag_ccs_code4 diag_ccs4 n
## 1 55 Fluid and electrolyte disorders 417
## 2 NA <NA> 416
## 3 42 Secondary malignancies 365
## 4 108 Congestive heart failure; nonhypertensive 358
## 5 106 Cardiac dysrhythmias 336
## diag_ccs_code5 diag_ccs5 n
## 1 NA <NA> 564
## 2 55 Fluid and electrolyte disorders 426
## 3 106 Cardiac dysrhythmias 320
## 4 108 Congestive heart failure; nonhypertensive 284
## 5 98 Essential hypertension 252
For the observations deceased in 1 year after admission, respiratory failure, secondary malignancies, pneumonia, and congestive heart failure were the most frequent co-morbidities.
## diag_ccs_code2
## 1 42
## 2 108
## 3 131
## 4 122
## 5 106
## diag_ccs2
## 1 Secondary malignancies
## 2 Congestive heart failure; nonhypertensive
## 3 Respiratory failure; insufficiency; arrest (adult)
## 4 Pneumonia (except that caused by tuberculosis or sexually transmitted disease)
## 5 Cardiac dysrhythmias
## n
## 1 1421
## 2 1254
## 3 1177
## 4 1135
## 5 988
## diag_ccs_code3 diag_ccs3 n
## 1 42 Secondary malignancies 1364
## 2 106 Cardiac dysrhythmias 1153
## 3 55 Fluid and electrolyte disorders 1134
## 4 108 Congestive heart failure; nonhypertensive 1055
## 5 NA <NA> 973
## diag_ccs_code4 diag_ccs4 n
## 1 NA <NA> 1329
## 2 55 Fluid and electrolyte disorders 1108
## 3 42 Secondary malignancies 1032
## 4 106 Cardiac dysrhythmias 1026
## 5 98 Essential hypertension 955
## diag_ccs_code5 diag_ccs5 n
## 1 NA <NA> 1876
## 2 98 Essential hypertension 1062
## 3 55 Fluid and electrolyte disorders 1000
## 4 106 Cardiac dysrhythmias 912
## 5 108 Congestive heart failure; nonhypertensive 753
For the observations deceased in 5 years after admission, congestive heart failure, cardiac dyshythmias, and essential hypertension were the most frequent co-morbidities.
## diag_ccs_code2 diag_ccs2 n
## 1 108 Congestive heart failure; nonhypertensive 2619
## 2 106 Cardiac dysrhythmias 2542
## 3 159 Urinary tract infections 2303
## 4 42 Secondary malignancies 2157
## 5 55 Fluid and electrolyte disorders 1946
## diag_ccs_code3 diag_ccs3 n
## 1 106 Cardiac dysrhythmias 2678
## 2 NA <NA> 2361
## 3 55 Fluid and electrolyte disorders 2233
## 4 108 Congestive heart failure; nonhypertensive 2183
## 5 98 Essential hypertension 2063
## diag_ccs_code4 diag_ccs4 n
## 1 NA <NA> 3362
## 2 98 Essential hypertension 2613
## 3 106 Cardiac dysrhythmias 2203
## 4 55 Fluid and electrolyte disorders 2044
## 5 108 Congestive heart failure; nonhypertensive 1850
## diag_ccs_code5 diag_ccs5 n
## 1 NA <NA> 4875
## 2 98 Essential hypertension 2748
## 3 106 Cardiac dysrhythmias 1867
## 4 55 Fluid and electrolyte disorders 1742
## 5 108 Congestive heart failure; nonhypertensive 1421
Primary diagnosis ccs code is a good feature to train a model but the secondary and below ccs code variables includes too many missing values that could result in shrinking the size of eligible data, if included in the model. Same condition applies to all procedure ccs code as they also has too many missing values.
## diag_ccs1
## 1 Osteoarthritis
## 2 Septicemia (except in labor)
## 3 Cardiac dysrhythmias
## 4 Rehabilitation care; fitting of prostheses; and adjustment of devices
## 5 Pneumonia (except that caused by tuberculosis or sexually transmitted disease)
## 6 Congestive heart failure; nonhypertensive
## 7 Spondylosis; intervertebral disc disorders; other back problems
## 8 Fracture of neck of femur (hip)
## 9 Acute cerebrovascular disease
## 10 Undefined
## 11 Complication of device; implant or graft
## 12 Nonspecific chest pain
## 13 Urinary tract infections
## 14 Intestinal obstruction without hernia
## 15 Coronary atherosclerosis and other heart disease
## n
## 1 12131
## 2 4359
## 3 4317
## 4 4229
## 5 3842
## 6 3453
## 7 3315
## 8 3293
## 9 3247
## 10 3159
## 11 2695
## 12 2674
## 13 2531
## 14 2322
## 15 2279
## is.na(diag_ccs_code1) n
## 1 FALSE 128530
## is.na(diag_ccs_code2) n
## 1 FALSE 121090
## 2 TRUE 7440
Not included in the final model - admission_typ shows no significant association
## admission_typ n
## 1 0 10
## 2 1 45832
## 3 2 82615
## 4 3 4
## 5 4 69
##
## Call:
## glm(formula = deceased ~ admission_typ, family = binomial, data = modified_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7941 -0.9014 0.9831 0.9831 1.4812
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.3863 0.7906 1.754 0.07951 .
## admission_typ1 -2.0770 0.7906 -2.627 0.00861 **
## admission_typ2 -0.9104 0.7906 -1.152 0.24951
## admission_typ3 -1.3863 1.2748 -1.087 0.27682
## admission_typ4 -0.8210 0.8293 -0.990 0.32219
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 178050 on 128529 degrees of freedom
## Residual deviance: 168466 on 128525 degrees of freedom
## AIC: 168476
##
## Number of Fisher Scoring iterations: 4
It does not show significant association as only a handful of source admission codes shows good association whereas some codes whith large portion of patients does not show significant association
##
## Call:
## glm(formula = deceased ~ src_admission_cde, family = binomial,
## data = modified_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8405 -1.2531 0.4546 1.0041 1.7941
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.680e-10 1.000e+00 0.000 1.000000
## src_admission_cde010 1.157e+01 1.970e+02 0.059 0.953175
## src_admission_cde011 1.335e-01 1.126e+00 0.119 0.905600
## src_admission_cde012 1.157e+01 1.393e+02 0.083 0.933819
## src_admission_cde013 1.763e-01 1.000e+00 0.176 0.860046
## src_admission_cde023 4.016e+00 1.120e+00 3.586 0.000336 ***
## src_admission_cde031 -1.811e-01 1.006e+00 -0.180 0.857175
## src_admission_cde032 1.157e+01 1.393e+02 0.083 0.933819
## src_admission_cde033 2.469e-01 1.035e+00 0.239 0.811486
## src_admission_cde041 1.977e+00 1.061e+00 1.863 0.062484 .
## src_admission_cde042 2.351e+00 1.129e+00 2.083 0.037216 *
## src_admission_cde043 3.169e+00 1.056e+00 3.000 0.002701 **
## src_admission_cde051 1.301e+00 1.002e+00 1.298 0.194451
## src_admission_cde052 8.849e-01 1.003e+00 0.883 0.377502
## src_admission_cde053 1.157e+01 9.849e+01 0.117 0.906515
## src_admission_cde061 8.183e-01 1.047e+00 0.782 0.434436
## src_admission_cde062 8.473e-01 1.058e+00 0.801 0.423154
## src_admission_cde063 1.157e+01 1.970e+02 0.059 0.953175
## src_admission_cde083 -1.157e+01 1.970e+02 -0.059 0.953175
## src_admission_cde091 1.157e+01 1.970e+02 0.059 0.953175
## src_admission_cde092 1.157e+01 8.042e+01 0.144 0.885639
## src_admission_cde093 6.242e-01 1.050e+00 0.594 0.552201
## src_admission_cde100 -1.157e+01 1.137e+02 -0.102 0.918992
## src_admission_cde111 1.099e+00 1.528e+00 0.719 0.472011
## src_admission_cde112 -6.931e-01 1.323e+00 -0.524 0.600299
## src_admission_cde121 1.946e+00 1.464e+00 1.329 0.183746
## src_admission_cde122 -2.683e-10 1.155e+00 0.000 1.000000
## src_admission_cde130 1.157e+01 1.970e+02 0.059 0.953175
## src_admission_cde131 4.224e-01 1.000e+00 0.422 0.672777
## src_admission_cde132 -7.991e-01 1.000e+00 -0.799 0.424258
## src_admission_cde1X3 1.157e+01 1.970e+02 0.059 0.953175
## src_admission_cde211 1.157e+01 1.137e+02 0.102 0.918992
## src_admission_cde221 1.157e+01 1.970e+02 0.059 0.953175
## src_admission_cde222 1.157e+01 1.970e+02 0.059 0.953175
## src_admission_cde231 2.218e+00 1.003e+00 2.212 0.026970 *
## src_admission_cde232 1.792e+00 1.016e+00 1.764 0.077705 .
## src_admission_cde311 2.683e-01 1.033e+00 0.260 0.795173
## src_admission_cde312 -5.199e-01 1.003e+00 -0.518 0.604157
## src_admission_cde321 -1.157e+01 1.393e+02 -0.083 0.933819
## src_admission_cde322 -4.700e-01 1.151e+00 -0.408 0.683044
## src_admission_cde331 -9.808e-01 1.208e+00 -0.812 0.416675
## src_admission_cde332 -8.899e-01 1.015e+00 -0.877 0.380748
## src_admission_cde411 1.897e+00 1.092e+00 1.738 0.082234 .
## src_admission_cde412 1.145e+00 1.016e+00 1.128 0.259500
## src_admission_cde421 2.090e+00 1.023e+00 2.043 0.041099 *
## src_admission_cde422 1.369e+00 1.041e+00 1.316 0.188312
## src_admission_cde431 2.058e+00 1.003e+00 2.053 0.040099 *
## src_admission_cde432 1.676e+00 1.011e+00 1.658 0.097405 .
## src_admission_cde4X3 1.157e+01 1.970e+02 0.059 0.953175
## src_admission_cde500 -1.157e+01 1.970e+02 -0.059 0.953175
## src_admission_cde511 5.108e-01 1.125e+00 0.454 0.649915
## src_admission_cde512 5.240e-01 1.001e+00 0.524 0.600476
## src_admission_cde521 1.823e-01 1.014e+00 0.180 0.857307
## src_admission_cde522 3.589e-01 1.001e+00 0.359 0.719828
## src_admission_cde531 1.157e+01 1.137e+02 0.102 0.918992
## src_admission_cde532 -5.108e-01 1.125e+00 -0.454 0.649915
## src_admission_cde611 1.386e+00 1.275e+00 1.087 0.276816
## src_admission_cde612 8.220e-01 1.018e+00 0.808 0.419334
## src_admission_cde621 9.886e-01 1.042e+00 0.949 0.342739
## src_admission_cde622 1.289e-01 1.009e+00 0.128 0.898278
## src_admission_cde631 1.157e+01 1.393e+02 0.083 0.933819
## src_admission_cde632 -1.386e+00 1.500e+00 -0.924 0.355384
## src_admission_cde731 6.931e-01 1.581e+00 0.438 0.661107
## src_admission_cde831 1.099e+00 1.528e+00 0.719 0.472011
## src_admission_cde832 -1.157e+01 1.393e+02 -0.083 0.933819
## src_admission_cde902 1.157e+01 1.970e+02 0.059 0.953175
## src_admission_cde911 1.157e+01 1.970e+02 0.059 0.953175
## src_admission_cde912 9.163e-01 1.304e+00 0.703 0.482204
## src_admission_cde921 5.557e-02 1.009e+00 0.055 0.956089
## src_admission_cde922 8.552e-02 1.003e+00 0.085 0.932039
## src_admission_cde931 9.607e-01 1.016e+00 0.946 0.344151
## src_admission_cde932 -7.020e-02 1.017e+00 -0.069 0.944988
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 178050 on 128529 degrees of freedom
## Residual deviance: 164816 on 128458 degrees of freedom
## AIC: 164960
##
## Number of Fisher Scoring iterations: 10
##
## Call:
## glm(formula = deceased ~ src_site_cde, family = binomial, data = combined_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.929 -1.026 -1.026 1.337 1.665
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.6931 1.2247 -0.566 0.5714
## src_site_cde1 0.3257 1.2248 0.266 0.7903
## src_site_cde2 2.3844 1.2271 1.943 0.0520 .
## src_site_cde3 -0.2829 1.2319 -0.230 0.8184
## src_site_cde4 2.2285 1.2266 1.817 0.0693 .
## src_site_cde5 0.7074 1.2253 0.577 0.5637
## src_site_cde6 0.8169 1.2307 0.664 0.5068
## src_site_cde8 -0.4055 1.6832 -0.241 0.8096
## src_site_cde9 0.7526 1.2265 0.614 0.5395
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 66401 on 48377 degrees of freedom
## Residual deviance: 64275 on 48369 degrees of freedom
## (84076 observations deleted due to missingness)
## AIC: 64293
##
## Number of Fisher Scoring iterations: 4
##
## Call:
## glm(formula = deceased ~ src_licensure_cde, family = binomial,
## data = combined_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.205 -1.068 -1.068 1.291 1.893
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.609 1.095 -1.469 0.142
## src_licensure_cde1 1.577 1.097 1.438 0.150
## src_licensure_cde2 1.673 1.096 1.527 0.127
## src_licensure_cde3 1.347 1.095 1.229 0.219
## src_licensure_cdeX 11.175 51.251 0.218 0.827
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 66401 on 48377 degrees of freedom
## Residual deviance: 66307 on 48373 degrees of freedom
## (84076 observations deleted due to missingness)
## AIC: 66317
##
## Number of Fisher Scoring iterations: 8
##
## Call:
## glm(formula = deceased ~ src_route_cde, family = binomial, data = combined_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2910 -0.8057 -0.8057 1.0679 1.6019
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.91629 0.83666 -1.095 0.273
## src_route_cde1 1.17941 0.83675 1.410 0.159
## src_route_cde2 -0.04211 0.83680 -0.050 0.960
## src_route_cde3 10.48225 51.24581 0.205 0.838
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 66401 on 48377 degrees of freedom
## Residual deviance: 62310 on 48374 degrees of freedom
## (84076 observations deleted due to missingness)
## AIC: 62318
##
## Number of Fisher Scoring iterations: 8
Only group 1,2,3 and 4 has a meaningful size of patients. Drop 0 ( invalide patients) from the data to see more clear association - shows significant association
##
## 0 1 2 3 4 5 6 7 8 9
## 0 6 33897 148 26787 589 17 156 14 308 297
## 1 7 57335 135 8240 74 15 69 9 199 228
##
## 0 1 2 3 4 5 6 7 8 9
## 0 46.15 37.15 52.30 76.48 88.84 53.12 69.33 60.87 60.75 56.57
## 1 53.85 62.85 47.70 23.52 11.16 46.88 30.67 39.13 39.25 43.43
##
## Call:
## glm(formula = deceased ~ payer_cat_cde, family = binomial, data = modified_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4072 -1.4072 0.9638 0.9638 2.0941
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.525585 0.006851 76.711 < 2e-16 ***
## payer_cat_cde2 -0.617522 0.119210 -5.180 2.22e-07 ***
## payer_cat_cde3 -1.704501 0.014340 -118.864 < 2e-16 ***
## payer_cat_cde4 -2.599946 0.123521 -21.049 < 2e-16 ***
## payer_cat_cde5 -0.650748 0.354312 -1.837 0.0663 .
## payer_cat_cde6 -1.341334 0.144741 -9.267 < 2e-16 ***
## payer_cat_cde7 -0.967417 0.427302 -2.264 0.0236 *
## payer_cat_cde8 -0.962380 0.091208 -10.552 < 2e-16 ***
## payer_cat_cde9 -0.789971 0.088317 -8.945 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 178032 on 128516 degrees of freedom
## Residual deviance: 161208 on 128508 degrees of freedom
## AIC: 161226
##
## Number of Fisher Scoring iterations: 4
It shows significant association
## payer_coverage_typ n
## 1 0 1077
## 2 1 51331
## 3 2 14401
## 4 3 61708
##
## Call:
## glm(formula = deceased ~ payer_coverage_typ, family = binomial,
## data = modified_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3833 -1.1393 0.9846 0.9846 1.7089
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.34316 0.06184 -5.549 2.87e-08 ***
## payer_coverage_typ1 0.25290 0.06247 4.048 5.16e-05 ***
## payer_coverage_typ2 -0.85273 0.06491 -13.136 < 2e-16 ***
## payer_coverage_typ3 0.81528 0.06239 13.067 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 178032 on 128516 degrees of freedom
## Residual deviance: 170326 on 128513 degrees of freedom
## AIC: 170334
##
## Number of Fisher Scoring iterations: 4
It shows significant association. Cleaned data shows less significant association - presumably because eliminating patients with 0 total charges takes away too many numbers of data of patients who are healthy enought to not to receive serious treatments
## [1] 0 10988 61866 0 18062 7265 19475 20565 35509 18927 25161 21571
## [13] 8098 17592 37395 73502 35285 18356 8812 0
## [1] 3270166
##
## Call:
## glm(formula = deceased ~ total_charges_amt, family = binomial,
## data = modified_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.685 -1.201 1.124 1.154 1.158
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.539e-02 6.567e-03 6.912 4.78e-12 ***
## total_charges_amt 4.191e-07 7.945e-08 5.276 1.32e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 178032 on 128516 degrees of freedom
## Residual deviance: 178004 on 128515 degrees of freedom
## AIC: 178008
##
## Number of Fisher Scoring iterations: 3
##
## Call:
## glm(formula = deceased ~ total_charges_amt, family = binomial,
## data = charges_cleaned_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.755 -1.200 1.117 1.156 1.162
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.539e-02 7.776e-03 4.551 5.33e-06 ***
## total_charges_amt 4.831e-07 8.421e-08 5.737 9.64e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 141491 on 102134 degrees of freedom
## Residual deviance: 141457 on 102133 degrees of freedom
## AIC: 141461
##
## Number of Fisher Scoring iterations: 3
It shows significant association
## patient_care_typ n
## 1 0 2
## 2 1 120435
## 3 3 4027
## 4 4 1492
## 5 5 93
## 6 6 2468
##
## Call:
## glm(formula = deceased ~ patient_care_typ, family = binomial,
## data = modified_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5593 -1.1955 0.8387 1.1594 1.8762
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.042602 0.005764 7.391 1.46e-13 ***
## patient_care_typ3 0.821425 0.034982 23.481 < 2e-16 ***
## patient_care_typ4 -0.486760 0.053372 -9.120 < 2e-16 ***
## patient_care_typ5 -1.613819 0.274809 -5.873 4.29e-09 ***
## patient_care_typ6 0.182000 0.040921 4.448 8.68e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 178029 on 128514 degrees of freedom
## Residual deviance: 177272 on 128510 degrees of freedom
## AIC: 177282
##
## Number of Fisher Scoring iterations: 4
It shows significant association. Although one of the code category is “died” which directly indicates the deceased status, neither the variable nor the “died” category is eliminated because rest of the data provides significantly important information for the prediction and deceased population cannot be largely eliminated. Fortunately “died” category in this variable only accounts around 4% of total deceased population.
01 = Routine (home)
02 = Acute Care within the admitting hospital
03 = Other Care within the admitting hospital
04 = Skilled Nursing / Intermediate Care (SN/IC) within the admitting hospital
05 = Acute Care at another hospital
06 = Other Care (not SN/IC) at another hospital
07 = Skilled Nursing / Intermediate Care (SN/IC) at another facility
08 = Residential Care Facility
09 = Prison/Jail
10 = Left Against Medical Advice
11 = Died
12 = Home Health Service
13 = Other
00 = Invalid/Blank
## patient_disposition_cde n
## 1 0 12
## 2 1 72120
## 3 2 538
## 4 3 2909
## 5 4 3334
## 6 5 2266
## 7 6 3377
## 8 7 17678
## 9 8 1705
## 10 9 3
## 11 10 268
## 12 11 3270
## 13 12 19916
## 14 13 382
## 15 20 228
## 16 21 1
## 17 50 194
## 18 51 45
## 19 61 10
## 20 62 148
## 21 63 48
## 22 64 12
## 23 65 11
## 24 70 13
## 25 81 1
## 26 83 3
## 27 86 1
## 28 89 1
## 29 93 1
## 30 99 20
##
## Call:
## glm(formula = deceased ~ patient_disposition_cde, family = binomial,
## data = modified_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.6623 -1.0129 0.0495 1.1084 1.7682
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.400095 0.007597 -52.666 < 2e-16 ***
## patient_disposition_cde2 1.341240 0.096251 13.935 < 2e-16 ***
## patient_disposition_cde3 0.562015 0.037971 14.801 < 2e-16 ***
## patient_disposition_cde4 1.315756 0.039077 33.671 < 2e-16 ***
## patient_disposition_cde5 0.984588 0.044475 22.138 < 2e-16 ***
## patient_disposition_cde6 0.179483 0.035449 5.063 4.13e-07 ***
## patient_disposition_cde7 1.419269 0.018655 76.080 < 2e-16 ***
## patient_disposition_cde8 2.357449 0.073935 31.885 < 2e-16 ***
## patient_disposition_cde9 -0.293052 1.224768 -0.239 0.8109
## patient_disposition_cde10 0.564644 0.122818 4.597 4.28e-06 ***
## patient_disposition_cde11 7.105122 0.500353 14.200 < 2e-16 ***
## patient_disposition_cde12 0.668209 0.016192 41.267 < 2e-16 ***
## patient_disposition_cde13 2.120395 0.142778 14.851 < 2e-16 ***
## patient_disposition_cde20 5.825045 1.002229 5.812 6.17e-09 ***
## patient_disposition_cde21 -10.165933 119.468043 -0.085 0.9322
## patient_disposition_cde50 2.620442 0.241668 10.843 < 2e-16 ***
## patient_disposition_cde51 4.184285 1.011328 4.137 3.51e-05 ***
## patient_disposition_cde61 -0.447203 0.690107 -0.648 0.5170
## patient_disposition_cde62 -0.928092 0.202145 -4.591 4.41e-06 ***
## patient_disposition_cde63 1.613118 0.343502 4.696 2.65e-06 ***
## patient_disposition_cde64 0.736567 0.585589 1.258 0.2085
## patient_disposition_cde65 0.582417 0.605578 0.962 0.3362
## patient_disposition_cde70 -0.069909 0.570138 -0.123 0.9024
## patient_disposition_cde81 -10.165933 119.468043 -0.085 0.9322
## patient_disposition_cde83 -10.165933 68.974907 -0.147 0.8828
## patient_disposition_cde86 10.966123 119.468043 0.092 0.9269
## patient_disposition_cde89 10.966123 119.468043 0.092 0.9269
## patient_disposition_cde93 10.966123 119.468043 0.092 0.9269
## patient_disposition_cde99 1.019134 0.468869 2.174 0.0297 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 178013 on 128502 degrees of freedom
## Residual deviance: 163569 on 128474 degrees of freedom
## AIC: 163627
##
## Number of Fisher Scoring iterations: 9
Based on the variable importance, variables with Mean Decrease Gini score of 1000 and above in each feature categories are selected and cleaned for the final prediction model feature selection.
## Predicted
## True 0 1
## 0 29985 17317
## 1 13959 34457
## Predicted
## True 0 1
## 0 26664 13030
## 1 8223 38882
## Predicted
## True 0 1
## 0 47475 12376
## 1 27594 32028
## Predicted
## True 0 1
## 0 54212 5545
## 1 8698 47201
## Predicted
## True 0 1
## 0 58389 2
## 1 18 57867
## Predicted
## True 0 1
## 0 45249 12334
## 1 14288 45846
## Predicted
## True 0 1
## 0 40497 15262
## 1 26462 23226
## ses_quartile_ind n
## 1 1 5817
## 2 2 22466
## 3 3 40484
## 4 4 57854
## 5 NA 1882
##
## 1 2 3 4
## 0 46.97 45.13 48.28 49.97
## 1 53.03 54.87 51.72 50.03
##
## Call:
## glm(formula = deceased ~ ses_quartile_ind, family = binomial,
## data = modified_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.273 -1.210 1.085 1.145 1.176
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.295186 0.021082 14.00 <2e-16 ***
## ses_quartile_ind -0.072935 0.006369 -11.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 175410 on 126620 degrees of freedom
## Residual deviance: 175279 on 126619 degrees of freedom
## (1882 observations deleted due to missingness)
## AIC: 175283
##
## Number of Fisher Scoring iterations: 3
## blockgroup90_urban_cat n
## 1 1R 16568
## 2 2T 4458
## 3 3C 22337
## 4 4S 70390
## 5 5M 12899
## 6 <NA> 1851
##
## 1R 2T 3C 4S 5M
## 0 8085 1897 10963 34192 6213
## 1 8483 2561 11374 36198 6686
##
## 1R 2T 3C 4S 5M
## 0 48.80 42.55 49.08 48.58 48.17
## 1 51.20 57.45 50.92 51.42 51.83
##
## Call:
## glm(formula = deceased ~ blockgroup90_urban_cat, family = binomial,
## data = modified_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.307 -1.202 1.053 1.153 1.162
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.048054 0.015542 3.092 0.00199 **
## blockgroup90_urban_cat2T 0.252070 0.034046 7.404 1.32e-13 ***
## blockgroup90_urban_cat3C -0.011250 0.020511 -0.548 0.58337
## blockgroup90_urban_cat4S 0.008959 0.017275 0.519 0.60406
## blockgroup90_urban_cat5M 0.025318 0.023497 1.078 0.28124
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 175454 on 126651 degrees of freedom
## Residual deviance: 175386 on 126647 degrees of freedom
## (1851 observations deleted due to missingness)
## AIC: 175396
##
## Number of Fisher Scoring iterations: 3
##
## Call:
## glm(formula = deceased ~ hysterectomy_ind, family = binomial,
## data = modified_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.275 -1.152 1.083 1.203 1.203
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.059377 0.007408 -8.015 1.1e-15 ***
## hysterectomy_ind 0.285397 0.011305 25.245 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 178013 on 128502 degrees of freedom
## Residual deviance: 177374 on 128501 degrees of freedom
## AIC: 177378
##
## Number of Fisher Scoring iterations: 3
##
## Call:
## glm(formula = deceased ~ bilateral_mastectomy_ind, family = binomial,
## data = modified_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.210 -1.210 1.146 1.146 1.524
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.075258 0.005623 13.38 <2e-16 ***
## bilateral_mastectomy_ind -0.861789 0.050975 -16.91 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 178013 on 128502 degrees of freedom
## Residual deviance: 177704 on 128501 degrees of freedom
## AIC: 177708
##
## Number of Fisher Scoring iterations: 4
##
## Call:
## glm(formula = deceased ~ bilateral_oophorectomy_ind, family = binomial,
## data = modified_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.226 -1.197 1.130 1.158 1.158
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.045094 0.006542 6.893 5.46e-12 ***
## bilateral_oophorectomy_ind 0.068135 0.012549 5.430 5.65e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 178013 on 128502 degrees of freedom
## Residual deviance: 177983 on 128501 degrees of freedom
## AIC: 177987
##
## Number of Fisher Scoring iterations: 3
The method used in final prediction modeling is Random Forest. The final models are trained on training set of data and validated on test set to see performance. Performance of the model is evaluated with AUC to see if it maximized both sensitivity and specificity. Error rate and accuracy rate is provided with Confusion matrix showing actual number of true/false positive/negative numbers of predictions.
The prediction models was built on target variable set different by various time windows: Total study period, 1 month after admission, 6 months after admission, 1 year after admission, 3 year after admission, and 5 year after admission.
Variable importance score was measured with mean decrease gini to see which variable play the most critical role in each prediction.
RandomForest model with target variable: deceased
## Predicted
## True 0 1
## 0 8659 1063
## 1 839 9753
## Error rate
## [1] 0.094
## Accuracy rate
## [1] 0.90637
## Area under the curve: 0.9683
## 95% CI: 0.9673-0.9694 (DeLong)
RandomForest model with target variable: deceased_in30after_admission
## Predicted
## True 0 1
## 0 19053 36
## 1 646 579
## Error rate
## [1] 0.034
## Accuracy rate
## [1] 0.9664271
## Area under the curve: 0.904
## 95% CI: 0.8992-0.9088 (DeLong)
RandomForest model with target variable: deceased_in180_after_admission
## Predicted
## True 0 1
## 0 17472 170
## 1 1805 867
## Error rate
## [1] 0.097
## Accuracy rate
## [1] 0.9027764
## Area under the curve: 0.8857
## 95% CI: 0.8824-0.8889 (DeLong)
RandomForest model with target variable: deceased_in365_after_admission
## Predicted
## True 0 1
## 0 16516 292
## 1 2214 1292
## Error rate
## [1] 0.123
## Accuracy rate
## [1] 0.8766368
## Area under the curve: 0.8865
## 95% CI: 0.8836-0.8894 (DeLong)
RandomForest model with target variable: deceased_in1095_after_admission
## Predicted
## True 0 1
## 0 13551 1051
## 1 2176 3536
## Error rate
## [1] 0.159
## Accuracy rate
## [1] 0.841144
## Area under the curve: 0.8977
## 95% CI: 0.8954-0.8999 (DeLong)
RandomForest model with target variable: deceased_in1825_after_admission
## Predicted
## True 0 1
## 0 13734 481
## 1 720 5378
## Error rate
## [1] 0.059
## Accuracy rate
## [1] 0.9408753
## Area under the curve: 0.9029
## 95% CI: 0.9005-0.9054 (DeLong)
RandomForest model with target variable: deceased_in30after_admission
## Predicted
## True 0 1
## 0 19011 78
## 1 1182 43
## Error rate
## [1] 0.062
## Accuracy rate
## [1] 0.9379738
## Area under the curve: 0.8201
## 95% CI: 0.8144-0.8258 (DeLong)
RandomForest model with target variable: deceased_in180_after_admission
## Predicted
## True 0 1
## 0 17381 261
## 1 2239 433
## Error rate
## [1] 0.123
## Accuracy rate
## [1] 0.8769322
## Area under the curve: 0.8472
## 95% CI: 0.8435-0.8508 (DeLong)
The final prediction model targeting risk of death in different time window after hospitalization provided clear findings. First, each model by time window yielded prediction accuracy around 90%. Because all accuracy rates and error rates were computed based on the prediction result of the test data set, these figures are objective indicators of performance of the model. With the same types of new patients’ data, these models can produce moderately accurate prediction.
Secondly, different trend of variable importance plot in predictions by each time window indicate that short-term risk of death is highly dependent on a specific variable. In the prediction of risk of death in 1 month, 6 months and 1 year, patient disposition code played a massively important role with matchless importance score than other variables. The reason behind this result is most likely because of the “died” code of the variable. Despite the fact that “died” category in this variable only accounts around 4% of total deceased population, large portion of the population in short-term death is recorded as died in this variable and it affected the result.
As it is analyzed in the hospitalization data exploration, although one of the patient disposition code category is “died” which directly indicates the deceased status, it cannot be deleted because rest of the data provides significantly important information for the prediction and deceased population also cannot be largely eliminated. However, the prediction model for short-term risk of death (6 months) created with the same variable selection but the patient disposition code also provided decent accuracy rate. This result shows that more sophisticated modeling may require different settings of feature selection for the short-term risk of death prediction compared to the long-term risk of death prediction.
On the other hands, in the longer term risk of death prediction, major diagnosis category code, and primary diagnosis ccs code catch up the position of patient disposition code. Specifically in 5 year model, major diagnosis category code takes the top position. It is clear that variables providing diagnosis code information play an important role in the risk prediction.
Thirdly, age group variable does not stand out throughout the predictions, especially in the short-term models. It gets more score in the longer term predictions and takes top position in risk of death prediction in total period. This result indicates that in short-term risk of death after hospitalization prediction, other hospitalization data and individual baseline information is more important than age. Further research on the models by sub-dividing the dataset by age blocks is expected to provide more accurate insight on this issue.
Lastly and interestingly, in extremely longer term risk prediction, menopausal status and high carbohydrate diet plays as much important role as diagnosis-related predictors. This interesting outcome displays that they certainly are an very influential factor related to the female death. Menopausal status might be related to the age issue, however, deeper research could reveal the association in detail.
One additional point that should be addressed in the study is relationship between cause of death code and diagnosis icd-9 codes. This project experienced difficulty utilizing such an important variables. However, it is believed that combining cause of death code and diagnosis icd-9 code with accurately matching level of icd code use will open possibility of creating more sophisticated machine learning prediction model.