A. Background
My interest in data science comes from years of analyzing small self-generated datasets. The information eked from those experiments was invigorating and exciting but I quickly realized that self-generated data doesn't scale. This, taken together with the practically limitless supply of public data just waiting to be analyzed, convinced me to learn data science. I am personally interested in drug re-purposing, genetics and public health so I started there. I recently developed a model to investigate demographics contributing to global maternal mortality rates and decided to continue this project for the United States.
In the literature, the focus is usually on tackling low hanging fruit to reduce MMR. The suggestions usually aim to increase access to clinicians, emergency services and prenatal education. Most of these interventions help substantially in countries with poor medical access and infrastructure. However, the United States, which has one of the most expensive health care systems in the world, has an abysmally high maternal mortality rate. By my calculations, our average MMR of 14 deaths per 100,000 births put us in 46th place out of 188 countries. That is right behind Qatar (45th) and just ahead of Uruguay (47th).
This poor performance is emphasized by the fact that these deaths are almost entirely preventable. It should also be noted that in some states MMR is at or below the global minimum. Unfortunately, other states have MMRs as high as 20 women per 100,000 live births. So why is this still a problem in the U.S.? Why are we behind other rich post-industrialized countries? I took a stab at developing a model to shed some light on the problem. You can see the full project write up as an ipython notebook on my github or read the brief summary below.
B. Selecting Features
Correlation Matrix
Highly correlated features are redundant and were paired down to simplify and improve the accuracy of the model. In this case, removing the features that correlate with teen birth rate improved the model accuracy. The following features were included in the models (for detailed descriptions of the feature variables please see project documentation on github):
Feature Choropleths and Descriptions
- 'MMR' - Maternal Mortality Rate (MMR) per 100,000 live births.
- MedianIncome - Median household income in US dollars.
- 'Medicaid_Extend_Pregnancy' - Does Medicaid cover pregnancy costs? (yes: 1, no: 0)
- 'Economic Distress' - Rank of economic distress across all factors (based on foreclosures, unemployment and food stamp usage)
- 'Teen Birth Rate per 1,000' - The number of births (per 1,000 live births) to teenagers aged 15-19
- 'PPR_White' - Proportional poverty rate of caucasians (Poverty of caucasians compared to the proportion of caucasians in the general population)
- 'PPR Non-White ' - Proportional poverty rate of NON-caucasians (Poverty of NON-caucasians compared to the proportion of NON-caucasians in the general population)
- 'Abortion_Policy_Rank' - Rank of Abortion policy across all factors (1 - least barriers to abortion, 50 - most barriers to abortion)
- 'Pill_InsurePol' - State legislation for insurers to cover contraceptives (0 - no coverage policy, 3 - meets coverage policy)
- 'EC_Access' - State legislation for accessibility to emergency contraceptives (0 - no policy, 3 - pro accessibility policy
- 'State Taxes Per Capita' - Includes property taxes, income taxes, sales tax and other taxes per capita in US dollars ($)
- 'Total Exports' - measure of total agricultural exports
C. Evaluating Model
Response Variables:
MMR - The continuous MMR variable is the rate of maternal deaths per 100,000 live births. The random forest regressor model with this response variable.
- MMR Classifier - In this response variable the MMRs were mapped to a scale from 1-4 based on quartile rank. ( highest MMR - 4, lowest MMR- 1). The random forest classifier model was used with this response variable.
Model Evaluation
Random Forest Regressor:
- Null RMSE (comparing average MMR and MMR100K): 4.921
- RMSE for all demographics (max features=9, CV=20, estimators=170): 4.200
- Improvement of ~15% over the null RMSE
Random Forest Classifier:
- Null Accuracy: 0.26
- Random Forest Classifier Accuracy (n_estimators=300, max features=1, and cross val=10) : 0.43
- Improvement of ~40% over null accuracy
D. Conclusions
"Teen birth rates per 1,000 live births had the highest feature importance."
Feature importance ranks are shown below for the random forest classifier. The teen birth rate per 1,000 births has the highest feature importance. This feature accounts for variation in the MMR response variable and is a predictive feature of the model. Proportional poverty for people of color is also high on the list. Total agricultural exports, state taxes per-capita and abortion policy rank are other important features of the model.
Feature importance:
* Teen Birth Rate per 1,000 0.133484
* PPR non-white 0.126148
* total exports 0.123436
* State Taxes Per Capita 0.121812
* Abortion_Policy_rank 0.120240
* economic distress 0.113177
* PPR_White 0.107429
* Pill_InsurePol 0.064076
* EC_access 0.051818
* Medicaid_extend_Pregnancy 0.038380
Discussion:
Given the small amount of variation in MMR accross the states, the accuracy did not improve much when using the Random Forest Regressor. Once the MMRs were classified into quartiles the random forest model performed better. The final accuracy score for the random forest classifier improved over the null by ~40%. The overall accuracy was still fairly low (43%) but was a vast improvement over the null.
It is interesting to note that total agricultural exports rank quite high in feature importance. It is difficult to interpret this finding given that it does not correlate with any of the other measures of economic prosperity.
If you have feedback or ideas, please feel free to leave a comment below.