A Country's Secret to Happiness
The World Happiness Report was recommended to be a good starting point for guaging world wide bliss. I perform this analysis, in order to understand what factors really make a country happy. The results that I got at the end are quite interesting.
- Introduction
- Problem Statement
- Part A
- Explaratory Data Analysis
- A look into Correlation
- Performing ANOVA test between predictors and response variable to guage how significantly it affects the scoring
- Looking at all countries and their ranks in Happiness Index Score
- Happiness with regards to Generosity and Economy
- Happiness with regards to Health and Economy
- Happiness with regards to Family and Economy
- Happiness with regards to Govt Trust and Economy
- World-wide View of Countries with regards to Generosity
- Trend of Happiness Over Time
- Part B
- Processing the Datasets
- Exploratory Data Analysis on Combines Dataset with Terrorism
- Predicting happiness Index
- Predicting Terrorist attacks
- Conclusions
- Improvements That Can Be Done
I started my journey to leverage Machine Learning for trying to answer the age-old question of - what exactly makes us happy?
The World Happiness Report was recommended to be a good starting point for guaging world wise bliss. Throughout our analysis, the data points surely helped us although towards the end we were able to understand that perhaps all of the variables in this report alone will not be sufficient for us to accurately measure the happiness of a country since "happiness" is very relative in nature.
There are six measurements taken per country for guaging the World Happiness Index. They consist of:
-
GDP per Capita - Gross Domestic Product per capita for the countries
-
Family - Satisfaction Rank of Family
-
Life Expectancy - Avg. expected years to live
-
Freedom - Perception of freedom quantified
-
Generosity - Numerical value estimated based on the perception of Generosity experienced by poll takers in their country.
-
Trust/Government Corruption - A quantification of the people's perceived trust in their governments.
-
Dystopia Score - Score based on comparison to hypothetically the saddest country in the world.
-
Dystopia Residual - Rank of any country in a particular year.
The Happiness Score calculated in the report is actually an average of the responses to the main life evaluation question asked in the Gallup World Poll (GWP), which uses the Cantril Ladder.
Cantril Ladder involved something called as Cantril step where they ask reponsents to think of a step with the most excellent life they can think of and with that as benchmark, score their current life.
Credits Remarks to:
- Univ.Ai Professor Pavlos Protopapas
- Kaggle Datasets
- Aashita Kesarwani - https://www.kaggle.com/aashita/guide-to-animated-bubble-charts-using-plotly - for demonstrating beautiful ways to plot bubble charts
- Jesper Sören Dramsch - https://www.kaggle.com/jesperdramsch/the-reason-we-re-happy - for demonstrating wonderful means of doing data analysis
- Jamaç Eren Ay - https://www.kaggle.com/yamaerenay/world-happiness-report-preprocessed - for preparing pre processed datasets and allowing it for free use for all
Given the data available per country to guage the Hapiness Index, our aim is to:
- Part A - Analyze and understand which factors affect the Happiness Index Score of countries
- Part B - Analyze and understand the relationship between Terror Attacks and Happiness Index
- Part C - Create a Model to predict the Happiness Index of a Country
- Part D - To see how much Health contributes to the Happiness Index? With the current pandemic at hand, predicting COVID-19 Cases in the coming days for countries.
- Part E - Creating a Dashbord for viewing COVID-19 Predictions
The Spearman's Rank Correlation Coefficient is used to discover the strength of a link between two sets of data.
-
The Spearman rank correlation coefficient, ρ considers the ranks of the values for the two variables.ρ will always be a value between -1 and 1.
-
The further away ρ is from zero, the stronger the relationship between the two variables. The sign of ρ corresponds to the direction of the relationship. If it is positive, then as one variable increases, the other tends to increase. If it is negative, then as one variable increases, the other tends to decrease.
-
You use Spearman’s correlation if your data have a non-linear relationship (like an exponential relationship) or you have one or more outliers. However, Spearman’s correlation is only appropriate if the relationship between your variables is monotonic.
Inference: From the above matrixes, it seems like Health, GDP Per Capita and freedom are the top 3 factors that correlate with happiness index.
Inference:
From the above plot, we can infer that there seems to be a:
Linear Relationship: happiness_score v/s gdp_per_capita, happiness_score v/s health, happiness_score v/s freedom
Non-Linear Relationship: happiness_score v/s gerosity, happiness_score v/s government_trust
Performing ANOVA test between predictors and response variable to guage how significantly it affects the scoring
Analysis of Variance is a statistical method, used to check the means of two or more groups that are significantly different from each other. It assumes Hypothesis as:
H0: Means of all groups are equal.
H1: At least one mean of the groups are different.
- If the distributions overlap or close, then the grand mean will be similar to individual means whereas if distributions are far, the grand mean and individual means differ by larger distance.
- In ANOVA, we will be checking & comparing both Between-group variability to Within-group variability through f-test.
- If there is no significant difference between the groups that all variances are equal, the result of ANOVA’s F-ratio will be close to 1.
Two of the aspects coming out of ANOVA test belong to our correlation inference i.e GDP per capita and health. Apart from that, it seems like government trust and family also play quite a significant role in realizing the happiness score.
Inference: Clearly Norwary seems to the top country scoring in Happiness Index. It is not surprising since European Countries have better living conditions.
Inference: The farther right side bubbless are mostly contries in the European Continents. Clearly they have better GDP Per capita. Surprisingly Europeans countries score average on Generosity(Asian countries have highest generosity) but have the most Happiness Score rankings.
Inference: The farther right side bubbles are mostly contries in the European Continents. Clearly they have better Health score as well since they are present on top. The lowest health scores mostly consists of African and Asian countries.
Inference:
The farther right side bubbless are mostly contries in the European Continents. Clearly they have better Family ratings. The most unsatisfied family rankings is actually mixture of mostly African, South American,Asian and a few European countries & North American countries.
Inference:
Most countries rank low on government trust giving us insights into how most of the world population doesn't necessarily trust it's governments despite the ovearching push of democracy to be adoptee. High government trust countries are Rwanda and obvious countries of Sinagpore, New Zealand, Finland.
From the chart we can notice that the continent of Europe has a good score of GDP per capita, compared to others. Australian countries contribute the least to global GDP.
Part B
Analyze and understand the relationship between Terror Attacks and Happiness Index
Thoughts/ Motive
One of the things that intrigued us terrorism across the world. With wars and conflicts happening on a day to day basis, we really wanted to understand to what extent terrorism plays a role in happiness index. For this we combined two datasets - the Happiness Datasets and the World Terrorism dataset from Global Terrorism Database(GTD).
In our datasets, we have only took the count of terror attacks and not other information such as text based data surrounding the context of what happpened, names of the weapons used and so on since that would delve into NLP. Our future work in scope is using NLP to also analyse the datasets in order to better guage the relationship between happiness and terrorism.
Processing the Datasets
Now that we have seen EDA on Happiness Index, we were wondering what about terror attacks? Clearly the factors mentioned above are not sufficient enough to explain true happiness. So we decided to see how terror attacks combine with happiness index and to answer the question if there is a correlation present.
We can see that there are some countries which go through alot of terrorist attacks
There seems to be a:
Linear Relationship: happiness_score v/s gdp_per_capita, happiness_score v/s health, happiness_score v/s freedom
Non-Linear Relationship: happiness_score v/s gerosity, happiness_score v/s government_trust
Inference:
With the data that we have, there doesn't seem to be much correlation between terror attacks and happienss index. We would need more data to come to a singificant conclusion as to how terrorism really affects the happiness index. Perhaps another factors that would allow us to further understand the happiness index would be war conditions. Countries like Syria and Palestine, are in critical war zones which would make their living condtions poor and hence affecting the happiness index.
We used Lasso Regression with the degree of 6 to perform Polynomial Lasso Regression in order to predict the Happiness Score.
Our MSE value for Lasso Regression is 0.25 and our R2 Score is 0.82 which is pretty satisfactory.
Why did we use Lasso Regression?
- We understood that Lasso tends to do well if there are a small number of significant parameters and the others are close to zero (ergo: when only a few predictors actually influence the response). This was our case where our parameters no. was relatively small hence this seemed like the good approach to take. Ridge works well if there are many large parameters of about the same value (ergo: when most predictors impact the response).
- Lasso, or Least Absolute Shrinkage and Selection Operator, is quite similar conceptually to ridge regression. It adds a penalty for non-zero coefficients. However, unlike ridge regression which penalizes sum of squared coefficients (the so-called L2 penalty), lasso penalizes the sum of their absolute values (L1 penalty). As a result, for high values of λ, many coefficients are exactly zeroed under lasso.
What did we do in MLP Regressor?
- Our choice of multiple number of layers here is to depict non-linearity in the model. Multiple number of layers lead to non-linearity, but excess number of layers may lead to overfitting of the model.
- Experimenting and trying out multiple combinations of layers and neurons, three layers with depicted neurons turned out to be suitable for our model.
- Also, we used the default Activation Function, ReLu because of our model being a Linear Regression Model and ReLu fits the best for this problem.
Our MSE value for MLP Regressor is 0.26 and our R2 Score is 0.82 which is pretty much the same as Lasso Regression.
Predicting Terrorist attacks
We also tried experimenting witht the variables we have from the happiness dataset to see if we can satisfactorily predict no. of terrorist attacks likely to happen.
Of course the model does not have the best performance because we understand that there are more factors that affect the outcome.
Our future work here is to get more external factors relating to what sparks terrorim attacks and create model to allow for better risk handling.
Clearly our model is not performing well here.
Part D
To see how much Health contributes to the Happiness Index? With the current pandemic at hand, predicting COVID-19 Cases in the coming days for countries.
Thoughts
From Part A, we have realized that Health does play a major role in a country's happiness score. With the current pandemic at hand, we were motivated to look at COVID cases and forecast the upcoming cases. We wanted to compare the COVID data with the happiness index data, however, we felt that it would not give the right results since the happiness index data of 2020 is from the months of January-February when there was not much COVID health crisis happening.
However, in pursuit of excitement and interest, we decided to go forth to do a basic forecasting model on COVID-19 dataset using fbprophet.
What and Why Prophet?
Prophet is Facebooks'open source time series prediction. Prophet decomposes time series into trend, seasonality and holiday. It has intuitive hyper parameters which are easy to tune.
Prophet time series = Trend + Seasonality + Holiday + error
Trend models non periodic changes in the value of the time series. Seasonality is the periodic changes like daily, weekly, or yearly seasonality. Holiday effect which occur on irregular schedules over a day or a period of days. Error terms is what is not explained by the model.
We believe that the advantages of using Prophet are:
- Accommodates seasonality with multiple periods
- Prophet is resilient to missing values
- Best way to handle outliers in Prophet is to remove them
- Fitting of the model is fast
- Intuitive hyper parameters which are easy to tune
Credits to https://towardsdatascience.com/time-series-prediction-using-prophet-in-python-35d65f626236 for information on Prophet.
Part E
Creating a Dashbord for viewing COVID-19 Predictions
Our very own COVID-19 Forecasting Dashboard
Using the model that we built, we created a COVID-19 Forecasting Dashboard. You can view it in this link:
https://covid-prediction.herokuapp.com/
Our main motivation here was to be able to learn how to best provide the model outcomes to audience.
You can see the code in our file under the name: Covid-pred
Conclusions
-
The data factors being used for calculating the Happiness Index of the countries is not holistic and inclusive. There are other factors to also be considered. GDP per capita seems to be a skewed figure itself and the limitations that GDP poses is highly likely to bias the happiness score.
-
We did not find much correlation between no. of terror attacks and happiness index of a country. However, we believe we need to consider more factors & influences pertaining to terrorism for us to properly see the relationship.
-
For COVID-19 forecasts, we performed univariate analysis on our historical data, which made us realize that historical data alone might not be sufficient for the prediction. But certainly, this is one of the main predictors and it can be used with other set of predictors to create a more powerful model.
Improvements That Can Be Done
Improvement: Figure out another way to calculate Happiness Index of a country which includes more holistic and inclusive factors
Based on our observations, we believe that factors apart from 6 selected need to be considered in order to make accurate happiness index scoring. A possible improvement would be to research on an alternative way to calculate the index without using GDP per capita as a score
Improvement: To move into using NLP & Decision Trees for analyzing Terrorism Data
Most of the factors in the Terrorism Dataset were text based. Hence, using NLP here will be best for us to understand the influences of the predictor on the response. To improve model prediction, we believe models pertaining to Decision Trees will help.
Improvement: To move into Multivariate Analysis
We forecasted COVID-19 cases using only past data – however, we are aware that historical data alone is not enough to make accurate forecasts. There are many other external factors – our intention was to more or less look at the trend and observe how this trend will move in the future.