This dataset is found on Kaggle and is based on the Gallup World Poll. It ranks 149 countries in terms of happiness while also providing different scores for a variety of factors which would impact a country’s happiness score, including ‘Freedom to make life choices’, ‘Life expectancy’, ‘Generosity’ and ‘GDP per capita’.
The data is straightforward to load into Python and contains no missing values with all neccesary columns in their required data type formats.
report = pd.read_csv('world-happiness-report-2021.csv')print(report.columns.isnull().sum())
# there are no null values in the dataset so we can continue
print(report.dtypes)
# all columns are in order in terms of variable types
Let’s dive straight into the most important variable in this dataset, the happiness score.
Top 20 Countries by happiness rank
report.groupby('Country name')['Ladder score'].sum().sort_values(ascending=False)[0:20].plot(kind='barh')
Finland tops the ranking with the highest score followed by Denmark and Switzerland in silver and bronze. In fact the 5 culturally Scandinavian countries are in the top 7. The top 20 countries are comprised 65% by Western European countries.
Western Europe 13
North America and ANZ 4
Latin America and Caribbean 1
Central and Eastern Europe 1
Middle East and North Africa 1
Effect of GDP on happiness score
It is very easy to see a strong positive correlation between the Logged GDP per capita and the score of happiness between countries. In fact there are similar positives correlations like that with our other factor variables, which we’ll look into soon.
Our GDP variable has the strongest correlation with our happiness variable (0.79), after that comes life expectancy (0.77) and social support (0.76). The generosity felt within a country variable has the lowest correlation (-0.018) which could just be thought of as 0.
The strongest correlation in our correlation matrix (other than the 1.00 for variables with themselves) is between GDP and life expectancy. A larger GDP per capita leads to a larger life expectancy, and vice versa.
I’m interested to find out how our scatterplot above would look if we coloured the points using the regions the countries are in. I hypothesise that we will see clusters of the same colour representing the regions.
As predicted. Western Europe is mainly clustered around the top right for highest happiness and GDP.
Let’s look at this with the regression line:
Most of our Latin America and Caribbean countries have above average happiness scores despite the median GDP perhaps suggesting that GDP less of a factor in these countries (although still important).
Afghanistan is really far off from most of the other South Asian countries, this interersted me to find out which countrie were the biggest deviators from their region’s mean happiness.
Biggest outliers from each region
regional_avg_gdp = report.groupby('Regional indicator')[['Ladder score','Logged GDP per capita']].mean().reset_index()regional_score_avg = regional_avg_gdp.copy()
regional_score_avg.columns = ['Regional indicator','avg_ladder_score','avg_gdp']report2 = report.merge(regional_score_avg, left_on='Regional indicator', right_on='Regional indicator')
report2['avg_diff'] = report2['Ladder score']-report2['avg_ladder_score']
report2['avg_diff_abs'] = abs(report2['Ladder score']-report2['avg_ladder_score'])# top 10 differences
report2.sort_values(['avg_diff_abs'], ascending=False)[['Country name','Ladder score','avg_ladder_score','avg_diff']][0:10]# top 10 best performers
report2.sort_values(['avg_diff'], ascending=False)[['Country name','Ladder score','avg_ladder_score','avg_diff']][0:10]# top 10 worst performers
report2.sort_values(['avg_diff'], ascending=True)[['Country name','Ladder score','avg_ladder_score','avg_diff']][0:10]# biggest differences df
diff_df = pd.concat([report2.sort_values(['avg_diff'], ascending=False)[['Country name','Ladder score','avg_ladder_score','avg_diff']][0:10],
report2.sort_values(['avg_diff'], ascending=True)[['Country name', 'Ladder score', 'avg_ladder_score', 'avg_diff']][0:10]],
join='inner').sort_values('avg_diff')
diff_df.groupby('Country name')['avg_diff'].sum().sort_values().plot(kind='barh')
To do this we create a dataframe which lists all the averages for each region then we merge this dataframe with a copy of the original dataframe so the average for each region is listed as a column. From there we can create a column which takes the average happiness score away from the happiness score of a country. We do this twice, once just plainly and the second applying an absolute value.
Haiti and Israel are the two extremes with the former having a lower happiness score than the average of its region and the latter performing much better than its respective region.
The Middle East and North African region has the greatest disparity between the happiness of its countries whilst North America and ANZ have the smallest disparity.
Looking at the remaining 5 factors
Like we discussed earlier, the correlations of the remaining 5 factors are seen above with Generosity not having a correlation and ‘Perceptions of corruption’ having a negative correlation.
The clusters of regions are once again apparent in these plots too.
Using Linear Regression to predict happiness
It is possible to use the Linear Regression machine learning algorithm to try and predict the happiness of our countries using the region and 6 factors. We first need to isolate the columns of our interest and apply one hot dummy coding to the region column.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_splitdata = report.iloc[:,[1,2,6,7,8,9,10,11]]
data = pd.get_dummies(data)
X, y = data.iloc[:,1:16], data.iloc[:,0]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)lr = LinearRegression().fit(X_train, y_train)print("lr.coef_: {}".format(lr.coef_))
print("lr.intercept_: {}".format(lr.intercept_))print("Training set score: {:.2f}".format(lr.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lr.score(X_test, y_test)))
Training set score: 0.83
Test set score: 0.72
Using different scalers we arrive at the same result. As there are only 149 rows it is difficult to improve our score. We would need to use the 2008–2019 version of this dataset, but we will put that on the list for another time.