Where do the internet users come from?

The blog is now on internet – and everybody on the wild world of web can access it. But what does that mean: where do the people on internet come from? How big is internet access in the world? Really?

Yes, there are plenty of fine and detailed studies on this. But, I wanted to do it all by myself powered by open data and freely available open source software tools.

The plan was to have it all wrapped up into one single blog entry, but there was too much to cover. So, there will most likely be part two – or more. Sorry about that.

The Tools

Main tool here is Jupyter Python Notebook in Anaconda Distribution. And all the twists and turns of exploration are visible in detail through Python code snippets, showing all the details – and revealing the cruel fact that Python it is not (yet) my strongest language. I’m more comfortable in R but learning more and more Pythonese every day.

The blog is hosted on WordPress which does not feel the most comfortable choice for publishing Jupyter based content, especially when one is not willing to invest into a subscription which would allow choosing and installing plugins – some of which would make this a lot easier. At least possibly. So, there is quite some manual post processing on WordPress involved – at least for now.

Still, the blog is coming from a Jupyter notebook and there we start by importing key libraries and setting some common parameters as follows

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
import seaborn as sns

plt.rcParams['figure.figsize'] = [12.0, 9.0]
plt.rcParams['font.size'] = 8
Numpy for making matrix mathematics work almost like in R, matplotlib for making most of the graphs, pandas for reading and writing and overall massaging and analysing data, statsmodels for access to near R-like library of regression models and their diagnosis, supported by seaborn.

The Data

For internet usage data my first stop was ITU as it was kind of natural choice. And yes, they do have data on telecommunications and on internet – but even the quite superficial time series data I needed here was behind a paywall.

Next stop was United Nations, but there the issue for me was poor access ergonomics where each indicator had to be retrieved separately. Furthermore, internet usage data extended only to 2014.

Fortunately, World Bank has a very nice interactive portal of data covering an amazing variety of indicators on human activity – and other things. Indicators originating from several sources have been sanitized to common terminology and supplemented with quite detailed descriptions.

They have bundled a lot of data into an easily downloadable World Development Indices excel (link to download page) which has been licensed under CC-BY 4.0 so there are no issues in using that data publicly. Even when it includes data that otherwise would have been behind the ITU paywall. Very nice.

The file is relatively large (~120 MB) and includes about 1600 indicator value series for different countries and regions starting 1960. Not all indicators are available for all countries and for all years, but there is plenty of data to play with.

The excel contains a separate sheet with following metadata on each indicator

  • Series Code
  • Topic Indicator Name
  • Short definition
  • Long definition
  • Unit of measure
  • Periodicity
  • Base Period
  • Other notes
  • Aggregation method
  • Limitations and exceptions
  • Notes from original source
  • General comments
  • Source
  • Statistical concept and methodology
  • Development relevance
  • Related source links
  • Other web links
  • Related indicators
  • License Type

So, it is a very good source of data for playing with statistics.

The indicator we are investigating is IT_NET_USER_ZS which means percentage of population who have used internet during the past three months. As an indicator it feels quite weak but based on some published studies (check links in the end of this post) it typically indicates a much more intensive use in reality: when they start to use internet they get hooked, I guess. 

Another useful downloadable excel from World Bank is their classification which gives additional perspective to individual countries or economies  World Bank Classification. I used the version dated summer 2017.

Reading the Data

Reading the 130 M excel took quite some time for my laptop. Therefore, I decided to save a copy of pre-processed data as a pickle which is far far faster to read with Python than Excel.

createPickle = False    # create or simply use the pickle file
if createPickle:
    # Importing the big World Development Indices dataset
    datasetwdi = pd.read_excel('../Data/WDIEXCEL.xlsx')
    # Getting country list, i.e. economies
    economies = pd.read_excel('../Data/WB_CLASS_2017_06.xls', skiprows=4)
    # exclude aggregates, i.e. entities which do not have a region
    economies = economies[economies['Region'].notnull()]
    # some additional cleanup
    economies = economies[economies['Region'] != 'x']
    del economies['x']
    del economies['x.1']
    del economies['X']
    # merge by economy classifications to the main data set
    dataset = datasetwdi.set_index('Country Code').join(economies.set_index('Code'), how='left')
    # using statsmodels but dots in variable names were not accepted by patsy as used by statsmodels.. Renaming
    dataset['Indicator Code'] = dataset['Indicator Code'].str.replace('.', '_')
    # store for future use
    economies.to_pickle('economies.pkl')
    dataset.to_pickle('wdi.pkl')
else:
    # reading data from pre-processed pickle file to save time 
    dataset = pd.read_pickle('wdi.pkl')
    economies = pd.read_pickle('economies.pkl')

Big Picture

World Bank data contains indicators for individual economies which quite often are countries, but not always. Indicators have been prepared for various aggregations of economies like European Union, geographical regions of the world and such.

Using those aggregations one can get a head start in the exploration. So we will check internet usage IT_NET_USER_ZS for 2016 using them.

# Aggregates do not have  Region in this merged data set
intusage2016 = dataset.loc[
    (dataset['Region'].isnull()) & 
    (dataset['2016'].notnull()) & 
    (dataset['Indicator Code'] == 'IT_NET_USER_ZS')].sort_values(by=['2016'])[['Country Name', '2016']]
# make a plot
fig=plt.figure(figsize=(8, 9), dpi= 80, facecolor='w', edgecolor='k')
y_pos = np.arange(intusage2016.shape[0])
plt.barh(y_pos, intusage2016['2016'], align='center', alpha=0.5)
plt.yticks(y_pos, intusage2016['Country Name'])
plt.ylabel('')
plt.title('Internet Usage by 100 persons')
plt.show()
# and the numbers
print(intusage2016 )
 internet_usage_2018_aggregates
                                          Country Name       2016
LIC                                         Low income  12.439754
HPC             Heavily indebted poor countries (HIPC)  15.568841
LDC       Least developed countries: UN classification  15.612042
IDX                                           IDA only  17.078794
PRE                           Pre-demographic dividend  17.197630
FCS           Fragile and conflict affected situations  17.311083
IDA                                          IDA total  19.062529
SSA         Sub-Saharan Africa (excluding high income)  19.919026
TSS          Sub-Saharan Africa (IDA & IBRD countries)  19.922363
SSF                                 Sub-Saharan Africa  19.922363
IDB                                          IDA blend  23.063195
SAS                                         South Asia  26.477736
TSA                            South Asia (IDA & IBRD)  26.477736
LMC                                Lower middle income  29.975606
PSS                        Pacific island small states  30.607213
EAR                         Early-demographic dividend  33.595687
LMY                                Low & middle income  38.883044
IBT                                   IDA & IBRD total  39.176190
MIC                                      Middle income  41.868573
ARB                                         Arab World  42.525040
TMN  Middle East & North Africa (IDA & IBRD countries)  42.870442
MNA  Middle East & North Africa (excluding high inc...  43.104981
OSS                                 Other small states  45.729935
WLD                                              World  45.784503
IBD                                          IBRD only  45.876003
SST                                       Small states  46.314677
MEA                         Middle East & North Africa  48.177897
TEA         East Asia & Pacific (IDA & IBRD countries)  48.416128
EAP        East Asia & Pacific (excluding high income)  48.416128
EAS                                East Asia & Pacific  52.932045
CSS                             Caribbean small states  53.809989
UMC                                Upper middle income  55.612175
LAC  Latin America & Caribbean (excluding high income)  56.300038
LTE                          Late-demographic dividend  56.492530
LCN                          Latin America & Caribbean  56.841310
TLA  Latin America & the Caribbean (IDA & IBRD coun...  57.001399
ECA      Europe & Central Asia (excluding high income)  62.454986
TEC       Europe & Central Asia (IDA & IBRD countries)  63.374588
CEB                     Central Europe and the Baltics  71.350411
ECS                              Europe & Central Asia  73.354524
NAC                                      North America  77.563459
OED                                       OECD members  78.731132
EMU                                          Euro area  80.579971
EUU                                     European Union  80.942428
PST                          Post-demographic dividend  81.127797
HIC                                        High income  82.181848

So it looks like – in general – the economically better off countries have higher percentage of individuals using internet. A no-brainer and not a big surprise.

Interestingly, according to World Bank (or ITU where the data actually comes from) 46% of all people are using internet. Based on Worldometers in 2016 that was 46% out of 7,466,964,280, meaning 3,434,803,569 – and that is quite some crowd and very precisely measured ;).

But that crowd is not evenly distributed between the countries around the world.

Individual Countries

Since aggregate data suggests that high income / low income aspect of countries is likely to play a role here, we take the indicator GDP per capita for further scrutiny

# getting dataset where we have both internet usage and GDP per capita for all included countries
intusage2016 = dataset.loc[(dataset['Region'].notnull()) & 
                           (dataset['2016'].notnull()) & 
                           (dataset['Indicator Code'].isin(['IT_NET_USER_ZS','NY_GDP_PCAP_CD']))]
intusage2016 = intusage2016.pivot(columns='Indicator Code', values='2016').dropna()
intusage2016.describe()
Indicator Code IT_NET_USER_ZS NY_GDP_PCAP_CD
count 185.000000 185.000000
mean 50.489357 13231.905621
std 28.043551 18110.816202
min 1.880000 285.727442
25% 25.246250 1714.680184
50% 53.200000 4989.427763
75% 75.498504 15891.626549
max 98.240016 100738.684223
 So we have this data for 185 economies or countries. And from mean vs median (the 50% value) we can see that internet usage distribution is fairly even compared to highly skewed GDP per capita distribution.

This visible with more nuance when we convert the above variables to the same scale and show the distributions as histogram

intusage2016s = intusage2016.copy()
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
intusage2016s [['IT_NET_USER_ZS', 'NY_GDP_PCAP_CD']] = scaler.fit_transform(intusage2016s [['IT_NET_USER_ZS', 'NY_GDP_PCAP_CD']])
ax = intusage2016s.plot.hist(bins=12, alpha=0.5)
 internet_usage_2018_density

Internet usage is fairly evenly distributed across the world. However, the distribution is not uniform but bi-modal with peaks at around 20% and 80% of population using internet – and with a clear dip at around 40% of population.

The story of GDP per capita is quite different: Clear majority of economies / countries in the lowest 20 % of the full range.

When thinking about the distributions one must bear in mind that the entities compared i.e. economies / countries are vastly different in size and population ranging from places like Monaco, Luxembourg to India, China and then to Malawi, Senegal and others. Each of them representing equal weight in the distributions above.

Further details are available through a world map on internet usage. Unfortunately, WordPress does not allow interactive graphics. For interactions one must follow the link in the static graph which opens a new tab/window on another site.

# creating an interactive choropleth map with plotly
import plotly.plotly as py
df = intusage2016.join(economies.set_index('Code'), how='left')
data = [ dict(
        type = 'choropleth',
        locations = df.index,
        z = df['IT_NET_USER_ZS'],
        text = df['Economy'],
        autocolorscale = True,
        reversescale = True,
        marker = dict(
            line = dict (
                color = 'rgb(180,180,180)',
                width = 0.5
            ) ),
        colorbar = dict(
            title = '% of popul.'),
      ) ]

layout = dict(
    title = 'Individuals using the Internet 2016 (% of population)
Source:\
            \
            World Bank & ITU',
    geo = dict(
        showframe = True,
        showcoastlines = True,
        projection = dict(
            type = 'Mercator'
        )
    )
)
fig = dict( data=data, layout=layout )
py.iplot(fig, filename='d3-cloropleth-map' )
 20180529-intuseworldmap-2016

Relationships

A plot of internet usage vs. GDP per capita should show some trend as suggested by aggregate information and the world map above.

plt.figure()
fig=plt.figure(figsize=(8, 5), dpi= 80, facecolor='w', edgecolor='k')
plt.scatter(intusage2016['NY_GDP_PCAP_CD'], intusage2016['IT_NET_USER_ZS'], c = 'g')
plt.xlabel('GDP per capita')
plt.ylabel('Internet usage per 100')
plt.title("Internet Usage vs Income")
plt.show()
 internet_usage_2018_gdp_pcap

The plot is strongly skewed. A log transformation could help to turn it into a linear(ish) relationship.

plt.figure()
fig=plt.figure(figsize=(8, 5), dpi= 80, facecolor='w', edgecolor='k')
plt.scatter(np.log(intusage2016['NY_GDP_PCAP_CD']), intusage2016['IT_NET_USER_ZS'], c = 'g')
plt.xlabel('log GDP per capita')
plt.ylabel('Internet usage per 100')
plt.title("Internet Usage vs Income")
plt.show()
 internet_usage_2018_gdp_logpcap

Indeed, the dots are better diagonally organised now. But, the high variance between individual countries in same the mid-range of log GDP per capita axis remains.

For countries at the very same score, internet usage can vary from 15% to 80% — so the prospect of getting a really useful model with these elements looks grim. More indicators would be needed to squeeze down the variation and learn more about factors associated with internet usage rate .

But that will have to wait for another blog entry as it opens a new can of worms and would explode the length of this entry far too much. We will soldier on with these two indicators now and see what we can learn.

Making a Fit

Next we will turn that scatter plot into an equation with a simple linear model using statsmodels package.

intusage2016['log_NY_GDP_PCAP_CD'] = np.log( intusage2016['NY_GDP_PCAP_CD'])
lm = smf.ols(formula = 'IT_NET_USER_ZS ~ log_NY_GDP_PCAP_CD', data = intusage2016).fit()
print(lm.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:         IT_NET_USER_ZS   R-squared:                       0.833
Model:                            OLS   Adj. R-squared:                  0.832
Method:                 Least Squares   F-statistic:                     914.3
Date:                Mon, 28 May 2018   Prob (F-statistic):           4.30e-73
Time:                        21:17:55   Log-Likelihood:                -713.07
No. Observations:                 185   AIC:                             1430.
Df Residuals:                     183   BIC:                             1437.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept           -100.5839      5.067    -19.851      0.000    -110.581     -90.587
log_NY_GDP_PCAP_CD    17.6169      0.583     30.238      0.000      16.467      18.766
==============================================================================
Omnibus:                        6.263   Durbin-Watson:                   2.078
Prob(Omnibus):                  0.044   Jarque-Bera (JB):                9.530
Skew:                          -0.104   Prob(JB):                      0.00852
Kurtosis:                       4.092   Cond. No.                         52.9
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The result look fairly good – p-value essentially zero, R-squared and adjusted R-squared over 0.8.

Diagnostic graphics help to get a better grip on the limitations of the model.

res = lm.resid
fig = sm.qqplot(res, line='s', fit=True)
plt.show()
 internet_usage_2016_fit_qq
Observed distribution is almost normal – apart from a few countries in ends of the distribution range. The same is visible in a histogram a bit later down.

Next we plot the data overlaid with the linear model along and 95% confidence range for the predictions created for each country.

fig, ax = plt.subplots(figsize=(12, 8))
fig = sm.graphics.plot_fit(lm, "log_NY_GDP_PCAP_CD", ax=ax)
plt.show()
 internet_usage_2016_fit

Apart from the tree countries in the upper and five in the lower part of the graph, all observations fall within the prediction confidence intervals. Then again, that interval is quite large.

Normalcy of the residuals can be explored in the following histogram:

import matplotlib.mlab as mlab
num_bins = 20
n, bins, patches = plt.hist(res, num_bins, density=1, facecolor='blue', alpha=0.5) 
plt.xlabel('Residual')
plt.ylabel('Probability')
plt.title(r'Histogram of residuals')
plt.subplots_adjust(left=0.15)
 internet_usage_2016_res_hist

It is not perfect but relatively nice anyways.

Let’s have a closer look on the shape of variance and highlight outlier countries where residual is beyond two standard deviations..

res = lm.resid_pearson
eco = pd.DataFrame(lm.resid).join(economies.set_index('Code'), how='left')
pred = lm.predict()
plt.scatter(pred, abs(res))
anno_offset=1.025
for i, txt in enumerate(eco['Economy']):
    if abs(res[i]) >= 2:
        plt.annotate(txt, (anno_offset*pred[i],anno_offset*abs(res[i])))
plt.axhline(y=0.0,  linestyle='-')
plt.axhline(y=2.0,  linestyle=':')
plt.axhline(y=3.0,  linestyle=':', c='r')
plt.show()
 internet_usage_2016_res_dist_abs
As already noted, in the mid area of log GDP per capita there is significantly higher variation than in either ends. Lo transformation has pushed the high variance in the lowest income countries into the very center of the graph.
res = lm.resid_pearson
eco = pd.DataFrame(lm.resid).join(economies.set_index('Code'), how='left')
pred = lm.predict()
plt.scatter(pred, res)
anno_offset=1.025
for i, txt in enumerate(eco['Economy']):
    if abs(res[i]) >= 2:
        plt.annotate(txt, (anno_offset*pred[i],anno_offset*res[i]))
plt.axhline(y=0.0,  linestyle='-')
plt.axhline(y=2.0,  linestyle=':')
plt.axhline(y=-2.0,  linestyle=':')
plt.axhline(y=3.0,  linestyle=':', c='r')
plt.axhline(y=-3.0,  linestyle=':', c='r')
plt.show()
 internet_usage_2016_res_dist

So is could be something special with the countries

  • Iraq
  • Papua New Guinea
  • Angola
  • Turkmenistan
  • Equatorial Guinea

which makes them have significantly lower share of internet users than expected by our simple model. And correspondingly, there could be something special which makes internet usage unexpectedly high for the following countries

  • Moldova
  • Azerbaijan
  • Armenia

But this may simply be a result of missing indicators in the model that would be needed to properly describe the state of a country in this respect. As noted earlier, I plan to come back to this in a future post.

Residuals on map

Out of curiosity, let’s have a look on how the expected (as per our simple model) and reported internet use differ country by country on map through standardised residuals. And again, to get to the interactive map you have to follow the link in the static graph.

data = [ dict(
        type = 'choropleth',
        locations = lm.resid.index,
        z = res,
        text = lm.resid.index,
        autocolorscale = True,
        reversescale = True,
        marker = dict(
            line = dict (
                color = 'rgb(180,180,180)',
                width = 0.5
            ) ),
        colorbar = dict(
            title = 'Residual'),
      ) ]
layout = dict(
    title = 'Internet Usage vs GDP per capita deviation from linear model - Pearson residuals',
    geo = dict(
        showframe = True,
        showcoastlines = True,
        projection = dict(
            type = 'Mercator'
        )
    )
)
fig = dict( data=data, layout=layout )
py.iplot(fig, filename='resid-cloropleth-map' )
20180529-intuseworldmappears-2016
Outlier countries identified earlier can fairly easily be found on the map behind deeper colours.

What about history

The stories of countries are partially visible in the graph tracking the path of how internet usage and GDP per capita have evolved over time.

This creates a mother of all spaghetti diagrams. Fortunately the plotly library used allows zooming and panning into the maze for exploring different histories. But once again, only by following the link in the static picture below.

# morphing the data into a more usable format
dataset_b = dataset[dataset['Indicator Code'].isin(['IT_NET_USER_ZS','NY_GDP_PCAP_CD']) &
                    dataset['Region'].notnull()].drop(['Indicator Name', 
                           'Economy', 
                           'Region', 
                           'Income group', 
                           'Lending category',
                           'Other'], axis=1)        
dataset_c =  (dataset_b.set_index(['Country Name', 'Indicator Code'])
   .rename_axis(['Year'], axis=1)
   .stack()
   .unstack('Indicator Code')
   .dropna()
   .sort_values('Year')
   .reset_index())
dataset_c['UseByLogGdpCap'] = dataset_c['IT_NET_USER_ZS'] /np.log(dataset_c['NY_GDP_PCAP_CD'] )
dataset_c['log_NY_GDP_PCAP_CD'] = np.log(dataset_c['NY_GDP_PCAP_CD'] )
dataset_c['Year'] = dataset_c['Year'].astype('int')
dataset_c1 = dataset_c.sort_values('Year').copy()
dataset_c1 = dataset_c1[dataset_c1['Year'] >= 1990]
dataset_c1.columns = dataset_c1.columns.str.replace(' ', '')
And then creating the graph
traces = []
for cls in dataset_c1['CountryName'].unique():
    traces.append({
        'type' : 'scatter',
        'mode' : 'lines+markers',
        'x' : dataset_c1.log_NY_GDP_PCAP_CD[dataset_c1['CountryName'] == cls],
        'y' : dataset_c1.IT_NET_USER_ZS[dataset_c1['CountryName'] == cls],
        'name' : cls,
        'text' : dataset_c1.Year[dataset_c1['CountryName'] == cls]
    })    
fig = {
    'data' : traces,
    'layout' : {
        'title' : 'Internet Usage by log GDP per capita by country starting 1990',
        'xaxis' : {
            'title' : 'log GDP per capita',
        },
        'yaxis' : {
            'title' : 'Internet users '
        }
    }
}
py.iplot(fig, filename='UseByLogGdpCap-hist' )
20180529-intusehistspaghetti

The spaghetti plot shows a general trend of increasing GDP per capita and increasing usage of internet over the years.

But with some interesting exceptions and stories here and there like Azerbaijan in the center of the graph (one of the outliers identifier earlier) where GDP per capita strongly decreased over a couple of last years (the graph has log of GDP per capita on x-axis) while internet usage has still increased.

Evolution of relationships

Looking into the overall relationship again – stepping between five years snapshots.

fig, ax = plt.subplots(2, 3, sharex='col', sharey='row')
year = 1991
for i in range(2):
    for j in range(3):
        dataset_c2 = dataset_c1[dataset_c1['Year'] == year]
        #ax[i, j].scatter(x=myx, y=myy)
        ax[i, j].text(0.05, 0.85, str(year), fontsize=18, ha='left', transform=ax[i,j].transAxes)
        ax[i, j].scatter(x=dataset_c2['log_NY_GDP_PCAP_CD'], y=dataset_c2['IT_NET_USER_ZS'])
        year = year + 5
 internet_usage_scatter_plots

 

Association between internet usage and log GDP per capita has evolved over the years. And only recently it has reached a linear(ish) relationship.

But always – I guess understandably – there has been a positive correlation between GDP per capita and internet usage. Wealth makes it possible to have more expensive toys earlier.

Summary

We have found that there is an association between internet usage and GDP per capita which in 2016 seems to be near linear (okay, against log of GDP per capita, but anyways).

The relationship has evolved to that over the years of overall adoption of internet into world economies and ways of life since its de facto birth in early 1990s.

Basic truth still is – with a few interesting exceptions – that the strong positive correlation between the two exist.

Finding more about the exceptions and getting more understanding about nuances requires more indicators and additional techniques — which are beyond the scope of this post but topics of future ones

Postscript

Yes, I know that there are plenty of studies about this topic around, well written and easily accessible like InternetWorldStatsWikipediaGlobalInternetUsagePew:Smartphone Ownership and Internet UsagePew 2016 and others.

And in case you find problems / mistakes / issues in the post, please let me know.
Thank you for reading to the end.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.