Google trends is a fantastic tool that lets you explore the popularity of different words or word associations within Google search. Data science is booming and so is the popularity of the keyword “data scientist”. In this blog I guide you through a time series analysis in python.

Note that this post is longer and more technical than the previous ones. So now might be a good time to grab a drink.

# Getting the data

To begin with, we need to scrap the data on the popularity of “data scientist” as a function of time from Google trends. To do so we use the nice python package pytrends available on github.

 12345 from pytrends.request import TrendReq pytrends = TrendReq(hl='en-US') kw_list=["data scientist"] pytrends.build_payload(kw_list, timeframe='2004-01-01 2018-03-31') df = pytrends.interest_over_time()

As you can see we initialise pytrends to use US english (‘en-US’). Then we set a list of keywords of interest as kw_list (this list could contain up to 5 different keywords) and pass it to the build_payload method. We also need to specify a timeframe for which we want the data. Data is available from January 1st 2004 to present time.

Sadly, and this is the first lack of Google trends, you cannot specify the binning of the data. As a rule of thumb, the data will be binned by days if you look at time scale of a few months, by weeks if you look at several months to a few years, and by month if you look at several years. So in our case the data will be binned by month.

the interest_over_time method produces a Pandas dataframe that contains the desired time series, that we can plot using matplotlib:

 12345678 from matplotlib import pyplot as plt df = df.reset_index() df = df.rename(columns={kw_list[0]:'data'}) plt.plot(df.date,df.data,label=kw_list[0]) plt.xlabel('Year') plt.ylabel('Interest') plt.legend() plt.savefig('data.png')

Worldwide interest over time for the keyword “data scientist” in Google search.

The value that Google trends returns as a measure of the popularity of the keyword is called the interest. Now we reach the second major problem of Google trends. The interest is a normalised quantity that as a value of 100 for the time when people used the keyword the most on Google search over the timeframe considered. As Google does not provide the normalisation, we have no indication on the absolute number of Google searches the data represents (they are not crazy enough to give it away as they make money out of it). The bright side is that it is always good to normalise your data, and Google already did it for us. We just have to hope they did it well.

# Interest by country

Another cool feature of Google trends is that you can look at the interest by region. We get the interest by country using the pytrends package:

 1 countries = pytrends.interest_by_region(resolution='COUNTRY')

Then, I use the code of my previous post to plot the data on a map.

We can see that India as the most interest into “data scientist”, followed by the US. In Europe, there is a mild interest, with Ireland at the top, but it is still confined to a few countries.

# Simple modelling

Alright, lets go back to the time series. The curve looks nice! We already see that the term “data scientist” started to boom in late 2010 – early 2011. Before that, we only see small fluctuations with a more or less constant average. Interestingly, since 2015 the increase in interest seems to be more or less linear. Lets fit a constant line ($f(x)=k$) for the period before 2010 and a straight line ($f(x)=ax+b$) after 2015 to emphasis this point. We also fit an exponential model of the form $f(x)=Ne^{\lambda x}+c$  to show that such a model is a poor fit to the data. We use the scipy.optimize package to fit the lines:

 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950 from scipy.optimize import curve_fit def constant(x,k):     return k def straight_line(x,a,b):     return (a*x+b) def exponential(x,N,lamb,c):     return N*np.e**(lamb*x) + c # Create a new column "days" that will be use as x-axis for the fit days = (df.date - df.date.min()) df['days'] = days.dt.total_seconds() /60/60/24 # Normalise X and Y axes x = df.days / df.days.max() y = df.data / df.data.max() # Fit the constant x_k = x[x<0.4] y_k = y[x<0.4] fit_k = curve_fit(constant,x_k,y_k) k = fit_k[0][0] # Fit the straight line x_sl = x[x>0.75] y_sl = y[x>0.75] fit_sl = curve_fit(straight_line,x_sl,y_sl) a = fit_sl[0][0] b = fit_sl[0][1] # Fit exponential model fit_exp = curve_fit(exponential,x,y) N_exp = fit_exp[0][0] lamb_exp = fit_exp[0][1] c_exp = fit_exp[0][2] # Plot plt.plot(df.date,df.data,label=kw_list[0]) plt.plot(df.date,exponential(x,N_exp,lamb_exp,c_exp)*df.data.max(),          label=r'$Ne^{\lambda x}+c$',ls=":",color='k',lw=1) plt.plot(df.date,straight_line(x,a,b)*df.data.max(),label=r'$ax+b$',ls="-.") plt.plot(df.date,np.ones(len(df))*k*df.data.max(),          label=r'$k=$'+str(round(k*df.data.max(),1)),ls="-.") plt.legend() plt.xlabel('Year') plt.ylabel('Interest') plt.ylim([-5,100]) plt.savefig('data_scientist_straight_lines.png')

Fitting a constant line before 2010 and a straight line after 2015.

The constant line as for equation $k=1.7$. This tells us that the interest into the keyword “data scientist” pre-2010 was less than 2% of the current one. Perhaps more interesting is the slope of the straight line (the parameter $a$ in the equation $f(x)=ax+b$) which is 0.05% per day or 18.8% per year. This means that since 2015 there has been on average an increase in search for the keyword “data scientist” of about 20% of the current value per year. Assuming this model holds in the future (here I am going to extrapolate, and you should keep in mind that extrapolating has often more to do with crystal balls than science) the number of searches for “data scientist” will have double in 5 years.

So we seem to have a decent modelling for the early and late parts of the curve, but we are missing a model for the intermediate part i.e. from 2010 to 2015. Interestingly the curve looks simple enough that we should be able to find a full analytical model (i.e. a mathematical equation describing the full trend). What we want is a model that nicely transition from a constant for pre-2010 to a straight line for post 2015. We can obtain such a model by using a Sigmoid function that transition from 0 to 1 in a “S” shape. The “0” part of the Sigmoid will correspond to the constant and the “1” part to the straight line. Thus we can write the following model:

Model: $f(x)=\frac{ax+b}{1+e^{-\lambda (x-x_0)}}+c$

When $x\ll x_0$ we have $e^{-\lambda (x-x_0)}\gg 1$ and thus $f(x)\approx c$. On the contrary, when $x\gg x_0$ we have $e^{-\lambda (x-x_0)}\approx 0$ and thus $f(x)\approx ax+b+c$. Here is the code to fit and plot the model:

 1234567891011121314151617181920212223242526272829 def sigmoid(x,lamb,x0):     return 1/(1+np.exp(-lamb*(x-x0))) def model(x,a,b,c,lamb,x0):     s = sigmoid(x,lamb,x0)     return (a*x+b)* s + c # Fit fit = curve_fit(model,x,y) a,b,c,lamb,x0 = fit[0] # Plot df['model'] = model(x,a,b,c,lamb,x0)*df.data.max() df['residuals'] = df.data - df.model fig, (ax1, ax2) = plt.subplots(2,sharex=True,gridspec_kw = {'height_ratios':[3, 1]}) ax1.plot(df.date,df.data,label=kw_list[0]) ax1.plot(df.date,df.model,label='model',ls="-.") ax1.set_ylabel('Interest') ax1.legend() ax2.plot(df.date,df.residuals) ax2.set_ylabel('Residuals') ax2.spines['top'].set_visible(True) ax2.tick_params(which='both',bottom=True,top=True) plt.xlabel('Year') fig.subplots_adjust(hspace=0)

Model going from a constant to a straight line thanks to a Sigmoid.

Alright! We now have a model over the full timeframe that seems to be doing a good job. On this plot I have added the residuals (residuals = data – model) to have a sens of how good the fit performs. Broadly speaking, what you want to see on the plot of residuals is an equal amount of points above and bellow 0. At first sight, this seems to be the case. At this point I would usually assume that the residuals have a Gaussian distribution to make a quantitative estimate of the goodness of fit. In this case the residuals are to highly non Gaussian distributed to make this kind of assumption. This can be an indication that there is some structure in the data not picked by the model. Actually this is likely as the model is intended to explain the long term trend, definitely not short term structures. So lets identify the small structures.

# Looking for anomalies

To look for small structures, that we will call “anomalies”, i.e. significant deviation from the model, we compute the Rolling Standard Deviation (RSTD) of the residuals, defined as:

$\sigma_i^2=\frac{1}{w-1}\sum_{i-w/2}^{i+w/2} (d_i - m_i)^2$

where $d_i$ and $m_i$ are the values of the time series and of the model at point $i$ and the sum run over a window of $w$ points centred on point $i$. If things were Gaussian, we could use the RSTD to estimate the probability of a given residual point being due to chance. As this is not the case, we will identify anomalies by looking for residuals more than 1.5 RSTD away from the model without turning it into a probability.

The code to compute and plot the RSTD and the anomalies is the following:

 1234567891011121314151617181920212223242526272829303132 window=30 # Rolling standard deviation df['rolling_std'] = df.residuals.rolling(window=window,center=True).std() # Outliers outliers = df[df.residuals>1.5*df.rolling_std] # Plot fig, (ax1, ax2) = plt.subplots(2,sharex=True,gridspec_kw = {'height_ratios':[3, 1]}) ax1.plot(df.date,df.data,label=kw_list[0]) ax1.plot(df.date,df.model,label='model',ls="-.") ax1.fill_between([d.date() for d in df.date],                  df.model - 1.5*df.rolling_std,                  df.model + 1.5*df.rolling_std,                  color='orange',alpha=0.4,label='RSTD') ax1.plot(outliers.date,outliers.residuals + outliers.model,'x',color='r',label="Outliers") ax2.plot(df.date,df.residuals) ax2.fill_between([d.date() for d in df.date],                  - 1.5*df.rolling_std,                  + 1.5*df.rolling_std,                  color='orange',alpha=0.4) ax2.plot(outliers.date,outliers.residuals,'x',color='r') ax1.legend() plt.xlabel('Year') ax1.set_ylabel('Interest') ax2.set_ylabel('Residuals') ax2.spines['top'].set_visible(True) ax2.tick_params(which='both',bottom=True,top=True) fig.subplots_adjust(hspace=0) plt.savefig('data_scientist_outliers.png')

Anomalies detected for being 1.5 times the RSTD away from the model.

Here we have our anomalies! As you can see on the table below, 12 data points are more than 1.5 RSTD away from the model. Interestingly, there is a clear seasonal effect with 10 out of the 12 points taking place in the August to October timeframe. This corresponds to the beginning of the school year. Are students looking into data scientist careers?

Table of anomalies

datedatamodelresidualsrolling_std
2007-09-0131.2788641.7211360.769550
2008-09-0131.1477931.8522070.886981
2009-09-0151.3707723.6292280.934489
2009-10-0131.4166951.5833050.970792
2010-09-0152.3923102.6076901.458642
2011-09-0184.9420303.0579701.527250
2012-09-011410.0381493.9618512.068268
2013-04-012214.5739697.4260312.117917
2014-08-013930.2205848.7794163.467941
2014-09-014031.4643838.5356173.561823
2015-10-015949.5068209.4931805.254111
2016-05-017460.53127713.4687235.084801

# Mysterious spikes

Two anomalies do not seem to be seasonal: the April 2013 and May 2016 ones. Lets have a deeper look into these two. To better understand these trends we need a better sampling of the time series. We can obtain a daily sampling by querying Google trends for a few weeks of data instead of several years. Here is what the anomalies look like with a daily sampling:

 1234567891011121314151617181920212223 pytrends.build_payload(kw_list, timeframe='2013-04-01 2013-04-30') df = pytrends.interest_over_time() df = df.reset_index() df = df.rename(columns={kw_list[0]:'data'}) plt.figure() plt.plot(df.date,df.data,'o',label=kw_list[0]+' - April 2013',ls='-') plt.xticks(rotation=-30) plt.xlabel('Date') plt.ylabel('Interest') plt.legend() plt.savefig('data_scientist_2013-04.png') pytrends.build_payload(kw_list, timeframe='2016-05-01 2016-05-31') df = pytrends.interest_over_time() df = df.reset_index() df = df.rename(columns={kw_list[0]:'data'}) plt.figure() plt.plot(df.date,df.data,'o',label=kw_list[0]+' - May 2016',ls='-') plt.xticks(rotation=-30) plt.xlabel('Date') plt.ylabel('Interest') plt.legend() plt.savefig('data_scientist_2016-05.png')

Time series of the two anomalies with a daily sampling.

We clearly see that both anomalies happen on a timeframe of a day. The first one happens on April 4, 2013 while the second one happens on May 8, 2016. What causes these spikes? Lets have a look at the geographic map of interest on May 8, 2016:

We see that India completely dominate the interest by country. Looking at the news in India on that date, I have found an article called Be a ‘data scientist’, earn upto Rs 75 lakhs a year published on india.com. So the spike we see is due to Indians looking for what this promising job is about.

I could not find a piece of actuality that can be linked with certainty to the April 2014 pick of interest. (Bro fist if you find it and post it below).

# You’ve reached the end

It is time to close this lengthy post. I hope I have given you a sens of what can be done with Google trends and time series analyses. There is an insane amount of information in there, and it is always fun to see how the interest in different keywords behaves.