Predicting subscriber growth for my YouTube channel using Time Series methods in R

In this article I will show how to use basic Time Series Forecasting methods to forecast the subscriber growth on my YouTube channel Data Science with Raghav. You can also Subscribe! 🙂

CODE REPO

Complete code can be found in below GITHUB repo

https://github.com/raaga500/YoutubeSubsGrowth

YouTube Video

Here is the YouTube video if you are more of a visual learner.

https://youtu.be/eK7YZjHA3CA

Getting the Data

Well, for forecasting the future you will need data from the past. In order to get daily subscriber growth data for past, you will need to login to your YouTube studio and go to the Analytics tab. Over there click on See More, it will open a full page popup window in the browser. Then in the first dropdown menu (look at the helpful screenshot below) where the line chart is displayed select the subscriber option and then check the Total box in the list of videos in the lower section.

This will give you line chart of the last 28 days for subscriber growth. For my purposes I need data for last 365 days, therefore I selected last 365 days option in the dropdown date filter. Once you have the line chart available, click on the Export current view button on the top right corner of the view and select the (.csv) option. It will download a zip file onto your computer with three files in it. We are interested in the one named Totals.csv which contains daily subscriber addition or subtraction for last 365 days. We also need the count of the Subscribers you had 365 days ago. For me it was easy, it was just 12. Therefore I will use 12 as starting value to generate a daily running sum of subscribers.

Below is a screenshot of the Youtube studio analytics page which you can use as a guide to understand the above mentioned steps.

Below is a screenshot of the downloaded Totals.csv file which contains daily subscriber change. Sadly, it was zero for many days in the beginning.

Now that we have the data, lets load it into R and get ready to forecast the user growth.

Why use R?

Although my go to programming language is Python, I am using R for this analysis because R has some of the best time series forecasting libraries which are not even available in Python. Also, my professor at the University preferred R over python for Time Series Analysis. Having said that I will also like to do the same in Python as well, so later in the blog you will also see similar analysis done using Python. But first, let’s fire up R Studio.

Setting up your tools

Well the first tools that you need is RStudio and R language. First you will need to install the R language and then RStudio. You can google on how to install R language and RStudio on your favorite operating system.

Next you will need to install some very useful R libraries which we will use in our analysis. You can install them using below commands through the RStudio console.

install.packages(“lubridate”)
install.packages(“zoo”)
install.packages(“ggplot2”)
install.packages(“forecast”)
install.packages(“ggrepel”) 
install.packages(“ggthemes”)


Data Preprocessing

First step is to import the downloaded csv file into R data frame. Below is the command to do this

dat = read.csv('Totals.csv')
head(dat)
tail(dat)

Below is the output of the head and tail commands which shows first 6 and last 6 rows of the dataset respectively.

The Subscriber column (column B) in the downloaded file is daily change in the number of subscribers. In order to plot the cumulative subscriber growth we will need to create a new column which will store the cumulative sum of the number of subscribers until that day.

Below is the R code that will create this new column

#Make the first observation as 12 as my starting subscriber count
dat[1,2] <- 12
#Generate a cumulative sum and store it in new column
dat['TotalSubscribers'] <- cumsum(dat['Subscribers'])

Here is the screenshot of the last 6 rows of the new column.

  \hat{y}

Next step is to convert the new column into a Time Series object. Before that we also need to convert the Date column, which is in char format, to Date. Then we can convert the TotalSubscribers column to a time series object. As this is daily data, the frequency is set to 1. Below is the code to convert the data into a Time Series object using the zoo library.

#Change the format of date column to date
datDate <- as.Date(datDate,"%Y-%m-%d")

#Convert to time series object
library(lubridate)
library(zoo)
library(forecast)
library(ggrepel)
library(ggthemes)
library(ggplot2) 
date_range = seq(from = as.Date("2020-01-04"), to = as.Date("2021-01-02"), by = 1)

y<-zoo(dat['TotalSubscribers'], date_range)

Visualizing past Subscriber growth


Now we are ready to create our Time-Series plot from our newly created time series. Below is the code to create the time series plot shown underneath

#Store last day’s Subscriber count in a variable so that we can use it to show in the plot in red
dat_ends <- tail(dat,n=1)
dat_ends

#Plot
ggplot(data = y) +
geom_line(mapping = aes(x=date_range, y=TotalSubscribers),color='black',size=2) +
  ggtitle("Total Subscribers for Data Science With Raghav") +
  theme_economist() +
  theme(axis.title.x=element_blank(),axis.title.y=element_blank()) +
  geom_text_repel(aes(x=Date,y=TotalSubscribers,label=TotalSubscribers),data=dat_ends,fontface="plain",color="red",size=6) 

Here is the graph for Subscriber growth plotted on a line chart with the final value of 219 in red

Basic Forecasting methods

Now we can apply some basic forecasting methods to project the future subscribers for my YouTube channel.

Mean

First method is just to use the mean to project the future growth. In some scenarios this is the best we can do. Many a times, this method is used as a baseline benchmark on which other advanced forecasting methods are compared against.

Naive

Naive method as the name suggests is a very basic method that just forecasts the last observation as the future projected value

Rwf with Drift

Random walk forecasts (rwf) with drift is the naive method along with the drift parameter which extracts the general drift in the data and uses it in addition to the last observation to project the future value. Below is the code to forecast with rwf along with the forecast plot. The red line is the forecast for next 30 days.

#MODEL1:- Using method rwf - Random walk with drift model - Equivalent to ARIMA(0,1,0)
z <- ts(dat['TotalSubscribers'],start=min(datDate),end=max(datDate),frequency=1)
autoplot(z) + aes(x=date_range) +
  autolayer(rwf(z, h=30,drift=TRUE),
            series="Naïve with drift", PI=FALSE) +
  ggtitle("Forecasts for Youtube Subscriber growth") +
  xlab("Date") + ylab("Subscribers") +
  guides(colour=guide_legend(title="Forecast")) +
  theme_economist()

#Store rwf point forecasts in a variable 
point_forecasts <- rwf(z, h=30,drift=TRUE)mean</pre> <!-- /wp:enlighter/codeblock -->  <!-- wp:image {"id":612,"sizeSlug":"large","linkDestination":"none"} --> <figure class="wp-block-image size-large"><img src="https://www.datasciencewithraghav.com/wp-content/uploads/2021/02/image.png" alt="" class="wp-image-612"/></figure> <!-- /wp:image -->  <!-- wp:paragraph -->  <!-- /wp:paragraph -->  <!-- wp:heading {"level":3} --> <h3>Auto.Arima</h3> <!-- /wp:heading -->  <!-- wp:paragraph --> All the above methods are basic methods which you can use as a baseline to compare all the complex models you will build. ARIMA however is a sophisticated model which stands for Auto Regressive Integrated Moving Average model. It uses three parameters namely Auto regressive, Integrated and Moving average. R has a very easy way to user ARIMA modelling using the Auto.Arima method. Below is the code to fit an ARIMA model to our subscribers data. Also shown is the forecast for next 30 days as predicted by the ARIMA model. <!-- /wp:paragraph -->  <!-- wp:enlighter/codeblock {"language":"r"} --> <pre class="EnlighterJSRAW" data-enlighter-language="r" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">#MODEL2:- ARIMA modeling (Auto Regressive Integrated Moving Average) lambda_subs <- BoxCox.lambda(y) lambda_subs fit_subs <- auto.arima(y,lambda=lambda_subs,seasonal = FALSE, approximation = FALSE,                        stepwise = FALSE) summary(fit_subs)  #Plot forecasts fit_subs %>% forecast(h=30) %>% autoplot()  +      ggtitle("Forecasts for Youtube channel growth") +     ylab('Subscriber Count') +     theme_economist()  #Store ARIMA forecasts in a variable fcast <- forecast(fit_subs,h=30) fcast_df <- as.data.frame(fcastmean)
colnames(fcast_df) <- "a"

And here is the model fit summary.

Series: y 
ARIMA(2,2,3) 
Box Cox transformation: lambda= 0.7730962 

Coefficients:
         ar1      ar2      ma1     ma2      ma3
      0.0935  -0.8620  -1.0936  1.1067  -0.9058
s.e.  0.0459   0.0459   0.0310  0.0440   0.0333

sigma^2 estimated as 0.0721:  log likelihood=-36.84
AIC=85.68   AICc=85.91   BIC=109.04

Training set error measures:
                     ME      RMSE       MAE       MPE      MAPE      MASE        ACF1
Training set 0.04026656 0.7132815 0.4595021 0.1014478 0.9284956 0.7852524 0.003670371

Method Evaluation

We will evaluate our forecasts using sum of squared errors metric. For this I have already kept the actual data for next 30 days in a dataframe, which I will compare against the forecasts generated by each method. Below is the code for evaluating both the rwf and arima models.

#Model Evaluation:- Comparison with Actual Data for next 10 day forecasts
dat_actual <- read.csv('Totals_Actual.csv')
head(dat_actual)
tail(dat_actual)

dat_actual[1,2] <- 219 #Set first day in New data to last day's total subscriber count
dat_actual['TotalSubscribers'] <- cumsum(dat_actual['Subscribers'])

#Sum of Square Errors - RWF with drift model
sum((dat_actual[1:10,'TotalSubscribers'] - point_forecasts[1:10])**2)

#SUm of Square Errors - ARIMA (2,2,3) model
sum((dat_actual[1:10,'TotalSubscribers'] - fcast_df[1:10,'a'])**2)

#Sum of Square Errors - RWF with drift model
sum((dat_actual[1:30,'TotalSubscribers'] - point_forecasts[1:30])**2)

#SUm of Square Errors - ARIMA (2,2,3) model
sum((dat_actual[1:30,'TotalSubscribers'] - fcast_df[1:30,'a'])**2)
#Sum of Square Errors - RWF with drift model
> sum((dat_actual[1:10,'TotalSubscribers'] - point_forecasts[1:10])**2)
[1] 558.8491
> #SUm of Square Errors - ARIMA (2,2,3) model
> sum((dat_actual[1:10,'TotalSubscribers'] - fcast_df[1:10,'a'])**2)
[1] 89.25826
> #Sum of Square Errors - RWF with drift model
> sum((dat_actual[1:30,'TotalSubscribers'] - point_forecasts[1:30])**2)
[1] 34901.45
> #SUm of Square Errors - ARIMA (2,2,3) model
> sum((dat_actual[1:30,'TotalSubscribers'] - fcast_df[1:30,'a'])**2)
[1] 11274.83

As you can see the SSE (Sum of Square errors) for the ARIMA model is extremely low as compared to the RWF SSE. This means ARIMA model will predict much better forecasts than the rwf model.

Also shown above is a comparison of SSE for 10 day forecast vs 30 day forecast. The large values show that forecast beyond 10 days are not all reliable. Therefore we should predict only for limited number of future periods.

Other Advanced methods

There are other advance methods which you can use as well. For example Neural networks etc. You can go through the book by hyndman to learn more.

Conclusion

I performed this exercise to have an insight into the time needed for my channel to reach 1000 subscribers. Although the below plot is completely wrong it gives my an indication by when I can expect my channel to reach 1000 subs. PLEASE DO NOT USE FORECASTING METHODS TO PROJECT SO FAR INTO THE FUTURE!. This was just a fun experiment. Hopefully you got some idea of how forecasting works in R. If you like my blog please do subscribe to my YouTube channel

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.