In this article I will show how to use basic Time Series Forecasting methods to forecast the subscriber growth on my YouTube channel Data Science with Raghav. You can also Subscribe! 🙂
CODE REPO
Complete code can be found in below GITHUB repo
https://github.com/raaga500/YoutubeSubsGrowth
YouTube Video
Here is the YouTube video if you are more of a visual learner.
Getting the Data
Well, for forecasting the future you will need data from the past. In order to get daily subscriber growth data for past, you will need to login to your YouTube studio and go to the Analytics tab. Over there click on See More, it will open a full page popup window in the browser. Then in the first dropdown menu (look at the helpful screenshot below) where the line chart is displayed select the subscriber option and then check the Total box in the list of videos in the lower section.
This will give you line chart of the last 28 days for subscriber growth. For my purposes I need data for last 365 days, therefore I selected last 365 days option in the dropdown date filter. Once you have the line chart available, click on the Export current view button on the top right corner of the view and select the (.csv) option. It will download a zip file onto your computer with three files in it. We are interested in the one named Totals.csv which contains daily subscriber addition or subtraction for last 365 days. We also need the count of the Subscribers you had 365 days ago. For me it was easy, it was just 12. Therefore I will use 12 as starting value to generate a daily running sum of subscribers.
Below is a screenshot of the Youtube studio analytics page which you can use as a guide to understand the above mentioned steps.
Below is a screenshot of the downloaded Totals.csv file which contains daily subscriber change. Sadly, it was zero for many days in the beginning.
Now that we have the data, lets load it into R and get ready to forecast the user growth.
Why use R?
Although my go to programming language is Python, I am using R for this analysis because R has some of the best time series forecasting libraries which are not even available in Python. Also, my professor at the University preferred R over python for Time Series Analysis. Having said that I will also like to do the same in Python as well, so later in the blog you will also see similar analysis done using Python. But first, let’s fire up R Studio.
Setting up your tools
Well the first tools that you need is RStudio and R language. First you will need to install the R language and then RStudio. You can google on how to install R language and RStudio on your favorite operating system.
Next you will need to install some very useful R libraries which we will use in our analysis. You can install them using below commands through the RStudio console.
install.packages(“lubridate”) install.packages(“zoo”) install.packages(“ggplot2”) install.packages(“forecast”) install.packages(“ggrepel”) install.packages(“ggthemes”)
Data Preprocessing
First step is to import the downloaded csv file into R data frame. Below is the command to do this
dat = read.csv('Totals.csv') head(dat) tail(dat)
Below is the output of the head and tail commands which shows first 6 and last 6 rows of the dataset respectively.
The Subscriber column (column B) in the downloaded file is daily change in the number of subscribers. In order to plot the cumulative subscriber growth we will need to create a new column which will store the cumulative sum of the number of subscribers until that day.
Below is the R code that will create this new column
#Make the first observation as 12 as my starting subscriber count dat[1,2] <- 12 #Generate a cumulative sum and store it in new column dat['TotalSubscribers'] <- cumsum(dat['Subscribers'])
Here is the screenshot of the last 6 rows of the new column.
Next step is to convert the new column into a Time Series object. Before that we also need to convert the Date column, which is in char format, to Date. Then we can convert the TotalSubscribers column to a time series object. As this is daily data, the frequency is set to 1. Below is the code to convert the data into a Time Series object using the zoo library.
#Change the format of date column to date datDate,"%Y-%m-%d") #Convert to time series object library(lubridate) library(zoo) library(forecast) library(ggrepel) library(ggthemes) library(ggplot2) date_range = seq(from = as.Date("2020-01-04"), to = as.Date("2021-01-02"), by = 1) y<-zoo(dat['TotalSubscribers'], date_range)
Visualizing past Subscriber growth
Now we are ready to create our Time-Series plot from our newly created time series. Below is the code to create the time series plot shown underneath
#Store last day’s Subscriber count in a variable so that we can use it to show in the plot in red dat_ends <- tail(dat,n=1) dat_ends #Plot ggplot(data = y) + geom_line(mapping = aes(x=date_range, y=TotalSubscribers),color='black',size=2) + ggtitle("Total Subscribers for Data Science With Raghav") + theme_economist() + theme(axis.title.x=element_blank(),axis.title.y=element_blank()) + geom_text_repel(aes(x=Date,y=TotalSubscribers,label=TotalSubscribers),data=dat_ends,fontface="plain",color="red",size=6)
Here is the graph for Subscriber growth plotted on a line chart with the final value of 219 in red
Basic Forecasting methods
Now we can apply some basic forecasting methods to project the future subscribers for my YouTube channel.
Mean
First method is just to use the mean to project the future growth. In some scenarios this is the best we can do. Many a times, this method is used as a baseline benchmark on which other advanced forecasting methods are compared against.
Naive
Naive method as the name suggests is a very basic method that just forecasts the last observation as the future projected value
Rwf with Drift
Random walk forecasts (rwf) with drift is the naive method along with the drift parameter which extracts the general drift in the data and uses it in addition to the last observation to project the future value. Below is the code to forecast with rwf along with the forecast plot. The red line is the forecast for next 30 days.
#MODEL1:- Using method rwf - Random walk with drift model - Equivalent to ARIMA(0,1,0) z <- ts(dat['TotalSubscribers'],start=min(datDate),frequency=1) autoplot(z) + aes(x=date_range) + autolayer(rwf(z, h=30,drift=TRUE), series="Naïve with drift", PI=FALSE) + ggtitle("Forecasts for Youtube Subscriber growth") + xlab("Date") + ylab("Subscribers") + guides(colour=guide_legend(title="Forecast")) + theme_economist() #Store rwf point forecasts in a variable point_forecasts <- rwf(z, h=30,drift=TRUE)mean) colnames(fcast_df) <- "a"
And here is the model fit summary.
Series: y ARIMA(2,2,3) Box Cox transformation: lambda= 0.7730962 Coefficients: ar1 ar2 ma1 ma2 ma3 0.0935 -0.8620 -1.0936 1.1067 -0.9058 s.e. 0.0459 0.0459 0.0310 0.0440 0.0333 sigma^2 estimated as 0.0721: log likelihood=-36.84 AIC=85.68 AICc=85.91 BIC=109.04 Training set error measures: ME RMSE MAE MPE MAPE MASE ACF1 Training set 0.04026656 0.7132815 0.4595021 0.1014478 0.9284956 0.7852524 0.003670371
Method Evaluation
We will evaluate our forecasts using sum of squared errors metric. For this I have already kept the actual data for next 30 days in a dataframe, which I will compare against the forecasts generated by each method. Below is the code for evaluating both the rwf and arima models.
#Model Evaluation:- Comparison with Actual Data for next 10 day forecasts dat_actual <- read.csv('Totals_Actual.csv') head(dat_actual) tail(dat_actual) dat_actual[1,2] <- 219 #Set first day in New data to last day's total subscriber count dat_actual['TotalSubscribers'] <- cumsum(dat_actual['Subscribers']) #Sum of Square Errors - RWF with drift model sum((dat_actual[1:10,'TotalSubscribers'] - point_forecasts[1:10])**2) #SUm of Square Errors - ARIMA (2,2,3) model sum((dat_actual[1:10,'TotalSubscribers'] - fcast_df[1:10,'a'])**2) #Sum of Square Errors - RWF with drift model sum((dat_actual[1:30,'TotalSubscribers'] - point_forecasts[1:30])**2) #SUm of Square Errors - ARIMA (2,2,3) model sum((dat_actual[1:30,'TotalSubscribers'] - fcast_df[1:30,'a'])**2)
#Sum of Square Errors - RWF with drift model > sum((dat_actual[1:10,'TotalSubscribers'] - point_forecasts[1:10])**2) [1] 558.8491 > #SUm of Square Errors - ARIMA (2,2,3) model > sum((dat_actual[1:10,'TotalSubscribers'] - fcast_df[1:10,'a'])**2) [1] 89.25826 > #Sum of Square Errors - RWF with drift model > sum((dat_actual[1:30,'TotalSubscribers'] - point_forecasts[1:30])**2) [1] 34901.45 > #SUm of Square Errors - ARIMA (2,2,3) model > sum((dat_actual[1:30,'TotalSubscribers'] - fcast_df[1:30,'a'])**2) [1] 11274.83
As you can see the SSE (Sum of Square errors) for the ARIMA model is extremely low as compared to the RWF SSE. This means ARIMA model will predict much better forecasts than the rwf model.
Also shown above is a comparison of SSE for 10 day forecast vs 30 day forecast. The large values show that forecast beyond 10 days are not all reliable. Therefore we should predict only for limited number of future periods.
Other Advanced methods
There are other advance methods which you can use as well. For example Neural networks etc. You can go through the book by hyndman to learn more.
Conclusion
I performed this exercise to have an insight into the time needed for my channel to reach 1000 subscribers. Although the below plot is completely wrong it gives my an indication by when I can expect my channel to reach 1000 subs. PLEASE DO NOT USE FORECASTING METHODS TO PROJECT SO FAR INTO THE FUTURE!. This was just a fun experiment. Hopefully you got some idea of how forecasting works in R. If you like my blog please do subscribe to my YouTube channel