Stock Market Prediction with R
Time Series Data, QuantMod, ARIMA
Now we assume that by reading our last blog your interest in the field of analytics might have peaked a bit. This post of ours would like to increase that a bit further. In today’s blog we’ll deal with Stock Market Prediction using data mining techniques in R. People who have worked in this field might know about this, as it is one of the most basic applications of data mining. Newcomers do not worry. Go through the blog and get an idea of how powerful data analytics is. We are going to start with our course in the very next blog.
Additional references for each topic are also provided at the end.
Although the
project is small, there is a lot of ground to be covered before we actually
proceed with the code. Let’s go!
Introduction
The standard Wikipedia definition of a stock market is “A stock market, equity market or share market is the aggregation of buyers and sellers (a loose network of economic transactions, not a physical facility or discrete entity) of stocks (also called shares), which represent ownership claims on businesses; these may include securities listed on a public stock exchange as well as those only traded privately.”
The standard Wikipedia definition of a stock market is “A stock market, equity market or share market is the aggregation of buyers and sellers (a loose network of economic transactions, not a physical facility or discrete entity) of stocks (also called shares), which represent ownership claims on businesses; these may include securities listed on a public stock exchange as well as those only traded privately.”
Now every
stock market has a stock exchange where the stocks of the company are listed.
The stock is nothing but a piece of the company AKA share(s). The price of the
shares keeps fluctuating and is dependent upon the value of the company. An investor or trader must know the right
time to buy or sell shares to maximize his gains. The very uncertain nature of these fluctuations is what caused analysts to devise methods to predict the stock prices. There have been many
instances in the past where the stock market collapsed overnight, forcing the
world into an economic freeze. Take for instance, the Wall Street crash of
2007-8 (watch the movie The Big Short).
Below: Margot Robbie explains finance terms.
The highly
publicized Enron Scandal, that eventually led to the bankruptcy of Enron
Corporation is the very example of how stock prices change with public’s
perception. After achieving a lifetime high of stock price in mid-2000, the
price of the shares plummeted to less than 1$ by the end of 2001. Hence it has
been quite difficult to predict stock market prices although many theories have
been devised. We will use the ARIMA model to analyse historical stock data. The
results will be visualized using R.
Quick-fact:
The Bombay Stock Exchange is Asia’s first stock exchange. It was established in
1875.
That is all
we need to know about stock markets. If you are interested in understanding how
a stock market works, watch the video below.
The basic assumption made while forecasting stock data is that it has some relation with its historical values. This forms the base for technical analysis. There is another theory that contends this belief. The Random Walk Theory states that successive price changes in an individual security (share) are independent i.e. the future price of the stock does not depend upon its previous price. But the theory also states that all price changes conform to a particular probability distribution, which can be used to forecast stock prices.
In this project we will be working with Time-Series data i.e a sequence of data collected over a specified period of time. Data can be collected daily,weekly,monthly,yearly etc. Technical analysis relies heavily on Time Series data for statistical inferences and predictions. As mentioned before we will be using the R programming language for forecasting.
If you are wondering why R? Well, the foremost reason would be the availability of numerous statistical software packages for forecasting like forecast, stats, QuantMod etc. The syntax is easy with high functionality.
The code
Here is the entire code. The explanation follows after this.
1. library(quantmod)
2. getSymbols("MSFT")
3. summary(MSFT)
4. chartSeries(MSFT, subset = 'last 12 months', type = 1)
5. addBBands()
6. library(tseries, quietly = T)
7. adf.test(MSFT$MSFT.Adjusted)
8. ret_MSFT <- 100*diff(log(MSFT$MSFT.Adjusted[2274:2638]))
9. library(forecast, quietly = T)
10. MSFT_ret_train <- ret_MSFT[1:(0.9*length(ret_MSFT))]
11. MSFT_ret_test <- ret_MSFT[(0.9*length(ret_MSFT)+1):length(ret_MSFT)]
12. fit <- Arima(MSFT_ret_train, order = c(2,0,2))
13. preds <- predict(fit, n.ahead = (length(ret_MSFT) - (0.9*length(ret_MSFT))))$pred
14. test_forecast <- forecast(fit,h = 25)
15. plot(test_forecast, main = "Arima forecast for Microsoft Stock")
Line 1, line 6, line 9 contain the library() function. This function is used to load/attach an already installed package. Packages are collections of R functions, data and code. Usually these packages are designed for performing specific functions. However there are a lot of packages that transcend case-specific uses and are used in almost all programs.
It is clear that we have used 3 packages in the code. These packages are quantmod, tseries and forecast.
The QuantMod package:
QuantMod stands for Quantitative Financial Modelling and Trading Framework for R. Now we load this package to make fetching data a whole lot easier. The package contains the getSymbols() function used in line 2.
This function is used to load data from different sources- remote or local. Here we have only used one parameter inside the function i.e. the stock ticker symbol/ stock symbol of the firm whose stock prices we want to forecast. We chose Microsoft Corporation for this example. Their stock symbol is MSFT.
So when we enter the code getSymbols("MSFT"), we'll get a table like this:
This is known as the OHLC data or Open, High, Low, Close data that denotes the opening price, highest price, lowest price and the closing price for a date.
You can go for other companies as well. You just need to change the ticker symbol.
Another parameter that we can define in the getSymbols() function is src (specifying the sourcing method). Unless specified, the default source is www.finance.yahoo.com .
Line 3 contains the summary() function. It takes the name of the table (here it is MSFT) as the parameter. The name of the function itself describes its purpose.
The function gives us the general statistical information like the minimum value, maximum value, mean, median etc. for every column of data in the table.
Now this might not look very appealing, so we use another function available in the quantmod package, called chartSeries() in line 4. It is one of the most important functions used to plot Time Series data. The function helps us see the variations in the data over the past year easily.
Note that we have specified 3 parameters in the function i.e.
chartSeries(MSFT, subset = 'last 12 months', type = 1)
The first parameter is the name of the data table. The second parameter subset allows us to specify the length of data we want to analyze. The type allows us to specify the type of plot we want (line, candlestick, matchstick,bar). Try experimenting with different subsets and plot types.
When we enter the code we get a plot like this
You can also note the volume of stock traded below.
Bollinger Bands can also be added to the plot using the function addBBands()like we have done in line 5. The function adds 3 bands to the existing plot.
Without going into details,the basic purpose of Bollinger Bands is to provide a relative definition of high and low. By definition, prices are high at the upper band and low at the lower band.
In chart_series() you can specify the subset upto the second you want to subset the data. The code looks like this :
From 9 a.m. on the 1st through 3 p.m. on the 9th
subset = "2012-07-01 9:00/2012-07-09 15:00"
The tseries package:
tseries stands for Time Series Analysis and Computational Finance. We load this package since it contains the function adf.test() used in line 7.
adf.test() is the function that allows us to perform the Augmented Dicky-Fuller test.
The Augmented Dicky-Fuller test is the unit root test for stationarity. In a simpler words, the test is used to adjust the randomness present in our time series data.
A time series data is said to be stationary when the values do not change with a shift in time. If there is change with shift in time, the distribution also changes. Unit roots present in the data can cause uncertainty in the distribution and forecasting becomes difficult.(Why they are called unit roots has got to do with mathematics which is beyond the scope of this blog. However if you are interested additional links are provided below.)
The null hypothesis for this test that there exists a unit root.
We use the adjusted values of the stock prices as our time series data.The adjusted closing price is often used when examining historical returns or performing a detailed analysis on historical returns.
So when we enter the code, we obtain the output like this.
Our approximate p-value is 0.9627, so we would fail to reject the null in all these cases, but that does not imply that
the null hypothesis is true. The data are merely consistent with it.
There is similar function adfTest() available in the forecast package. A comparison between them is available here
Next we calculate the return of each day for the last 12 months. (When you invest in stocks, you do get something in "return"). The yearly return is for example the geometric average of the monthly returns.Therefore, in practice we will often use continuously compounded returns.
A little bit of maths
If you denote by Pt the stock price at the end of month “t”, the simple return is given by:
Rt = [ Pt - Pt-1 ]/ Pt-1, the percentage price difference.
Now, rt=ln(1+Rt),
with Rt the simple
return and rt the
continuously compounded return at moment t.
From the above two equations we get rt=ln(Pt/Pt−1)
This the mathematics behind the entire calculation being done in line 8.
As we can see, to calculate the returns we need element-wise difference. The diff()allows us to do that. We first calculate the continuously compounded return and then use the diff() function to calculate the element-wise difference.
Note that we have created a subset MSFT$MSFT.Adjusted[2274:2638] i.e. 365 values.
On entering the code you will obtain a table that gives you the intra-day returns for one year.
Now we move on to the forecast package.
The Forecast Package
It contains forecasting functions for time series and linear models. This is the holy grail of this project.
We load the forecast package in line 9.
In line 10 and line 11 we split our returns data into two parts. One of them is the training data (MSFT_ret_train ) and the other is testing data (MSFT_ret_test).
The difference between the two? In simple terms, we know what result to expect with a training dataset as the model/code/algorithm has been designed according to give specific results when fed with specific data.
Test set is the data set on which we apply our model and see if it is working correctly and yielding expected and desired results or not. Test set is like a test to your model.
Line 12 shows us the use of the arima() function.
ARIMA
ARIMA stands for auto-regressive integrated moving average and is specified by these three order parameters: (p, d, q).
Auto Regression AR(p)
Auto regression technique estimates
the future values based on the previous values. The function of an
autoregressive model is denoted by AR(p), where p represents the order of the
model. AR(0), the simplest process, involves no dependence between terms,
preceding or current. For a first order autoregressive model AR(1), the preceding
term and a percentage of error contribute to the output. AR(2) model takes into
account 2 preceding values and noise to predict the output.
Moving Average MA(q)
A
moving average is a technique to model datasets that vary according to single
factor. It finds the future trends based on the previous values that do not
follow a definitive pattern. The two commonly used moving average techniques
are exponential moving average (EMA) and the simple moving average (SMA).
A moving average (MA(q)) component represents the error of
the model as a combination of previous error terms.
The Integrated I(d) part of the model is used to stabilize the time series data by differencing. Differencing a series involves simply subtracting its
current and previous values d times.
Order of ARIMA
The order of an ARIMA model is
generally represented as ARIMA(p,d,q), where-
p =
order of the autoregressive part .
d =
degree of first differencing involved.
q = order of the moving average part.
To make our predictions accurate we have to predict the values of p,d and q.
Now there are 2 ways to do this. Either using co-relations or using the auto.arima() function available in the forecast package. In the above code we could have also used the auto.arima() function to estimate the values of p,d,q.
The code looks something like this:
auto.arima(MSFT_ret_train)
The output of the code:
This function works on the Hyndman-Khandakar algorithm. The steps in the algorithm are:
1. The number of differences is determined using repeated KPSS (Kwiatkowski–Phillips–Schmidt–Shin)
tests.
2. The values of p and q are then chosen by minimizing the AIC (Akaike’s information criterion ) after differencing the data d times.
Watch the video below to learn about ACF and PACF.
Lets talk about the predict() function used in line 13 and forecast() function used in line 14.
The predict()(from the stats package) function is used to forecast data fit by an ARIMA model. As seen in line 12, we have fit MSFT_ret_train into an ARIMA(2,0,2) model. The predict() function in the next line takes two parameters, object and n.ahead. The object here is MSFT_ret_train. The n.ahead parameter is used to specify the number of steps ahead for which prediction is required (in simpler words it allows us to specify the duration of forecast). Note we have fetched the value of predictions by adding $pred at the end of the code.
The forecast() function is used in line 14. Although both the predict() and forecast() function give the same result, but the combination of Arima() and forecast() from the forecast package provides additional functionality. The function returns a forecast object like the one shown below, that is easier to plot and analyze. The parameter h allows us to specify number of periods for forecasting.
Line 15 uses the plot function to plot the forecasts. Which gives us the below plot.
Above is the result that we obtain from a simple ARIMA(2,0,2) model.The deeply shaded region provides us the 80% confidence interval and the lightly shaded region provides the 95% confidence interval.The basic interpretation of a 95% confidence interval of the model tells us that the forecasted values will have a maximum deviation of +/- 2 as shown in the plot above, thus giving us a fair estimation of the values of the future stock indices.
We can also calculate the accuracy of the forecasts. The code will look like
and the output looks like this
An intrinsic shortcoming of the ARIMA models,which is evident from the plot above,is the assumption of the mean reversal of the series.What this means is that after some time future,the forecasts would tend to the mean of the time series's historical values thus making it a poor model for long term predictions.
Note: We can also forecast stock prices for Indian companies by using their stock symbols. We only need to add .NS (for companies that trade on the NSE) and .BS (for companies that trade on the BSE) at the end.
Additional References
http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc4.htm
http://www.yourarticlelibrary.com/investment/market-theory/random-walk-theory-concept-and-hypothesis/82675/
https://en.wikipedia.org/wiki/Augmented_Dickey%E2%80%93Fuller_test
https://rstudio-pubs-static.s3.amazonaws.com/78839_afca73ae18194eaf8f1b86d399dde969.html
https://www.otexts.org/fpp/8
The predict()(from the stats package) function is used to forecast data fit by an ARIMA model. As seen in line 12, we have fit MSFT_ret_train into an ARIMA(2,0,2) model. The predict() function in the next line takes two parameters, object and n.ahead. The object here is MSFT_ret_train. The n.ahead parameter is used to specify the number of steps ahead for which prediction is required (in simpler words it allows us to specify the duration of forecast). Note we have fetched the value of predictions by adding $pred at the end of the code.
The forecast() function is used in line 14. Although both the predict() and forecast() function give the same result, but the combination of Arima() and forecast() from the forecast package provides additional functionality. The function returns a forecast object like the one shown below, that is easier to plot and analyze. The parameter h allows us to specify number of periods for forecasting.
Line 15 uses the plot function to plot the forecasts. Which gives us the below plot.
Above is the result that we obtain from a simple ARIMA(2,0,2) model.The deeply shaded region provides us the 80% confidence interval and the lightly shaded region provides the 95% confidence interval.The basic interpretation of a 95% confidence interval of the model tells us that the forecasted values will have a maximum deviation of +/- 2 as shown in the plot above, thus giving us a fair estimation of the values of the future stock indices.
We can also calculate the accuracy of the forecasts. The code will look like
accuracy(preds, MSFT_ret_test)
and the output looks like this
The
lower the value of RMSE (Root Mean Square Error) better is the accuracy of the
model.
An intrinsic shortcoming of the ARIMA models,which is evident from the plot above,is the assumption of the mean reversal of the series.What this means is that after some time future,the forecasts would tend to the mean of the time series's historical values thus making it a poor model for long term predictions.
Note: We can also forecast stock prices for Indian companies by using their stock symbols. We only need to add .NS (for companies that trade on the NSE) and .BS (for companies that trade on the BSE) at the end.
Additional References
http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc4.htm
http://www.yourarticlelibrary.com/investment/market-theory/random-walk-theory-concept-and-hypothesis/82675/
https://en.wikipedia.org/wiki/Augmented_Dickey%E2%80%93Fuller_test
https://rstudio-pubs-static.s3.amazonaws.com/78839_afca73ae18194eaf8f1b86d399dde969.html
https://www.otexts.org/fpp/8
Comments
Post a Comment