Back to Articles
NK Labs: How New Knowledge Uses Classic Time Series Models to Predict the Future
Can you predict the future? This age old question has baffled philosophers and scientists for generations. However, machine learning and artificial intelligence have made it possible to build algorithms and technology that, to some extent, have the capacity to do just this.
In last month’s NK Labs blog post, we presented our work on Sent2Vec, an approach that uses vector embeddings to represent sentences and small paragraphs. Our work on Sent2Vec was part of our participation in DARPA’s Data-Driven Discovery of Models (D3M) program, introduced in the first NK Labs blog post, which aims to develop tools that automate various steps of the data science workflow.
This month we’re introducing another tool that we’re working on for D3M: an ARIMA time series prediction model. This blog post will cover what an ARIMA model is, why simple models are sometimes better, and how we’ll use these tools to fight disinformation at New Knowledge in the future.
What is ARIMA?
ARIMA stands for AutoRegressive Integrated Moving Average and is one of the most popular and widely used machine learning models for time series prediction. There are three distinct components that parameterize the ARIMA model: autoregression (p), integration (d), and moving average (q).
- Autoregression (p) - uses the past observations of the time series to predict the future observations.
- Integration (d) - makes the time series stationary by removing any trend or seasonal component.
- Moving average (q) - uses the past forecast errors of the time series to predict the future observations.
Additionally, the implementation of ARIMA that NK Labs uses automatically selects the optimal order of the model. This is done by first selecting the integration parameter that makes the time series stationary and then by minimizing an information criterion, a measure of how much information is lost by the model [ref:1,2].
For those that are especially interested in time series - another nice feature of the ARIMA model is that it can represent a number of simple, special cases. For example, an ARIMA model with p = 0, q = 0, and d = 1 represents a simple random walk, while an ARIMA model with p = 1, q = 0, and d = 0 represents a first-order autoregression, and an ARIMA model with p = 0, q = 1, and d = 0 represents a first-order moving average.
To visualize our ARIMA time series prediction model in action, let’s walk through an example. We’ll use one of the test datasets from DARPA’s Data-Driven Discovery of Models (D3M) program - the number of annual sunspot observations from 1700-1989. For context, sunspots are dark spots on the surface of the sun that are caused by strong magnetic field lines coming from the sun’s interior.
Figure 1 shows the number of annual sunspot observations from 1700-1989. The annual observations appear to have a cyclical component and this hunch is confirmed by the seasonal decomposition in Figure 2. In fact, it turns out that the number of annual sunspot observations repeats with an 11-year cycle [ref:5], which we can use to specify a seasonal parameter in the ARIMA model. Figure 3 shows the forecast results from the ARIMA model compared against the actual number of sunspot observations for 1961-1989. Here, 1700-1960 was used as the training set.
Even though recent research favors deep learning techniques (LSTMs, ANNs, SVMs) over ARIMA models, there are still a number of reasons to prefer ARIMA. First, compared to their deep learning counterparts, ARIMA models are more interpretable, which means that their forecasts can be more intuitively explained. Second, ARIMA predictions naturally produce confidence intervals because they are regressive, unlike deep learning techniques. Finally, because of ARIMA models’ proficiency with linear predictions, multiple recent approaches have proposed hybrid techniques that combine ARIMA models and neural networks. TL;DR, the classic ARIMA model still has value.
How Does New Knowledge Use ARIMA?
NK Lab’s ARIMA prediction model is one of the computational techniques that New Knowledge uses to identify and counter disinformation. Specifically, ARIMA can be used to forecast time series data, examples being: a user’s post frequency or their re-tweet frequency on social media. These future forecasts can then be used as a module in a bigger computational strategy to identify anomalous activity before it occurs, alert organizations about the findings, and develop early response strategies. Thanks to tools like ARIMA, brands using New Knowledge will be able to proactively detect disinformation before it can impact their brand reputation.