A Study on the Methodology of Visitor Forecasting through the Combination of ARIMA Analysis and Search Frequency Analysis.

Author:

Eoncg

Date:

2018년 12월 17일

Namjae Cho, Seokwan Mun (2018), "Agent Licensing Effect of Cause-Related Activity on Product Choice Justification," Journal of IT Applications & Management 19th,.p59-64

문석완 컨설턴트

Abstract

In the age of the Fourth Industrial Revolution, a large amount of data is generated, the importance of data analysis is emphasized, and data analysis research for prediction is increasing. In addition, various methodologies are used to make predictions by combining various methodologies.

This study aims to verify whether the combination of estimations from various methodologies improves accuracy of estimation. This study verifies the effect of combination of ARIMA(Autoregressive Integrated Moving Average) model and the multiple regression analysis of search frequency data.

1. Introduction

Big data analysis is now used in various industries, which is promoted by the decreasing price of semiconductor and appearance of various open source based analysis tools. Especially, the importance of forecasting analysis is increasing, and various kinds of data (Time series Data, SNS Data, Search frequency data, GPS Data, etc) are used via various kinds of methods (Time series analysis, Regression analysis, Deep Learning, etc).

This study aims to validate that the combination of two analysis method is more effective than each single analysis method to forecast. The target industry is tourism which is one of the fastest growing industry in the world today because an effective forecasting of tourists plays a great role in decision making of tourism. We forecast the number of tourists by comparing some forecasting methods and finding out the most effective one.

2.Data

2.1 Time series Data

This study collects the data of Chinese visitor statistics from January 2011 to March 2017 provided by TourGo (https://www.tour.go.kr/) of Korea Culture & Tourism Institute.

Figure 1 – Time series data (from China to Korea)

Figure 1 shows a rapid decline of number of visitors from June 2015 to August 2015 due to MERS (Middle East Respiratory Syndrome). Since the average residual during these three months is less than negative three, the data of this period is considered outlier and removed.

2.2 Search Frequency Data

Baidu, the largest portal in China, provides search frequency data through Baidu Index. On Baidu Index, we select search words related to ‘Korean tour’ which has been mostly used for existing researches from September 2016 to February 2017. Besides duplicate words, we select five search words: ‘Korean tour’, ‘Korean tour tip, ‘Seoul’, ‘Korean Visa’, ‘famous Korean food’. Since the data provided by Baidu Index is on a weekly basis, it is converted to a monthly basis by weighting technique suggested by Christian and Kristoffer [10] to be combined with time series data.

2.3 Sliding Window

As the purpose of this study is to validate that the combination of two methods is more powerful than each single method, a large amount of data is required. Due to the limitation of data source, however, this study rather used sliding window to divide the available data into various samples.

3. Research Procedure

This study validates that the combination of ARIMA analysis with time series data and multiple regression analysis with search frequency data has more powerful performance than each single method does. The research procedure is shown in Figure 3 below.

Figure 2 – Research Procedure

This study collects and preprocesses the data according to research procedure on Figure 3. Search frequency data is used for multiple regression analysis and time series data is used for ARIMA analysis. The forecasting combination of two methods is done by SA (Simple Average) suggested by Aksu and Gunter. [11] To compare the performance of combination of two methods with each method, we calculate MAE (Mean Absolute Error) for comparing accuracy and use t-test for validating the difference of MAE.

4.Methods

4.1 ARIMA

The ARIMA model developed by Box-Jenkins is a probabilistic ARIMA model formed by random walk. This ARIMA model is an integrated model (3) as AR (Autoregressive) Model (1) and MA (Moving Average) Model (2) is instable like random walk and general time series data.

∅_1,∅_2,∅_3,…,∅_p in (1) is a parameter of AR model, and ε_t is an White Noise error term with the average of zero and variance of σ^2. In (2), ε_t is an error term, and θ_1,θ_2,θ_3,…,θ_q is a parameter of MA. (3) is a process of differentiation by Backshift operator "∇=(1-B)." (4) is a process of differentiation of {Y_t} with new measure w_t.

w_t is an observed value of stationary time series, d is differential order of non-stationary. ARIMA model is expressed as ARIMA(p,d,q).

4.2 Multiple regression

Multiple regression is a method to make a model by functional relationship between independent variable and dependent variable to express the causality of phenomena and variables. The simple linear regression is used when a single independent variable, and the multiple regression when multiple independent variables.

4.3 Combined Forecasting (SA, Simple Average)

SA is one of the simplest combining method which calculates the average of estimates from two methods. The equation of SA is shown below (5).

F^A is an estimate of time series data by ARIMA analysis. is an estimate of search frequency data by multiple regression analysis. F^S.A is the combination of two estimates by SA. As shown in (6), we find out the most accurate weight by changing the weight (a) instead of limiting it to 0.5 each.

5.Forecast

5.1 Forecast by Time series data

Figure 3 – The graph of Chinese visitors (left) and graph of non-seasonal one difference (right)

As shown in Figure 3, since the average and variance of the graph on left looks inconstant, the graph is converted into stationary time series through one non-seasonal difference.

Data shows that the spike shows seasonality as ACF is not within 95% confidence of upper and lower bound at lag 12, 24, and one seasonal difference is necessary. Considering seasonal difference and non-seasonal difference at the same time, over-differencing should be required. As the result of using two differences, the standard error increases. This study uses seasonal difference.

As the result of analysis, ACF and PACF shows trend of decline after lag 1. For MA(1) and AR(1), we consider the three following models: ARIMA(1,0,1)(0,1,0)12, ARIMA(1,0,0)(0,1,0)12, ARIMA(0,0,0)(0,1,0)12.

Firstly, we check the t-statistics and p-value to validate the suitability of parameter and statistical significance. Table 2 indicates that the significance of ARIMA(1,0,1)(0,1,0) is more than 0.05, and the t-statistics is less than the absolute value 2. The result of estimating via Normalized BIC and MAPE at Table 3 is that ARIMA(0,0,1)(0,1,0)12 is the most accurate model with the lowest Normalize BIC and MAPE. Since the ACF and PACF of residual of ARIMA(0,0,1)(0,1,0)12 is within 95% confidence of upper bound and lower bound, ARIMA(0,0,1)(0,1,0)12 model is selected for forecasting model.

Like this method, we forecasts the models of the 33 sets.

Table 1 – Normalized BIC and MAPE

Model	Normalize BIC	MAPE
ARIMA(1,0,1)(0,1,0)12	21.278	12.6
ARIMA(1,0,0)(0,1,0)12	21.184	12.192
ARIMA(0,0,1)(0,1,0)12	20.935	11.829

Figure 4 – ACF and PACF of the residual of ARIMA(0,0,1)(0,1,0)12

5.2 Forecast by Search Frequency data

Like the same way of ARIMA analysis with time series data, we forecast the number of Chinese visitors at February 2014 with data set 1 (January 2011 to December 2013). The independence variable is search frequency (‘Korean tour’, ‘Korean tour tip, ‘Seoul’, ‘Korean Visa’, ‘famous Korean food’) at Y month provided by Baidu Index. The dependence variable is the number of Chinese visitors at Y+1 month. With these variables, we make a regression model.

The regression equation is (8). In this equation, is ‘Korean tour’, is ‘Korean tour tip’, is ‘Seoul’, is ‘Korean Visa’, is ‘famous Korean food’. The same equation is used for forecasting at the rest of 33 sets.

5.3 Combination of methods

To combine the estimates of two methods – ARIMA analysis with time series data and multiple regression analysis with search frequency data – we use both SA (Simple Average) and various weighting to find out the most accurate forecasting method.

6.Result

As the result of forecasting 34 sets, MAE of ARIMA analysis with time series data is 72,682, and MAE of multiple regression analysis with search frequency data is 72,663, in which the difference of MAE of two methods is not significant. Meanwhile, MAE of combination of two methods by SA (weighting 0.5 for each method) is 61,946, which is a considerable difference comparing to each method. To validate the difference of MAE, we use t test. The result of t test is that p-value of ARIMA and multiple regression analysis is 0.044, 0.048 each. Therefore, we conclude that the difference between SA and each single method is significant.

7. Conclusion

The purpose of this study is to validate that the combination of two methods (ARIMA analysis with time series data and Multiple regression analysis with search frequency data) has more accurate forecasting than each single method does.

To validate, we use the data of both Chinese visitors and Baidu Index. The result of this study shows that the combination of two method has more accurate forecasting than each single method does.

In conclusion, this study indicates that the combination of various methods can be more effective in forecasting than using one single method.

When analyzing data, not only the method of analysis and statistics, but also domain knowledge of the data will be necessary to understand the characteristics of the data and draw better insight.

EON Consulting Group

A 06627 서울특별시 서초구 강남대로 331. 10층 1001호(서초동, 서초동 광일빌딩)

T 02-556-8088

E eoncg@eoncg.com