Regional Training Data Estimation and Imputation

December 16, 2021

An ASEAN Regional Training on Data Estimation and Imputation (DEI), supported by the ARISE+ project, took place from 6-9 December, and was followed by a dedicated training session for ASEANStats on 10 December 2020. More than 30 participants attended the regional training.

The training has provided participants with the methodologies, tools, procedures and examples for data estimation and handling missing values in any dataset. Missing values are a fact of life in developing as well as developed countries, potentially limiting the chance for more meaningful analysis and dissemination. Yet many statistical office do little or nothing for lack of knowledge of the proper methods and tools.

International organisations such as the World Bank, IMF, and United Nations, as well as ASEAN Secretariat, receive data from countries with some missing values and face more challenges as they also need to do data aggregation. The World Bank has developed a methodology for data aggregation and estimation which was adopted by ASEANStats in support of the publication of ASEAN@50 in 2017, which required many long time series data from 1967-2015.

The DEI training provided participants from the Working Group on SDG Indicators and the Working Group on System of National Accounts with proper methodologies, tools and procedures for data estimation and data imputation. The most widely used tools such as the multiple regression analysis and the Auto-Autoregressive Integrated Moving Average (Auto-ARIMA) were provided. These tools allow for proper prediction of values within the dataset as well as for forecasting the series into some periods ahead.

Another very important tool specifically meant for data imputation, was also provided, for handling missing values by replacing the missing values with the best possible values, taking into account relevant variables and their behavioural relations. A range of techniques from the ad-hoc imputation technique to the Multiple Imputation by Chained Equations (MICE) along with the most sophisticated estimation methods, the predictive mean matching (PMM), were also provided.

Pre-requisites for having a good regression, time series model and proper imputation model, and cautions in handling flaws, issues and limitations with the model were emphasized. Step-by-step procedure and examples with real data, including an exercise on data imputation using the ASEAN SDG dataset were provided, and the results were reviewed and discussed.

All programming works were carried out using the RStudio, one of the most widely used and free software made available by the CRAN (https://cran.r-project.org/mirrors.html). All codes (Scripts) developed for use in the training including those for data manipulation, data visualization, regression and time series analysis and data imputation, were also provided to the participants.

Participants have now the knowledge of the methodology, tools and procedures and have seen how these tools were used to address missing values. ASEANStats encouraged participants to use what they have learned in this training to improve data availability. Indeed, data forecasting and imputation are best if carried out at the country level as they know better the data and have all the support data and knowledge for proper data estimation and imputation.