Again about data pre-processing
Institute of Chemistry, University of Silesian, 40-006 Katowice, Szkolna 9, Poland; firstname.lastname@example.org
Each experimental research involves many steps and each step essentially influences final result. Such steps as, e.g., experimental design, data analysis and its validation are supported by the established knowledge and the well defined rules. There are also the rules concerning sample collection and preparation, as well as a ‘good practice’ concerning measurements. Unfortunately, situation is quite different, when data pre-processing is considered. This step is data- and goal-dependent and as such, it requires a unique treatment, well suited for the data at hand. Nowadays, pre-processing is the most time consuming and challenging step. It aims at elimination of an undesired data variance associated with data noise, background and shifts, as well as at elimination of size effect (present due to an unknown overall sample concentration) and differences of contributions of the studied variables to the total data variance. There are many available methods of data pre-processing, but they cannot be treated as alternative approaches. Their assumptions and principles should be taken into the account and their choice should be based on data characteristics.
This presentation aims at outlining important problems of data pre-processing and their illustration upon the real and simulated data sets. Some rules for data pre-processing will be discussed and few diagnostics tools for data characterization will be demonstrated.
Special attention will be given to the data normalization step, particularly important in the metabolomics studies. Recently, a new approach to normalization of metabolomics data, based on Compositional Data Analysis (CODA) is popularized (1). It consists of an attractive idea of working with the log ratios, thus eliminating data normalization step. CODA was tested in our previous simulation study and its performance was compared with Total Sum Normalization, Probabilistic Quotient Normalization and Pair-wise Log Ratios, which showed that the CODA transformations should not be applied to identify biomarkers (2). These conclusions do not coincide with that presented in (1). As stated in (1), the discrepancy was probably caused by a limited number of variables considered in our study. Thus we undertook a new simulation study, working with the data sets of much higher numbers of variables. An additional motivation appeared, when we realized that there was another question to be answered. Namely, it was observed that the Pair-wise Log Ratio (plr) method performed well (2), whereas CODA, based on a similar concept, did not. The presented results allow an explanation of this apparent contradiction in terms of local and global characteristics of the features transformed by plr and clr, respectively.
Acknowledgment: Author acknowledges the financial support of the project PL-RPA/ROOIBOS/05/2016, accomplished within the framework of the bilateral agreement co-financed by the National Research Foundation (NRF), South Africa, and the National Centre for Research and Development (NCBR), Poland
References: 1. A. Gardlo, A. K. Smilde, K. Hron, M. Hrda, R. Karlıkova, D. Friedecky, T. Adam, Normalization techniques for PARAFAC modeling of urine metabolomic data, Metabolomics, 12, 117; (2016); 2. P. Filzmoser, B. Walczak, What can go wrong at the data normalization step for identification of biomarkers?, Journal of Chromatography A 1362, 194-205 (2014)