I4: Johan Westerhuis

VARIABLE SELECTION AND CLASSIFICATION IN THE PRESENCE OF OBSERVATION BELOW THE DETECTION LIMIT USING ERROR RATE P-VALUES

Mari van Reenen3 , Johan A Westerhuis1,3, Carolus J Reinecke3 , J Hendrik Venter2

1Biosystems Data Analysis, Swammerdam Institute for Life Sciences, University of Amsterdam, Science Park 904, 1098 XH Amsterdam, The Netherlands

2Centre for Business Mathematics and Informatics, Faculty of Natural Sciences, North-West University (Potchefstroom Campus), Private Bag X6001, Potchefstroom, South Africa.

3Centre for Human Metabolomics, Faculty of Natural Sciences, North-West University (Potchefstroom Campus), Private Bag X6001, Potchefstroom, South Africa

We introduce an approach using minimum classification error rates as test statistics to find discriminatory variables. The thresholds resulting in the minimum error rates can be used to classify new subjects. This approach transforms error rates into p-values and is referred to as ERp. ERp can handle unequal and small group sizes, as well as account for the cost of misclassification. In metabolomics studies, often many values below the detection limit (indicated by zero’s in the data table) are observed. We extended ERp (to XERp) to address two sources of zero-valued observations: (i) zeros reflecting the complete absence of a metabolite from a sample (true zeros); and (ii) zeros reflecting a measurement below the detection limit. XERp is able to identify variables that discriminate between two groups by extracting information from the difference in the proportion of zeros and shifts in the distributions of the non-zero observations simultaneously. To demonstrate the utility of XERp, it is applied to GC-MS data from a metabolomics study on tuberculosis meningitis in infants and children. We find that XERp is able to provide an informative shortlist of discriminatory variables, while attaining satisfactory classification accuracy for new subjects in a leave-one-out cross-validation context.