Is PCA for statistical fingerprinting still relevant in the age of machine learning?
November 16, 2022
Michael Bock (mbock@intell-group.com) (TIG Environmental, Portland, ME, USA) and Nicholas Rose (nrose@intell-group.com) (TIG Environmental, New York, NY, USA)
In the age of big data, machine learning (ML) methods such as t-SNE and UMAP have been developed to analyze and classify high dimensional data. Increasingly these methods are being used for statistical fingerprinting in environmental forensics. T-SNE and UMAP are closely related methods that use non-linear dimensional reduction to maximize the variability captured. This contrasts with PCA which uses linear regression to maximize the variability captured in the lowest numbered principal components. Plots of the first two or three principal components are retained for analysis, although the higher-numbered components are still available and can be analyzed. We explored the trade-off associated with the preservation of more of the data structure using the non-linear ML methods (t-SNE and UMAP) relative to PCA, the stochastic vs deterministic nature of these methods, and the ramifications of non-linear vs linear methods. We applied these methods to datasets that represent different environmental sources and typical environmental alteration processes that can influence chemical profiles including (1) simple mixing of sources (2) physical weathering processes such as chromatographic separation or differential solubilities, and (3) transformation processes such as the conversion of PFAS precursors to PFAS compounds.
The analyses demonstrate some critical differences in the understanding of sources and processes associated with these different methods. We found that the ML methods are able to differentiate a larger number of unique sources. When many sources are present, PCA can be easily overwhelmed by the most extreme sources, masking more subtle differences. However, the ML methods were found to often be ineffective in depicting mixing between sources in a meaningful way. In contrast, mixed sources typically manifest as a straight-line connecting the sources in PCA space. The ML methods were also found to be ineffective in meaningfully depicting weathering processes and transformation processes. In contrast, these processes typically manifest as a curved trajectory in PCA space. These results show that while these new methods provide powerful tools, they often fail to provide meaningful insight into important processes such as source mixing and transformation processes. PCA continues to be a powerful tool for understanding processes that are critical to a forensics analysis.