PCA vs PLS: when to use each in chemometric work

Two methods dominate routine chemometric work on vibrational spectra: principal component analysis and partial least squares regression. They look superficially similar. Both take a matrix of spectra, both decompose it into a small number of latent variables, both let an analyst replace 1,000 wavelength channels with a handful of scores. In practice the two methods answer different questions and the choice between them shapes everything downstream - what you can validate, what regulators will accept, and what the model will do when the process drifts.

The distinction is not new. The form of PCA used in chemometrics was codified by Wold and colleagues in 1987; the modern formulation of PLS regression by the same group was published in 2001. The questions the two methods address have not changed in the intervening decades, and neither has the rule for choosing between them. What has changed is the volume of poorly chosen models in production, often because someone reached for PLS when PCA would have done the job, or the reverse.

This piece sets out the decision rule, the diagnostics that confirm the choice, and the failure modes to watch for. It is a companion to our chemometrics primer, which covers the underlying matrix algebra in more detail.

What each method actually does

PCA decomposes a spectral matrix X into scores T and loadings P such that X = T P' + E. The decomposition is driven by one criterion only: maximise the variance captured in each successive component. PCA has no knowledge of any reference value. It sees only the spectra. The first principal component is the direction in spectral space along which the samples spread the most; the second is the direction of greatest remaining spread orthogonal to the first; and so on.

PLS, in contrast, decomposes X and the reference matrix Y simultaneously. It seeks latent variables in X that maximise the covariance with Y. The first PLS component is not the direction of greatest variance in the spectra - it is the direction of greatest covariance between the spectra and the property you want to predict. Every subsequent component is constrained by the same criterion. PLS knows about Y; PCA does not.

The consequence: PCA is descriptive, PLS is predictive. PCA tells you how samples differ from each other in spectral space. PLS tells you how to map a spectrum to a numerical answer.

The decision rule

If the task is to predict a continuous property (concentration, density, viscosity, an octane number, a tablet hardness) from a spectrum, use PLS. If the task is to describe the structure of a dataset - to look for outliers, to see whether samples cluster by batch or by operator or by season, to detect drift in a stream of routine measurements - use PCA. There is no scenario in which PCA produces a numerical prediction of a chemical property, and there is no scenario in which a PLS model substitutes for the descriptive overview that PCA scores plots provide.

Two corollaries follow:

A PAT model that goes into a regulated production line and predicts a release-relevant attribute is, with rare exceptions, a PLS model (or one of its variants - PLS2 for multiple y values, PLS-DA when the y is a class label). The 2017 revision of ASTM E1655, which governs infrared multivariate calibration, is built around exactly this assumption.
The model-monitoring layer sitting on top of the predictive model is almost always a PCA model. PCA scores and Hotelling’s T-squared statistic are the standard tools for detecting that a current measurement falls outside the spectral space the PLS model was trained on. The two methods coexist on the same analyser, doing different jobs.

Diagnostics that confirm the choice

When the decision rule is followed, the diagnostics fall out cleanly:

For a PCA model the relevant metrics are the variance explained per component, the cumulative variance, the Hotelling’s T-squared limit, and the Q-residual (or SPE) limit. A useful PCA model for routine monitoring typically captures more than 95% of the variance in three to six components on a well-behaved process.
For a PLS model the relevant metrics are the root mean square error of cross-validation (RMSECV) and of prediction on an independent test set (RMSEP), the number of latent variables selected, the slope and bias of predicted-vs-reference, and the coefficient of determination on test data. ASTM E1655 specifies the form of these statistics in detail.

The number of components rarely matches across the two methods on the same data. A PLS model that uses four latent variables to predict a concentration will often coexist with a PCA monitor that needs eight components to span the spectral space adequately, because PCA must capture variance that has nothing to do with the analyte (temperature drift, fibre alignment, particle scatter) and PLS can ignore it.

When PCA looks like it predicts (principal component regression)

There is a hybrid called principal component regression, in which a PCA is run on X, the scores are kept, and a linear regression of Y on the scores is fitted afterwards. PCR was historically the alternative to PLS, before PLS became the default. It still works, and it is occasionally preferred when the analyst wants the spectral decomposition to be independent of the reference values - for example in robustness studies where the same decomposition must be used to predict several different properties.

PCR is not a reason to use PCA where PLS belongs. The decomposition step in PCR is still PCA in the strict sense - it ignores Y. The prediction is bolted on. In practice PLS almost always needs fewer latent variables than PCR for equal predictive performance, because PLS uses the available Y information to orient the decomposition. Martens and Naes’s 1989 monograph remains the most thorough treatment of the comparison.

Failure modes

Two failure modes are common enough to be worth flagging.

The first is fitting PLS to a dataset where the reference values are dominated by a single source of variation, and that source happens to be correlated with a non-analyte spectral feature (a temperature gradient across batches, for instance). PLS will happily build a model that uses the non-analyte feature to predict the reference, because the covariance is real. The model will fail the first time the process runs at a different temperature. The fix is to design the calibration set so the analyte and the nuisance variable are decorrelated, and to confirm with PCA scores that the calibration spans the operational temperature range.

The second is using PCA scores to make decisions that should rest on a quantitative prediction. A drifting cluster on a PCA scores plot is not the same thing as a concentration excursion. It is a signal that something has changed in spectral space, which is useful, but the only way to know whether the change matters for release is to predict the release attribute with the PLS model and compare against specification. Conflating the two leads to either false alarms (PCA drifted but the property is in spec) or missed excursions (the property drifted but stayed within the historical spectral envelope). Our piece on validating a chemometric model under GMP covers the procedural side of this distinction.

Where deep learning fits

A growing minority of process applications replace PLS with a convolutional or transformer-style regressor. The decision rule above does not change. A deep regressor still needs a separate spectral-space monitor, and PCA remains the unglamorous choice for that role. Recent literature on deep-learning chemometrics for process Raman shows the predictive layer evolving without disturbing the descriptive one. PCA is forty years old and still works.

Closing

PCA and PLS are not competitors. They are complementary tools for different parts of the same workflow: PLS predicts the number that goes into a control loop or a release decision, and PCA watches whether the spectrum still looks like one the predictor was trained on. Mixing up the two is the most common chemometric error we see in production, and almost always traces back to skipping the basic question of what the model is supposed to answer.