Planning a calibration set for chemometric models: a working checklist

A chemometric model is a regression. The dataset it is regressed on - the calibration set - constrains every claim that model is later allowed to make. Choose that dataset badly and validation either fails outright, or worse, passes on paper while the model misbehaves in production.

Five decisions, taken before any reference assay or any spectrum is collected, do most of the work. The rest of the project - PLS or PCR, the wavelength region, the preprocessing, the number of latent variables - is downstream tuning. The standards that govern multivariate calibration (ASTM E1655 for vibrational spectroscopy, USP general chapter 1039 on chemometrics, USP general chapter 1119 on NIR, ICH Q14 on procedure development) all agree on this ordering: design first, model second.

This article is the planning checklist. For the modelling itself, see our chemometrics primer and the PCA-versus-PLS decision guide. For what an inspector then expects in the validation file, see validating a chemometric model for GMP.

What “calibration set” actually means

The calibration set is the set of samples for which both the spectrum and an independent reference value are known, and which is used to fit the regression coefficients. Two things it is not: it is not the validation set used to estimate prediction error, and it is not the routine production samples the model will later predict.

ASTM E1655 distinguishes calibration samples (used to build the model), validation samples (used to estimate the standard error of prediction, SEP), and prediction samples (the unknowns in routine use). Mixing these categories is the single most common failure mode in chemometric work. A model that has seen its validation samples during training will report an optimistic SEP and a pessimistic confidence interval in production.

Decision 1: define the design space before sampling

The design space is the set of conditions over which the model is expected to predict reliably. For a chemometric model, that means at minimum: the concentration range of the analyte, the concentration ranges of every chemical interferent that varies independently of the analyte, the physical states (temperature, pressure, particle size, moisture, viscosity) that affect the spectrum, and the instrument and probe configuration.

Two practical rules follow from ICH Q14. First, the model can extrapolate to nothing. If routine concentrations occasionally drift to 1.5 times the production setpoint, the calibration set must cover up to 1.5 times the setpoint - not just plus-or-minus 10 percent. Second, every factor that varies in production must vary in the calibration set, ideally orthogonally. A model trained at one temperature and deployed across a 15 K range will, in vibrational spectroscopy, fail.

Decision 2: sampling strategy - factorial, mixture, or augmented

The calibration design is not a list of “representative samples”. It is a structured experiment. Three patterns cover most cases.

For independent factors (analyte concentration, temperature, an unrelated interferent), a fractional factorial across the corner and centre points of the design space gives the cleanest model. For pure mixtures where components must sum to 100 percent, a simplex-lattice or simplex-centroid mixture design is the textbook answer; Cornell’s monograph remains the standard reference. For matrices where some samples come from a process and others must be spiked into laboratory standards, an augmented design - process samples plus spiked standards covering the edges - is usually the only feasible option.

The pitfall is the unstructured “grab whatever the plant gives us this week” approach. The model fits the joint distribution of whatever was sampled. If analyte and a confounding solvent moved together during the campaign, the regression cannot tell them apart, and the model collapses the first time the confounder moves independently in routine production.

Decision 3: how many samples

ASTM E1655 gives a working formula for the minimum calibration set size: roughly six samples per latent variable retained in the model, with a hard floor of 24 samples for non-trivial models, and an explicit recommendation to plan for substantially more when the spectrum is noisy or the chemistry has many interferents. USP chapter 1119 adds that the calibration set must adequately span the matrix and the concentration range; the exact n is a function of how many sources of variance the model must absorb.

Practically, project teams in vibrational spectroscopy plan for 40-80 calibration samples for a clean, well-controlled analyte and 100-200 when interferents and physical variation are heavy. The often-quoted “more is always better” is not quite right: samples clustered near one corner of the design space inflate n without adding information. Geometry matters more than count.

Decision 4: reference values - error budget and traceability

The model cannot be more accurate than the reference method that supplied its calibration values. ASTM E1655 frames this explicitly: the standard error of the laboratory reference (SEL) propagates into the SEP, and a multivariate model is generally judged usable when SEP is no worse than about 1.5 times SEL. If the reference assay has a 0.3 percent absolute uncertainty, the chemometric model will not deliver 0.1 percent in production - that is a physical impossibility, not a chemometric one.

Two practical points. First, the reference method must be the same one validated under ICH Q2(R2) for that matrix, with documented intermediate precision and a traceable standard. Second, every calibration sample should ideally have a duplicate reference value; the within-sample reference replicate variance is what feeds SEL, and pooling it across the set is how you defend the SEL number in a regulatory file.

Decision 5: hold back a real test set, not a cross-validation alias

Cross-validation (leave-one-out, k-fold, venetian-blind) estimates how the model performs on resampled versions of the training data. It is not a validation. ASTM E1655, USP 1039, and ICH Q2(R2) all require an independent test set - samples collected separately from the calibration set, ideally on different days, different production lots, and where feasible different instruments or probes of the same configuration. The test set sample count should be planned at the same time as the calibration set, not improvised at the end.

A separate concern is calibration transfer. If the model will ever be deployed on a second instrument, the test set should include samples measured on both. The literature on calibration transfer between instruments is consistent: a transferred model that was never tested against a second instrument during development almost always needs more remediation than one designed for transfer from the start.

Common failure modes

A few patterns recur across project audits. Collinear factors (the analyte and a process variable that always move together) make individual coefficients meaningless even when SEP looks acceptable. Sampling that ignores the time dimension (all calibration samples drawn in one week, then the model deployed for a year) misses long-term drift in raw materials or instrument response. Reference values rounded to fewer decimal places than the assay actually delivers throw away information that PLS could have used. Spiked standards prepared in pure solvent when production runs in a complex matrix produce models that work beautifully on the bench and fail at the plant.

None of these are model problems. They are calibration-set problems, fixed at planning time and unfixable later.

A short pre-flight checklist

Before any spectrum is collected:

The analyte range, every varying interferent, and the physical conditions of routine production are written down with min, max, and routine setpoint.
The sampling design (factorial, mixture, or augmented) is drawn on paper, and every cell of the design has at least one planned sample.
The reference method is the one already validated under ICH Q2(R2) for this matrix; its SEL is known and documented.
A separate test set, with its own collection plan and its own sample count, is part of the protocol from day one.
If a second instrument is in the deployment plan, at least a subset of test-set samples will be measured on both.

The checklist is short, but each item closes a class of failure mode that no amount of clever modelling later can recover from. Chemometric models are sensitive to their training data in the same ordinary way every regression is. The design of that training data is the analytical procedure, in every sense the regulators use the term.