When QA/QC Passes Bad Data: Using Machine Learning to Identify Laboratory Errors

Laboratory QA/QC protocols exist for good reason. Chain of custody documentation, blank analysis, spike recovery, duplicate samples—these checks catch most data quality issues before reports leave the lab.

But “most” is doing heavy lifting in that sentence.

We’ve encountered two cases where data passed all QA/QC requirements, underwent external validation, and still turned out to be wrong. In one case, fictitious PCB data was used in site analysis for over two years before machine learning tools identified the problem.

Traditional validation checks individual samples against acceptance criteria. Machine learning analyzes patterns across entire datasets. When you look at the full chemical fingerprint rather than individual detections, problems that pass conventional QA/QC become visible.

Case Study 1: Fictitious PCB Data

A large environmental investigation analyzed PCB congeners across multiple sampling events over several years. Every batch of data met QA/QC requirements of EPA Method 1668C, the standard protocol for PCB congener analysis. Every batch passed external data validation review. The data went into site analysis, informed remediation decisions, and supported regulatory reporting.

One batch was fiction.

When we analyzed the full dataset using UMAP (Uniform Manifold Approximation and Projection) and hierarchical cluster analysis, one batch stood apart. The pattern of PCB congeners didn’t match contamination at the site. It didn’t match the batch collected immediately before. It didn’t match the batch collected immediately after. It was statistically and chemically incongruent with the rest of the investigation.

The laboratory had reported results that looked plausible in isolation—congener ratios within expected ranges, detection frequencies reasonable for contaminated soil—but the overall fingerprint was wrong. When you see PCB-153 consistently dominant across 200 samples and then suddenly PCB-138 becomes the primary congener in a single batch before reverting to PCB-153 dominance afterward, you’re not looking at spatial variability. You’re looking at a reporting error.

The batch passed EPA Method 1668C QA/QC requirements because individual samples met acceptance criteria. Blanks were clean. Duplicates agreed. Spike recoveries fell within control limits. But the congener distribution pattern—the chemical signature that should remain relatively consistent within a site’s source material—was fundamentally different.

This data had been used for over two years. Risk calculations, remediation boundary delineation, and regulatory compliance decisions all incorporated these results. Excluding the batch from reanalysis changed site characterization in that investigation area.

Case Study 2: Laboratory Coelution Differences

The second case involved PAH analysis from two different laboratories. Both labs reported the 16 priority pollutant PAHs. Both met EPA Method 8270 requirements. Both passed QA/QC. When the datasets were combined for site-wide analysis, hierarchical cluster analysis revealed a problem.

The issue centered on three specific compounds: benzo(b)fluoranthene, benzo(j)fluoranthene, and benzo(k)fluoranthene. These compounds are notoriously difficult to separate chromatographically. Depending on the gas chromatography column type and operating conditions, they may coelute partially or completely.

Different laboratories use different columns. Lab A’s column separated benzo(b)fluoranthene from benzo(k)fluoranthene but coeluted benzo(j)fluoranthene with one of them. Lab B’s column had a different coelution pattern. When reporting results, each lab made assumptions about which compounds were present based on their specific coelution behavior.

The result: systematic differences between laboratories that had nothing to do with actual site conditions. When you cluster samples by chemical composition, they grouped by laboratory rather than by location or sampling event. The pattern revealed the measurement artifact.

This isn’t a QA/QC failure in the traditional sense. Both laboratories followed proper analytical procedures. But combining datasets from different laboratories requires understanding their methodological differences. Without ML pattern analysis, these coelution differences would not have been seen. Simple histograms would likely not have caught differences in how the different laboratories reported their PAHs.

Why Pattern Analysis Catches What QA/QC Misses

Traditional QA/QC validates samples individually or in small batches. It checks whether blanks are clean, whether duplicates agree, whether known standards recover correctly. These are essential checks. But they operate at the wrong scale to detect certain types of errors.

Machine learning tools like UMAP, hierarchical cluster analysis (HCA), and principal component analysis (PCA) evaluate patterns across entire datasets. They examine relationships between dozens or hundreds of chemical analytes simultaneously and identify samples that don’t fit the patterns of chemicals within the dataset.

For PCB congeners, this means comparing the ratios of all 209 congeners (or the subset analyzed) rather than checking individual detections. For PAHs, it means examining the full 16-compound fingerprint to identify systematic shifts that suggest analytical artifacts rather than environmental processes.

These tools don’t replace QA/QC. They complement it. A sample can have excellent precision and accuracy—meeting all acceptance criteria—while still being wrong because it was mislabeled, transposed in reporting, or analyzed under non-standard conditions that produced valid-looking but incorrect results.

The Dimensionality Problem

Environmental datasets are inherently high-dimensional. A single soil sample might be analyzed for 50+ PAH compounds, 100+ PCB congeners, or dozens of metals and volatile organics. Evaluating these manually means looking at compounds one at a time, maybe comparing a few key ratios.

UMAP and PCA reduce this dimensionality while preserving the structure of the data. They collapse 50 dimensions into 2 or 3 that can be visualized and interpreted. Samples that cluster together have similar chemical compositions. Samples that stand apart are chemically distinct—which might indicate different source material, different contamination history, or different (potentially erroneous) analytical procedures.

HCA builds hierarchical relationships between samples based on their full chemical profiles. When you see samples grouping by batch submission date rather than by spatial proximity, you’re seeing batch effects. When samples group by laboratory rather than by site area, you’re seeing systematic analytical differences.

This isn’t subjective interpretation. These are quantitative assessments of chemical similarity across hundreds of measurements simultaneously.

Integration with Document Intelligence

Pattern analysis identifies the problem. Understanding the cause requires going back to the documentation.

In the PCB case, identifying the aberrant batch prompted review of chain-of-custody records, laboratory notebooks, and correspondence. The trail led to a reporting error during a period when the laboratory was transitioning between LIMS systems. Data had been transcribed incorrectly and validated against the wrong quality control batch.

In the PAH case, recognizing the coelution pattern required reviewing analytical methods from both laboratories, understanding their column specifications, and determining which reported values were directly measured versus inferred from coeluted peaks.

Statvis integrates both capabilities. The Explore module includes UMAP, HCA, and PCA tools for pattern analysis across chemical datasets. The document processing pipeline extracts laboratory methods, QA/QC documentation, and analytical specifications from PDFs. When pattern analysis flags potential issues, the supporting documentation is already indexed and searchable.

This matters because data validation isn’t a one-time check. It’s an ongoing process of comparing analytical results against expectations derived from site conceptual models, spatial trends, temporal trends, and known analytical limitations. Machine learning identifies anomalies. Document review explains them.

When Good Data Goes Bad

The unsettling reality is that bad data often looks fine in isolation. A PCB congener concentration of 2,300 µg/kg doesn’t trigger red flags. It’s within the range you’d expect at a contaminated industrial site. The laboratory reported it with appropriate precision. It passed validation.

The problem only becomes apparent when you see 200 other samples with PCB-153 as the dominant congener and this one sample, uniquely, shows PCB-138 dominance instead. Context reveals the error.

Traditional QA/QC focuses on precision and accuracy—whether measurements are reproducible and whether they’re correct relative to known standards. Pattern analysis adds a third dimension: consistency. Does this result fit the chemical fingerprint established by the rest of the investigation?

For liability determinations, remediation planning, and regulatory compliance, using incorrect data has consequences. A remediation boundary drawn based on fictitious concentrations wastes resources excavating clean soil or, worse, leaves contamination in place. Risk calculations based on systematic analytical biases might underestimate or overestimate exposure.

What We’re Building Toward

Machine learning for data quality isn’t a research curiosity. It’s a practical necessity when investigations accumulate thousands of analytical results across years or decades. Manual review at that scale becomes impractical. Pattern-based anomaly detection becomes essential.

We’re extending these tools to operate automatically as data enters the platform. Upload a new batch of laboratory results, and the system flags samples that deviate from the established site fingerprint. Combine datasets from multiple laboratories, and the system identifies systematic differences that need methodological review.

This doesn’t eliminate the need for professional judgment. It focuses that judgment on the samples and patterns that warrant closer examination. When everything passes QA/QC, knowing which results to scrutinize further requires looking beyond individual sample acceptance criteria to the broader dataset structure.

Site characterization depends on data quality. QA/QC catches most problems. Pattern analysis catches the rest.