All guides

How to measure product data quality

"Our data quality is fine" is the most expensive sentence in a catalog org, because nobody can produce the number behind it. Product data quality is measurable — not with a vibe, an audit anecdote, or a single "completeness" percentage, but with a small set of dimensions, each defined precisely enough that two analysts running the same query get the same answer. If you can't compute it twice and get the same result, you don't have a metric; you have an opinion.

This guide gives you the dimensions that actually matter for a product catalog, the exact formulas for each, how to sample so your numbers are honest, and how to roll them into a weighted scorecard that ties back to revenue rather than to a tidiness score. It's written for the people who own the catalog at distributors, retailers, brands, and manufacturers — the ones who get asked "is the data good enough to launch?" and need a defensible answer.

One framing to hold throughout: measure fitness for use, not abstract perfection. A spec sheet that's 100% complete for engineering can be useless for ecommerce search. The right question is always "good enough for which job?" — and the job, for most of you, is getting a SKU found, compared, and chosen across your own site, marketplaces, distributor feeds, and AI answer engines.

Start with the job, not the field list

Before you measure anything, write down what the data has to do. A measurement that isn't anchored to a use case produces numbers nobody trusts and nobody acts on.

For product catalogs, there are usually three or four jobs running at once, and they demand different fields:

  • Findability — the buyer searches and filters. Needs category, the attributes your facets are built on, GTIN/MPN, synonyms, and search keywords.
  • Comparison — the buyer evaluates against alternatives. Needs the full attribute set for that category, units, compatibility, and normalized values.
  • Decision/conversion — the buyer commits. Needs a real description, imagery, compliance and certification flags, dimensions/weight for shipping, and accurate price/availability.
  • Channel syndication — the listing has to pass another system's required-field gate (Amazon, Google Shopping, a distributor's portal, a marketplace category spec).

Define the required attribute set per category for each job. "Required" is not one global list — a circuit breaker and a safety glove have almost no overlapping mandatory fields. This per-category, per-use-case requirement map is the denominator for almost every metric that follows. Skip this step and your completeness score measures the wrong thing precisely.

The six dimensions of product data quality

Most credible data-quality frameworks converge on the same core dimensions. Here they are, scoped specifically to a product catalog. Measure all six — picking only completeness is the most common way teams fool themselves.

  1. Completeness — is the value present? The percentage of required attributes (per the category requirement map) that are populated and non-trivial. The hard part is rejecting junk fills: "N/A", "TBD", "-", "see description", a lone space, or a description that's just the title repeated. Those are blanks wearing a costume.

  2. Accuracy — is the value correct? Does the stated value match reality — the manufacturer spec sheet, the physical product, the authoritative source? Accuracy is the only dimension you cannot measure by querying your own database; it requires comparison to an external truth. This is why it's the most skipped and the most important.

  3. Consistency / Standardization — is the same thing expressed the same way everywhere? "in" vs "inch" vs '"', "Red" vs "RED" vs "#FF0000", brand "3M" vs "3-M" vs "Three M". Measured as conformance to a controlled vocabulary or unit standard.

  4. Validity / Conformity — does the value fit its format rules? GTIN passes check-digit validation and is 8/12/13/14 digits; weight is a positive number; voltage is from the allowed enum; a URL resolves. This is machine-checkable and cheap — automate it fully.

  5. Uniqueness — one real product, one record. Duplicate SKUs, the same item under two MPNs, and variant explosions that should be a single parent. Measured as duplicate rate against a defined match key.

  6. Timeliness / Freshness — is the value current? Price, availability, discontinued status, and spec revisions decay. Measured as the share of records updated within the acceptable staleness window for that field (price might be hours; physical dimensions, years).

A seventh, discoverability fitness, is worth tracking separately for ecommerce: does the record carry the language and structured signal that search engines, marketplaces, and AI answer engines actually rank on? A record can score well on the first six and still be invisible because it lacks the buyer's vocabulary.

Exact formulas you can implement

Vague dimensions become real metrics only when you write the formula. Here are implementable definitions — run them as SQL or in a notebook against an export.

Field-level completeness (per attribute a, per category c): completeness(a,c) = non_empty_valid_count(a,c) / record_count(c) where non_empty_valid excludes nulls, empty strings, and your junk-fill blocklist.

Record-level completeness (per SKU): record_completeness = required_attributes_populated / required_attributes_total Report the distribution, not just the mean. "Average 86% complete" hides that 30% of SKUs are below 60%. Track the percentage of SKUs at or above a launch threshold (e.g., "% of SKUs ≥ 95% complete").

Validity rate (per attribute): validity(a) = records_passing_format_rules(a) / records_with_a_value(a) Keep validity and completeness separate — a populated-but-invalid GTIN should not count as a win on either.

Consistency rate: consistency(a) = values_matching_controlled_vocab(a) / records_with_a_value(a)

Uniqueness / duplicate rate: duplicate_rate = duplicate_records / total_records against an explicit match key (e.g., normalized brand + MPN). Publish the match key; an unstated key makes the number unarguable and useless.

Freshness: freshness(field) = records_updated_within_window(field) / total_records

Accuracy is a sampled metric, not a full-population query (see the sampling section): accuracy(a) = correct_values_in_sample(a) / sampled_values_checked(a) with a confidence interval attached. An accuracy number without an interval and a sample size is theater.

Measure accuracy with sampling, not faith

You cannot check a million SKUs against reality by hand, and you cannot check accuracy by querying your own database — the database is what you're trying to verify. So you sample, and you do it honestly.

Pull a stratified random sample. Don't eyeball your "important" SKUs; that biases the result upward. Stratify by category and by source/supplier, because error rates cluster there. A practical starting point: 300–400 randomly drawn SKUs gives you a population accuracy estimate within roughly ±5% at 95% confidence regardless of catalog size. For per-category accuracy you need that sample size within each category that matters.

Define the source of truth before you look. Manufacturer spec sheet, official datasheet, the physical product, a regulatory database. Write it down per attribute so reviewers aren't improvising.

Check value-by-value, not record-by-record. Record-level "looks right" hides field-level errors. Log each checked attribute as correct / incorrect / unverifiable, and treat unverifiable as its own bucket — a high unverifiable rate is itself a finding (your source of truth is missing too).

Re-sample on a cadence. Accuracy decays as suppliers change specs and staff fat-finger updates. A one-time accuracy number is a snapshot of a moving target. Quarterly re-sampling turns it into a trend, which is the thing that actually tells you whether you're winning.

Roll it into a weighted scorecard

Six dimension scores per category is a lot of numbers. Leadership wants one, and reasonable people want it built from defensible parts. Build a weighted composite — but weight by business impact, not by what's easy to measure.

Steps:

  1. Normalize every dimension to 0–100.
  2. Assign weights that reflect what the data is for. For a discovery-driven ecommerce catalog, a defensible split is: Accuracy 30, Completeness 25, Consistency 15, Validity 10, Uniqueness 10, Freshness 10. A price-volatile distributor weights Freshness far higher. The weights are a strategy statement — make them on purpose.
  3. Compute a composite per SKU, roll up per category, then per catalog.
  4. Gate, don't just average. Some failures are disqualifying regardless of the composite: an invalid GTIN, a missing safety certification, a wrong voltage. A SKU that fails a gate is "not launch-ready" even at a 92 composite. Averages hide the landmines; gates surface them.

Report three views: the headline composite (for the exec), the dimension breakdown (for the program owner), and the worst-offending categories and suppliers (for whoever does the work this week). A single number with no drill-down gets celebrated and ignored in equal measure.

Watch the failure mode: optimizing the score instead of the outcome. If filling fields with low-value boilerplate lifts completeness without helping a buyer, the metric is now lying to you. Tie at least one dimension weight to a downstream signal — search conversion, marketplace rejection rate, return rate from wrong specs — so the scorecard stays honest.

Connect data quality to outcomes that get budget

Quality scores that float free of money get cut in the next budget cycle. Pair each dimension with a downstream business metric so the score earns its keep.

  • Completeness / discoverability fitness → on-site search exit rate, zero-result searches, facet coverage. Thin attributes show up as filters buyers can't use.
  • Validity (GTIN/MPN) → marketplace and feed rejection rate. Missing or bad GTINs get listings suppressed or de-ranked on Google Shopping and Amazon; this is a directly countable loss.
  • Accuracy → return rate and "item not as described" claims. Wrong dimensions, voltage, or compatibility convert into freight, restocking, and chargebacks.
  • Freshness → price-error margin leakage and out-of-stock orders.
  • Consistency → comparison abandonment. Buyers can't compare what isn't normalized.

The most persuasive version of a data-quality report isn't "completeness rose from 78% to 91%." It's "the 1,200 SKUs we enriched last quarter moved from 4% to 22% add-to-cart, and feed rejections dropped 60%." Measure the data, but report the consequence.

Make it continuous, not a one-time audit

A quarterly audit tells you where you were. Catalogs drift continuously: new SKUs arrive thin, suppliers change file formats, channels add required fields, and prices move. Measured once, quality looks like an event; in reality it's a leak.

Operationalize it:

  • Baseline now across all six dimensions and freeze the definitions in writing so future numbers are comparable.
  • Automate the cheap dimensions (validity, consistency, uniqueness, completeness) to run on every export or on a schedule — these are pure database queries and should never be done by hand twice.
  • Sample accuracy on a cadence (quarterly is a sane default; monthly for fast-moving categories).
  • Instrument intake. Score new and updated SKUs at the door, not months later. The cheapest data to fix is the data that hasn't shipped yet.
  • Trend, don't snapshot. A dashboard showing each dimension over time, split by category and supplier, surfaces the supplier whose feed quietly degraded — the single most common source of a sudden quality drop.

The measurement is the easy half. The hard half is closing the gaps the measurement reveals — filling missing attributes, correcting wrong values, and normalizing the inconsistent ones, at catalog scale, and keeping them filled as the catalog drifts. That continuous fill-and-verify loop, scored against your standards and written back to your PIM or source of truth, is the work Anglera does. Your PIM stores the data and your scorecard tells you where it's thin; closing the gap on every SKU, every week, is the job that doesn't end.

Step-by-step checklist

  • Write the per-category required-attribute map for each job (findability, comparison, conversion, syndication) before measuring anything — this is your denominator
  • Measure all six dimensions, not just completeness: completeness, accuracy, consistency, validity, uniqueness, freshness
  • Build a junk-fill blocklist (N/A, TBD, '-', title repeated as description) and exclude it from completeness counts
  • Report completeness as a distribution and as '% of SKUs above launch threshold,' never as a lone average
  • Validate format rules automatically: GTIN check digits, numeric weights, enum values, resolving URLs
  • Measure accuracy on a stratified random sample of 300–400 SKUs against a named source of truth, with a confidence interval attached
  • Define and publish the match key you use for duplicate detection
  • Set a freshness window per field (price in hours, dimensions in years) and measure against it
  • Build a weighted composite scorecard, weighting by business impact, and add disqualifying gates for safety/GTIN/critical-spec failures
  • Pair each dimension with a downstream metric (feed rejection rate, return rate, search exit rate) so the score ties to revenue
  • Automate the cheap dimensions on every export; re-sample accuracy quarterly; score new SKUs at intake, not months later

Frequently asked questions

What is a good product data quality score?

There's no universal pass mark, because the target depends on the job. For a discovery-driven ecommerce catalog, a reasonable bar is 95%+ validity, 90%+ on completeness of required attributes for launch-ready categories, sub-1% duplicate rate, and accuracy above 95% on sampled fields. More useful than a single threshold is a trend and a launch gate: define 'launch-ready' per category, then track the percentage of SKUs that clear it. A flat 'we're at 88%' tells you far less than 'we moved 1,200 SKUs over the launch gate this quarter.'

How is data accuracy different from data completeness?

Completeness asks whether a value is present; accuracy asks whether it's correct. They're independent — a field can be 100% populated and 40% wrong. Critically, completeness is measurable by querying your own database, while accuracy requires comparison to an external source of truth (the spec sheet, the physical product, a regulatory database) and is therefore measured by sampling. Teams skip accuracy because it's harder, then wonder why returns and 'not as described' claims stay high.

How big a sample do I need to measure accuracy?

For a catalog-wide accuracy estimate, a random sample of roughly 300–400 SKUs gives you about ±5% at 95% confidence, and that holds whether your catalog is 10,000 or 10 million SKUs — sample size depends on the confidence you want, not catalog size. If you need per-category accuracy, you need that sample size within each category that matters. Always stratify by category and supplier, because error rates cluster there, and report the interval and sample size alongside the number.

Can I measure product data quality automatically?

Four of the six dimensions — completeness, validity, consistency, and uniqueness — are pure database checks and should be fully automated on every export or on a schedule. Freshness is automatable if you track update timestamps per field. Accuracy is the one dimension you can't fully automate, because it requires comparison to external reality; it's measured by sampling and human or source-backed verification. So the honest answer is: automate roughly 80% of the measurement, and sample the rest on a cadence.

Why not just use the completeness percentage my PIM reports?

PIM completeness usually counts any non-null value as 'complete,' which means junk fills (N/A, TBD, a description that just repeats the title) inflate the score, and it typically measures against a global field list rather than a per-category requirement map. It also tells you nothing about accuracy, consistency, or freshness. Use it as a rough starting signal, but a defensible quality program redefines completeness against required-attribute maps, strips junk fills, and measures the other five dimensions the PIM doesn't.

How often should I re-measure?

Automate the cheap dimensions to run continuously or on every catalog export, and instrument intake so new and updated SKUs are scored at the door. Re-sample accuracy quarterly for most catalogs, monthly for fast-moving categories or volatile suppliers. The point is to turn quality from a once-a-year audit snapshot into a trend line split by category and supplier — that's what surfaces the supplier feed that quietly degraded, which is the most common cause of a sudden drop.

See it on your own SKUs.

A 30-minute walkthrough on your categories and your supplier data.

Book a demo