How to fix miscategorized SKUs across a large catalog

Miscategorization is the catalog problem nobody schedules time for. A SKU lands in a roughly-right bucket during onboarding, an analyst moves on, and the cost shows up months later as a product that never appears in the filtered view a buyer is actually looking at. At ten thousand SKUs you can argue it's tolerable. At two hundred thousand, spread across a primary taxonomy plus a different tree for every marketplace and the buyer's own internal categories, it's a structural drag on discoverability, on-site search, and increasingly on whether an AI agent ever surfaces you at all.

The instinct is to "clean it up" — open a spreadsheet, sort by category, and start fixing. That works for a few hundred rows and collapses past a few thousand, because the hard part isn't editing values, it's deciding the correct value for each SKU against a tree you may not fully control, in a way that stays correct as the tree changes. This guide treats the cleanup as a repeatable process rather than a one-time spreadsheet heroics session.

What follows is the sequence we'd actually run: quantify the problem before touching it, fix the taxonomy before the data, classify in bulk with the right mix of rules and judgment, validate against signals you can measure, write the result back to the system of record, and put a guardrail in place so the same drift doesn't re-accumulate. The goal is a catalog where every SKU sits in the deepest correct node on every channel — and stays there.

First, define what "miscategorized" means for your catalog

Before you can fix anything you need a precise, testable definition, because "wrong category" hides at least four distinct failure modes and each one is fixed differently:

Flatly wrong — a cordless drill filed under hand tools. Rare but high-impact; usually a data-entry or bad-mapping error.
Too shallow — a yoga mat sitting in Sports & Outdoors instead of Sports & Outdoors > Exercise & Fitness > Yoga > Yoga Mats. Technically "correct" at the top, invisible at the bottom, and it inherits none of the facets (material, thickness) that the deep node would expose.
Ambiguously placed — products that legitimately could sit in two nodes (a "desk lamp" vs. "task lighting"), where you've been inconsistent across similar SKUs.
Channel-mismatched — correct in your internal taxonomy, wrong against Google's product taxonomy, Amazon's browse nodes, or a specific distributor's tree.

Write these down as your taxonomy of errors. Most teams discover that "miscategorized" in their catalog is overwhelmingly too shallow and channel-mismatched, not flatly wrong — which changes the whole remediation strategy, because shallow and mismatched are mapping problems, not correction problems.

Quantify the problem before you touch a single SKU

You cannot prioritize what you haven't measured, and "the catalog is a mess" is not a number a stakeholder will fund. Run a diagnostic pass that produces a defect rate and a ranked worklist:

Leaf-depth distribution. Histogram every SKU by how deep its category path goes. A catalog where 40% of SKUs sit at depth 1–2 in a taxonomy that runs 4–5 deep has a shallowness problem you can size in one chart.
Sibling outlier detection. Within each category node, compare key attributes (title tokens, brand, GPC/UNSPSC code, supplier class) across the SKUs that share it. A node full of "valves" with three "safety glasses" in it flags the outliers automatically.
Attribute–category contradiction. If a SKU has voltage and amperage populated but lives under Plumbing, the attributes contradict the category. These cross-field conflicts are some of the highest-precision miscategorization signals you have.
Behavioral signals. Pull on-site search and browse analytics. Products with impressions but near-zero filtered-view appearances, or high "viewed from search, never from category" ratios, are frequently mis-shelved.
Channel rejection logs. Google Merchant Center disapprovals, Amazon category suppressions, and marketplace mapping errors are a free, already-labeled list of wrong categories — mine them first.

The output is a single ranked table: SKU, current node, suspected correct node, error type, confidence, and revenue/traffic weight. That table is your project plan.

Fix the taxonomy before you fix the data

A very common, very expensive mistake is to start re-classifying SKUs against a taxonomy that is itself broken — duplicate nodes, overlapping leaves, missing branches. You'll spend weeks placing products into categories you'll later merge or delete. Stabilize the tree first:

Establish one golden internal taxonomy as your source of truth, ideally aligned to a standard (GS1 GPC, UNSPSC, eClass, or your industry's equivalent) so it maps cleanly outward later.
De-duplicate and disambiguate leaves. Merge Lighting > Lamps and Lamps > Lighting. Every leaf node needs a one-sentence inclusion rule and a couple of explicit exclusions so two analysts (or two models) make the same call.
Build explicit crosswalks from your golden tree to each channel's taxonomy — Google product categories, Amazon browse nodes, each major customer's internal scheme. Maintain these as versioned mapping tables, not tribal knowledge. When Google reshuffles its taxonomy (it does, periodically), you update one crosswalk, not a hundred thousand SKUs.
Decide the tie-break rules for genuinely dual-natured products up front, and apply them consistently. Consistency beats theoretical perfection here.

This step is governance, not data work, and it's the highest-leverage hour you'll spend. A clean taxonomy with documented rules makes the bulk classification that follows dramatically more accurate.

Choose a classification method that matches your scale and judgment load

There is no single right tool — there's a portfolio, and mature catalogs use all of these in layers, cheapest-and-most-certain first:

Deterministic rules / lookups — When a supplier class, a GTIN-to-category map, or an exact attribute pattern uniquely determines the node, use a rule. It's free, auditable, and 100% repeatable. Push as much volume through rules as you can; it's your highest-confidence tier.
Classical ML classifier — Train on your already-correct SKUs (title, description, attributes → leaf node). Good for high-volume, repetitive catalogs with stable categories. Needs labeled data and retraining as the tree evolves, and it gives you a probability you can threshold on.
LLM-based classification — A model that reads what a product actually is (messy title, spec sheet, supplier copy) and matches it to the deepest correct node, with a rationale. This is the strongest option for the too-shallow and ambiguous cases that defeat rules, and for mapping into unfamiliar channel trees. Constrain it to your taxonomy (don't let it invent nodes), require a confidence score and a cited reason, and route low-confidence calls to review.
Human review — Not for the whole catalog — for the long tail of genuinely ambiguous, low-confidence, or high-revenue SKUs the automated tiers flag. Humans are expensive and inconsistent at volume, which is the original cause of the mess, so spend them only where judgment genuinely can't be encoded.

The tradeoff: rules are cheap and brittle, ML is scalable but needs labels and drifts, LLMs handle judgment and novelty but cost more per SKU and need guardrails, humans are accurate-ish but don't scale. Layer them so each SKU is handled by the cheapest method that can be confident about it.

Run remediation in waves, not one big bang

Don't reclassify the whole catalog in a single transaction. Sequence it so you bank value early and contain risk:

Wave 1 — high-confidence, high-impact. The flatly-wrong and obvious-shallow SKUs that also carry traffic or revenue. These are the ones where a rule or a high-confidence model call agrees with a cross-field signal. Big wins, low risk.
Wave 2 — systematic shallowness. Whole branches that default to a shallow node. Often fixable with a single mapping rule applied across thousands of SKUs at once.
Wave 3 — channel re-mapping. Re-run the channel crosswalks so the now-correct internal node propagates to Google, Amazon, and customer trees. Clear the rejection/suppression backlog here.
Wave 4 — the ambiguous long tail. Human-in-the-loop review of everything that stayed low-confidence.

For each wave, stage the change, diff it against current values, sample-audit before you commit, and keep the previous category in a category_previous field so any wave is reversible. Track a defect rate per wave so you can show the trend line moving.

Validate before you write back — and measure after

A reclassification that's confidently wrong is worse than the original, because it looks fixed. Gate every wave with validation:

Sample audit. Pull a stratified random sample (by error type and confidence band) and have a human grade it. Hold to a published precision bar — e.g., "95% correct at the leaf before this wave ships."
Consistency checks. After the change, re-run sibling-outlier and attribute–category-contradiction detection. The number of contradictions should drop, not move sideways.
No-orphan check. Confirm every reassigned SKU lands on a live, indexable leaf that exposes the right facets — not a node that's hidden or facet-less.
Post-fix behavioral lift. This is the real scorecard. Watch filtered-view impressions, category-page entrances, on-site search recall, and channel approval rates for the affected SKUs over the following weeks. Correct categorization should show up as products appearing in browse paths they were absent from before. If the numbers don't move, your "correct" node may still be wrong or too shallow.

Measurement closes the loop: it tells you whether the project actually bought you discoverability or just rearranged labels.

Write the fix back to the system of record — not just the channel

The cleanup only sticks if the corrected category lands in your source of truth — the PIM, ERP, or master catalog — and flows outward from there. Patch a category directly in a Google feed or an Amazon listing and you've created a fork: the next full sync from your PIM overwrites your fix with the original wrong value, and you're back where you started next quarter.

The durable pattern is one-directional: corrected internal node written to the PIM, channel-specific nodes derived from it through the versioned crosswalks, feeds regenerated from that. Your PIM stores the data; the categorization work — gathering, classifying to the deepest correct node per channel, and writing it back — is the layer that sits alongside it. That's exactly the kind of high-volume, rules-plus-judgment work Anglera does at catalog scale: read what each SKU actually is, place it in the most specific accurate node for every channel, and write it to your source of truth so the fix propagates everywhere and survives the next sync.

Stop the drift: make categorization a pipeline step, not a cleanup project

If you fix the catalog and change nothing about how SKUs enter it, you've bought a temporary reprieve. Re-accumulation is guaranteed because the original causes — manual entry, supplier feeds with their own (wrong) categories, new channels with new trees, taxonomy updates — are all still running.

Build the guardrail in:

Classify at intake. Every new or changed SKU gets auto-categorized and confidence-scored before it goes live. New SKUs are the cheapest to get right and the most expensive to fix later.
Quarantine low-confidence. SKUs the classifier isn't sure about don't ship to channels until reviewed.
Re-validate on taxonomy change. When a channel reshuffles its tree, re-run the affected crosswalk and re-classify only the impacted branch.
Monitor the defect rate continuously. The same diagnostics from step two, run on a schedule, turn miscategorization from a periodic fire drill into a metric with a baseline and an alert threshold.

The difference between a catalog that's clean once and one that stays clean is whether categorization lives in the ingestion pipeline or in a recurring spreadsheet emergency.

Step-by-step checklist

Define your error types: separate flatly-wrong, too-shallow, ambiguous, and channel-mismatched SKUs — they're fixed differently
Run a diagnostic to produce a real defect rate: leaf-depth histogram, sibling-outlier scan, attribute-vs-category contradictions, and behavioral/rejection signals
Mine free labeled data first: Google Merchant disapprovals, Amazon suppressions, and channel mapping errors are a ready-made worklist
Fix and freeze a golden internal taxonomy (aligned to GPC/UNSPSC) with one-sentence inclusion rules and explicit exclusions per leaf before reclassifying anything
Build versioned crosswalks from your golden tree to each channel and major customer taxonomy
Layer your classification: deterministic rules first, then ML or LLM for judgment cases, then human review only for the low-confidence, high-value long tail
Constrain any model to your taxonomy, require a confidence score and a cited reason, and route below-threshold SKUs to review
Remediate in waves (high-confidence/high-impact first), staging diffs and keeping a category_previous field so every wave is reversible
Gate each wave on a published precision bar via stratified sample audits before committing
Confirm every reassigned SKU lands on a live, indexable, correctly-faceted leaf node
Write corrections to the system of record (PIM/ERP), derive channel nodes from it, and regenerate feeds — never patch the channel directly
Move categorization into the intake pipeline: classify and confidence-score new SKUs before they go live, and monitor the defect rate on a schedule

Frequently asked questions

How do I find miscategorized SKUs without manually reviewing the whole catalog?

Use signals that flag suspects automatically. The four highest-yield ones are: leaf-depth analysis (SKUs sitting shallower than their branch allows), sibling-outlier detection (products whose attributes don't match their category neighbors), attribute-versus-category contradictions (a SKU with voltage and amperage filed under plumbing), and your existing channel rejection logs (Merchant Center disapprovals, Amazon suppressions). These narrow a 200,000-SKU catalog down to a ranked worklist of a few thousand genuine suspects, which is the only part a human or model needs to look at closely.

Should I use rules, machine learning, or an LLM to reclassify products?

All three, layered cheapest-first. Push everything you can through deterministic rules and lookups — they're free, auditable, and perfectly repeatable. Use a trained classifier for high-volume, stable, repetitive categories where you have labeled data. Use an LLM for the cases rules can't handle: too-shallow placements, genuinely ambiguous products, and mapping into unfamiliar channel taxonomies, where reading what the product actually is matters. Reserve human review for the low-confidence, high-revenue long tail. The mistake is picking one method for the whole catalog.

What's the difference between a wrong category and a too-shallow one, and why does it matter?

A wrong category puts a drill under plumbing — rare and obvious. A too-shallow one leaves a yoga mat in 'Sports & Outdoors' instead of the 'Yoga Mats' leaf. The shallow SKU is technically 'correct' at the top, but it's invisible in the deep filtered view where buyers actually decide, and it inherits none of the facets (material, thickness) the leaf node would expose. It matters because most large catalogs are overwhelmingly too-shallow, not flatly-wrong — which means your fix is a mapping-to-depth problem, not an error-correction problem.

If I fix categories in my Google or Amazon feed, won't that solve it?

Only until the next sync. Patching the channel directly creates a fork: your PIM still holds the original wrong value, and the next full feed regeneration overwrites your fix. The durable pattern is one-directional — correct the node in your system of record, derive each channel's category from it through versioned crosswalks, and regenerate feeds from there. Fix the source, and the correction propagates to every channel and survives future syncs.

How do I keep SKUs from drifting back into the wrong categories?

Move categorization out of periodic cleanup and into the ingestion pipeline. Auto-classify and confidence-score every new or changed SKU before it goes live, quarantine anything the classifier isn't sure about, re-run the affected crosswalk whenever a channel changes its taxonomy, and monitor the defect rate on a schedule with an alert threshold. New SKUs are the cheapest to get right and the most expensive to fix later, so the guardrail belongs at intake.

How do I prove the reclassification actually worked?

Validate before and measure after. Before each wave, gate it on a stratified sample audit against a published precision bar (e.g., 95% correct at the leaf). After it ships, watch the behavioral scorecard for the affected SKUs: filtered-view impressions, category-page entrances, on-site search recall, and channel approval rates. Correct categorization shows up as products appearing in browse paths they were absent from before. If those numbers don't move, your 'correct' node is probably still wrong or too shallow.