Build vs buy: product data enrichment
Every distributor, retailer, brand, and manufacturer with a large catalog eventually hits the same wall: the product data is too thin to get found and too voluminous to fix by hand. At 30 to 45 minutes of manual work per SKU, a 100,000-SKU catalog is roughly 60 person-years of enrichment. Nobody is staffing that. So the question becomes build or buy — staff and tool an internal enrichment function, or bring in a vendor that does the work.
This is not a religious question, and the honest answer depends on your catalog volatility, your taxonomy complexity, your engineering bandwidth, and how fast you need results. A team enriching 2,000 stable SKUs a year should not be standing up an ML pipeline. A team onboarding 50,000 supplier SKUs a quarter against a 400-attribute taxonomy will drown a hand-built tool inside two quarters. Most teams land somewhere in between, and the right call is usually a specific split of build and buy rather than an all-or-nothing bet.
This guide gives you the actual decision criteria, a real total-cost model for each path (including the costs people forget), the failure modes that kill build projects, and a checklist you can run against your own situation this week. Where Anglera is genuinely relevant we say so plainly; everywhere else the framework stands on its own.
First, define what "enrichment" actually includes
Before you can decide build vs buy, you have to scope the work, because "enrichment" hides at least six distinct jobs and most build estimates only budget for one or two:
- Gathering — pulling missing specs from supplier PDFs, spec sheets, manufacturer sites, and images. This is the hardest part to build and the part most teams underestimate.
- Cleaning and normalization — deduping, standardizing units (in/mm, lb/kg), reconciling conflicting supplier values, fixing encoding and formatting.
- Categorization — mapping every SKU to a granular node in your taxonomy and to each channel's taxonomy (Google, Amazon, vertical marketplaces).
- Attribute fill — populating the structured fields buyers and search engines filter on: dimensions, material, compatibility, certifications, voltage, thread size.
- Copy and media — titles, descriptions, and selecting or generating imagery.
- Scoring and QA — judging completeness and accuracy against a standard, and flagging what needs human review.
A build that only does "AI writes a description" solves the easy 15%. The expensive, defensible work is gathering true facts and normalizing many messy sources into one clean record. Scope all six before you cost anything.
The criteria that actually decide it
Run your situation against these. The more you skew toward the right-hand descriptions, the more buy wins:
- Catalog volume and churn. A few thousand stable SKUs favors build (or even manual). Tens of thousands with constant supplier onboarding favors buy — throughput is the whole problem.
- Taxonomy and attribute complexity. Flat catalog with 20 attributes is buildable. Deep, multi-vertical taxonomies with hundreds of conditional attributes need accuracy you won't reach with a weekend of prompt engineering.
- Source heterogeneity. If your data comes from one clean ERP export, build is realistic. If it comes from 200 suppliers in 200 formats (PDFs, fax-quality scans, inconsistent CSVs), gathering and normalization dominate, and that's specialist work.
- Required accuracy and liability. Wrong dimensions cause returns; wrong certifications or compliance flags cause recalls and legal exposure. High-stakes attributes raise the QA bar past most internal builds.
- Time to value. Build is 6–12 months to something production-grade. If you need lift this quarter, that alone can decide it.
- Engineering opportunity cost. Every engineer on an enrichment pipeline is an engineer not on your core product. Ask whether enrichment is a competitive differentiator for you or undifferentiated heavy lifting.
- Ongoing maintenance appetite. Enrichment is a loop, not a project (see below). Build means you own that loop forever — model drift, new channel rules, taxonomy changes, supplier format changes.
The real cost of building (including what estimates miss)
A credible internal build is not a script. It's a system plus a standing team. Budget honestly:
- People: 1–2 ML/data engineers, a backend engineer, a taxonomy/data steward, and a pool of human reviewers for QA and edge cases. Loaded, that's commonly $600K–$1.2M/year before it enriches a single SKU at scale.
- The data-acquisition problem: gathering specs from PDFs and supplier sites means OCR, document parsing, web extraction, and per-source maintenance. Each supplier format you support is a small ongoing liability. This is where build timelines slip from months to years.
- Evaluation infrastructure: you can't ship enrichment you can't measure. You need a labeled gold set, completeness/accuracy scoring, and regression checks — a real engineering investment most teams forget to budget.
- The 80/20 trap: the demo works on clean SKUs in week three. The last 20% — ambiguous categories, conflicting sources, rare attributes, units that don't convert cleanly — consumes 80% of the timeline and never fully "finishes."
- Maintenance: models drift, channel schemas change, suppliers reformat. Plan for the team to stay, not disband at launch.
Build wins when enrichment logic is genuinely proprietary to your business, your volume justifies the standing team, and you have engineers to spare. It loses when you're rebuilding generic gather-clean-enrich-score plumbing that a vendor already runs at scale.
The real cost of buying (including what to watch for)
Buying converts a capital project into an operating cost and collapses time-to-value from quarters to weeks. But "buy" is not free of work or risk:
- Pricing models vary widely: per-SKU, per-attribute, per-seat, or platform subscription. Per-SKU looks cheap until you re-enrich on every taxonomy change; understand what triggers a recharge.
- Accuracy is the product. Demand to see accuracy and completeness numbers on your SKUs, not a polished sample. Run a paid pilot on a representative, messy slice of your catalog before committing.
- Integration and write-back. The decisive question: does the vendor write enriched data back into your source of truth, or only into a feed? If it only feeds channels, you've rented exit-level enrichment and your real record stays thin (more below).
- Taxonomy fit. Can it map to your taxonomy and each channel's, or does it impose its own? Generic categorization that doesn't match your filters is busywork.
- Lock-in and data ownership. Confirm you own the enriched output and can export it cleanly if you leave.
- Human-in-the-loop. The best vendors score and surface low-confidence items for review rather than silently guessing. Ask how QA works.
Buy wins when you need throughput and speed, when enrichment isn't your differentiator, and when the vendor can prove accuracy on your data and write results back where they compound.
Why "buy a PIM and build on top" is a common, expensive mistake
Many teams think they're choosing build vs buy when they're actually buying the wrong thing. A PIM is a system of record — a very good filing cabinet. It stores a clean product record; it does not produce one. A PIM will not gather a missing spec from a supplier PDF, normalize twelve suppliers into one taxonomy, write a description, or score completeness against buyer signals.
So "we'll buy a PIM and enrich inside it" quietly becomes a build project: the PIM holds empty fields, and filling them still lands on a person or a pipeline you now have to build. The PIM's new "AI enrich" button helps at the margin but doesn't solve gathering or normalization at catalog scale.
The clean way to think about it: the PIM is the destination, not the labor. Decide separately how the fields get filled — that's the build-vs-buy decision this guide is about. Buying storage is not buying enrichment.
The hybrid that usually wins: buy the engine, own the standard
For most mid-to-large catalogs the right answer isn't pure build or pure buy. It's a split:
- Buy the heavy, undifferentiated machinery: document parsing, web gathering, normalization, first-pass attribute fill, categorization, and scoring at scale.
- Build/own the parts that are genuinely yours: your taxonomy and attribute standard, your definition of "complete enough," your business rules, your QA thresholds, and the human review of high-liability attributes.
This gets you throughput without surrendering control of the things that are actually your competitive edge. You're not paying a vendor to learn your business from scratch, and you're not paying your engineers to rebuild OCR and entity resolution.
A practical rule: buy the work, own the judgment. The vendor should do the gathering, cleaning, enriching, and scoring; you should set and audit the standard it's scored against.
Enrichment is a loop, not a project — budget for both
Whichever path you choose, the single biggest forecasting error is treating enrichment as a one-time cleanup. It isn't. New SKUs arrive, suppliers reformat, channels change required attributes, and AI shopping surfaces raise the bar on structured data. Do it once as a project and the catalog drifts back to thin by Q3.
That changes the build-vs-buy math: you're not comparing a one-time build cost to a one-time vendor fee. You're comparing two ongoing operating commitments. Build means a standing team that owns the loop forever. Buy means a recurring fee for someone else to run it. Cost the loop, not the launch.
It also reframes where the work should live. If enrichment runs continuously, you want it to land upstream — written back into your source of truth — so every channel, marketplace, and AI assistant inherits one complete record instead of you re-fixing the same SKU at each exit.
Where Anglera fits
This is exactly the line Anglera draws: your PIM stores the data; Anglera does the work. Anglera runs the full enrichment loop — gathering, cleaning, enriching, categorizing, and scoring every SKU against buyer signals (how buyers actually search, compare, and decide) — and writes the enriched result back into your source of truth, not just out to a feed. It sits alongside your PIM or ERP; it is not a PIM and not a CRM.
In build-vs-buy terms, Anglera is the "buy the engine" half of the hybrid above: it does the undifferentiated heavy lifting at catalog scale while you keep ownership of your taxonomy, your standard, and your QA thresholds. Typical implementation runs about 30 days, which is the practical argument against building — you get production throughput this quarter instead of standing up a pipeline over the next three.
If enrichment genuinely is your differentiator and you have engineers to spare, build. If it's undifferentiated work standing between your catalog and getting found, buying the loop and owning the standard is almost always the cheaper, faster call.
Evaluation checklist
- Scope all six jobs (gather, clean, categorize, fill attributes, copy/media, score) — not just "write descriptions" — before estimating either path
- Count your real numbers: SKU volume, quarterly churn, attribute count, and number of distinct supplier source formats
- Identify high-liability attributes (dimensions, certifications, compliance) and set the QA accuracy bar they require
- Cost a build as a standing team plus eval infrastructure plus maintenance — not a one-time script — and add the 80/20 long tail
- Cost a buy as a recurring operating commitment, and pin down exactly what triggers a re-charge (re-enrichment, taxonomy changes)
- Run a paid pilot on a messy, representative slice of your catalog and demand accuracy/completeness numbers on YOUR SKUs
- Confirm any vendor writes enriched data back into your source of truth, not just into channel feeds
- Verify taxonomy fit: it must map to your taxonomy and each channel's, and you must own and be able to export the output
- Separate the decisions: buying a PIM is buying storage, not enrichment — decide how fields get filled independently
- Default to the hybrid — buy the engine (gathering, normalization, scoring), own the standard (taxonomy, rules, QA)
- Budget for the loop, not the launch: account for new SKUs, supplier reformats, and changing channel/AI requirements
- Set a time-to-value deadline; if you need lift this quarter, that alone may rule out a 6–12 month build
Frequently asked questions
Is it ever right to build product data enrichment in-house?
Yes — when enrichment logic is genuinely proprietary to your business, your volume justifies a standing team of ML/data engineers plus reviewers, you have engineers to spare from core product work, and you can wait 6–12 months for production-grade results. For most teams, though, the gathering and normalization layers are undifferentiated heavy lifting that a vendor already runs at scale, and rebuilding them is hard to justify.
What's the most underestimated cost of building?
Data acquisition and the long tail. Pulling true specs from supplier PDFs and websites means OCR, document parsing, and per-source maintenance — each supplier format is an ongoing liability. And the last 20% of SKUs (ambiguous categories, conflicting sources, rare attributes) consumes roughly 80% of the timeline. The demo works in week three; "done" never quite arrives. Most estimates also forget evaluation infrastructure and ongoing maintenance.
Doesn't buying a PIM solve this?
No. A PIM stores a clean record; it doesn't produce one. It won't gather a missing spec, normalize twelve suppliers into one taxonomy, write a description, or score completeness. "Buy a PIM and enrich inside it" quietly becomes a build project, because the fields still have to be filled by a person or a pipeline you build. Storage and enrichment are separate decisions.
How do I evaluate an enrichment vendor before committing?
Run a paid pilot on a representative, messy slice of your real catalog — not a clean sample. Demand accuracy and completeness numbers on your SKUs. Confirm it maps to your taxonomy and each channel's, that it writes results back into your source of truth (not just a feed), how human-in-the-loop QA surfaces low-confidence items, and that you own and can export the enriched output.
What is the hybrid approach, and why does it usually win?
Buy the undifferentiated engine (document parsing, web gathering, normalization, first-pass attribute fill, categorization, scoring) and own the judgment (your taxonomy, your definition of complete, your business rules, your QA thresholds, and review of high-liability attributes). You get throughput without handing over the parts that are actually your competitive edge, and you don't pay engineers to rebuild generic plumbing.
Why does it matter where the enriched data lands?
Because enrichment is a continuous loop, not a one-time project, and AI shopping agents read surfaces you don't hand-tune — your PDP, marketplace listings, distributor copies of your SKU. If enrichment only lives in the feeds you tuned, you're complete on a few surfaces and thin everywhere else. Writing enriched data back upstream into your source of truth means every channel, marketplace, and assistant inherits one complete record instead of you re-fixing the same SKU at each exit.