Catalog Intelligence Platform

Background

This case study describes a catalog intelligence capability anchored to TrueLeaf Tech's engagement with a major B2B marketplace operator in India. The client connects millions of buyers and suppliers across an enormous range of product categories, and the integrity of their catalog is, in a real sense, the integrity of their entire business.

Anchored to a confidential client engagement. The B2B marketplace operator whose work this case study draws from operates at a scale that makes catalog management one of the most consequential engineering problems in the business. Specific implementation details have been generalised to honour the engagement's confidentiality.

The problem we were solving

B2B marketplaces face a particular flavour of catalog challenge. Supplier-provided listings are inconsistent in quality. Product naming conventions vary widely. Categorisation is imprecise. Duplicates proliferate. Search relevance suffers, buyer experience degrades, and the marketplace's ability to make accurate matches between buyers and suppliers — the central function of the business — erodes over time.

The traditional approach to this problem is heavy on manual review and rule-based cleanup. This works at small scale and breaks down completely at the catalog sizes typical of large B2B marketplaces. The brief was to build an AI-augmented catalog intelligence layer that could classify, normalise, deduplicate, and enrich listings at the rate they were being added, with quality high enough that downstream search and matching genuinely improved.

What we built

The classification and normalisation layer

Every new listing enters the system as raw, supplier-provided text and attribute data. The first stage of the pipeline classifies it into the marketplace's taxonomy — often several levels deep — and normalises its attributes against a canonical schema for that category. This is harder than it sounds because the canonical schema for "industrial pumps" looks nothing like the canonical schema for "office furniture," and the classifier needs to handle both correctly.

We built the classification layer as a hybrid system: a fast rule-based pre-pass that handles the easy cases, an embedding-based similarity pass that handles the medium cases, and an LLM-based deep classifier that handles the genuinely ambiguous cases. Each pass is faster and cheaper than the next, so the system applies them in order and only escalates when necessary. The disciplines behind this approach are described in more detail in our writing on retrieval pipelines.

The deduplication layer

B2B marketplaces accumulate duplicates faster than almost any other type of catalog. The same product is listed by multiple suppliers, often with slightly different names, slightly different specifications, and meaningfully different prices. The deduplication challenge is to recognise that these listings are the same product without collapsing distinct products that happen to share many attributes.

The architecture uses a multi-signal matching approach: text similarity, attribute similarity, image similarity, and seller behaviour patterns all contribute to a confidence score. High-confidence matches are auto-deduplicated. Low-confidence candidates flow into a human review queue. The model is calibrated continuously against the outcomes of human review, so the system gets sharper over time.

The enrichment layer

Beyond classification and deduplication, the system enriches listings with structured data that the supplier did not originally provide. Product specifications are inferred from descriptions. Use-case keywords are derived from category context. Compatible-with relationships are predicted from product taxonomy. Each enrichment is tagged with its confidence and source, so downstream systems can decide how much weight to give it.

The catalog is not what suppliers send you. It is what your platform extracts from what suppliers send you. The extraction is the product.

Engineering trade-offs we made

Precision versus recall in deduplication

The deduplication system has to balance two opposing failure modes. If it is too aggressive, it merges distinct products and degrades search results for both. If it is too conservative, it leaves obvious duplicates in the catalog. We tuned for precision — fewer false merges, more false misses — because the cost of an incorrect merge is much higher than the cost of a missed duplicate, and the missed duplicates can be caught on the next pass.

Real-time enrichment versus batch enrichment

An obvious design question was whether enrichment should happen at listing creation time or in periodic batches. Real-time enrichment has better data-freshness properties but more variable latency at the listing creation flow. Batch enrichment is operationally simpler but creates a window during which listings are visibly under-enriched. We chose a hybrid: a fast first-pass enrichment at creation time, followed by a more comprehensive batch enrichment within a few hours.

Single model versus ensemble

For the highest-stakes classifications, we use an ensemble of models rather than a single one, with explicit handling for disagreement cases. The ensemble approach is more expensive and more complex, but it produces more reliable outputs and — crucially — clearer signals about when the system is uncertain. The uncertainty signal is what makes the human review queue useful.

What we learned

Catalog quality is a flywheel. Better classification produces better search results, which drives more meaningful supplier behaviour, which produces better source data, which makes classification easier. The same flywheel runs in reverse for low-quality catalogs. Investing in the quality layer pays compounding returns.
Human review is the calibration loop. The deduplication and classification models would degrade over time without the human review queue. Not because the models drift on their own, but because the underlying catalog drifts — new product categories, new supplier behaviours, new edge cases. The human review queue is how the system stays calibrated.
Confidence is itself a product. The most useful feature of the system, from a downstream consumer perspective, is not the classification itself but the confidence rating attached to it. Downstream systems behave differently based on whether they trust the catalog data fully, partially, or not at all. Exposing that confidence rating was, in retrospect, the highest-leverage design decision.

How this informs our client work

The patterns described here — tiered classification, multi-signal matching, calibrated enrichment, and confidence-rated outputs — apply to any large-scale data quality problem. We have used variations on these patterns for retail catalog management, supplier data integration, and similar workloads.

If you are building or operating a system where data quality at scale is the constraint, get in touch. The engineering patterns travel well across industries, and the underlying disciplines — particularly the discipline of treating confidence as a first-class output — are widely applicable.

Work with us

Have a similar challenge in front of you?

If something in this case study resonates with what you're trying to build — or if you'd like to talk through a related problem — we'd be glad to spend a half-hour helping you think it through.

Start a conversation →

Frequently asked questions

How accurate is the classification layer?

Classification accuracy varies by category complexity but is consistently high in production. The hybrid approach — rules for easy cases, embeddings for medium cases, LLMs for ambiguous cases — produces accuracy that exceeds single-model approaches at a substantially lower cost per listing. Specific accuracy numbers depend on the catalog and the calibration of the human review loop.

Can the deduplication system be tuned for different precision-recall trade-offs?

Yes. The matching system produces a confidence score per candidate pair, and the threshold for auto-merge can be tuned per category to match the business's risk tolerance. Categories where false merges are particularly costly can be tuned more conservatively; categories where missed duplicates are particularly visible can be tuned more aggressively.

How does the system handle catalog drift over time?

The human review queue is the primary mechanism for handling drift. New product types, new supplier behaviours, and new edge cases surface through low-confidence cases that go to human review. The outcomes of human review feed back into the models, keeping the system calibrated as the underlying catalog evolves.

Can this approach be applied to non-marketplace catalogs?

Yes. The underlying patterns apply to any large-scale catalog problem — retail product management, supplier data integration, content classification, document categorisation. The specific tuning differs, but the architecture and the engineering disciplines transfer well across catalog types.

Background

The problem we were solving

What we built

The classification and normalisation layer

The deduplication layer

The enrichment layer

Engineering trade-offs we made

Precision versus recall in deduplication

Real-time enrichment versus batch enrichment

Single model versus ensemble

What we learned

How this informs our client work

Have a similar challenge in front of you?

Frequently asked questions

How accurate is the classification layer?

Can the deduplication system be tuned for different precision-recall trade-offs?

How does the system handle catalog drift over time?

Can this approach be applied to non-marketplace catalogs?

Related work

Have an ambitious idea? We'd love to hear it.

Catalog Intelligence Platform

Background

The problem we were solving

What we built

The classification and normalisation layer

The deduplication layer

The enrichment layer

Engineering trade-offs we made

Precision versus recall in deduplication

Real-time enrichment versus batch enrichment

Single model versus ensemble

What we learned

How this informs our client work

Have a similar challenge in front of you?

Frequently asked questions

How accurate is the classification layer?

Can the deduplication system be tuned for different precision-recall trade-offs?

How does the system handle catalog drift over time?

Can this approach be applied to non-marketplace catalogs?

Related work

Avluz Price Intelligence

SellerBlaze Marketplace Insights

What we look for in retrieval pipelines

Have an ambitious idea? We'd love to hear it.