Technical Overview of Building a Scalable Machine Learning Solution

In my previous article, I discussed the business background of this project. In this article, I will take a technical deep dive into how we built a scalable machine learning solution that transformed a manual process into an automated system.

Business Background

A major technology company processing over $1 billion in transactions required an efficient ML-based solution to normalize and match customer-uploaded product catalogs against a master catalog of 80,000 product names. The existing manual mapping process was slow, error-prone, and took several weeks, significantly delaying customer onboarding.

The objective was to reduce onboarding time to three days by implementing a near real-time, scalable ML pipeline leveraging embedding-based similarity search, probabilistic matching, and automated retraining.

Problem Definition & Data Challenges

High Variability in Product Naming Conventions

Typographical errors (e.g., "Henessey" vs. "Hennessy")
Vendor abbreviations (e.g., "JW Blue" vs. "Johnnie Walker Blue Label")
Encoding inconsistencies (e.g., non-standard Unicode characters, special characters)
Structural differences (e.g., "Corona - 12oz Bottle" vs. "Corona Extra 12oz")

Scalability and Performance Requirements

Matching must occur in near real-time as new product records are ingested
Processing millions of product variations efficiently without memory overhead
Continuous model improvement with full retraining cycles to prevent performance degradation over time

ML Model Architecture & Design

Embedding-Based Product Normalization

Embedding-based product normalization leverages a Large Language Model (LLM) to generate vector representations of product names, capturing their contextual meaning. Instead of relying on exact text matching, the system compares embeddings to identify variations, abbreviations, and synonyms, ensuring that different representations of the same product are correctly matched.

Tokenization Process (Splitting Text into Tokens)

Used BERT-based WordPiece tokenization to split product names into subword units
Example transformation:
- "Fireball Cinnamon Whisky" → ["fire", "##ball", "cinnamon", "whisky"]
- "Firebal Cinn. Whsky" → ["fire", "##bal", "cinn", "whsky"]
Ensures robust similarity detection even with typos and partial matches

Embedding Strategy (Converting Tokens into Vectors)

AWS SageMaker Object2Vec was selected for:

Generating high-dimensional vector representations of product names
Capturing contextual relationships beyond simple string matching
More robust handling of misspellings and abbreviation mismatches

Training Data Engineering & Model Optimization

Dataset Construction

Positive Pairs: Extracted from historical mappings (15K product names)
Negative Sampling:
- 445K negative samples generated by pairing products that are known to be distinct
- Ensured 3:1 negative-to-positive sample ratio for robust learning
Final Dataset Size: 655K labeled training pairs

Computing Similarity Between Product Embeddings

How Cosine Similarity Works:

Cosine similarity measures the angle between two vectors (product embeddings).

Closer to 1 (cosine ≈ 1.0) → Products are very similar
Closer to 0 (cosine ≈ 0.0) → Products are different
Examples:
- "Coca-Cola 12 Pack" vs. "Pepsi 12pk" → 0.34 ❌ (Different products)
- "iPhone 14 Pro 256GB" vs. "iPhone 14 128GB" → 0.87 ✅ (Same model, different variant)
- "Johnnie Walker Black Label" vs. "JW Black" → 0.96 ✅ (High similarity)

Matching Decision Thresholds

Cosine Similarity	Action Taken
≥ 0.85	Auto-Approved
0.65 - 0.84	Flagged for Manual Review
< 0.65	No Match Found

Embedding Storage

To efficiently retrieve similar products in near real-time, embeddings are stored in an S3 bucket.

Baseline Model Performance (Pre-Retraining)

Metric	Initial Accuracy
Top-1 Match Accuracy	57.3%
Top-5 Match Accuracy	75.4%

Near Real-Time Inference

AWS SageMaker Endpoint serves the trained embedding model
AWS Lambda triggers inference upon new product data ingestion
Matching occurs in near real-time (typically under three seconds) using a pre-loaded embedding model

Feedback-Driven Incremental Learning & Full Retraining Cycles

Feedback Loop Integration

Manual corrections captured via API and stored in a feedback dataset
Incremental dataset augmentation to continuously refine embeddings
Full model retraining scheduled every four weeks, with incremental fine-tuning updates

Automated Retraining Pipeline (AWS SageMaker Pipelines)

Aggregate New Feedback Data from database
Merge with Existing Training Data and generate updated embeddings
Retrain Model from scratch using SageMaker GPU instances
Deploy Updated Model Only if accuracy improves

Post-Retraining Accuracy Improvement

Metric	Initial Accuracy	Post-Retraining Accuracy
Top-1 Match Accuracy	57.3%	68.0%
Top-5 Match Accuracy	75.4%	86.0%

MLOps & Production Model Deployment

Infrastructure as Code (IaC)

CloudFormation for reproducible AWS deployment
GitHub Actions for CI/CD integration
Model versioning & rollback mechanisms for safe deployment

Automated Retraining Triggers

Scheduled Retraining: Full model retraining occurs at regular intervals
Performance Monitoring: Auto-retraining is triggered if match accuracy drops ≥5%

Final Results & Business Impact

✅ Reduced onboarding time from weeks to three days

✅ Eliminated most manual product mapping efforts

✅ Scalable system handling near real-time product ingestion

✅ Continuous full retraining cycles ensure long-term accuracy improvements

Key Takeaways

This ML-driven product normalization and matching pipeline transformed a manual, time-consuming process into a fully automated, scalable solution. By leveraging AWS SageMaker, deep learning embeddings, and scheduled full retraining cycles, the system continuously improves, ensuring rapid and accurate product mapping.