Machine LearningAWSCloud Architecture

Technical Overview of Building a Scalable Machine Learning Solution

Learn how to design and implement a production-grade ML system that reduced customer onboarding time from weeks to days using embedding-based similarity search and automated retraining.

Linjing (LJ) Wen
Linjing (LJ) Wen
February 27, 2025
5 min read
Technical Overview of Building a Scalable Machine Learning Solution

Technical Overview of Building a Scalable Machine Learning Solution

In my previous article, I discussed the business background of this project. In this article, I will take a technical deep dive into how we built a scalable machine learning solution that transformed a manual process into an automated system.

Business Background

A major technology company processing over $1 billion in transactions required an efficient ML-based solution to normalize and match customer-uploaded product catalogs against a master catalog of 80,000 product names. The existing manual mapping process was slow, error-prone, and took several weeks, significantly delaying customer onboarding.

The objective was to reduce onboarding time to three days by implementing a near real-time, scalable ML pipeline leveraging embedding-based similarity search, probabilistic matching, and automated retraining.

Problem Definition & Data Challenges

High Variability in Product Naming Conventions

  • Typographical errors (e.g., "Henessey" vs. "Hennessy")
  • Vendor abbreviations (e.g., "JW Blue" vs. "Johnnie Walker Blue Label")
  • Encoding inconsistencies (e.g., non-standard Unicode characters, special characters)
  • Structural differences (e.g., "Corona - 12oz Bottle" vs. "Corona Extra 12oz")

Scalability and Performance Requirements

  • Matching must occur in near real-time as new product records are ingested
  • Processing millions of product variations efficiently without memory overhead
  • Continuous model improvement with full retraining cycles to prevent performance degradation over time

ML Model Architecture & Design

Embedding-Based Product Normalization

Embedding-based product normalization leverages a Large Language Model (LLM) to generate vector representations of product names, capturing their contextual meaning. Instead of relying on exact text matching, the system compares embeddings to identify variations, abbreviations, and synonyms, ensuring that different representations of the same product are correctly matched.

Tokenization Process (Splitting Text into Tokens)

  • Used BERT-based WordPiece tokenization to split product names into subword units
  • Example transformation:
    • "Fireball Cinnamon Whisky" → ["fire", "##ball", "cinnamon", "whisky"]
    • "Firebal Cinn. Whsky" → ["fire", "##bal", "cinn", "whsky"]
  • Ensures robust similarity detection even with typos and partial matches

Embedding Strategy (Converting Tokens into Vectors)

AWS SageMaker Object2Vec was selected for:

  • Generating high-dimensional vector representations of product names
  • Capturing contextual relationships beyond simple string matching
  • More robust handling of misspellings and abbreviation mismatches

Training Data Engineering & Model Optimization

Dataset Construction

  • Positive Pairs: Extracted from historical mappings (15K product names)
  • Negative Sampling:
    • 445K negative samples generated by pairing products that are known to be distinct
    • Ensured 3:1 negative-to-positive sample ratio for robust learning
  • Final Dataset Size: 655K labeled training pairs

Computing Similarity Between Product Embeddings

How Cosine Similarity Works:

Cosine similarity measures the angle between two vectors (product embeddings).

  • Closer to 1 (cosine ≈ 1.0) → Products are very similar
  • Closer to 0 (cosine ≈ 0.0) → Products are different
  • Examples:
    • "Coca-Cola 12 Pack" vs. "Pepsi 12pk" → 0.34 ❌ (Different products)
    • "iPhone 14 Pro 256GB" vs. "iPhone 14 128GB" → 0.87 ✅ (Same model, different variant)
    • "Johnnie Walker Black Label" vs. "JW Black" → 0.96 ✅ (High similarity)

Matching Decision Thresholds

Cosine Similarity Action Taken
≥ 0.85 Auto-Approved
0.65 - 0.84 Flagged for Manual Review
< 0.65 No Match Found

Embedding Storage

To efficiently retrieve similar products in near real-time, embeddings are stored in an S3 bucket.

Baseline Model Performance (Pre-Retraining)

Metric Initial Accuracy
Top-1 Match Accuracy 57.3%
Top-5 Match Accuracy 75.4%

Near Real-Time Inference

  • AWS SageMaker Endpoint serves the trained embedding model
  • AWS Lambda triggers inference upon new product data ingestion
  • Matching occurs in near real-time (typically under three seconds) using a pre-loaded embedding model

Feedback-Driven Incremental Learning & Full Retraining Cycles

Feedback Loop Integration

  • Manual corrections captured via API and stored in a feedback dataset
  • Incremental dataset augmentation to continuously refine embeddings
  • Full model retraining scheduled every four weeks, with incremental fine-tuning updates

Automated Retraining Pipeline (AWS SageMaker Pipelines)

  1. Aggregate New Feedback Data from database
  2. Merge with Existing Training Data and generate updated embeddings
  3. Retrain Model from scratch using SageMaker GPU instances
  4. Deploy Updated Model Only if accuracy improves

Post-Retraining Accuracy Improvement

Metric Initial Accuracy Post-Retraining Accuracy
Top-1 Match Accuracy 57.3% 68.0%
Top-5 Match Accuracy 75.4% 86.0%

MLOps & Production Model Deployment

Infrastructure as Code (IaC)

  • CloudFormation for reproducible AWS deployment
  • GitHub Actions for CI/CD integration
  • Model versioning & rollback mechanisms for safe deployment

Automated Retraining Triggers

  • Scheduled Retraining: Full model retraining occurs at regular intervals
  • Performance Monitoring: Auto-retraining is triggered if match accuracy drops ≥5%

Final Results & Business Impact

✅ Reduced onboarding time from weeks to three days

✅ Eliminated most manual product mapping efforts

✅ Scalable system handling near real-time product ingestion

✅ Continuous full retraining cycles ensure long-term accuracy improvements

Key Takeaways

This ML-driven product normalization and matching pipeline transformed a manual, time-consuming process into a fully automated, scalable solution. By leveraging AWS SageMaker, deep learning embeddings, and scheduled full retraining cycles, the system continuously improves, ensuring rapid and accurate product mapping.