Technical Overview of Building a Scalable Machine Learning Solution
Learn how to design and implement a production-grade ML system that reduced customer onboarding time from weeks to days using embedding-based similarity search and automated retraining.

Technical Overview of Building a Scalable Machine Learning Solution
In my previous article, I discussed the business background of this project. In this article, I will take a technical deep dive into how we built a scalable machine learning solution that transformed a manual process into an automated system.
Business Background
A major technology company processing over $1 billion in transactions required an efficient ML-based solution to normalize and match customer-uploaded product catalogs against a master catalog of 80,000 product names. The existing manual mapping process was slow, error-prone, and took several weeks, significantly delaying customer onboarding.
The objective was to reduce onboarding time to three days by implementing a near real-time, scalable ML pipeline leveraging embedding-based similarity search, probabilistic matching, and automated retraining.
Problem Definition & Data Challenges
High Variability in Product Naming Conventions
- Typographical errors (e.g., "Henessey" vs. "Hennessy")
- Vendor abbreviations (e.g., "JW Blue" vs. "Johnnie Walker Blue Label")
- Encoding inconsistencies (e.g., non-standard Unicode characters, special characters)
- Structural differences (e.g., "Corona - 12oz Bottle" vs. "Corona Extra 12oz")
Scalability and Performance Requirements
- Matching must occur in near real-time as new product records are ingested
- Processing millions of product variations efficiently without memory overhead
- Continuous model improvement with full retraining cycles to prevent performance degradation over time
ML Model Architecture & Design
Embedding-Based Product Normalization
Embedding-based product normalization leverages a Large Language Model (LLM) to generate vector representations of product names, capturing their contextual meaning. Instead of relying on exact text matching, the system compares embeddings to identify variations, abbreviations, and synonyms, ensuring that different representations of the same product are correctly matched.
Tokenization Process (Splitting Text into Tokens)
- Used BERT-based WordPiece tokenization to split product names into subword units
- Example transformation:
- "Fireball Cinnamon Whisky" → ["fire", "##ball", "cinnamon", "whisky"]
- "Firebal Cinn. Whsky" → ["fire", "##bal", "cinn", "whsky"]
- Ensures robust similarity detection even with typos and partial matches
Embedding Strategy (Converting Tokens into Vectors)
AWS SageMaker Object2Vec was selected for:
- Generating high-dimensional vector representations of product names
- Capturing contextual relationships beyond simple string matching
- More robust handling of misspellings and abbreviation mismatches
Training Data Engineering & Model Optimization
Dataset Construction
- Positive Pairs: Extracted from historical mappings (15K product names)
- Negative Sampling:
- 445K negative samples generated by pairing products that are known to be distinct
- Ensured 3:1 negative-to-positive sample ratio for robust learning
- Final Dataset Size: 655K labeled training pairs
Computing Similarity Between Product Embeddings
How Cosine Similarity Works:
Cosine similarity measures the angle between two vectors (product embeddings).
- Closer to 1 (cosine ≈ 1.0) → Products are very similar
- Closer to 0 (cosine ≈ 0.0) → Products are different
- Examples:
- "Coca-Cola 12 Pack" vs. "Pepsi 12pk" → 0.34 ❌ (Different products)
- "iPhone 14 Pro 256GB" vs. "iPhone 14 128GB" → 0.87 ✅ (Same model, different variant)
- "Johnnie Walker Black Label" vs. "JW Black" → 0.96 ✅ (High similarity)
Matching Decision Thresholds
| Cosine Similarity | Action Taken |
|---|---|
| ≥ 0.85 | Auto-Approved |
| 0.65 - 0.84 | Flagged for Manual Review |
| < 0.65 | No Match Found |
Embedding Storage
To efficiently retrieve similar products in near real-time, embeddings are stored in an S3 bucket.
Baseline Model Performance (Pre-Retraining)
| Metric | Initial Accuracy |
|---|---|
| Top-1 Match Accuracy | 57.3% |
| Top-5 Match Accuracy | 75.4% |
Near Real-Time Inference
- AWS SageMaker Endpoint serves the trained embedding model
- AWS Lambda triggers inference upon new product data ingestion
- Matching occurs in near real-time (typically under three seconds) using a pre-loaded embedding model
Feedback-Driven Incremental Learning & Full Retraining Cycles
Feedback Loop Integration
- Manual corrections captured via API and stored in a feedback dataset
- Incremental dataset augmentation to continuously refine embeddings
- Full model retraining scheduled every four weeks, with incremental fine-tuning updates
Automated Retraining Pipeline (AWS SageMaker Pipelines)
- Aggregate New Feedback Data from database
- Merge with Existing Training Data and generate updated embeddings
- Retrain Model from scratch using SageMaker GPU instances
- Deploy Updated Model Only if accuracy improves
Post-Retraining Accuracy Improvement
| Metric | Initial Accuracy | Post-Retraining Accuracy |
|---|---|---|
| Top-1 Match Accuracy | 57.3% | 68.0% |
| Top-5 Match Accuracy | 75.4% | 86.0% |
MLOps & Production Model Deployment
Infrastructure as Code (IaC)
- CloudFormation for reproducible AWS deployment
- GitHub Actions for CI/CD integration
- Model versioning & rollback mechanisms for safe deployment
Automated Retraining Triggers
- Scheduled Retraining: Full model retraining occurs at regular intervals
- Performance Monitoring: Auto-retraining is triggered if match accuracy drops ≥5%
Final Results & Business Impact
✅ Reduced onboarding time from weeks to three days
✅ Eliminated most manual product mapping efforts
✅ Scalable system handling near real-time product ingestion
✅ Continuous full retraining cycles ensure long-term accuracy improvements
Key Takeaways
This ML-driven product normalization and matching pipeline transformed a manual, time-consuming process into a fully automated, scalable solution. By leveraging AWS SageMaker, deep learning embeddings, and scheduled full retraining cycles, the system continuously improves, ensuring rapid and accurate product mapping.