Graph-Partitioning Entity Resolution for Resolving Noisy Product Names in OCR Scans of Retail Receipts
In business intelligence for retail, it is critical to ensure consistent and unambiguous product dimension information. This is challenging, especially if an organization does not have full control over the source of either transaction or master data. Such lack of control is the case when brands rel...
Saved in:
Main Authors: | , |
---|---|
Format: | text |
Published: |
Archīum Ateneo
2024
|
Subjects: | |
Online Access: | https://archium.ateneo.edu/qmit-faculty-pubs/26 https://archium.ateneo.edu/context/qmit-faculty-pubs/article/1025/viewcontent/1_s2.0_S1877050924014236_main.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Ateneo De Manila University |
id |
ph-ateneo-arc.qmit-faculty-pubs-1025 |
---|---|
record_format |
eprints |
spelling |
ph-ateneo-arc.qmit-faculty-pubs-10252024-09-30T07:38:02Z Graph-Partitioning Entity Resolution for Resolving Noisy Product Names in OCR Scans of Retail Receipts Ilagan, Jose Ramon Ilagan, Joseph Benjamin R In business intelligence for retail, it is critical to ensure consistent and unambiguous product dimension information. This is challenging, especially if an organization does not have full control over the source of either transaction or master data. Such lack of control is the case when brands rely on data provided directly by consumers through images of receipts. Product name strings obtained from the digitization of receipts often contain substitution, insertion, and deletion errors. These errors prevent product names from serving as a useful dimension for further analysis. This paper proposes a clustering-based approach to link error-laden product names to underlying SKUs to remove this noise. The problem can be modeled as an entity resolution problem: each digitized product name is a reference to an underlying entity SKU. The entity resolution problem can further be modeled as a clique-partitioning problem that can be solved in a reasonable time with an agglomerative clustering heuristic. The results of clustering a synthetic data set show that the approach can successfully resolve product references to reveal coarse-grained (i.e., category, generic product) groupings. Future work may be done on implementing blocking strategies, optimizing the model parameters, and understanding the limits of the model for fine-grained (i.e., size variation) groupings. 2024-01-01T08:00:00Z text application/pdf https://archium.ateneo.edu/qmit-faculty-pubs/26 https://archium.ateneo.edu/context/qmit-faculty-pubs/article/1025/viewcontent/1_s2.0_S1877050924014236_main.pdf Quantitative Methods and Information Technology Faculty Publications Archīum Ateneo Agglomerative Clustering Business Intelligence Master Data Management Retail Retail Analytics Shopper Insights Shopping Analytics Business Computer Sciences Physical Sciences and Mathematics |
institution |
Ateneo De Manila University |
building |
Ateneo De Manila University Library |
continent |
Asia |
country |
Philippines Philippines |
content_provider |
Ateneo De Manila University Library |
collection |
archium.Ateneo Institutional Repository |
topic |
Agglomerative Clustering Business Intelligence Master Data Management Retail Retail Analytics Shopper Insights Shopping Analytics Business Computer Sciences Physical Sciences and Mathematics |
spellingShingle |
Agglomerative Clustering Business Intelligence Master Data Management Retail Retail Analytics Shopper Insights Shopping Analytics Business Computer Sciences Physical Sciences and Mathematics Ilagan, Jose Ramon Ilagan, Joseph Benjamin R Graph-Partitioning Entity Resolution for Resolving Noisy Product Names in OCR Scans of Retail Receipts |
description |
In business intelligence for retail, it is critical to ensure consistent and unambiguous product dimension information. This is challenging, especially if an organization does not have full control over the source of either transaction or master data. Such lack of control is the case when brands rely on data provided directly by consumers through images of receipts. Product name strings obtained from the digitization of receipts often contain substitution, insertion, and deletion errors. These errors prevent product names from serving as a useful dimension for further analysis. This paper proposes a clustering-based approach to link error-laden product names to underlying SKUs to remove this noise. The problem can be modeled as an entity resolution problem: each digitized product name is a reference to an underlying entity SKU. The entity resolution problem can further be modeled as a clique-partitioning problem that can be solved in a reasonable time with an agglomerative clustering heuristic. The results of clustering a synthetic data set show that the approach can successfully resolve product references to reveal coarse-grained (i.e., category, generic product) groupings. Future work may be done on implementing blocking strategies, optimizing the model parameters, and understanding the limits of the model for fine-grained (i.e., size variation) groupings. |
format |
text |
author |
Ilagan, Jose Ramon Ilagan, Joseph Benjamin R |
author_facet |
Ilagan, Jose Ramon Ilagan, Joseph Benjamin R |
author_sort |
Ilagan, Jose Ramon |
title |
Graph-Partitioning Entity Resolution for Resolving Noisy Product Names in OCR Scans of Retail Receipts |
title_short |
Graph-Partitioning Entity Resolution for Resolving Noisy Product Names in OCR Scans of Retail Receipts |
title_full |
Graph-Partitioning Entity Resolution for Resolving Noisy Product Names in OCR Scans of Retail Receipts |
title_fullStr |
Graph-Partitioning Entity Resolution for Resolving Noisy Product Names in OCR Scans of Retail Receipts |
title_full_unstemmed |
Graph-Partitioning Entity Resolution for Resolving Noisy Product Names in OCR Scans of Retail Receipts |
title_sort |
graph-partitioning entity resolution for resolving noisy product names in ocr scans of retail receipts |
publisher |
Archīum Ateneo |
publishDate |
2024 |
url |
https://archium.ateneo.edu/qmit-faculty-pubs/26 https://archium.ateneo.edu/context/qmit-faculty-pubs/article/1025/viewcontent/1_s2.0_S1877050924014236_main.pdf |
_version_ |
1811611647708495872 |