Graph-Partitioning Entity Resolution for Resolving Noisy Product Names in OCR Scans of Retail Receipts

In business intelligence for retail, it is critical to ensure consistent and unambiguous product dimension information. This is challenging, especially if an organization does not have full control over the source of either transaction or master data. Such lack of control is the case when brands rel...

Full description

Saved in:
Bibliographic Details
Main Authors: Ilagan, Jose Ramon, Ilagan, Joseph Benjamin R
Format: text
Published: Archīum Ateneo 2024
Subjects:
Online Access:https://archium.ateneo.edu/qmit-faculty-pubs/26
https://archium.ateneo.edu/context/qmit-faculty-pubs/article/1025/viewcontent/1_s2.0_S1877050924014236_main.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Ateneo De Manila University
id ph-ateneo-arc.qmit-faculty-pubs-1025
record_format eprints
spelling ph-ateneo-arc.qmit-faculty-pubs-10252024-09-30T07:38:02Z Graph-Partitioning Entity Resolution for Resolving Noisy Product Names in OCR Scans of Retail Receipts Ilagan, Jose Ramon Ilagan, Joseph Benjamin R In business intelligence for retail, it is critical to ensure consistent and unambiguous product dimension information. This is challenging, especially if an organization does not have full control over the source of either transaction or master data. Such lack of control is the case when brands rely on data provided directly by consumers through images of receipts. Product name strings obtained from the digitization of receipts often contain substitution, insertion, and deletion errors. These errors prevent product names from serving as a useful dimension for further analysis. This paper proposes a clustering-based approach to link error-laden product names to underlying SKUs to remove this noise. The problem can be modeled as an entity resolution problem: each digitized product name is a reference to an underlying entity SKU. The entity resolution problem can further be modeled as a clique-partitioning problem that can be solved in a reasonable time with an agglomerative clustering heuristic. The results of clustering a synthetic data set show that the approach can successfully resolve product references to reveal coarse-grained (i.e., category, generic product) groupings. Future work may be done on implementing blocking strategies, optimizing the model parameters, and understanding the limits of the model for fine-grained (i.e., size variation) groupings. 2024-01-01T08:00:00Z text application/pdf https://archium.ateneo.edu/qmit-faculty-pubs/26 https://archium.ateneo.edu/context/qmit-faculty-pubs/article/1025/viewcontent/1_s2.0_S1877050924014236_main.pdf Quantitative Methods and Information Technology Faculty Publications Archīum Ateneo Agglomerative Clustering Business Intelligence Master Data Management Retail Retail Analytics Shopper Insights Shopping Analytics Business Computer Sciences Physical Sciences and Mathematics
institution Ateneo De Manila University
building Ateneo De Manila University Library
continent Asia
country Philippines
Philippines
content_provider Ateneo De Manila University Library
collection archium.Ateneo Institutional Repository
topic Agglomerative Clustering
Business Intelligence
Master Data Management
Retail
Retail Analytics
Shopper Insights
Shopping Analytics
Business
Computer Sciences
Physical Sciences and Mathematics
spellingShingle Agglomerative Clustering
Business Intelligence
Master Data Management
Retail
Retail Analytics
Shopper Insights
Shopping Analytics
Business
Computer Sciences
Physical Sciences and Mathematics
Ilagan, Jose Ramon
Ilagan, Joseph Benjamin R
Graph-Partitioning Entity Resolution for Resolving Noisy Product Names in OCR Scans of Retail Receipts
description In business intelligence for retail, it is critical to ensure consistent and unambiguous product dimension information. This is challenging, especially if an organization does not have full control over the source of either transaction or master data. Such lack of control is the case when brands rely on data provided directly by consumers through images of receipts. Product name strings obtained from the digitization of receipts often contain substitution, insertion, and deletion errors. These errors prevent product names from serving as a useful dimension for further analysis. This paper proposes a clustering-based approach to link error-laden product names to underlying SKUs to remove this noise. The problem can be modeled as an entity resolution problem: each digitized product name is a reference to an underlying entity SKU. The entity resolution problem can further be modeled as a clique-partitioning problem that can be solved in a reasonable time with an agglomerative clustering heuristic. The results of clustering a synthetic data set show that the approach can successfully resolve product references to reveal coarse-grained (i.e., category, generic product) groupings. Future work may be done on implementing blocking strategies, optimizing the model parameters, and understanding the limits of the model for fine-grained (i.e., size variation) groupings.
format text
author Ilagan, Jose Ramon
Ilagan, Joseph Benjamin R
author_facet Ilagan, Jose Ramon
Ilagan, Joseph Benjamin R
author_sort Ilagan, Jose Ramon
title Graph-Partitioning Entity Resolution for Resolving Noisy Product Names in OCR Scans of Retail Receipts
title_short Graph-Partitioning Entity Resolution for Resolving Noisy Product Names in OCR Scans of Retail Receipts
title_full Graph-Partitioning Entity Resolution for Resolving Noisy Product Names in OCR Scans of Retail Receipts
title_fullStr Graph-Partitioning Entity Resolution for Resolving Noisy Product Names in OCR Scans of Retail Receipts
title_full_unstemmed Graph-Partitioning Entity Resolution for Resolving Noisy Product Names in OCR Scans of Retail Receipts
title_sort graph-partitioning entity resolution for resolving noisy product names in ocr scans of retail receipts
publisher Archīum Ateneo
publishDate 2024
url https://archium.ateneo.edu/qmit-faculty-pubs/26
https://archium.ateneo.edu/context/qmit-faculty-pubs/article/1025/viewcontent/1_s2.0_S1877050924014236_main.pdf
_version_ 1811611647708495872