Graph-Partitioning Entity Resolution for Resolving Noisy Product Names in OCR Scans of Retail Receipts

In business intelligence for retail, it is critical to ensure consistent and unambiguous product dimension information. This is challenging, especially if an organization does not have full control over the source of either transaction or master data. Such lack of control is the case when brands rel...

全面介紹

Saved in:
書目詳細資料
Main Authors: Ilagan, Jose Ramon, Ilagan, Joseph Benjamin R
格式: text
出版: Archīum Ateneo 2024
主題:
在線閱讀:https://archium.ateneo.edu/qmit-faculty-pubs/26
https://archium.ateneo.edu/context/qmit-faculty-pubs/article/1025/viewcontent/1_s2.0_S1877050924014236_main.pdf
標簽: 添加標簽
沒有標簽, 成為第一個標記此記錄!
機構: Ateneo De Manila University
實物特徵
總結:In business intelligence for retail, it is critical to ensure consistent and unambiguous product dimension information. This is challenging, especially if an organization does not have full control over the source of either transaction or master data. Such lack of control is the case when brands rely on data provided directly by consumers through images of receipts. Product name strings obtained from the digitization of receipts often contain substitution, insertion, and deletion errors. These errors prevent product names from serving as a useful dimension for further analysis. This paper proposes a clustering-based approach to link error-laden product names to underlying SKUs to remove this noise. The problem can be modeled as an entity resolution problem: each digitized product name is a reference to an underlying entity SKU. The entity resolution problem can further be modeled as a clique-partitioning problem that can be solved in a reasonable time with an agglomerative clustering heuristic. The results of clustering a synthetic data set show that the approach can successfully resolve product references to reveal coarse-grained (i.e., category, generic product) groupings. Future work may be done on implementing blocking strategies, optimizing the model parameters, and understanding the limits of the model for fine-grained (i.e., size variation) groupings.