Incorporating intrinsic structures into entity matching and representation learning

The proliferation of internet-connected devices and online services has generated vast amounts of user-generated content in various formats, such as text, visual, and spatial information. Despite the potential of advanced deep learning techniques, challenges such as fragmentation, lack of cohesive s...

Full description

Saved in:
Bibliographic Details
Main Author: LEE, Ween Jiann
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2024
Subjects:
Online Access:https://ink.library.smu.edu.sg/etd_coll/625
https://ink.library.smu.edu.sg/context/etd_coll/article/1623/viewcontent/GPIS_AY2019_PhD_Lee_Ween_Jiann.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
Description
Summary:The proliferation of internet-connected devices and online services has generated vast amounts of user-generated content in various formats, such as text, visual, and spatial information. Despite the potential of advanced deep learning techniques, challenges such as fragmentation, lack of cohesive structure, and the inability to capture intrinsic data structures persist, affecting data amalgamation and quality. Our research addresses these challenges by enhancing entity matching and representation learning across graph, semi-ordered, and spatial data. These advancements have significant implications for applications in transportation, recommendation systems, and urban planning. In entity matching, we introduce Robust BiPoly-Matching and Semi-Ordered Bidirectional Poly-Matching. Matching records across datasets is fundamental for generating high-quality data applicable to diverse domains. While prior research has extensively explored various matching paradigms, significant research gaps remain in two key areas. Firstly, previous work presumes that entities to be matched are of comparable granularity. Our approach addresses one-to-many or poly-matching scenarios where entities vary in granularity. A distinctive feature of our method is its bidirectional nature, allowing the ‘one’ or the ‘many’ to originate from either source. By incorporating notions of receptivity and reclusivity into a robust matching objective, we effectively handle diverse entity representations and noisy similarity values. Secondly, existing methods tailored to ordered datasets primarily focus on globally ordered records, assuming consistency across data sources. Our work addresses the challenge of matching records within partially ordered datasets, where groups of records exhibit internal order, but their alignment may vary across datasets. We formalize these problems, demonstrating their computational intractability and introducing novel heuristics that are both effective and efficient. Comprehensive evaluations on real-world and constructed datasets validate the effectiveness of our proposed algorithms in resolving matches across multiple datasets. In the field of representation learning, existing approaches often fall short of explicitly capturing both semantic and spatial information, relying on proxies and synthetic features. we present the GeoNN model for spatially-aware embeddings. The GeoNN model leverages edge features generated from geodesic functions, dynamically selecting relevant features based on relative locations. It introduces both transductive (GeoNN-T) and inductive (GeoNN-I) models, ensuring effective encoding of geospatial features and scalability with entity changes. Extensive experiments demonstrate GeoNN’s effectiveness in various tasks, outperforming baselines across various evaluation measures. Our research not only bridges critical gaps in entity matching and representation learning but also provides robust methodologies that can be applied to diverse realworld scenarios. By addressing the complexities of multi-granular and semi-ordered data, and capturing intrinsic spatial relationships, our work significantly advances the fields of data matching and representation learning. Support a wide range of applications, ultimately contributing to improved data quality and more effective solutions in various domains.