Learning to query: Focused web page harvesting for entity aspects

As the Web hosts rich information about real-world entities, our information quests become increasingly entity centric. In this paper, we study the problem of focused harvesting of Web pages for entity aspects, to support downstream applications such as business analytics and building a vertical por...

Full description

Saved in:

Bibliographic Details
Main Authors:	FANG, Yuan, ZHENG, Vincent W., CHANG, Kevin Chen-Chuan
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2016
Subjects:	Harvesting Websites business analytics Databases and Information Systems
Online Access:	https://ink.library.smu.edu.sg/sis_research/4066 https://ink.library.smu.edu.sg/context/sis_research/article/5069/viewcontent/icde16_l2q.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-5069
record_format	dspace
spelling	sg-smu-ink.sis_research-50692018-07-20T04:59:03Z Learning to query: Focused web page harvesting for entity aspects FANG, Yuan ZHENG, Vincent W. CHANG, Kevin Chen-Chuan As the Web hosts rich information about real-world entities, our information quests become increasingly entity centric. In this paper, we study the problem of focused harvesting of Web pages for entity aspects, to support downstream applications such as business analytics and building a vertical portal. Given that search engines are the de facto gateways to assess information on the Web, we recognize the essence of our problem as Learning to Query (L2Q) - to intelligently select queries so that we can harvest pages, via a search engine, focused on an entity aspect of interest. Thus, it is crucial to quantify the utilities of the candidate queries w.r.t. some entity aspect. In order to better estimate the utilities, we identify two opportunities and address their challenges. First, a target entity in a given domain has many peers. We leverage these peer entities to become domain aware. Second, a candidate query may “overlap” with the past queries that have already been fired. We account for these past queries to become context aware. Empirical results show that our approach significantly outperforms both algorithmic and manual baselines by 16% and 10% in F-scores, respectively. 2016-05-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/4066 info:doi/10.1109/ICDE.2016.7498308 https://ink.library.smu.edu.sg/context/sis_research/article/5069/viewcontent/icde16_l2q.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Harvesting Websites business analytics Databases and Information Systems
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Harvesting Websites business analytics Databases and Information Systems
spellingShingle	Harvesting Websites business analytics Databases and Information Systems FANG, Yuan ZHENG, Vincent W. CHANG, Kevin Chen-Chuan Learning to query: Focused web page harvesting for entity aspects
description	As the Web hosts rich information about real-world entities, our information quests become increasingly entity centric. In this paper, we study the problem of focused harvesting of Web pages for entity aspects, to support downstream applications such as business analytics and building a vertical portal. Given that search engines are the de facto gateways to assess information on the Web, we recognize the essence of our problem as Learning to Query (L2Q) - to intelligently select queries so that we can harvest pages, via a search engine, focused on an entity aspect of interest. Thus, it is crucial to quantify the utilities of the candidate queries w.r.t. some entity aspect. In order to better estimate the utilities, we identify two opportunities and address their challenges. First, a target entity in a given domain has many peers. We leverage these peer entities to become domain aware. Second, a candidate query may “overlap” with the past queries that have already been fired. We account for these past queries to become context aware. Empirical results show that our approach significantly outperforms both algorithmic and manual baselines by 16% and 10% in F-scores, respectively.
format	text
author	FANG, Yuan ZHENG, Vincent W. CHANG, Kevin Chen-Chuan
author_facet	FANG, Yuan ZHENG, Vincent W. CHANG, Kevin Chen-Chuan
author_sort	FANG, Yuan
title	Learning to query: Focused web page harvesting for entity aspects
title_short	Learning to query: Focused web page harvesting for entity aspects
title_full	Learning to query: Focused web page harvesting for entity aspects
title_fullStr	Learning to query: Focused web page harvesting for entity aspects
title_full_unstemmed	Learning to query: Focused web page harvesting for entity aspects
title_sort	learning to query: focused web page harvesting for entity aspects
publisher	Institutional Knowledge at Singapore Management University
publishDate	2016
url	https://ink.library.smu.edu.sg/sis_research/4066 https://ink.library.smu.edu.sg/context/sis_research/article/5069/viewcontent/icde16_l2q.pdf
_version_	1770574239295864832

Learning to query: Focused web page harvesting for entity aspects

Similar Items