Domain-specific cross-language relevant question retrieval

Chinese developers often cannot effectively search questions in English, because they may have difficulties in translating technical words from Chinese to English and formulating proper English queries. For the purpose of helping Chinese developers take advantage of the rich knowledge base of Stack...

Full description

Saved in:
Bibliographic Details
Main Authors: XU, Bowen, XING, Zhenchang, XIA, Xin, David LO, LI, Shanping
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2018
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/3842
https://ink.library.smu.edu.sg/context/sis_research/article/4844/viewcontent/101007_2Fs10664_017_9568_3.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-4844
record_format dspace
spelling sg-smu-ink.sis_research-48442019-03-19T03:32:25Z Domain-specific cross-language relevant question retrieval XU, Bowen XING, Zhenchang XIA, Xin David LO, LI, Shanping Chinese developers often cannot effectively search questions in English, because they may have difficulties in translating technical words from Chinese to English and formulating proper English queries. For the purpose of helping Chinese developers take advantage of the rich knowledge base of Stack Overflow and simplify the question retrieval process, we propose an automated cross-language relevant question retrieval (CLRQR) system to retrieve relevant English questions for a given Chinese question. CLRQR first extracts essential information (both Chinese and English) from the title and description of the input Chinese question, then performs domain-specific translation of the essential Chinese information into English, and finally formulates an English query for retrieving relevant questions in a repository of English questions from Stack Overflow. We propose three different retrieval algorithms (word-embedding, word-matching, and vector-space-model based methods) that exploit different document representations and similarity metrics for question retrieval. To evaluate the performance of our approach and investigate the effectiveness of different retrieval algorithms, we propose four baseline approaches based on the combination of different sources of query words, query formulation mechanisms and search engines. We randomly select 80 Java, 20 Python and 20 .NET questions in SegmentFault and V2EX (two Chinese Q&A websites for computer programming) as the query Chinese questions. We conduct a user study to evaluate the relevance of the retrieved English questions using CLRQR with different retrieval algorithms and the four baseline approaches. The experiment results show that CLRQR with word-embedding based retrieval achieves the best performance. 2018-04-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/3842 info:doi/10.1007/s10664-017-9568-3 https://ink.library.smu.edu.sg/context/sis_research/article/4844/viewcontent/101007_2Fs10664_017_9568_3.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Cross-language question retrieval Domain-specific translation Computer programming Knowledge based systems Linguistics Search engines Vector spaces Cross-language question Document Representation Domain-specific translation Query formulation Retrieval algorithms Retrieval process Similarity metrics Vector space models Translation (languages) Databases and Information Systems Numerical Analysis and Scientific Computing Software Engineering
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Cross-language question retrieval
Domain-specific translation
Computer programming
Knowledge based systems
Linguistics
Search engines
Vector spaces
Cross-language question
Document Representation
Domain-specific translation
Query formulation
Retrieval algorithms
Retrieval process
Similarity metrics
Vector space models
Translation (languages)
Databases and Information Systems
Numerical Analysis and Scientific Computing
Software Engineering
spellingShingle Cross-language question retrieval
Domain-specific translation
Computer programming
Knowledge based systems
Linguistics
Search engines
Vector spaces
Cross-language question
Document Representation
Domain-specific translation
Query formulation
Retrieval algorithms
Retrieval process
Similarity metrics
Vector space models
Translation (languages)
Databases and Information Systems
Numerical Analysis and Scientific Computing
Software Engineering
XU, Bowen
XING, Zhenchang
XIA, Xin
David LO,
LI, Shanping
Domain-specific cross-language relevant question retrieval
description Chinese developers often cannot effectively search questions in English, because they may have difficulties in translating technical words from Chinese to English and formulating proper English queries. For the purpose of helping Chinese developers take advantage of the rich knowledge base of Stack Overflow and simplify the question retrieval process, we propose an automated cross-language relevant question retrieval (CLRQR) system to retrieve relevant English questions for a given Chinese question. CLRQR first extracts essential information (both Chinese and English) from the title and description of the input Chinese question, then performs domain-specific translation of the essential Chinese information into English, and finally formulates an English query for retrieving relevant questions in a repository of English questions from Stack Overflow. We propose three different retrieval algorithms (word-embedding, word-matching, and vector-space-model based methods) that exploit different document representations and similarity metrics for question retrieval. To evaluate the performance of our approach and investigate the effectiveness of different retrieval algorithms, we propose four baseline approaches based on the combination of different sources of query words, query formulation mechanisms and search engines. We randomly select 80 Java, 20 Python and 20 .NET questions in SegmentFault and V2EX (two Chinese Q&A websites for computer programming) as the query Chinese questions. We conduct a user study to evaluate the relevance of the retrieved English questions using CLRQR with different retrieval algorithms and the four baseline approaches. The experiment results show that CLRQR with word-embedding based retrieval achieves the best performance.
format text
author XU, Bowen
XING, Zhenchang
XIA, Xin
David LO,
LI, Shanping
author_facet XU, Bowen
XING, Zhenchang
XIA, Xin
David LO,
LI, Shanping
author_sort XU, Bowen
title Domain-specific cross-language relevant question retrieval
title_short Domain-specific cross-language relevant question retrieval
title_full Domain-specific cross-language relevant question retrieval
title_fullStr Domain-specific cross-language relevant question retrieval
title_full_unstemmed Domain-specific cross-language relevant question retrieval
title_sort domain-specific cross-language relevant question retrieval
publisher Institutional Knowledge at Singapore Management University
publishDate 2018
url https://ink.library.smu.edu.sg/sis_research/3842
https://ink.library.smu.edu.sg/context/sis_research/article/4844/viewcontent/101007_2Fs10664_017_9568_3.pdf
_version_ 1770573824450887680