POSIT: Simultaneously tagging natural and programming languages

Software developers use a mix of source code and natural language text to communicate with each other: Stack Overflow and Developer mailing lists abound with this mixed text. Tagging this mixed text is essential for making progress on two seminal software engineering problems — traceability, and reu...

Full description

Saved in:

Bibliographic Details
Main Authors:	PÂRȚACHI, Profir-Petru, DASH, Santanu, TREUDE, Christoph, BARR, Earl T.
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2020
Subjects:	Code-switching Language identification Mixed-code Part-of-speech tagging Programming Languages and Compilers Software Engineering
Online Access:	https://ink.library.smu.edu.sg/sis_research/8907 https://ink.library.smu.edu.sg/context/sis_research/article/9910/viewcontent/icse20a.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

Description
Summary:	Software developers use a mix of source code and natural language text to communicate with each other: Stack Overflow and Developer mailing lists abound with this mixed text. Tagging this mixed text is essential for making progress on two seminal software engineering problems — traceability, and reuse via precise extraction of code snippets from mixed text. In this paper, we borrow code-switching techniques from Natural Language Processing and adapt them to apply to mixed text to solve two problems: language identification and token tagging. Our technique, POSIT, simultaneously provides abstract syntax tree tags for source code tokens, part-of-speech tags for natural language words, and predicts the source language of a token in mixed text. To realize POSIT, we trained a biLSTM network with a Conditional Random Field output layer using abstract syntax tree tags from the CLANG compiler and part-of-speech tags from the Standard Stanford part-of-speech tagger. POSIT improves the state-of-the-art on language identification by 10.6% and PoS/AST tagging by 23.7% in accuracy

POSIT: Simultaneously tagging natural and programming languages

Similar Items