POSIT: Simultaneously tagging natural and programming languages

Software developers use a mix of source code and natural language text to communicate with each other: Stack Overflow and Developer mailing lists abound with this mixed text. Tagging this mixed text is essential for making progress on two seminal software engineering problems — traceability, and reu...

Full description

Saved in:
Bibliographic Details
Main Authors: PÂRȚACHI, Profir-Petru, DASH, Santanu, TREUDE, Christoph, BARR, Earl T.
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2020
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/8907
https://ink.library.smu.edu.sg/context/sis_research/article/9910/viewcontent/icse20a.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-9910
record_format dspace
spelling sg-smu-ink.sis_research-99102024-06-27T08:10:22Z POSIT: Simultaneously tagging natural and programming languages PÂRȚACHI, Profir-Petru DASH, Santanu TREUDE, Christoph BARR, Earl T. Software developers use a mix of source code and natural language text to communicate with each other: Stack Overflow and Developer mailing lists abound with this mixed text. Tagging this mixed text is essential for making progress on two seminal software engineering problems — traceability, and reuse via precise extraction of code snippets from mixed text. In this paper, we borrow code-switching techniques from Natural Language Processing and adapt them to apply to mixed text to solve two problems: language identification and token tagging. Our technique, POSIT, simultaneously provides abstract syntax tree tags for source code tokens, part-of-speech tags for natural language words, and predicts the source language of a token in mixed text. To realize POSIT, we trained a biLSTM network with a Conditional Random Field output layer using abstract syntax tree tags from the CLANG compiler and part-of-speech tags from the Standard Stanford part-of-speech tagger. POSIT improves the state-of-the-art on language identification by 10.6% and PoS/AST tagging by 23.7% in accuracy 2020-05-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8907 info:doi/10.1145/3377811.3380440 https://ink.library.smu.edu.sg/context/sis_research/article/9910/viewcontent/icse20a.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Code-switching Language identification Mixed-code Part-of-speech tagging Programming Languages and Compilers Software Engineering
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Code-switching
Language identification
Mixed-code
Part-of-speech tagging
Programming Languages and Compilers
Software Engineering
spellingShingle Code-switching
Language identification
Mixed-code
Part-of-speech tagging
Programming Languages and Compilers
Software Engineering
PÂRȚACHI, Profir-Petru
DASH, Santanu
TREUDE, Christoph
BARR, Earl T.
POSIT: Simultaneously tagging natural and programming languages
description Software developers use a mix of source code and natural language text to communicate with each other: Stack Overflow and Developer mailing lists abound with this mixed text. Tagging this mixed text is essential for making progress on two seminal software engineering problems — traceability, and reuse via precise extraction of code snippets from mixed text. In this paper, we borrow code-switching techniques from Natural Language Processing and adapt them to apply to mixed text to solve two problems: language identification and token tagging. Our technique, POSIT, simultaneously provides abstract syntax tree tags for source code tokens, part-of-speech tags for natural language words, and predicts the source language of a token in mixed text. To realize POSIT, we trained a biLSTM network with a Conditional Random Field output layer using abstract syntax tree tags from the CLANG compiler and part-of-speech tags from the Standard Stanford part-of-speech tagger. POSIT improves the state-of-the-art on language identification by 10.6% and PoS/AST tagging by 23.7% in accuracy
format text
author PÂRȚACHI, Profir-Petru
DASH, Santanu
TREUDE, Christoph
BARR, Earl T.
author_facet PÂRȚACHI, Profir-Petru
DASH, Santanu
TREUDE, Christoph
BARR, Earl T.
author_sort PÂRȚACHI, Profir-Petru
title POSIT: Simultaneously tagging natural and programming languages
title_short POSIT: Simultaneously tagging natural and programming languages
title_full POSIT: Simultaneously tagging natural and programming languages
title_fullStr POSIT: Simultaneously tagging natural and programming languages
title_full_unstemmed POSIT: Simultaneously tagging natural and programming languages
title_sort posit: simultaneously tagging natural and programming languages
publisher Institutional Knowledge at Singapore Management University
publishDate 2020
url https://ink.library.smu.edu.sg/sis_research/8907
https://ink.library.smu.edu.sg/context/sis_research/article/9910/viewcontent/icse20a.pdf
_version_ 1814047627925258240