POSIT: Simultaneously tagging natural and programming languages
Software developers use a mix of source code and natural language text to communicate with each other: Stack Overflow and Developer mailing lists abound with this mixed text. Tagging this mixed text is essential for making progress on two seminal software engineering problems — traceability, and reu...
Saved in:
Main Authors: | , , , |
---|---|
Format: | text |
Language: | English |
Published: |
Institutional Knowledge at Singapore Management University
2020
|
Subjects: | |
Online Access: | https://ink.library.smu.edu.sg/sis_research/8907 https://ink.library.smu.edu.sg/context/sis_research/article/9910/viewcontent/icse20a.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Singapore Management University |
Language: | English |
id |
sg-smu-ink.sis_research-9910 |
---|---|
record_format |
dspace |
spelling |
sg-smu-ink.sis_research-99102024-06-27T08:10:22Z POSIT: Simultaneously tagging natural and programming languages PÂRȚACHI, Profir-Petru DASH, Santanu TREUDE, Christoph BARR, Earl T. Software developers use a mix of source code and natural language text to communicate with each other: Stack Overflow and Developer mailing lists abound with this mixed text. Tagging this mixed text is essential for making progress on two seminal software engineering problems — traceability, and reuse via precise extraction of code snippets from mixed text. In this paper, we borrow code-switching techniques from Natural Language Processing and adapt them to apply to mixed text to solve two problems: language identification and token tagging. Our technique, POSIT, simultaneously provides abstract syntax tree tags for source code tokens, part-of-speech tags for natural language words, and predicts the source language of a token in mixed text. To realize POSIT, we trained a biLSTM network with a Conditional Random Field output layer using abstract syntax tree tags from the CLANG compiler and part-of-speech tags from the Standard Stanford part-of-speech tagger. POSIT improves the state-of-the-art on language identification by 10.6% and PoS/AST tagging by 23.7% in accuracy 2020-05-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8907 info:doi/10.1145/3377811.3380440 https://ink.library.smu.edu.sg/context/sis_research/article/9910/viewcontent/icse20a.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Code-switching Language identification Mixed-code Part-of-speech tagging Programming Languages and Compilers Software Engineering |
institution |
Singapore Management University |
building |
SMU Libraries |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
SMU Libraries |
collection |
InK@SMU |
language |
English |
topic |
Code-switching Language identification Mixed-code Part-of-speech tagging Programming Languages and Compilers Software Engineering |
spellingShingle |
Code-switching Language identification Mixed-code Part-of-speech tagging Programming Languages and Compilers Software Engineering PÂRȚACHI, Profir-Petru DASH, Santanu TREUDE, Christoph BARR, Earl T. POSIT: Simultaneously tagging natural and programming languages |
description |
Software developers use a mix of source code and natural language text to communicate with each other: Stack Overflow and Developer mailing lists abound with this mixed text. Tagging this mixed text is essential for making progress on two seminal software engineering problems — traceability, and reuse via precise extraction of code snippets from mixed text. In this paper, we borrow code-switching techniques from Natural Language Processing and adapt them to apply to mixed text to solve two problems: language identification and token tagging. Our technique, POSIT, simultaneously provides abstract syntax tree tags for source code tokens, part-of-speech tags for natural language words, and predicts the source language of a token in mixed text. To realize POSIT, we trained a biLSTM network with a Conditional Random Field output layer using abstract syntax tree tags from the CLANG compiler and part-of-speech tags from the Standard Stanford part-of-speech tagger. POSIT improves the state-of-the-art on language identification by 10.6% and PoS/AST tagging by 23.7% in accuracy |
format |
text |
author |
PÂRȚACHI, Profir-Petru DASH, Santanu TREUDE, Christoph BARR, Earl T. |
author_facet |
PÂRȚACHI, Profir-Petru DASH, Santanu TREUDE, Christoph BARR, Earl T. |
author_sort |
PÂRȚACHI, Profir-Petru |
title |
POSIT: Simultaneously tagging natural and programming languages |
title_short |
POSIT: Simultaneously tagging natural and programming languages |
title_full |
POSIT: Simultaneously tagging natural and programming languages |
title_fullStr |
POSIT: Simultaneously tagging natural and programming languages |
title_full_unstemmed |
POSIT: Simultaneously tagging natural and programming languages |
title_sort |
posit: simultaneously tagging natural and programming languages |
publisher |
Institutional Knowledge at Singapore Management University |
publishDate |
2020 |
url |
https://ink.library.smu.edu.sg/sis_research/8907 https://ink.library.smu.edu.sg/context/sis_research/article/9910/viewcontent/icse20a.pdf |
_version_ |
1814047627925258240 |