METHOD OF FILE-TYPE IDENTIFICATION WITH SUMMARIZED N-GRAM USING STAGE SAMPLE SELECTION

File-type identification is complex problem because of different data type and file type. Some common software used to identify file types fail to recognize file types when the file is damaged or modified because it works based on extension, signature file, and database software. Several studies on...

Full description

Saved in:
Bibliographic Details
Main Author: Supriyatno, Gigih
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/38786
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:38786
spelling id-itb.:387862019-06-17T15:13:12ZMETHOD OF FILE-TYPE IDENTIFICATION WITH SUMMARIZED N-GRAM USING STAGE SAMPLE SELECTION Supriyatno, Gigih Indonesia Theses n-gram, number summarization, letter summarization, non-summarization, file-type identification. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/38786 File-type identification is complex problem because of different data type and file type. Some common software used to identify file types fail to recognize file types when the file is damaged or modified because it works based on extension, signature file, and database software. Several studies on the file-type identification have been carried out using different approach methods, one of them using n-gram analysis. Some studies that using n-gram for file classification generally only uses n-gram with short size (1-gram to 2-gram). In 2011, Mayer developed the summarized n-gram concept to utilize n-gram with length n> 2. His method eliminates short n-gram and utilizes long n-gram as a predictor. In 2013, Burman improved the Mayer method by involving a short n-gram in his algorithm. Unfortunately the two researchers used different learning files to make predictor models for file classification. Differences in methods and learning files affect the performance resulted. This study developed both methods from Mayer and Burman's research by analysing summarization methods and systematically selecting learning files. The results of this study indicate that by using the learning file selection in stages and appropriate n-gram extraction method produces better performance than the Burman experiment. text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description File-type identification is complex problem because of different data type and file type. Some common software used to identify file types fail to recognize file types when the file is damaged or modified because it works based on extension, signature file, and database software. Several studies on the file-type identification have been carried out using different approach methods, one of them using n-gram analysis. Some studies that using n-gram for file classification generally only uses n-gram with short size (1-gram to 2-gram). In 2011, Mayer developed the summarized n-gram concept to utilize n-gram with length n> 2. His method eliminates short n-gram and utilizes long n-gram as a predictor. In 2013, Burman improved the Mayer method by involving a short n-gram in his algorithm. Unfortunately the two researchers used different learning files to make predictor models for file classification. Differences in methods and learning files affect the performance resulted. This study developed both methods from Mayer and Burman's research by analysing summarization methods and systematically selecting learning files. The results of this study indicate that by using the learning file selection in stages and appropriate n-gram extraction method produces better performance than the Burman experiment.
format Theses
author Supriyatno, Gigih
spellingShingle Supriyatno, Gigih
METHOD OF FILE-TYPE IDENTIFICATION WITH SUMMARIZED N-GRAM USING STAGE SAMPLE SELECTION
author_facet Supriyatno, Gigih
author_sort Supriyatno, Gigih
title METHOD OF FILE-TYPE IDENTIFICATION WITH SUMMARIZED N-GRAM USING STAGE SAMPLE SELECTION
title_short METHOD OF FILE-TYPE IDENTIFICATION WITH SUMMARIZED N-GRAM USING STAGE SAMPLE SELECTION
title_full METHOD OF FILE-TYPE IDENTIFICATION WITH SUMMARIZED N-GRAM USING STAGE SAMPLE SELECTION
title_fullStr METHOD OF FILE-TYPE IDENTIFICATION WITH SUMMARIZED N-GRAM USING STAGE SAMPLE SELECTION
title_full_unstemmed METHOD OF FILE-TYPE IDENTIFICATION WITH SUMMARIZED N-GRAM USING STAGE SAMPLE SELECTION
title_sort method of file-type identification with summarized n-gram using stage sample selection
url https://digilib.itb.ac.id/gdl/view/38786
_version_ 1823638317401374720