VULNERABILITY DETECTION IN PHP WEB APPLICATION USING LEXICAL ANALYSIS APPROACH WITH MACHINE LEARNING

<p align="justify">One of the important aspect in PHP web application development is the security aspect. Data breach is caused by vulnerabilities in web applications. The method for detecting vulnerability is by performing static analysis. Static analysis is a method in application...

Full description

Saved in:
Bibliographic Details
Main Author: RIZKI ANBIYA - NIM : 23515029, DHIKA
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/26584
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:<p align="justify">One of the important aspect in PHP web application development is the security aspect. Data breach is caused by vulnerabilities in web applications. The method for detecting vulnerability is by performing static analysis. Static analysis is a method in application analysis that performed without executing the program. The advantage of static analysis is that this method performs a deep checking on the source code so that the root of security problems can be found, not just the symptoms of security problems. However, to perform static analysis requires an expert and takes huge amout of time. <br /> <br /> <br /> Security vulnerability detection can also be done using lexical analysis and machine learning. Lexical analysis is performed by transforming the source code into the form of information that is easy to be processed such as token which is then applied to machine learning for the classification. Selecting features and classification algorithms affect to the results of security vulnerability detection. The cross-project detection is applied in this research. Data comes from cve details website with details of 264 sqli, 80 cross site scripting, 117 trasversal directories and 136,090 not vulnerable. The Distribution of data is imbalanced, then it required techniques to handle by doing oversampling SMOTE and undersampling Cluster Centroid. The features are AST tokens and PHP tokens as well as pruning on ASTs and modifications on PHP’s tokens. The machine learning algorithm uses Gaussian Naïve Bayes (GNB), Support Vector Machine (SVM) and Decision Tree for classification and also KMeans for clustering. In the KMeans algorithm is weighted by giving weight to features that often appear on vulnerable classes. <br /> <br /> <br /> Based on the test results, the GNB algorithm with modification on PHP’s token as a feature has the highest recall value for two class vulnerability classification and four class vulnerability classes but has a very low precision value.<p align="justify"> <br />