Detection and analysis of web-based malware and vulnerability

Since the dawn of the Internet, all of us have been swept up by the Niagara of information that fills our daily life. In this process, browsers play an extremely important role. Modern browsers have turned from a simple text displayer to a complicated software that supports rich user interfaces and...

Full description

Saved in:
Bibliographic Details
Main Author: Wang, Junjie
Other Authors: Liu Yang
Format: Theses and Dissertations
Language:English
Published: 2019
Subjects:
Online Access:https://hdl.handle.net/10356/89049
http://hdl.handle.net/10220/47659
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Since the dawn of the Internet, all of us have been swept up by the Niagara of information that fills our daily life. In this process, browsers play an extremely important role. Modern browsers have turned from a simple text displayer to a complicated software that supports rich user interfaces and a variety of file formats and protocols. This enlarges the attack surface and makes browsers one of the main targets of cyber attack. Inside the Internet security, JavaScript malware is one of the major threats. They exploit vulnerabilities in the browsers to launch attacks remotely. To protect end-users from these threats, this thesis makes two main contributions: identifying JavaScript malware and detecting vulnerabilities in browsers, which aim at a complete solution for Internet security. In identifying JavaScript malware, we first propose to classify JavaScript malware using the machine learning approach combined with dynamic confirmation. Static and dynamic approaches both have merits and drawbacks. Dynamic approaches are effective while not scalable. Static approaches are efficient but normally suffer from a high false negative ratio. To identify JavaScript malware effectively and efficiently, we propose a two-phase approach. The first phase lightweight classifies JavaScript malware from benign web pages. Then the second phase further subdivides the attack behaviors of JavaScript malware. We implement our approach as an online tool and conduct a large-scale experiment to show its effectiveness. Towards an insightful analysis of JavaScript malware evolution trend, it is desirable to further classify them according to the exploited attack vector and the corresponding attack behaviors. Considering the emergence of numerous new JavaScript malware and their variants, such an automated classification can significantly speed up the overall response to the JavaScript malware and even shorten the time to discover the zero-day attacks. We propose to use the Deterministic Finite Automaton (DFA), to summarize patterns of malware. Our approach can automatically learn a DFA from the dynamic execution traces of JavaScript malware. The experiment results demonstrate that our approach is more scalable and effective in JavaScript malware detection and classification, compared with other commercial anti-virus tools. Through previous two works, we realized that the root cause of the prevalence of JavaScript malware is the existence of vulnerabilities in browsers. Therefore, finding vulnerabilities in browsers and improving mitigation is of significant importance. We propose a novel data-driven seed generation approach to test the core components of browsers, especially XML engines and XSLT engines. We first learn a Probabilistic Context-Sensitive Grammar (PCSG) from a large number of samples of one specific grammar. The feature of PCSG can help us to generate samples whose syntax and semantics are correct with high probability. The experimental results demonstrate that both the bug finding capability and code coverage of fuzzing are advanced. We further improve coverage-based greybox fuzzing by proposing a new grammar- aware approach for programs that process structured inputs. In details, our approach requires the grammar of test inputs, which is often publicly available. Based on the grammar, we propose a grammar-aware trimming strategy to trim test inputs at the tree level. Besides, we introduce two grammar-aware mutation strategies (i.e., enhanced dictionary-based mutation and tree-based mutation). Tree-based mutation works by replacing sub-trees of the Abstract Syntax Tree (AST) of parsed test inputs. With grammar-awareness, we can effectively mutate test inputs while keeping the input structure valid, quickly carrying the fuzzing exploration into width and depth. We conduct experiments to evaluate the effectiveness of it on one XML engine, libplist and two JavaScript engines, WebKit, and Jerryscript. The results demonstrate that our approach outperforms other fuzzing tools in both code coverage and the bug-finding capability.