INTERNET BROWSING HISTORY DATA ANALYIS FOR AUTOMATIC NEGATIVE CONTENT WEBSITE IDENTIFICATION (CASE STUDY: TRUST+ POSITIF)

<p align="justify">Negative content website is a website containing one or more of the following elements: pornography, violence and coercion in children, incitement to anarchy, and gambling. Negative content website grow along with the development of the internet. Authorities has tr...

Full description

Saved in:
Bibliographic Details
Main Author: ARISTOFANY - NIM: 23514105 , ARMY
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/25808
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:<p align="justify">Negative content website is a website containing one or more of the following elements: pornography, violence and coercion in children, incitement to anarchy, and gambling. Negative content website grow along with the development of the internet. Authorities has tried several ways to filtering out negative content websites. Among them is the creation of TRUST+™ Positive, a list of negative content websites, which becomes the reference of negative website blocking activity conducted by ISP (Internet Service Provider). TRUST+™ Positive data updating were done by manual verification of community reports and back engine crawling processes. In order to make TRUST+™ Positive data updating process better, we’re conducting internet browsing history data analysis so that the process of identifying negative content websites can be done automatically. We are using data mining processes to identify negative content websites by utilizing internet browsing history data. The technique also known as web usage mining. We can use several data mining algorithms, such as association rule and frequent sequence. Refering to previous work of Mathias Ge &#769;ry and Hatem Haddad in Evaluation of Web Usage Mining Approaches for User’s Next Request Prediction, we are using association rule since it allows predicting some new navigation possibilities. The association rule algorithm used is Apriori. Some data preparation steps are taken in order to get the best results, including the sequence of pages visited by internet users and the average time of internet network usage by single user. Filtering out web sites visited by internet users who have never visit negative content website were executed to minimize the amount of data processed on web usage mining. In the end, web usage mining results are compared to the newer TRUST+™ Positive lists to see how many new negative content websites can be identified. Web usage mining with Apriori algorithm gives result that the maximum support and confident that can be used is 0,001. This came from the huge variations of websites visited by internet users and the unpopularity of negative content websites. Separating the websites that users visit before and after visiting negative content websites in data preparation steps does not provide better results than the overall data usage. Filtering out websites visited by Internet users who do not visit negative content websites in the data preparation steps can reduce the amount of data source processed in web usage mining while maintaining the number of new negative content websites found. Results of web usage mining that are not listed in the newer version of TRUST+ ™ Positive are also suspected to have negative content because they have keywords that relate to negative content websites such as "sex", "porn", "fuck", "tits", "cock", "bokep", "xxx", "poker", "lesbi", and "hentai". Conclusion that can be drawn from this research is that internet browsing history can be used for identifying negative content websites automatically. The results obtained can be used for updating TRUST+™ Positive list. Data-preparation model in negative content website identification by utilizing internet browsing history has great effect on the results. Preparing user’s internet browsing history data that occupies networks longer than 25.5 minutes and filtering out websites visited by internet users who do not visit negative content websites provides the best negative content website identification with minimum amount of data source. <p align="justify">