Detecting fraud via statistical anomalies
Urban planners and researchers are increasingly integrating mobility data in designing smarter and sustainable cities. It is therefore crucial to identify any anomalies in the dataset to prevent poor planning or statistical interferences. Such mobility data could come from public sources or data b...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/175633 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Urban planners and researchers are increasingly integrating mobility data in designing smarter and
sustainable cities. It is therefore crucial to identify any anomalies in the dataset to prevent poor
planning or statistical interferences. Such mobility data could come from public sources or data brokers,
like CityData who offers products for their customers’ economic development. Few studies had detected
anomalies in the September 2020 dataset provided by CityData in the context of their research [1], [2]
but there is a general lack of studies that focused on analysing those anomalies. Therefore, the purpose
of this report is to: find more anomalies not present in previous studies, determine the manipulated ping
percentage in each Singapore zone, and then determine if the data was intentionally manipulated. We
did these by synthesising statistical techniques proposed by [3] and three other mathematical methods.
We found three more anomalies: a circle and line segment, excessive pings, and squares. The number
of decimal places (d.p) a ping could have was classified into 16 independent and uniformly distributed
bins. We found that our statistical anomalies were the excessive ping anomalies whose d.p do not follow
a uniform distribution. Our results indicated that Mandai and Southern Islands produced the highest
manipulated percentages while River Valley produced the lowest manipulated percentage. Moreover,
Central Area had the largest manipulated percentage SD across all regions. Thus, CityData might
had intentionally manipulated the dataset to corroborate the interests of Singapore’s urban planners. |
---|