Seeing Through NAT to Detect Shadow IT: A Machine Learning Approach

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Network Address Translation (NAT) is present in many routers and Customer Premise Equipment (CPEs). It is used to distribute internet access to several local hosts. Most NAT devices implement Port Address Translation (PAT), which allows mapping multiple private IP addresses to a single public IP address. The private network behind a NAT becomes hidden from the public internet and only a single outward IP address will be visible to Internet Service Providers (ISP’s). With the proliferation of unauthorized wired and wireless NAT routers, internet subscribers can re-distribute an internet connection or deploy hidden devices, thus causing a problem known as shadow IT. To this end, it is of ISP’s interest to know how their services are used. This study will propose a method to detect NAT devices and identify the size of the network (number of hosts) hidden behind them. A supervised Machine Learning (ML) algorithm that uses aggregated network traffic flow features is proposed to detect NAT devices. Traffic features are aggregated within multiple window sizes to study the effect of feature aggregation on NAT detection. The host counting algorithm is processed by a machine learning approach on real network traffic features. This research demonstrates that eXtreme Gradient Boosting (XGBoost) performs best in NAT detection and hidden network size detection. Whereas the Random Forest (RF) classifier was more able to predict the exact number of hidden hosts than any other algorithm. The XGBoost NAT detection model can detect NAT devices with a 97.09% F1 score which significantly outperforms many state-of-the-art methods. The exact host counting model resulted in a 65.53% F1 score, and the result increased to 90.63% after transforming the problem into a binary one. Most previous methods focused on achieving a high detection rate on given datasets instead of focusing on the model’s generalizability. However, this thesis focuses on the performance of the detection algorithms especially when the network data is subjected to intended obfuscation or even when there is an environment change. The performance of detection models dropped below 70% when testing the model in a new network environment. In this thesis we also focus on interpreting the behavior of the complex algorithm to enhance trust in the results, understand the generalizability, and explain the importance of feature aggregation in case of NAT. Two eXplainable Artificial Intelligence (XAI) methods are used to analyze the generalizability of a given feature set to different network environments or after performing obfuscation techniques. These methods are also used to study the sensitivity of the detection algorithms to the aggregated feature set extracted. Finally, this study uses transfer learning to build an optimized model that can work in case of any feature change in the network traffic data.

Description

Keywords

Network Address Translation, NAT, Network Security, Passive Detection, Client Counting, Machine Learning, NAT Detectiom, Host Identification, User Anonymity

Citation

Endorsement

Review

Supplemented By

Referenced By