AUB ScholarWorks

Generalized Machine Learning Based Network Traffic Classification

Show simple item record

dc.contributor.advisor Elhajj, Imad Chaiban, Jean Paul 2023-01-10T11:53:55Z 2023-01-10T11:53:55Z 2023-01-10 2023-01-10
dc.description.abstract With the exponential rise in online activity, Internet Service Providers (ISPs) have prioritized network traffic classification in order to dynamically adapt their networks to best serve their customers while increasing their gains. While most work on machine learning based classification studied different models and the best techniques to solve the issue, none studied the effect of the traffic capture location on the model and whether a model could be generalized to work effectively with different flow capture directions. The aim of this work is to find the best approach in creating network traffic classification models that are, from one side, capable of generalizing to different environments, while being able to target narrower classes in internet traffic and from the other side, adaptive, scalable and performant in different production environments. While most previous work attempted separating general classes such as SSH traffic, VPN traffic and HTTP/HTTPS, we attempt to separate very similar classes related to gaming that use common protocols and backends with the added complexity of background noise traffic. Another contribution of this work is tackling the traffic direction problem, which is directly related to the traffic capture location. Since no multi location dataset was available, this work is limited in this regards. This problem was addressed by training and testing our models versus each of the directions of the flows apart followed by the full flow comparison. To this end, our approach to solve this issue is two-fold. From one side we attempt to tackle generalizability loss versus traffic capture direction. We thus attempt to create several models and test their generalizability. From the other side, we tackle another issue with generalizability which is the applicability of the same machine learning models used in previous work in classifying narrower classes. Using the Gaming Network Traffic Dataset, we attempt to classify gaming network traffic with much narrower user activity classes than previous work. We create several models: random forest the state-of-the-art algorithm, with pre-engineered features such as interarrival times, packet length and other flow statistics, as a baseline, which obtained a testing accuracy of 44.14%. The second Convolutional Neural Network (CNN) based deep learning model, also created based on previous work, having as input raw network traffic converted into either a grayscale or RGB image, where the optimal bi-flow grayscale model resulted with a testing accuracy of 47.24%. The third model, a deeper CNN-Long Short Term Memory (LSTM) based version that takes into consideration the temporal dimension of consecutive flows obtained a testing accuracy of 52.27% surpassing both the random forest and CNN state-of-the-art models. This model also consistently showed significant increase in accuracy versus the client-server side traffic where traffic categories are harder to separate. Finally, our proposed semi supervised stacked CNN – random forest model obtained a testing accuracy of 53.4%. We then analyze and compare the results of the proposed simple CNN model and the proposed semi-supervised CNN-random forest architecture for different datasets. The proposed algorithm proved to perform best versus the Gaming Network Traffic Dataset in specific surpassing both the CNN state-of-the-art algorithm and our previous LSTM-based model. This result was however bound to the dataset, provided it showed very slight improvement versus timeseries based datasets but worse results in regular image classification tasks where it also fell behind models mentioned in previous work. Model generalizability will have huge impact in future model development and large-scale deployment and adaptation to different networks. Future work will study further model optimization for the Gaming Network Dataset, the collection of a multi location dataset with a bigger number of samples on internet service provider premises, work on improved generalizability and the application of other possible more complex recurrent network structures from one side and unsupervised clustering algorithms from the other side in choosing the initial model class subgroups used.
dc.language.iso en
dc.subject machine learning
dc.subject deep learning
dc.subject CNN
dc.subject LSTM
dc.subject network traffic
dc.subject classification
dc.subject unsupervised learning
dc.subject DBSCAN
dc.subject cybersecurity
dc.title Generalized Machine Learning Based Network Traffic Classification
dc.type Thesis
dc.contributor.department Department of Electrical and Computer Engineering
dc.contributor.faculty Faculty of Engineering and Architecture
dc.contributor.commembers Kayssi, Ayman
dc.contributor.commembers Hajj, Hazem ME
dc.contributor.AUBidnumber 202124523

Files in this item

This item appears in the following Collection(s)

Show simple item record

Search AUB ScholarWorks


My Account