By Paulo Freitas de Araujo Filho

As a result of Tempest’s efforts to disseminate knowledge and contribute to the science, academia, industry, and cybersecurity community, we are launching a series of five blog posts to discuss the use of machine learning in intrusion detection systems (IDSs).

In this first post, we discuss the advantages and disadvantages of signature and anomaly-based IDSs as well as how machine learning can empower IDSs and allow them to detect more complex cyber-attacks. We explain the differences between supervised and unsupervised learning, and the main techniques used by unsupervised IDSs to detect unknown attacks. Finally, we discuss the downside and future trends in the use of machine learning to detect intrusions.

In the other four posts of this series, we discuss four methods that are commonly used in state-of-the-art unsupervised IDSs, namely: clustering, one-class novelty detection, autoencoders, and generative adversarial networks (GANs) [1]. We will present them as the result of research efforts from academia but with the practical application of enhancing cybersecurity for industries and enterprises. We will introduce the required machine learning concepts for easily understanding those techniques. Welcome to the journey!

The occurrence of cyber-attacks has been rapidly increasing over the years. They compromise the operation of services and industries, cause significant financial losses, and put people’s lives at risk [1]. The threat is even greater in the case of Advanced Persistent Threats (APTs), which are one of the biggest security challenges that we face. APTs are sophisticated and premeditated attacks that are designed to persist and stay in the system/network until it fulfills its goals. In 2014, for example, the Carbanak APT campaign successfully attacked up to 100 financial institutions across the globe causing losses as high as $1 billion [2].

In this context, IDSs, which detect cyber-attacks when other security mechanisms fail, gained even more importance to protect and secure systems and networks [3]–[5]. IDSs monitor devices and networks to identify evidence of intrusions and report potentially malicious activities [1]. Many of them have been proposed, developed, and evaluated over the past two decades by following the signature or misuse-based approach, such as the SNORT [6], BRO [7] and Suricata [8]. The signature-based approach relies on predefined attack patterns to identify matches between those patterns and the current status of systems and networks. It has been shown to be very effective by detecting known attacks with low false positive rates.

However, signature-based IDSs are just as good as their database of attacks and cannot efficiently detect unknown attacks without raising many false alarms. Precisely, there is no specific signature for unknown attacks, and making other signatures loose so that they detect unknown attacks raise many false alarms. Moreover, their database of attacks needs to be constantly updated as new attacks come up, and the larger it becomes, the longer it takes to search for matches [1]. On the other hand, anomaly-based IDSs overcome these limitations as they detect attacks by measuring deviations between data patterns and what they consider to be a normal behavior of systems and networks. Although statistical methods may be used to construct such a normal behavior profile and measure deviations from it, recently, the use of machine learning has been showing very promising results in anomaly-based IDSs while also allowing the detection of more complex cyber-attacks [9].

Machine Learning-Based IDSs

Machine learning-based IDSs do not rely on signatures of known attacks but in observing and identifying patterns on data. They learn the patterns that data from systems and networks have when there is no malicious activity, and even the patterns of what occurs during attacks. Then, they evaluate whether data samples under analysis look like the patterns of normal or malicious activities, and decide whether to raise or not alerts.

According to their learning approach, machine learning-based IDSs can be classified as supervised and unsupervised IDSs. Supervised IDSs train machine learning models with data from systems and networks’ normal operation, and data from past examples of cyber-attacks. Hence, they use labels to distinguish between normal and malicious data so that they know which data corresponds to what, e.g., all normal data are marked with a label “0” and all malicious data are marked with a label “1”. Then, they compute a probability or score that a sample being evaluated has the same pattern of normal or malicious samples, i.e., whether it belongs to the normal class or to the attack class.

Moreover, supervised IDSs can also be trained with data from more than two classes, e.g., with normal data and data from different types of cyber-attacks. Such multi-class systems learn the patterns that correspond to the normal operation of systems and networks and to when different cyber-attacks occur. Thus, they evaluate and classify samples as normal or as specific types of cyber-attacks. Different classification algorithms, such as Decision Trees, Support Vector Machine (SVM), K-Nearest Neighbor (KNN), and neural networks have been successfully used for detecting intrusions in supervised IDSs. However, two main drawbacks of supervised IDSs are their inability to detect unknown attacks and their need to retrain models every time a new attack is discovered [1], [10].

On the other hand, unsupervised IDSs do not rely on labels to distinguish between different types of data. They use data only from the normal operation of systems and networks, and learn the pattern of how systems and networks behave when there is no attack. Then, they detect cyber-attacks as the samples that do not conform with the normal pattern. Since they only rely on normal data, unsupervised IDSs are able to detect cyber-attacks without any prior knowledge of past attacks and do not need to be retrained when new types of attacks are discovered.

Main Strategies of Unsupervised IDSs

The four methods that are most commonly used in state-of-the-art unsupervised IDSs are clustering, one-class novelty detection, autoencoders, and GANs [1]. While clustering is a broader category of algorithms that are concerned with grouping data samples that look-alike, the other three categories focus on different machine learning techniques. One-class novelty detection covers several traditional machine learning algorithms that are trained with data from only a single class. Autoencoders and GANs, on the other hand, rely on specific deep learning architectures. In the following, we briefly describe those four strategies, which will be further discussed in the other blog posts of this series.

Clustering-based Unsupervised IDSs

Clustering algorithms are used to separate a finite unlabeled dataset, i.e., data samples that are not marked with labels, into a finite and discrete set of ”natural” hidden data structures. They partition the given data into groups that present high inner similarity and outer dissimilarity, without any prior knowledge. In a simpler way, samples that look like each other are grouped together and kept far away from samples that are different from them. Then, scores are assigned to the formed clusters so that clusters whose score exceeds a threshold are considered to be malicious.

One-Class Novelty Detection

Novelty detection is the task of identifying data patterns that differ from the data patterns available for training. Thus, one-class machine learning classifiers can be used to learn a decision boundary that includes all data patterns available for training, such that different data patterns lie outside of that decision boundary. In a simpler manner, if we think that each data sample can be represented as a point, one-class classifiers define the region in which all training points should be so that all the points that are too different from them are outside of that region and considered to be malicious.

Autoencoder-based Unsupervised IDSs

Autoencoders are neural networks that have the ability to reconstruct their input. They encode data samples into usually smaller versions of themselves, and then decode them by reconstructing the samples to their original form. Thus, if autoencoders are trained with only normal data, they learn how to reconstruct normal data so that the reconstruction error, i.e., the difference between the original sample and its reconstruction, is small. However, if such autoencoders try to reconstruct malicious samples, they will produce large reconstruction errors as they were not trained with such type of data. Therefore, data samples are evaluated by measuring the deviation between them and their reconstruction through the autoencoder so that large deviations indicate a high probability that a data pattern represents malicious activities [13].

GAN-based Unsupervised IDSs

GANs are a framework that simultaneously train two competing neural networks through an adversarial process: generator and discriminator. The generator learns the probabilistic distribution of the training data and how to produce data similar to the training data from a random vector drawn from a distribution . That is, the generator is a neural network that learns how to produce samples that look like the training data. For example, it can be trained with cat images so that it learns to generate other cat images. Or, in our case, it can be trained with normal data from systems and networks so that it learns how that data should be. On the other hand, the discriminator learns to distinguish between real data and data produced by the generator, i.e., to identify if a sample is real or produced by the generator. In our cat example, given a cat image, the discriminator tells us whether that image is a real picture of a cat or if it is a fake image produced by the generator. Hence, if the GAN is trained with normal data from systems and networks, the discriminator learns to detect between samples that are normal and samples that do not conform to the normal behavior and may be the result of cyber-attacks [9].

The Downside

Despite everything that we have discussed so far, it is important to highlight that machine learning is not the single and only solution for detecting intrusions or that it will replace the entire cybersecurity arsenal that already exists. Although machine learning-based IDSs significantly contribute to detecting cyber-attacks, as any other detection approach, they also have disadvantages. For instance, supervised IDSs need to be constantly retrained and unsupervised IDSs usually present more false positives.

Moreover, using machine learning to detect cyber-attacks also present limitations that challenge their deployment. Maybe the first and more important of those limitations is the availability of data. As we know, to produce good and reliable results, machine learning models need a lot of data to be trained. However, obtaining a large amount of malicious samples is not an easy or cheap task.

Finally, while machine learning contributes to detecting cyber-attacks, it can also introduce vulnerabilities. Recently, it has been shown that especially crafted noises that are not easily recognized can be added to data samples so that they compromise the classification result of machine learning-based classifiers. That is, by introducing an especially crafted noise to cat images, such that the images still look the same, image classifiers based on machine learning may be compromised and classify those tampered cat images poorly. Hence, following that same concept, an adversary can craft such a noise, which is called adversarial perturbation, to perform an adversarial attack so that intrusion detection systems or even malware detection systems that are based on machine learning are compromised and no longer achieve good accuracy [14].

Conclusion and Future Trends

As you can realize by now, while machine learning can empower intrusion detection systems, there are still several challenges that need to be addressed. For instance, although some hybrid IDSs have already been proposed, it is still an open challenge to combine the strengths of signature-based, supervised IDSs, and unsupervised IDSs into a more effective and efficient IDS. Besides, as IDSs produce isolated security alerts, it is still very challenging to correlate detected alerts to identify larger, distributed, and more sophisticated cyber-attacks. Finally, it is also essential to enhance machine learning models against adversarial attacks so that we mitigate the risks of using machine learning in cybersecurity.

At Tempest, we are working towards solving such challenging problems to better protect your business! Stay tuned for our next post!

References

[1] A. Nisioti, A. Mylonas, P. D. Yoo and V. Katos, “From Intrusion Detection to Attacker Attribution: A Comprehensive Survey of Unsupervised Methods,” in IEEE Commun. Surveys & Tut., vol. 20, no. 4, pp. 3369-3388, Fourth quarter 2018, doi: 10.1109/COMST.2018.2854724.

[2] (2015). Kaspersky Security Bulletin. Accessed: Feb. 15, 2022. [Online]. Available: https://securelist.com/analysis/kaspersky-security-bulletin/73038/kaspersky-security-bulletin-2015-overall-statistics-for-2015/

[3] Y. Jia, F. Zhong, A. Alrawais, B. Gong, and X. Cheng, “FlowGuard: An Intelligent Edge Defense Mechanism Against IoT DDoS Attacks,” IEEE Internet of Things J., vol. 7, no. 10, pp. 9552–9562, 2020. doi: 10.1109/JIOT.2020.2993782.

[4] D. Li, D. Chen, B. Jin, L. Shi, J. Goh, and S.-K. Ng, “MAD-GAN: Multivariate anomaly detection for time series data with generative adversarial networks,” in Springer Int. Conf. on Artif. Neural Netw., 2019, pp. 703–716.

[5] N. Chaabouni, M. Mosbah, A. Zemmari, C. Sauvignac, and P. Faruki, “Network Intrusion Detection for IoT Security Based on Learning Techniques,” IEEE Commun. Surveys & Tut., vol. 21, no. 3, pp. 2671–2701, 2019. doi: 10.1109/COMST.2019.2896380.

[6] M. Roesch, “Snort: Lightweight Intrusion detection for networks,” in Proc. LISA, vol. 99. Seattle, WA, USA, Nov. 1999, pp. 229–238.

[7] V. Paxson, “Bro: A system for detecting network intruders in real-time,” Comput. Netw., vol. 31, nos. 23–24, pp. 2435–2463. 1999.

[8] Suricata-IDS. Accessed: Feb. 15, 2022. [Online]. Available: https://suricata-ids.org/

[9] P. Freitas de Araujo-Filho, G. Kaddoum, D. R. Campelo, A. Gondim Santos, D. Macêdo and C. Zanchettin, “Intrusion Detection for Cyber–Physical Systems Using Generative Adversarial Networks in Fog Environment,” in IEEE Internet of Things J., vol. 8, no. 8, pp. 6247-6256, 15 April15, 2021, doi: 10.1109/JIOT.2020.3024800.

[10] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM Comput. Surveys, vol. 41, no. 3, p. 15, 2009.

[11] E. Schubert, J. Sander, M. Ester, H. P. Kriegel, and X. Xu, “DBSCAN revisited, revisited: Why and how you should (still) use DBSCAN,” ACM Trans. Database Syst., vol. 42, no. 3, pp. 19: 1-21, Jul. 2017.

[12] P. Freitas De Araujo-Filho, A. J. Pinheiro, G. Kaddoum, D. R. Campelo and F. L. Soares, “An Efficient Intrusion Prevention System for CAN: Hindering Cyber-Attacks With a Low-Cost Platform,” in IEEE Access, vol. 9, pp. 166855-166869, 2021, doi: 10.1109/ACCESS.2021.3136147.

[13] Yisroel Mirsky, Tomer Doitshman, Yuval Elovici, and Asaf Shabtai. Kitsune: an ensemble of autoencoders for online network intrusion detection. arXiv preprint arXiv:1802.09089, 2018

[14] J. Liu, M. Nogueira, J. Fernandes and B. Kantarci, “Adversarial Machine Learning: A Multi-Layer Review of the State-of-the-Art and Challenges for Wireless and Mobile Systems,” in IEEE Communications Surveys & Tutorials, doi: 10.1109/COMST.2021.3136132.


Other articles in this series

Empowering Intrusion Detection Systems with Machine Learning

Part 1 of 5: Signature vs. Anomaly-Based Intrusion Detection Systems

Part 2 of 5: Clustering-Based Unsupervised Intrusion Detection Systems

Part 3 of 5: One-Class Novelty Detection Intrusion Detection Systems

Part 4 of 5: Intrusion Detection using Autoencoders

Part 5 of 5: Intrusion Detection using Generative Adversarial Networks