Human Factors in Phishing Attacks: A Systematic Literature Review

phishing research paper pdf

New Citation Alert added!

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations.

  • Yasin A Fatima R Wen L JiangBin Z Niazi M (2025) What goes wrong during phishing education? A probe into a game-based assessment with unfavorable results Entertainment Computing 10.1016/j.entcom.2024.100815 52 (100815) Online publication date: Jan-2025 https://doi.org/10.1016/j.entcom.2024.100815
  • Albarrak A (2024) Integration of Cybersecurity, Usability, and Human-Computer Interaction for Securing Energy Management Systems Sustainability 10.3390/su16188144 16 :18 (8144) Online publication date: 18-Sep-2024 https://doi.org/10.3390/su16188144
  • Fan Z Li W Laskey K Chang K (2024) Investigation of Phishing Susceptibility with Explainable Artificial Intelligence Future Internet 10.3390/fi16010031 16 :1 (31) Online publication date: 17-Jan-2024 https://doi.org/10.3390/fi16010031
  • Show More Cited By

Index Terms

Human-centered computing

Human computer interaction (HCI)

Security and privacy

Human and societal aspects of security and privacy

Intrusion/anomaly detection and malware mitigation

Social engineering attacks

Recommendations

Mitigating phishing attacks: an overview.

Social engineering is the process of getting a person to provide a service or complete a task that may give away private or confidential information. Phishing is the most common type of social engineering. In phishing, an attacker poses as a trustworthy ...

Defending against phishing attacks: taxonomy of methods, current issues and future directions

Internet technology is so pervasive today, for example, from online social networking to online banking, it has made people's lives more comfortable. Due the growth of Internet technology, security threats to systems and networks are relentlessly ...

Fighting against phishing attacks: state of the art and future challenges

In the last few years, phishing scams have rapidly grown posing huge threat to global Internet security. Today, phishing attack is one of the most common and serious threats over Internet where cyber attackers try to steal user's personal or financial ...

Information

Published in.

cover image ACM Computing Surveys

University of Sydney, Australia

Association for Computing Machinery

New York, NY, United States

Publication History

Permissions, check for updates, author tags.

  • human factors
  • cybersecurity

Funding Sources

  • Italian Ministry of University and Research (MUR)
  • PON projects LIFT, TALIsMAn, and SIMPLe
  • “Dipartimento di Eccellenza”
  • DATACLOUD, DESTINI, and FIRST
  • RoMA—Resilience of Metropolitan Areas

Contributors

Other metrics, bibliometrics, article metrics.

  • 36 Total Citations View Citations
  • 3,655 Total Downloads
  • Downloads (Last 12 months) 1,209
  • Downloads (Last 6 weeks) 117
  • Katsarakes E Edwards M Still J (2024) Where Do Users Look When Deciding If a Text Message is Safe or Malicious? Proceedings of the Human Factors and Ergonomics Society Annual Meeting 10.1177/10711813241264204 Online publication date: 12-Aug-2024 https://doi.org/10.1177/10711813241264204
  • Guo S Fan Y (2024) X-Phishing-Writer: A Framework for Cross-lingual Phishing E-mail Generation ACM Transactions on Asian and Low-Resource Language Information Processing 10.1145/3670402 23 :7 (1-34) Online publication date: 26-Jun-2024 https://dl.acm.org/doi/10.1145/3670402
  • Kanaoka A Isohara T (2024) Enhancing Smishing Detection in AR Environments: Cross-Device Solutions for Seamless Reality 2024 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW) 10.1109/VRW62533.2024.00108 (565-572) Online publication date: 16-Mar-2024 https://doi.org/10.1109/VRW62533.2024.00108
  • Sarker O Jayatilaka A Haggag S Liu C Babar M (2024) A Multi-vocal Literature Review on challenges and critical success factors of phishing education, training and awareness Journal of Systems and Software 10.1016/j.jss.2023.111899 208 :C Online publication date: 4-Mar-2024 https://dl.acm.org/doi/10.1016/j.jss.2023.111899
  • Varshney G Kumawat R Varadharajan V Tupakula U Gupta C (2024) Anti-phishing Expert Systems with Applications: An International Journal 10.1016/j.eswa.2023.122199 238 :PF Online publication date: 27-Feb-2024 https://dl.acm.org/doi/10.1016/j.eswa.2023.122199
  • Baltuttis D Teubner T (2024) Effects of visual risk indicators on phishing detection behavior: An eye-tracking experiment Computers & Security 10.1016/j.cose.2024.103940 144 (103940) Online publication date: Sep-2024 https://doi.org/10.1016/j.cose.2024.103940
  • Marshall N Sturman D Auton J (2024) Exploring the evidence for email phishing training Computers and Security 10.1016/j.cose.2023.103695 139 :C Online publication date: 16-May-2024 https://dl.acm.org/doi/10.1016/j.cose.2023.103695

View Options

Login options.

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

View options.

View or Download as a PDF file.

View online with eReader .

HTML Format

View this article in HTML Format.

Share this Publication link

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

A comprehensive survey of AI-enabled phishing attacks detection techniques

  • Published: 23 October 2020
  • Volume 76 , pages 139–154, ( 2021 )

Cite this article

phishing research paper pdf

  • Abdul Basit 1 ,
  • Maham Zafar 1 ,
  • Xuan Liu   ORCID: orcid.org/0000-0002-7966-4488 2 ,
  • Abdul Rehman Javed 3 ,
  • Zunera Jalil 3 &
  • Kashif Kifayat 3  

49k Accesses

170 Citations

4 Altmetric

Explore all metrics

In recent times, a phishing attack has become one of the most prominent attacks faced by internet users, governments, and service-providing organizations. In a phishing attack, the attacker(s) collects the client’s sensitive data (i.e., user account login details, credit/debit card numbers, etc.) by using spoofed emails or fake websites. Phishing websites are common entry points of online social engineering attacks, including numerous frauds on the websites. In such types of attacks, the attacker(s) create website pages by copying the behavior of legitimate websites and sends URL(s) to the targeted victims through spam messages, texts, or social networking. To provide a thorough understanding of phishing attack(s), this paper provides a literature review of Artificial Intelligence (AI) techniques: Machine Learning, Deep Learning, Hybrid Learning, and Scenario-based techniques for phishing attack detection. This paper also presents the comparison of different studies detecting the phishing attack for each AI technique and examines the qualities and shortcomings of these methodologies. Furthermore, this paper provides a comprehensive set of current challenges of phishing attacks and future research direction in this domain.

Similar content being viewed by others

phishing research paper pdf

Classification of Phishing Attack Solutions by Employing Deep Learning Techniques: A Systematic Literature Review

phishing research paper pdf

A Survey on Phishing Website Detection Using Deep Neural Networks

phishing research paper pdf

An Exploratory Study of Automated Anti-phishing System

Explore related subjects.

  • Artificial Intelligence

Avoid common mistakes on your manuscript.

1 Introduction

The process of protecting cyberspace from attacks has come to be known as Cyber Security [ 16 , 32 , 37 ]. Cyber Security is all about protecting, preventing, and recovering all the resources that use the internet from cyber-attacks [ 20 , 38 , 47 ]. The complexity in the cybersecurity domain increases daily, which makes identifying, analyzing, and controlling the relevant risk events significant challenges. Cyberattacks are digital malicious attempts to steal, damage, or intrude into the personal or organizational confidential data [ 2 ]. Phishing attack uses fake websites to take sensitive client data, for example, account login credentials, credit card numbers, etc. In the year of 2018, the Anti-Phishing Working Group (APWG) detailed above 51,401 special phishing websites. Another report by RSA assessed that worldwide associations endured losses adding up to $9 billion just due to phishing attack happenings in the year 2016 [ 26 ]. These stats have demonstrated that the current anti-phishing techniques and endeavors are not effective. Figure  1 shows how a typical phishing attack activity happens.

figure 1

Phishing attack diagram [ 26 ]

figure 2

Phishing report for third quarter of the year 2019 [ 1 ]

Personal computer clients are victims of phishing attack because of the five primary reasons [ 60 ]: (1) Users do not have brief information about Uniform Resource Locator (URLs), (2) the exact idea about which pages can be trusted, (3) entire location of the page because of the redirection or hidden URLs, (4) the URL possess many possible options, or some pages accidentally entered, (5) Users cannot differentiate a phishing website page from the legitimate ones.

Phishing websites are common entry points of online social engineering attacks, including numerous ongoing web scams [ 30 ]. In such type of attacks, the attackers create website pages by copying genuine websites and send suspicious URLs to the targeted victims through spam messages, texts, or online social networking. An attacker scatters a fake variant of an original website, through email, phone, or content messages [ 5 ], with the expectation that the targeted victims would accept the cases in the email made. They will likely target the victim to include their personal or highly sensitive data (e.g., bank details, government savings number, etc.). A phishing attack brings about an attacker acquiring bank card information and login data. In any case, there are a few methods to battle phishing [ 27 ]. The expanded utilization of Artificial Intelligence (AI) has affected essentially every industry, including cyber-security. On account of email security, AI has brought speed, accuracy, and the capacity to do a detailed investigation. AI can detect spam, phishing, skewers phishing, and different sorts of attacks utilizing previous knowledge in the form of datasets. These type of attacks likely creates a negative impact on clients’ trust toward social services such as web services. According to the APWG report, 1,220,523 phishing attacks have been reported in 2016, which is 65% more expansion than 2015 [ 1 ]. Figure  2 shows the Phishing Report for the third quarter of 2019.

As per Parekh et al. [ 51 ], a generic phishing attack has four stages. First, the phisher makes and sets up a fake website that looks like an authentic website. Secondly, the person sends a URL connection of the website to a targeted victim pretending like a genuine organization, user, or association. Thirdly, the person in question will be tempted to visit the injected fake website. Fourth, the unfortunate targeted victim will click on the fake source link and give his/her valuable data as input. By utilizing the individual data of the person in question, impersonation activities will be performed by the phisher. APWG contributes individual reports on phishing URLs and analyzes the regularly evolving nature and procedures of cybercrimes. The Anti-Phishing Working Group (APWG) tracks the number of interesting phishing websites, an essential proportion of phishing over the globe. Phishing locales dictate the interesting base URLs. The absolute number of phishing websites recognized by APWG in the 3rd quarter-2019 was 266,387 [ 3 ]. This was 46% from the 182,465 seen in Q2 and in Q4-2018 practically twofold 138,328 was seen.

Figure  3 shows the most targeted industries in 2019. Attacks on distributed storage and record facilitating websites, financial institutions stayed more frequent, and attacks on the gaming, protection, vitality, government, and human services areas were less prominent during the 3rd quarter [ 3 ].

MarkMonitor is an online brand insurance association, verifying licensed innovation. In the 3rd quarter of 2019, the greatest focus of phishing remained Software as a service (SaaS) and webmail websites. Phishers keep on collecting credentials to these sorts of websites, using them to execute business email compromises (BEC) and to enter corporate SaaS accounts.

figure 3

Most targeted industry sectors—3rd quarter 2019 [ 3 ]

figure 4

Taxonomy of this survey focusing on phishing attack detection studies

This survey covers the four aspects of a phishing attack: communication media, target devices, attack technique, and counter-measures as shown in Fig.  4 . Human collaboration is a communication media with an application targeted by the attack. Seven types of communication media which include Email, Messenger, Blog & Forum, Voice over internet protocol, Website, Online Social Network (OSN), and Mobile platform are identified from the literature. For the selection of attack strategies, our devices play a significant role as victims interact online through physical devices. Phishing attack may target personal computers, smart devices, voices devices, and/or WiFi-smart devices which includes VOIP devices as well as mobile phone device.

Attack techniques are grouped into two categories: attack launching and data collection. For attack launching, several techniques are identified such as email spoofing, attachments, abusing social settings, URLs spoofing, website spoofing, intelligent voice reaction, collaboration in a social network, reserve social engineering, man in the middle attack, spear phishing, spoofed mobile internet browser and installed web content. Meanwhile, for data collection during and after the victim’s interaction with attacks, various data collection techniques are used [ 49 ]. There are two types of data collection techniques, one is automated data collection techniques (such as fake websites forms, key loggers, and recorded messages) and the other is manual data collection techniques (such as human misdirection and social networking). Then, there are counter-measures for victim’s data collected or used before and after the attack. These counter-measures are used to detect and prevent attacks. We categorized counter-measurement into four groups (1) Deep learning-based Techniques, (2) Machine learning Techniques, (3) Scenario-based Techniques, and (4) Hybrid Techniques.

To the best of our knowledge, existing literature [ 11 , 18 , 28 , 40 , 62 ] include a limited number of surveys focusing more on providing an overview of attack detection techniques. These surveys do not include details about all deep learning, machine learning, hybrid, and scenario based techniques. Besides, these surveys lack in providing an extensive discussion about current and future challenges for phishing attack detection.

Keeping in sight the above limitations, this article makes the following contributions:

Provide a comprehensive and easy-to-follow survey focusing on deep learning, machine learning, hybrid learning, and scenario-based techniques for phishing attack detection.

Provide an extensive discussion on various phishing attack techniques and comparison of results reported by various studies.

Provide an overview of current practices, challenges, and future research directions for phishing attack detection.

The study is divided into the following sections: Sect.  1 present the introduction of phishing attacks. Section  2 presents the literature survey focusing on deep learning, machine learning, hybrid learning, and scenario-based phishing attack detection techniques and presents the comparison of these techniques. Section  3 presents a discussion on various approaches used in literature. Section  4 present the current and future challenges. Section  5 concludes the paper with recommendations for future research.

2 Literature survey

This paper explores detailed literature available in prominent journals, conferences, and chapters. This paper explores relevant articles from Springer, IEEE, Elsevier, Wiley, Taylor & Francis, and other well-known publishers. This literature review is formulated after an exhaustive search on the existing literature published in the last 10 years.

A phishing attack is one of the most serious threats for any organization and in this section, we present the work done on phishing attacks in more depth along with its different types. Initially, the phishing attacks were performed on telephone networks also known as Phone Phreaking which is the reason the term “fishing” was replaced with the term “Phishing”, ph replaced f in fishing. From the reports of the anti-phishing working group (APWG) [ 1 ], it can be confirmed that phishing was discovered in 1996 when America-on-line (AOL) accounts were attacked by social engineering. Phishing turns into a danger to numerous people, especially individuals who are unaware of the dangers while being in the internet world. In light of a report created by the Federal Bureau of Investigation (FBI) [ 4 ], from October-2013 to February-2016, a phishing attack caused severe damage of 2.3 billion dollars. In general, users tend to overlook the URL of a website. At times, phishing tricks connected through phishing websites can be effectively prevented by seeing whether a URL is of phishing or an authentic website. For the situation where a website is suspected as a targeted phish, a client can escape from the criminal’s trap.

The conventional approaches for phishing attack detection give low accuracy and can recognize only about 20% of phishing attacks. Machine learning approaches give good outcomes for phishing detection but are time-consuming even on the small-sized datasets and not scale-able. Phishing recognition by heuristics techniques gives high false-positive rates. Client mindfulness is a significant issue, for resistance against phishing attacks. Fake URLs are utilized by phisher, to catch confidential private data of the targeted victim like bank account data, personal data, username, secret password, etc.

Previous work on phishing attack detection has focused on one or more techniques to improve accuracy however, accuracy can be further improved by feature reduction and by using an ensemble model. Existing work done for phishing attack detection can be placed in four categories:

Deep learning for phishing attack detection

Machine learning for phishing attack detection

Scenario-based phishing attack detection

Hybrid learning based Phishing attack detection

2.1 Deep learning (DL) for phishing attack detection

This section describes the DL approaches-based intrusion detection systems. Recent advancements in DL approaches suggested that the classification of phishing websites using deep NN should outperform the traditional Machine Learning (ML) algorithms. However, the results of utilizing deep NN heavily depend on the setting of different learning parameters [ 61 ]. There exist multiple DL approaches used for cybersecurity intrusion detection [ 25 ], namely, (1) deep neural-network, (2) feed-forward deep neural-network, (3) recurrent neural-network, (4) convolutional neural-network, (5) restricted Boltzmann machine, (6) deep belief network, (7) deep auto-encoder. Figure  5 shows the working of deep learning models. A batch of input data is fed to the neurons and assigned some weights to predict the phishing attack or legitimate traffic.

figure 5

Authors in Benavides et al. [ 15 ] work to incorporate a combination of each chosen work and the classification. They characterize the DL calculations chosen in every arrangement, which yielded that the most regularly utilized are the Deep Neural Network (DNN) and Convolutional Neural Network (CNN) among all. Diverse DL approaches have been presented and analyzed, but there exists a research gap in the use of DL calculations in recognition of cyber-attacks.

Authors in Shie [ 55 ] worked on the examination of different techniques and talked about different strategies for precisely recognizing phishing attacks. Of the evaluated strategies, DL procedures that used feature extraction shows good performance because of high accuracy, while being robust. Classifications models also depict good performance. Authors in Maurya and Jain [ 46 ] proposed an anti-phishing structure that depends on utilizing a phishing identification model dependent on DL, at the ISP’s level to guarantee security at a vertical scale as opposed to even execution. This methodology includes a transitional security layer at ISPs and is set between various workers and end-clients. The proficiency of executing this structure lies in the way that a solitary purpose of blocking can guarantee a large number of clients being protected from a specific phishing attack. The calculation overhead for phishing discovery models is restricted distinctly to ISPs and end users are granted secure assistance independent of their framework designs without highly efficient processing machines.

Authors in Subasi et al. [ 57 ] proposed a comparison of Adaboost and multi boosting for detecting the phishing website. They used the UCI machine learning repository dataset having 11,055 instances, and 30 features. AdaBoost and multi boost are the proposed ensemble learners in this research to upgrade the presentation of phishing attack calculations. Ensemble models improve the exhibition of the classifiers in terms of precision, F-measure, and ROC region. Experimental results reveal that by utilizing ensemble models, it is possible to recognize phishing pages with a precision of 97.61%. Authors in Abdelhamid et al. [ 9 ] proposed a comparison based on model content and features. They used a dataset from PhishTank, containing around 11,000 examples. They used an approach named enhanced dynamic rule induction (eDRI) and claimed that dynamic rule induction (eDRI) is the first algorithm of machine learning and DL which has been applied to an anti-phishing tool. This algorithm passes datasets with two main threshold frequencies and rules strength. The training dataset only stores “strong” features and these features become part of the rule while others are removed.

Authors in Mao et al. [ 44 ] proposed a learning-based system to choose page design comparability used to distinguish phishing attack pages. for effective page layout features, they characterized the guidelines and build up a phishing page classifier with two conventional learning-algorithms, SVM and DT. They tested the methodology on real website page tests from phishtank.com and alexa.com. Authors in Jain and Gupta [ 34 ] proposed techniques and have performed experiments on more than two datasets. First from Phishtank containing 1528 phishing websites, second from Openphish: which contains 613 phishing websites, third from Alexa: which contains 1600 legitimate websites, fourth from payment gateway: which contains 66 legitimate websites, and fifth from top banking website: which contains 252 legitimate websites. By applying machine-learning algorithms, they improved accuracy for phishing detection. They used RF, SVM, Neural-Networks (NN), LR, and NB. They used a feature extraction approach on the client-side.

Authors in Li et al. [ 42 ] proposed a novel approach in which the URL is sent as input and the URL, as well as HTML related features, are extracted. After feature extraction, a stacking model is used to combine classifiers. They performed experiments on different datasets: The first one was obtained from Phishtank, with 2000 web pages (1000 legitimate and 1000 phishing). The second dataset is a larger one with 49,947 web pages (30,873 legitimate, and 19,074 phishing) and was taken from Alexa. They used a support vector machine, NN, DT, RF, and combined these through stacking to achieve better accuracy. This research achieves good accuracy using different classifiers.

Some studies are limited to few classifiers and some used many classifiers, but their techniques were not efficient or accurate. Two datasets have been commonly used by researchers in past and these are publicly accessible from Phishtank and UCI machine learning repository. ML techniques have been used but without feature reduction, and some studies used only a few classifiers to compare their results.

2.2 Machine learning (ML) for phishing attack detection

ML approaches are popular for phishing websites detection and it becomes a simple classification problem. To train a machine learning model for a learning-based detection system, the data at hand must-have features that are related to phishing and legitimate website classes. Different classifiers are used to detect a phishing attack. Previous studies show that detection accuracy is high as robust ML techniques are used. Several feature selection techniques are used to reduce features. Figure  6 shows the working of the machine learning model. A batch of input data is given as input for training to the machine learning model to predict the phishing attack or legitimate traffic.

figure 6

By reducing features, dataset visualization becomes more efficient and understandable. The most significant classifiers that were used in various studies and are found to give good phishing attack detection accuracy are C4.5, k-NN, and SVM. These classifiers are based on DTs such as C4.5, so it gives the maximum accuracy and efficiency to detect a phishing attack. To further explore the detection of phishing attacks, researchers have mentioned the limitations of their work. Many highlighted a common limitation that ensemble learning techniques are not used, and in some studies, feature reduction was not done. Authors in James et al. [ 36 ] used different classifiers such as C4.5, IBK, NB, and SVM. Similarly, authors in Liew et al. [ 43 ] used RF to distinguish phishing attacks from original web pages. Authors in Adebowale et al. [ 10 ] used the Adaptive Neuro-Fuzzy Inference System based robust scheme using the integrated features for phishing attack detection and protection.

Authors in Zamir et al. [ 65 ] presented an examination of supervised learning and stacking models to recognize phishing websites. The rationale behind these experiments was to improve the classification precision through proposed features with PCA and the stacking of the most efficient classifiers. Stacking (RF, NN, stowing) outperformed other classifiers with proposed features N1 and N2. The experiments were performed on the phishing websites datasets. The data-set contained 32 pre-processed features with 11,055 websites. Authors in Alsariera et al. [ 13 ] used four meta-student models: AdaBoost-Extra Tree (ABET), Bagging-Extra tree (BET), Rotation Forest-Extra Tree (RoFBET), and LogitBoost-Extra Tree (LBET), using the extra-tree base classifier. The proposed meta-algorithms were fitted for phishing website datasets, and their performance was tested. Furthermore, the proposed models beat existing ML-based models in phishing attack recognition. Thus, they suggest the appropriation of meta-algorithms when building phishing attack identification models.

Authors in El Aassal et al. [ 22 ] proposed a benchmarking structure called PhishBench, which enables us to assess and analyze the existing features for phishing detection and completely understand indistinguishable test conditions, i.e., unified framework specification, datasets, classifiers, and performance measurements. The examinations indicated that the classification execution dropped when the proportion among phishing and authentic decreases towards 1 to 10. The decrease in execution extended from 5.9 to 42% in F1-score. Furthermore, PhishBench was likewise used to test past techniques on new and diverse datasets.

Authors in Subasi and Kremic [ 56 ] proposed an intelligent phishing website identification system. They utilized unique ML models to classify websites as genuine or phishing. A few classification methods were used to implement an accurate and smart phishing website detecting structure. ROC area, F-measure, and AUC were used to assess the performance of ML techniques. Results demonstrated that Adaboost with SVM performed best among all other classification techniques achieving the highest accuracy of 97.61%. Authors in Ali and Malebary [ 12 ] proposed a phishing website detection technique utilizing Particle Swarm Optimization (PSO) based component weighting to improve the detection of phishing websites. Their proposed approach recommends using PSO to weigh different websites, effectively accomplishing higher accuracy when distinguishing phishing websites. In particular, the proposed PSO based website features weighting is utilized to separate different features in websites, given how significantly these contribute towards distinguishing the phishing from real websites. Results showed that the ML models improved with the proposed PSO-based component weighting to effectively distinguish, and monitor both phishing and real websites separately.

Authors in James et al. [ 36 ] used datasets from Alexa and Phishtank. Their proposed approach read the URL one by one and analyze the host-name URL and path to classify into an attack or legitimate activity using four classifiers: NB, DT, KNN, and Support Vector Machine (SVM). Authors in Subasi et al. [ 57 ] used Artificial Neural Network (ANN), KNN, SVM, RF, Rotation Forest, and C4.5. They discussed in detail how these classifiers are very accurate in detecting a phishing attack. They claim that the accuracy of the RF is not more than 97.26%. All other classifiers got the same accuracy as given in the study. Authors in Hutchinson et al. [ 31 ] proposed a study on phishing website detection focusing on features selection. They used the dataset of the UCI machine learning repository that contains 11,055 URLs and 30 features and divided these features into six groups. They selected three groups and concluded that these groups are suitable options for accurate phishing attack detection.

Authors in Abdelhamid et al. [ 9 ] creates a method called Enhanced Dynamic Rule Induction (eDRI) to detect phishing attacks. They used feature extraction, Remove replace feature selection technique (RRFST), and ANOVA to reduce features. The results show that they have the highest accuracies of 93.5% in comparison with other studies. The research [ 29 ] proposed a feature selection technique named as Remove Replace Feature Selection Technique (RRFST). They claim that they got the phishing email dataset from the khoonji’s anti-phishing website containing 47 features. The DT was used to predict the performance measures.

Authors in Tyagi et al. [ 58 ] used a dataset from the UCI machine learning repository that contains unique 2456 URL instances, and 11,055 total number of URLs that have 6157 phishing websites and 4898 legitimate websites. They extracted 30 features of URLs and used these features to predict the phishing attack. There were two possible outcomes whether the user has to be notified that the website is a phishing or aware user that the website is safe. They used ML algorithms such as DT, RF, Gradient Boosting (GBM), Generalized Linear Model (GLM), and PCA. The authors in Chen and Chen [ 17 ] used the SMOTE method which improves the detection coverage of the model. They trained machine learning models including bagging, RF, and XGboost. Their proposed method achieved the highest accuracy through the XGboost method. They used the dataset of Phishtank which has 24,471 phishing websites and 3850 legitimate websites.

Authors in Joshi et al. [ 39 ] used a RF algorithm as a binary classifier and reliefF algorithm for feature selection algorithm. They used the dataset from the Mendeley website which is given as input to the feature selection algorithm to select efficient features. Next, they trained a RF algorithm over the selected features to predict the phishing attack. Authors in Ubing et al. [ 59 ] proposed their work on ensemble Learning. They used ensemble learning through three techniques that were bagging, boosting, stacking. Their dataset contains 30 features with a result column of 5126 records. The dataset is taken from UCI, which is publicly accessible. They had combined their classifiers to acquire the maximum accuracy which they got from a DT. Authors in Mao et al. [ 45 ] used different machine learning classifiers that include SVM, DT, AdaBoost, and RF to predict the phishing attack. Authors in Sahingoz et al. [ 54 ] created their dataset. The dataset contains 73,575 URLs, and out of this 36,400 legitimate URLs and 37,175 phishing URLs. As they mentioned that Phishtank doesn’t give a free dataset on the web page therefore they created their dataset. They used seven classification-algorithms and natural-language-processing (NLP) based features for phishing attack detection.

Table 1 presents the summary of ML approaches for phishing websites detection. Table shows that some studies provide highly efficient results for phishing attack detection.

2.3 Scenario-based phishing attack detection

In this section, we provide a comparison of scenario-based phishing attack detection used by various researchers. The comparison of scenario-based techniques to detect a phishing attack is shown in Table 2 . Studies show that different scenarios worked with various methods and provides different outcomes.

Authors in Begum and Badugu [ 14 ] discussed some approaches which are useful to detect a phishing attack. They performed a detailed survey of existing techniques such as Machine Learning (ML) based approaches, Non-machine Learning-based approaches, Neural Network-based approaches, and Behavior-based detection approaches for phishing attack detection. Authors in Yasin et al. [ 64 ] consolidated various studies that researchers have used to clarify different exercises of social specialists. Moreover, they proposed that a higher comprehension of the social engineering attack scenarios would be possible utilizing topical and game-based investigation techniques. The proposed strategy for interpreting social engineering attack scenario is one such endeavor to empower people to comprehend general attack scenarios. Even though the underlying outcomes have demonstrated neutral outcomes, the hypothetically predictable system of this strategy despite everything, merits future augmentation and re-performance.

Authors in Fatima et al. [ 23 ] presented PhishI as a precise way to deal with structure genuine games for security training. They characterize a game structure system that incorporates the group of information on social networking, that needs authoritative players. They used stick phishing as a guide to show how the proposed approach functions, and afterward assessed the learning impacts of the produced game dependent on observational information gathered from the student’s movement. In the PhishI game, members are needed to trade phishing messages and have the option to remark on the viability of the attack scenario. Results demonstrated that student’s attention to spear-phishing chances is improved and that the protection from the first potential attack is upgraded. Moreover, the game demonstrated a beneficial outcome on members’ comprehension of extreme online data and information disclosure.

Authors in Chiew et al. [ 18 ] concentrated phishing attacks in detail through their features of the medium and vector which they live in and their specialized methodologies. Besides, they accept this information will assist the overall population by taking preparatory and preventive activities against these phishing attacks and the policies to execute approaches to check any further misuse by the phishers. Relying just on client instruction as a preventive measure in a phishing attack is not sufficient. Their survey shows that the improvement of clever frameworks to counter these specialized methodologies is required, as such countermeasures will have the option to recognize and disable both existing attacks and new phishing dangers.

Authors in Yao et al. [ 63 ] used the logo extraction method by using the identity detection process to detect phishing. Two non-overlapping datasets were made from a sum of 726 pages. Phishing pages are from the PhishTank website, and the legitimate website pages are from the Alexa website as they limited their work by not using the DL technique. The authors gave the concept of dark triad attackers. Phishing exertion and execution, and end-users’ arrangement of emails are the theoretical approach of the dark triad method. They had limited their work as end-client members may have been hyper-mindful of potential duplicity and in this way progressively careful in their ratings of each email than they would be in their normal workplace. Authors in Williams et al. [ 62 ] uses a mixed approach to detect a phishing attack. They used ensemble learning to investigate 62, 000 instances over a six-week time frame to detect phishing messages, called spear phishing. As they had a drawback of just taking information from two organizations, employee observations and encounters are probably going to be affected by a scope of components that might be explicit to the association considered.

Authors in Parsons et al. [ 52 ] used the method of ANOVA. In a scenario-based phishing study, they took a total of 985 participants completed to play a role. Two-way repeated-measures analysis of variance (ANOVA) was led to survey the impact of email authenticity and that impact was focused on the study. This investigation included only one phishing and one certifiable email with one of the standards and did not test the impact of numerous standards inside an email. Following are the comparison of specific classifier known as RF which is the most used algorithm by the researchers.

Table 3 provides a comparison of RF classifiers with different datasets and different approaches. Some studies reduced features without creating a lot of impact on accuracy and the remaining studies focused on accuracy. Authors in Subasi et al. [ 57 ] used different classifiers to detect phishing attacks and they achieved an accuracy of 97.36% by RF algorithm.

Authors in Tyagi et al. [ 58 ] used 30 features to detect the attack by RF. They used other classifiers as well but their result on RF was better than other classifiers. Similarly, authors in Mao et al. [ 45 ] collected the dataset of 49 phishing websites from PhishinTank.com . They used four learning classifiers to detect phishing attacks and concluded that the RF classifiers are much better than others. Authors in Jagadeesan et al. [ 33 ] used two datasets one from UCI Machine Learning Repository having 30 features and one target class, containing 2456 instances of phishing and non-phishing URLs. The second dataset comprises of 1353 URLs with 10 features, grouped into 3 classifications: phishing, non-phishing and suspicious. They concluded that RF provides better accuracy than that of support vector machine. Authors in Joshi et al. [ 39 ] used the dataset from Mendeley website which is publicly accessible. The dataset contains 5000 legitimate and 5000 phishing records. Authors in Sahingoz et al. [ 54 ] used Ebbu2017 Phishing Dataset containing 73,575 URLs in which 36,400 are legitimate URLs and 37,175 are phishing URLs. They proposed seven different classification algorithms including Natural Language Processing (NLP) based features. They actually used a dataset which is not used commonly for detecting phishing attack.

Authors in Williams et al. [ 62 ] conducted two studies considering different aspects of emails. The email that is received, the person who received that email, and the context of the email all the theoretical approaches were studied in this paper. They believe that the current study will provide a way to theoretical development in this field. They considered 62,000 employers over 6 weeks and observed the individuals and targeted phishing emails known as spear phishing. Authors in Parsons et al. [ 52 ] proposed and worked on 985 participants who completed a role in a scenario-based phishing study. They used two-way repeated-measures analysis of variance which was named (ANOVA) to assess the effect of email legitimacy and email influence. The email which was used in their research indicates that the recipient has previously donated to some charity.

Authors in Yao et al. [ 63 ] proposed a methodology which mainly includes two processes: logon extraction and identity detection. The proposed methodology describes that the logon extraction extracted the logo from the image from the two-dimensional code after performing image processing. Next, the identity detection process assessed the relationship between the actual identity of the website and it’s described identity. If the identity is actual then the website is legitimate, if it is not then this is a phishing website. They created two datasets which are non-overlapping datasets from 726 web pages. The dataset contains phishing web pages and legitimate web pages. The legitimate pages are taken from Alexa, whereas the phishing pages are taken from Phishtank. They believe that logo extraction can be improved in the future. Authors in Curtis et al. [ 21 ] introduced the dark triad attacker’s concepts. They used a dark triad score to complete the 27 items short dark triad with both attackers. The end-users were asked to participate in the scenario to assign scores based on psychopathy, narcissism, and Machiavellianism.

2.4 Hybrid learning (HL) based phishing attack detection

In this section, we present the comparison of HL models which are used by state-of-the-art studies as shown in Tables 4 and 5 The studies show how the accuracies got improved by ensemble and HL techniques.

Authors in Kumar et al. [ 41 ] separated some irrelevant features from the content and pictures and applied SVM as a binary classifier. They group the real and phished messages with strategies like Text parsing, word tokenization, and stop word evacuation. The authors in Jain et al. [ 35 ] utilized TF-IDF to locate the most significant features of the website to be used in the search question, yet it has been well adjusted to improve execution. The proposed approach has been discovered to be more accurate for their methodology against existing techniques utilizing the traditional TF-IDF approach.

Authors in Adebowale et al. [ 10 ] proposed a hybrid approach comprising Search and Heuristic Rule and Logistic Regression (SHLR) for efficient phishing attack detection. Authors proposed three steps approach: (1) the most of website shown in the result of a search query is legal if the web page domain matches the domain name of the websites retrieved in results against the query, (2) the heuristic rules defined by the character features (3) an ML model to predict the web page to be either a legal web page or a phishing attack. Authors in Patil et al. [ 53 ] used LR, DT, and RF techniques to detect a phishing attack, and they believe the RF is a much-improved way to detect the attack. The drawback of this system is detecting some minimal false-positive and false-negative results. Authors in Niranjan et al. [ 48 ] used the UCI dataset on phishing containing 6157 legitimate and 4898 phishing instances out of a total of 11,055 instances. The EKRV model was used that involves a combination of KNN and random committee techniques. Authors in Chiew et al. [ 19 ] used two datasets one from 5000 phishing web-pages based on URLs from PhishTank and second OpenPhish. Another 5000 legitimate web-pages were based on URLs from Alexa and the Common Crawl5 archive. They used Hybrid Ensemble Strategy. Authors in Pandey et al. [ 50 ] used a dataset from the Website phishing dataset, available online in a repository of the University of California. This dataset has 10 features and 1353 instances. They trained an RF-SVM hybrid model that achieved an accuracy of 94%.

Authors in Niranjan et al. [ 48 ] proposed an ensemble technique through the voting and stacking method. They selected the UCI ML phishing dataset and take only 23 features out of 30 features for further attack detection. Out of a total of 11,055 instances, the dataset has 6157 legitimate and 4898 phishing instances. They used the EKRV model to predict the phishing attack. Authors in Patil et al. [ 53 ] proposed a hybrid solution that uses three approaches: blacklist and whitelist, heuristics, and visual similarity. The proposed methodology monitors all traffic on the end-user system and compares each URL with the white list of trusted domains. The website analyzes various details for features. The three outcomes are suspicious websites, phishing websites, and legitimate websites. The ML classifier is used to collect data and to generate a score. If the score is greater than the threshold, then they marked the URL as a phishing attack and immediately blocked it. They used LR, DT, and RF to predict the accuracy of their test websites.

Authors in Jagadeesan et al. [ 33 ] utilized RF and SVM to detect phishing attacks. They used two types of datasets the first one is from the UCI machine learning repository which has 30 features. This dataset consists of 2456 entries of phishing and non-phishing URLs. The second dataset consists of 1353 URLs which has 10 features and three categories: Phishing, non-Phishing, and suspicious. Authors in Pandey et al. [ 50 ] used the dataset of a repository of the University of California. The dataset has 10 features and 1353 instances. They trained a hybrid model comprising RF and SVM which they utilize to predict the accuracy.

3 Discussion

Phishing is a deceitful attempt to obtain sensitive data using social networking approaches, for example, usernames and passwords in an endeavor to deceive website users and getting their sensitive credentials [ 24 ]. Phishers prey on human emotion and the urge to follow instructions in a flow. Phishing is so omnipresent in the internet world that it has become a constant threat. In phishing, the biggest challenge is that the attackers are continuously devising new approaches to deceive clients such that they fall prey to their phishing traps.

A comparative study of previous works using different approaches is discussed in the above section with details. Machine learning based approaches, deep learning based approaches, scenario-based approaches, and hybrid techniques are deployed in past to tackle this problem. A detailed comparative analysis revealed that machine learning methods are the most frequently used and effective methods to detect a phishing attack. Different classification methods such as SVM, RF, ANN, C4.5, k-NN, DT have been used. Techniques with feature reduction give better performance. Classification is done through ELM, SVM, LR, C4.5, LC-ELM, kNN, XGB, and feature selection with ANOVA detected phishing attack with 99.2% accuracy, which is highest among all methods proposed so far but with trade-offs in terms of computational cost.

The RF method gives the best performance with the highest accuracy among any other classification methods on different datasets. Several studies proved that more than 95% attack detection accuracy can be achieved using a RF classification method. UCI machine learning dataset is the common dataset that has been used by researchers for phishing attack detection in past.

In various studies, the researchers also created a scenario-based environment to detect phishing attacks but these solutions are only applicable for a particular environment. Individual users in each organization exhibit different behaviors and individuals in the organization are sometimes aware of the scenarios. The hybrid learning approach is another way to detect phishing attacks as it occasionally gave better accuracy than that of a RF. Researchers are of the view that some ensemble models can further improve performance.

Nowadays phishing attacks defense is probably considered a hard job by system security experts. With low false positives, a feasible detection system should be there to identify phishing attacks. The defense approaches talked about so far are based on machine learning and deep learning algorithms. Besides having high computational costs, these methods have high false-positive rates; however, better at distinguishing phishing attacks. The machine learning techniques provide the best results when compared with other different approaches. The most effective defense for phishing attacks is an educated and well aware employee. But still, people are people with their built features of curiosity. They have a thirst to explore and know more. To mitigate the risks of falling victim to phishing tricks, organizations should try to keep employees away from their inherent core processes and make them develop a mindset that will abstain from clicking suspicious links and webpages.

4 Current practices and future challenges

A phishing attack is still considered a fascinating form of attack to lure a novice internet user to pass his/her private confidential data to the attackers. There are different measures available, yet at whatever point a solution is proposed to overcome these attacks, attackers consider the vulnerabilities of that solution to continue with their attacks. Several solutions to control phishing attacks have been proposed in past. A recent increase in the number of phishing attacks linked to COVID-19 performed between March 1 and March 23, 2020, and attacks performed on online collaboration tools (ZOOM, Microsoft Teams, etc.) has led researchers to pay more attention in this research domain. Most of the working be it at government or the corporate level, educational activities, businesses, as well as non-commercial activities, have switched online from the traditional on-premises approach. More users are relying on the web to perform their routine work. This has increased the importance of having a comprehensive phishing attack detection solution with better accuracy and better response time [ 6 , 7 , 8 ].

The conventional approaches for phishing attack detection are not accurate and can recognize only about 20% of phishing attacks. ML approaches give better results but with scalability trade-off and time-consuming even on the small-sized datasets. Phishing detection by heuristics techniques gives high false-positive rates. User cautiousness is a key requirement to prevent phishing attacks. Besides educating the client regarding safe browsing, some changes can be done in the user interfaces such as giving dynamic warnings and consequently identifying malicious emails. As the classified resources are accessible to the IoT gadgets, but their security architectures and features are not mature so far which makes them an exceptionally obvious target for the attackers.

Phishing is a door for all kinds of malware and ransomware. Malware attacks on organizations use ransomware and ransomware operators demand heavy amount as ransom in exchange for not disclosing stolen data which is a recent trend in 2020. Phishing scams in 2020 are deliberately impersonating COVID-19 and healthcare-related organizations and individuals by exploiting the unprepared users. It is better to safeguard doors at our ends and be proactive in defense rather than thinking about reactive strategies to combat once a phishing attack has happened.

Fake websites with phishing appear to be original but it is hard to identify as attackers imitate the appearance and functionality of real websites. Prevention is better than cure so there is a need for anti-phishing frameworks or plug-ins with web browsers. These plug-ins or frameworks may perform content filtering and identify as well as block suspected phishing websites to proceed further. An automated reporting feature can be added that can report phishing attacks to the organization from the user’s end such as a bank, government organization, etc. The time lost on remediation after a phishing attack can have a damaging impact on the productivity and profitability of businesses. In the current scenario, organizations need to provide their employees with awareness and feasible solutions to detect and report phishing attacks proactively and promptly before it causes any harm.

In the future, an all-inclusive phishing attack detection solution can be designed to identify, report, and block malicious web websites without the user’s involvement. If a website is asking for login credentials or sensitive information, a framework or smart web plug-in solution should be responsible to ensure the website is legitimate and inform the owner (organization, business, etc.) beforehand. Web pages health checking during user browsing has become a need of the time and a scalable, as well as a robust solution, is needed.

5 Conclusion

This survey enables researchers to comprehend the various methods, challenges, and trends for phishing attack detection. Nowadays, prevention from phishing attacks is considered a tough job in the system security domain. An efficient detection system ought to have the option to identify phishing attacks with low false positives. The protection strategies talked about in this paper are data mining and heuristics, ML, and deep learning algorithms. With high computational expenses, heuristic and data mining methods have high FP rates, however better at distinguishing phishing attacks. The ML procedures give the best outcomes when contrasted with different strategies. A portion of the ML procedures can identify TP up to 99%. As malicious URLs are created every other day and the attackers are using techniques to fool users and modify the URLs to attack. Nowadays deep learning and machine learning methods are used to detect a phishing attack. classification methods such as RF, SVM, C4.5, DT, PCA, k-NN are also common. These methods are most useful and effective for detecting the phishing attack. Future research can be done for a more scalable and robust method including the smart plugin solutions to tag/label if the website is legitimate or leading towards a phishing attack.

Abbreviations

Support vector machine

Random forest

Instant base learner

Artificial neural network

Rotation forest

Decision forest

Enhanced dynamic rule induction

Linear regression

Classification and regression tree

Extreme gradient boost

Gradient boosting decision tree

Neural-networks

Gradient boosting machine

Generalized linear model

Navies Bayes

K-nearest neighbor

Combination extreme learning machine

Extreme learning machine

Random committee

Principle component analysis

(2016). Apwg trend report. http://docs.apwg.org/reports/apwg_trends_report_q4_2016.pdf . Accessed from 20 July 2020

(2018) Phishing activity trends report. http://docs.apwg.org/reports/apwg_trends_report_q2_2018.pdf . Accessed from 20 July 2020

(2019) Apwg trend report. https://docs.apwg.org/reports/apwg_trends_report_q3_2019.pdf . Accessed from 20 July 2020

(2019) Fbi warns of dramatic increase in business e-mail compromise (bec) schemes—fbi. https://www.fbi.gov/contact-us/field-offices/memphis/news/press-releases/fbi-warns-of-dramatic-increase-in-business-e-mail-compromise-bec-schemes . Accessed from 20 July 2020

(2019) What is phishing? https://www.phishing.org/what-is-phishing . Accessed from 20 July 2020

(2020) Coronavirus-related spear phishing attacks see 667% increase. https://www.securitymagazine.com/articles/92157-coronavirus-related-spear-phishing-attacks-see-667-increase-in-march-2020 . Accessed from 20 July 2020

(2020) Cost of black market phishing kits soars 149% in 2019. https://www.infosecurity-magazine.com/news/black-phishing-kits/ . Accessed from 20 July 2020

(2020) Recent phishing attacks. https://www.infosec.gov.hk/english/anti/recent.html . Accessed from 20 July 2020

Abdelhamid, N., Thabtah, F., Abdel-jaber, H. (2017). Phishing detection: A recent intelligent machine learning comparison based on models content and features. In 2017 IEEE international conference on intelligence and security informatics (ISI) (pp. 72–77). IEEE.

Adebowale, M. A., Lwin, K. T., Sanchez, E., & Hossain, M. A. (2019). Intelligent web-phishing detection and protection scheme using integrated features of images, frames and text. Expert Systems with Applications , 115 , 300–313.

Article   Google Scholar  

Aleroud, A., & Zhou, L. (2017). Phishing environments, techniques, and countermeasures: A survey. Computers and Security , 68 , 160–196.

Ali, W., & Malebary, S. (2020). Particle swarm optimization-based feature weighting for improving intelligent phishing website detection. IEEE Access , 8 , 116766–116780.

Alsariera, Y. A., Adeyemo, V. E., Balogun, A. O., & Alazzawi, A. K. (2020). Ai meta-learners and extra-trees algorithm for the detection of phishing websites. IEEE Access , 8 , 142532–142542.

Begum, A., & Badugu, S. (2020). A study of malicious url detection using machine learning and heuristic approaches. In Advances in decision sciences, security and computer vision, image processing (pp. 587–597). Berlin: Springer.

Benavides, E., Fuertes, W., Sanchez, S., & Sanchez, M. (2020). Classification of phishing attack solutions by employing deep learning techniques: A systematic literature review. In Developments and advances in defense and security (pp. 51–64). Springer.

Cabaj, K., Domingos, D., Kotulski, Z., & Respício, A. (2018). Cybersecurity education: Evolution of the discipline and analysis of master programs. Computers and Security , 75 , 24–35.

Chen, Y. H., & Chen, J. L. (2019). Ai@ ntiphish—machine learning mechanisms for cyber-phishing attack. IEICE Transactions on Information and Systems , 102 (5), 878–887.

Chiew, K. L., Yong, K. S. C., & Tan, C. L. (2018). A survey of phishing attacks: Their types, vectors and technical approaches. Expert Systems with Applications , 106 , 1–20.

Chiew, K. L., Tan, C. L., Wong, K., Yong, K. S., & Tiong, W. K. (2019). A new hybrid ensemble feature selection framework for machine learning-based phishing detection system. Information Sciences , 484 , 153–166.

Conklin, W. A., Cline, R. E., & Roosa, T. (2014). Re-engineering cybersecurity education in the us: An analysis of the critical factors. In 2014 47th Hawaii international conference on system sciences (pp. 2006–2014). IEEE.

Curtis, S. R., Rajivan, P., Jones, D. N., & Gonzalez, C. (2018). Phishing attempts among the dark triad: Patterns of attack and vulnerability. Computers in Human Behavior , 87 , 174–182.

El Aassal, A., Baki, S., Das, A., & Verma, R. M. (2020). An in-depth benchmarking and evaluation of phishing detection research for security needs. IEEE Access , 8 , 22170–22192.

Fatima, R., Yasin, A., Liu, L., & Wang, J. (2019). How persuasive is a phishing email? A phishing game for phishing awareness. Journal of Computer Security , 27 (6), 581–612.

Feng, Q., Tseng, K. K., Pan, J. S., Cheng, P., & Chen, C. (2011). New anti-phishing method with two types of passwords in openid system. In 2011 Fifth international conference on genetic and evolutionary computing (pp. 69–72). IEEE.

Ferrag, M. A., Maglaras, L., Moschoyiannis, S., & Janicke, H. (2020). Deep learning for cyber security intrusion detection: Approaches, datasets, and comparative study. Journal of Information Security and Applications , 50 , 102419.

Forecast. (2017). Global fraud and cybercrime forecast. https://www.rsa.com/en-us/blog/2016-12/2017-global-fraud-cybercrime-forecast . Accessed from 20 July 2020

Gupta, B. B., Tewari, A., Jain, A. K., & Agrawal, D. P. (2017). Fighting against phishing attacks: State of the art and future challenges. Neural Computing and Applications , 28 (12), 3629–3654.

Gupta, B. B., Arachchilage, N. A., & Psannis, K. E. (2018). Defending against phishing attacks: Taxonomy of methods, current issues and future directions. Telecommunication Systems , 67 (2), 247–267.

Hota, H., Shrivas, A., & Hota, R. (2018). An ensemble model for detecting phishing attack with proposed remove-replace feature selection technique. Procedia Computer Science , 132 , 900–907.

Hulten, G. J., Rehfuss, P. S., Rounthwaite, R., Goodman, J. T., Seshadrinathan, G., Penta, A. P., Mishra, M., Deyo, R. C., Haber, E. J., & Snelling, D. A. W. et al. (2014). Finding phishing sites . US Patent 8,839,418.

Hutchinson, S., Zhang, Z., & Liu, Q. (2018). Detecting phishing websites with random forest. In International conference on machine learning and intelligent communications (pp. 470–479). Springer.

Iwendi, C., Jalil, Z., Javed, A. R., Reddy, T., Kaluri, R., Srivastava, G., et al. (2020). Keysplitwatermark: Zero watermarking algorithm for software protection against cyber-attacks. IEEE Access , 8 , 72650–72660.

Jagadeesan, S., Chaturvedi, A., & Kumar, S. (2018). Url phishing analysis using random forest. International Journal of Pure and Applied Mathematics , 118 (20), 4159–4163.

Google Scholar  

Jain, A. K., & Gupta, B. B. (2018). Towards detection of phishing websites on client-side using machine learning based approach. Telecommunication Systems , 68 (4), 687–700.

Jain, A. K., Parashar, S., Katare, P., & Sharma, I. (2020). Phishskape: A content based approach to escape phishing attacks. Procedia Computer Science , 171 , 1102–1109.

James, J., Sandhya, L., & Thomas, C. (2013). Detection of phishing urls using machine learning techniques. In 2013 International conference on control communication and computing (ICCC) (pp. 304–309). IEEE.

Javed, A. R., Jalil, Z., Moqurrab, S. A., Abbas, S., & Liu, X. (2020). Ensemble adaboost classifier for accurate and fast detection of botnet attacks in connected vehicles. Transactions on Emerging Telecommunications Technologies .

Javed, A. R., Usman, M., Rehman, S. U., Khan, M. U., & Haghighi, M. S. (2020). Anomaly detection in automated vehicles using multistage attention-based convolutional neural network. IEEE Transactions on Intelligent Transportation Systems , pp. 1–10.

Joshi, A., Pattanshetti, P., & Tanuja, R. (2019). Phishing attack detection using feature selection techniques. In International conference on communication and information processing (ICCIP), Nutan College of Engineering and Research .

Khonji, M., Iraqi, Y., & Jones, A. (2013). Phishing detection: A literature survey. IEEE Communications Surveys and Tutorials , 15 (4), 2091–2121.

Kumar, A., Chatterjee, J. M., & Díaz, V. G. (2020). A novel hybrid approach of svm combined with nlp and probabilistic neural network for email phishing. International Journal of Electrical and Computer Engineering , 10 (1), 486.

Li, Y., Yang, Z., Chen, X., Yuan, H., & Liu, W. (2019). A stacking model using url and html features for phishing webpage detection. Future Generation Computer Systems , 94 , 27–39.

Liew, S. W., Sani, N. F. M., Abdullah, M. T., Yaakob, R., & Sharum, M. Y. (2019). An effective security alert mechanism for real-time phishing tweet detection on twitter. Computers and Security , 83 , 201–207.

Mao, J., Bian, J., Tian, W., Zhu, S., Wei, T., Li, A., et al. (2018). Detecting phishing websites via aggregation analysis of page layouts. Procedia Computer Science , 129 , 224–230.

Mao, J., Bian, J., Tian, W., Zhu, S., Wei, T., Li, A., et al. (2019). Phishing page detection via learning classifiers from page layout feature. EURASIP Journal on Wireless Communications and Networking , 2019 (1), 43.

Maurya, S., & Jain, A. (2020). Deep learning to combat phishing. Journal of Statistics and Management Systems , pp. 1–13.

Mittal, M., Iwendi, C., Khan, S., & Rehman Javed, A. (2020). Analysis of security and energy efficiency for shortest route discovery in low-energy adaptive clustering hierarchy protocol using Levenberg–Marquardt neural network and gated recurrent unit for intrusion detection system. Transactions on Emerging Telecommunications Technologies , p. e3997.

Niranjan, A., Haripriya, D., Pooja, R., Sarah, S., Shenoy, P. D., & Venugopal, K. (2019). Ekrv: Ensemble of knn and random committee using voting for efficient classification of phishing. In Progress in advanced computing and intelligent engineering (pp. 403–414). Springer.

Ollmann, G. (2004). The phishing guide understanding and preventing phishing attacks . NGS Software Insight Security Research.

Pandey, A., Gill, N., Nadendla, K. S. P., & Thaseen, I. S. (2018). Identification of phishing attack in websites using random forest-svm hybrid model. In International conference on intelligent systems design and applications (pp. 120–128). Springer.

Parekh, S., Parikh, D., Kotak, S., & Sankhe, S. (2018). A new method for detection of phishing websites: Url detection. In 2018 Second international conference on inventive communication and computational technologies (ICICCT) (pp. 949–952). IEEE.

Parsons, K., Butavicius, M., Delfabbro, P., & Lillie, M. (2019). Predicting susceptibility to social influence in phishing emails. International Journal of Human-Computer Studies , 128 , 17–26.

Patil, V., Thakkar, P., Shah, C., Bhat, T., & Godse, S. (2018). Detection and prevention of phishing websites using machine learning approach. In 2018 Fourth international conference on computing communication control and automation (ICCUBEA) (pp. 1–5). IEEE.

Sahingoz, O. K., Buber, E., Demir, O., & Diri, B. (2019). Machine learning based phishing detection from urls. Expert Systems with Applications , 117 , 345–357.

Shie, E. W. S. (2020). Critical analysis of current research aimed at improving detection of phishing attacks . Selected computing research papers, p. 45.

Subasi, A., & Kremic, E. (2020). Comparison of adaboost with multiboosting for phishing website detection. Procedia Computer Science , 168 , 272–278.

Subasi, A., Molah, E., Almkallawi, F., & Chaudhery, T. J. (2017). Intelligent phishing website detection using random forest classifier. In 2017 International conference on electrical and computing technologies and applications (ICECTA) (pp. 1–5). IEEE.

Tyagi, I., Shad, J., Sharma, S., Gaur, S., & Kaur, G. (2018). A novel machine learning approach to detect phishing websites. In 2018 5th International conference on signal processing and integrated networks (SPIN) (pp. 425–430). IEEE.

Ubing, A. A., Jasmi, S. K. B., Abdullah, A., Jhanjhi, N., & Supramaniam, M. (2019). Phishing website detection: An improved accuracy through feature selection and ensemble learning. International Journal of Advanced Computer Science and Applications , 10 (1), 252–257.

Volkamer, M., Renaud, K., Reinheimer, B., & Kunz, A. (2017). User experiences of torpedo: Tooltip-powered phishing email detection. Computers and Security , 71 , 100–113.

Vrbančič, G., Fister Jr, I., & Podgorelec, V. (2018). Swarm intelligence approaches for parameter setting of deep learning neural network: Case study on phishing websites classification. In Proceedings of the 8th international conference on web intelligence, mining and semantics (pp. 1–8).

Williams, E. J., Hinds, J., & Joinson, A. N. (2018). Exploring susceptibility to phishing in the workplace. International Journal of Human-Computer Studies , 120 , 1–13.

Yao, W., Ding Y., & Li, X. (2018). Logophish: A new two-dimensional code phishing attack detection method. In 2018 IEEE international conference on parallel and distributed processing with applications, ubiquitous computing and communications, big data and cloud computing, social computing and networking, sustainable computing and communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom) (pp. 231–236). IEEE.

Yasin, A., Fatima, R., Liu, L., Yasin, A., & Wang, J. (2019). Contemplating social engineering studies and attack scenarios: A review study. Security and Privacy , 2 (4), e73.

Zamir, A., Khan, H. U., Iqbal, T., Yousaf, N., Aslam, F., Anjum, A., et al. (2020). Phishing web site detection using diverse machine learning algorithms. The Electronic Library .

Download references

Author information

Authors and affiliations.

Department of Computer Science, Air University, E-9, Islamabad, Pakistan

Abdul Basit & Maham Zafar

School of Information Engineering, Yangzhou University, Yangzhou, China

Department of Cyber Security, Air University, E-9, Islamabad, Pakistan

Abdul Rehman Javed, Zunera Jalil & Kashif Kifayat

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Xuan Liu .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Basit, A., Zafar, M., Liu, X. et al. A comprehensive survey of AI-enabled phishing attacks detection techniques. Telecommun Syst 76 , 139–154 (2021). https://doi.org/10.1007/s11235-020-00733-2

Download citation

Accepted : 09 October 2020

Published : 23 October 2020

Issue Date : January 2021

DOI : https://doi.org/10.1007/s11235-020-00733-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Phishing attack
  • Security threats
  • Advanced phishing techniques
  • Cyberattack
  • Internet security
  • Machine learning
  • Deep learning
  • Hybrid learning
  • Find a journal
  • Publish with us
  • Track your research

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 25 May 2022

An effective detection approach for phishing websites using URL and HTML features

  • Ali Aljofey 1 , 2 ,
  • Qingshan Jiang 1 ,
  • Abdur Rasool 1 , 2 ,
  • Hui Chen 1 , 2 ,
  • Wenyin Liu 3 ,
  • Qiang Qu 1 &
  • Yang Wang 4  

Scientific Reports volume  12 , Article number:  8842 ( 2022 ) Cite this article

25k Accesses

38 Citations

Metrics details

  • Computer science
  • Information technology
  • Scientific data

Today's growing phishing websites pose significant threats due to their extremely undetectable risk. They anticipate internet users to mistake them as genuine ones in order to reveal user information and privacy, such as login ids, pass-words, credit card numbers, etc. without notice. This paper proposes a new approach to solve the anti-phishing problem. The new features of this approach can be represented by URL character sequence without phishing prior knowledge, various hyperlink information, and textual content of the webpage, which are combined and fed to train the XGBoost classifier. One of the major contributions of this paper is the selection of different new features, which are capable enough to detect 0-h attacks, and these features do not depend on any third-party services. In particular, we extract character level Term Frequency-Inverse Document Frequency (TF-IDF) features from noisy parts of HTML and plaintext of the given webpage. Moreover, our proposed hyperlink features determine the relationship between the content and the URL of a webpage. Due to the absence of publicly available large phishing data sets, we needed to create our own data set with 60,252 webpages to validate the proposed solution. This data contains 32,972 benign webpages and 27,280 phishing webpages. For evaluations, the performance of each category of the proposed feature set is evaluated, and various classification algorithms are employed. From the empirical results, it was observed that the proposed individual features are valuable for phishing detection. However, the integration of all the features improves the detection of phishing sites with significant accuracy. The proposed approach achieved an accuracy of 96.76% with only 1.39% false-positive rate on our dataset, and an accuracy of 98.48% with 2.09% false-positive rate on benchmark dataset, which outperforms the existing baseline approaches.

Similar content being viewed by others

phishing research paper pdf

Life-long phishing attack detection using continual learning

phishing research paper pdf

Machine learning-based guilt detection in text

phishing research paper pdf

A Multiple change-point detection framework on linguistic characteristics of real versus fake news articles

Introduction.

Phishing offenses are increasing, resulting in billions of dollars in loss 1 . In these attacks, users enter their critical (i.e., credit card details, passwords, etc.) to the forged website which appears to be legitimate. The Software-as-a-Service (SaaS) and webmail sites are the most common targets of phishing 2 . The phisher makes websites that look very similar to the benign websites. The phishing website link is then sent to millions of internet users via emails and other communication media. These types of cyber-attacks are usually activated by emails, instant messages, or phone calls 3 . The aim of the phishing attack is not only to steal the victims' personality, but it can also be performed to spread other types of malware such as ransomware, to exploit approach weaknesses, or to receive monetary profits 4 . According to the Anti-Phishing Working Group (APWG) report in the 3rd Quarter of 2020, the number of phishing attacks has grown since March, and 28,093 unique phishing sites have been detected between July to September 2 . The average amount demanded during wire transfer Business E-mail Compromise (BEC) attacks was $48,000 in the third quarter, down from $80,000 in the second quarter and $54,000 in the first.

Detecting and preventing phishing offenses is a significant challenge for researchers due to the way phishers carry out the attack to bypass the existing anti-phishing techniques. Moreover, the phisher can even target some educated and experienced users by using new phishing scams. Thus, software-based phishing detection techniques are preferred for fighting against the phishing attack. Mostly available methods for detecting phishing attacks are blacklists/whitelists 5 , natural language processing 6 , visual similarity 7 , rules 8 , machine learning techniques 9 , 10 , etc. Techniques based on blacklists/whitelists fail to detect unlisted phishing sites (i.e. 0-h attacks) as well as these methods fail when blacklisted URL is encountered with minor changes. In the machine learning based techniques, a classification model is trained using various heuristic features (i.e., URL, webpage content, website traffic, search engine, WHOIS record, and Page Rank) in order to improve detection efficiency. However, these heuristic features are not warranted to present in all phishing websites and might also present in the benign websites, which may cause a classification error. Moreover, some of the heuristic features are hard to access and third-party dependent. Some third-party services (i.e., page rank, search engine indexing, WHOIS etc.) may not be sufficient to identify phishing websites that are hosted on hacked servers and these websites are inaccurately identified as benign websites because they are contained in search results. Websites hosted on compromised servers are usually more than a day old unlike other phishing websites which only take a few hours. Also, these services inaccurately identify the new benign website as a phishing site due to the lack of domain age. The visual similarity-based heuristic techniques compare the new website with the pre-stored signature of the website. The website’s visual signature includes screenshots, font styles, images, page layouts, logos, etc. Thus, these techniques cannot identify the fresh phishing websites and generate a high false-negative rate (phishing to benign). The URL based technique does not consider the HTML of the webpage and may misjudge some of the malicious websites hosted on free or compromised servers. Many existing approaches 11 , 12 , 13 extract hand-crafted URL based features, e.g., number of dots, presence of special “@”, “#”, “–” symbol, URL length, brand names in URL, position of Top-Level domain, check hostname for IP address, presence of multiple TLDs, etc. However, there are still hurdles to extracting manual URL features due to the fact that human effort requires time and extra maintenance labor costs. Detecting and preventing phishing offense is a major defiance for researchers because the scammer carries out these offenses in a way that can avoid current anti-phishing methods. Hence, the use of hybrid methods rather than a single approach is highly recommended by the networks security manager.

This paper provides an efficient solution for phishing detection that extracts the features from website's URL and HTML source code. Specifically, we proposed a hybrid feature set including URL character sequence features without expert’s knowledge, various hyperlink information, plaintext and noisy HTML data-based features within the HTML source code. These features are then used to create feature vector required for training the proposed approach by XGBoost classifier. Extensive experiments show that the proposed anti-phishing approach has attained competitive performance on real dataset in terms of different evaluation statistics.

Our anti-phishing approach has been designed to meet the following requirements.

High detection efficiency: To provide high detection efficiency, incorrect classification of benign sites as phishing (false-positive) should be minimal and correct classification of phishing sites (true-positive) should be high.

Real-time detection: The prediction of the phishing detection approach must be provided before exposing the user's personal information on the phishing website.

Target independent: Due to the features extracted from both URL and HTML the proposed approach can detect new phishing websites targeting any benign website (zero-day attack).

Third-party independent: The feature set defined in our work are lightweight and client-side adaptable, which do not rely on third-party services such as blacklist/whitelist, Domain Name System (DNS) records, WHOIS record (domain age), search engine indexing, network traffic measures, etc. Though third-party services may raise the effectiveness of the detection approach, they might misclassify benign websites if a benign website is newly registered. Furthermore, the DNS database and domain age record may be poisoned and lead to false negative results (phishing to benign).

Hence, a light-weight technique is needed for phishing websites detection adaptable at client side. The major contributions in this paper are itemized as follows.

We propose a phishing detection approach, which extracts efficient features from the URL and HTML of the given webpage without relying on third-party services. Thus, it can be adaptable at the client side and specify better privacy.

We proposed eight novel features including URL character sequence features (F1), textual content character level (F2), various hyperlink features (F3, F4, F5, F6, F7, and F14) along with seven existing features adopted from the literature.

We conducted extensive experiments using various machine learning algorithms to measure the efficiency of the proposed features. Evaluation results manifest that the proposed approach precisely identifies the legitimate websites as it has a high true negative rate and very less false positive rate.

We release a real phishing webpage detection dataset to be used by other researchers on this topic.

The rest of this paper is structured as follows: The " Related work " section first reviews the related works about phishing detection. Then the " Proposed approach " section presents an overview of our proposed solution and describes the proposed features set to train the machine learning algorithms. The " Experiments and result analysis ” section introduces extensive experiments including the experimental dataset and results evaluations. Furthermore, the " Discussion and limitation " section contains a discussion and limitations of the proposed approach. Finally, the " Conclusion " section concludes the paper and discusses future work.

Related work

This section provides an overview of the proposed phishing detection techniques in the literature. Phishing methods are divided into two categories; expanding the user awareness to distinguish the characteristics of phishing and benign webpages 14 , and using some extra software. Software-based techniques are further categorized into list-based detection, and machine learning-based detection. However, the problem of phishing is so sophisticated that there is no definitive solution to efficiently bypass all threats; thus, multiple techniques are often dedicated to restrain particular phishing offenses.

List-based detection

List-based phishing detection methods use either whitelist or blacklist-based technique. A blacklist contains a list of suspicious domains, URLs, and IP addresses, which are used to validate if a URL is fraudulent. Simultaneously, the whitelist is a list of legitimate domains, URLs, and IP addresses used to validate a suspected URL. Wang et al. 15 , Jain and Gupta 5 and Han et al. 16 use white list-based method for the detection of suspected URL. Blacklist-based methods are widely used in openly available anti-phishing toolbars, such as Google safe browsing, which maintains a blacklist of URLs and provides warnings to users once a URL is considered as phishing. Prakash et al. 17 proposed a technique to predict phishing URLs called Phishnet. In this technique, phishing URLs are identified from the existing blacklisted URLs using the directory structure, equivalent IP address, and brand name. Felegyhazi et al. 18 developed a method that compares the domain name and name server information of new suspicious URLs to the information of blacklisted URLs for the classification process. Sheng et al. 19 demonstrated that a forged domain was added to the blacklist after a considerable amount of time, and approximately 50–80% of the forged domains were appended after the attack was carried out. Since thousands of deceptive websites are launched every day, the blacklist requires to be updated periodically from its source. Thus, machine learning-based detection techniques are more efficient in dealing with phishing offenses.

Machine learning-based detection

Data mining techniques have provided outstanding performance in many applications, e.g., data security and privacy 20 , game theory 21 , blockchain systems 22 , healthcare 23 , etc. Due to the recent development of phishing detection methods, various machine learning-based techniques have also been employed 6 , 9 , 10 , 13 to investigate the legality of websites. The effectiveness of these methods relies on feature collection, training data, and classification algorithm. The feature collection is extracted from different sources, e.g., URL, webpage content, third party services, etc. However, some of the heuristic features are hard to access and time-consuming, which makes some machine learning approaches demand high computations to extract these features.

Jain and Gupta 24 proposed an anti-phishing approach that extracts the features from the URL and source code of the webpage and does not rely on any third-party services. Although the proposed approach attained high accuracy in detecting phishing webpages, it used a limited dataset (2141 phishing and 1918 legitimate webpages). The same authors 9 present a phishing detection method that can identify phishing attacks by analyzing the hyperlinks extracted from the HTML of the webpage. The proposed method is a client-side and language-independent solution. However, it entirely depends on the HTML of the webpage and may incorrectly classify the phishing webpages if the attacker changes all webpage resource references (i.e., Javascript, CSS, images, etc.). Rao and Pais 25 proposed a two-level anti-phishing technique called BlackPhish. At first level, a blacklist of signatures is created using visual similarity based features (i.e., file names, paths, and screenshots) rather than using blacklist of URLs. At second level, heuristic features are extracted from URL and HTML to identify the phishing websites which override the first level filter. In spite of that, the legitimate websites always undergo two-level filtering. In some researches 26 authors used search engine-based mechanism to authenticate the webpage as first-level authentication. In the second level authentication, various hyperlinks within the HTML of the website are processed for the phishing websites detection. Although the use of search engine-based techniques increases the number of legitimate websites correctly identified as legitimate, it also increases the number of legitimate websites incorrectly identified as phishing when newly created authentic websites are not found in the top results of search engine. Search based approaches assume that genuine website appears in the top search results.

In a recent study, Rao et al. 27 proposed a new phishing websites detection method with word embedding extracted from plain text and domain specific text of the html source code. They implemented different word embedding to evaluate their model using ensemble and multimodal techniques. However, the proposed method is entirely dependent on plain text and domain specific text, and may fail when the text is replaced with images. Some researchers have tried to identify phishing attacks by extracting different hyperlink relationships from webpages. Guo et al. 28 proposed a phishing webpages detection approach which they called HinPhish. The approach establishes a heterogeneous information network (HIN) based on domain nodes and loading resources nodes and establishes three relationships between the four hyperlinks: external link, empty link, internal link and relative link. Then, they applied an authority ranking algorithm to calculate the effect of different relationships and obtain a quantitative score for each node.

In Sahingoz et al. 6 work, the distributed representation of words is adopted within a specific URL, and then seven various machine learning classifiers are employed to identify whether a suspicious URL is a phishing website. Rao et al. 13 proposed an anti-phishing technique called CatchPhish. They extracted hand-crafted and Term Frequency-Inverse Document Frequency (TF-IDF) features from URLs, then trained a classifier on the features using random forest algorithm. Although the above methods have shown satisfactory performance, they suffer from the following restrictions: (1) inability to handle unobserved characters because the URLs usually contain meaningless and unknown words that are not in the training set; (2) they do not consider the content of the website. Accordingly, some URLs, which are distinctive to others but imitate the legitimate sites, may not be identified based on URL string. As their work is only based on URL features, which is not enough to detect the phishing websites. However, we have provided an effective solution by proposing our approach to this domain by utilizing three different types of features to detect the phishing website more efficiently. Specifically, we proposed a hybrid feature set consisting of URL character sequence, various hyperlinks information, and textual content-based features.

Deep learning methods have been used for phishing detection e.g., Convolutional Neural Network (CNN), Deep Neural Network (DNN), Recurrent Neural Network (RNN), and Recurrent Convolutional Neural Networks (RCNN) due to the success of the Natural Language Processing (NLP) attained by these techniques. However, deep learning methods are not employed much in phishing detection due to the inclusive training time. Aljofey et al. 3 proposed a phishing detection approach with a character level convolutional neural network based on URL. The proposed approach was compared by using various machine and deep learning algorithms, and different types of features such as TF-IDF characters, count vectors, and manually-crafted features. Le et al. 29 provided a URLNet method to detect phishing webpage from URL. They extract character-level and word-level features from URL strings and employ CNN networks for training and testing. Chatterjee and Namin 30 introduced a phishing detection technique based on deep reinforcement learning to identify phishing URLs. They used their model on a balanced, labeled dataset of benign and phishing URLs, extracting 14 hand-crafted features from the given URLs to train the proposed model. In recent studies, Xiao et al. 31 proposed phishing website detection approach named CNN–MHSA. CNN network is applied to extract characters features from URLs. In the meanwhile, multi-head self-attention (MHSA) mechanism is employed to calculate the corresponding weights for the CNN learned features. Zheng et al. 32 proposed a new Highway Deep Pyramid Neural Network (HDP-CNN) which is a deep convolutional network that integrates both character-level and word-level embedding representation to identify whether a given URL is phishing or legitimate. Albeit the above approaches have shown valuable performances, they might misclassify phishing websites hosted on compromised servers since the features are extracted only from the URL of the website.

The features extracted in some previous studies are based on manual work and require additional effort since these features need to be reset according to the dataset, which may lead to overfitting of anti-phishing solutions. We got the motivation from the above-mentioned studies and proposed our approach. In which, the current work extract character sequences feature from URL without manual intervention. Moreover, our approach employs noisy data of HTML, plaintext, and hyperlinks information of the website with the benefit of identifying new phishing websites. Table 1 presents the detailed comparison of existing machine learning based phishing detection approaches.

Proposed approach

Our approach extracts and analyzes different features of suspected webpages for effective identification of large-scale phishing offenses. The main contribution of this paper is the combined uses of these feature set. For improving the detection accuracy of phishing webpages, we have proposed eight new features. Our proposed features determine the relationship between the URL of the webpage and the webpage content.

System architecture

The overall architecture of the proposed approach is divided into three phases. In the first phase, all the essential features are extracted and HTML source code will be crawled. The second phase applies feature vectorization to generate a particular feature vector for each webpage. The third phase identifies if the given webpage is phishing. Figure  1 shows the system structure of the proposed approach. Details of each phase are described as follows.

figure 1

General architecture of the proposed approach.

Feature generation

The features are generated in this component. Our features are based on the URL and HTML source code of the webpage. A Document Object Model (DOM) tree of the webpage is used to extract the hyperlink and textual content features using a web crawler automatically. The features of our approach are categorized into four groups as depicted in Table 2 . In particular, features F1–F7, and F14 are new and proposed by us; Features F8–F13, and F15 are taken from other approaches 9 , 11 , 12 , 24 , 33 but we adjusted them for better results. Moreover, the observational method and strategy regarding the interpretation of these features are applied differently in our approach. A detailed explanation of the proposed features is provided in the feature extraction section of this paper.

Feature vectorization

After the features are extracted, we apply feature vectorization to generate a particular feature vector for each webpage to create a labeled dataset. We integrate URL character sequences features with textual content TF-IDF features and hyperlink information features to create feature vector required for training the proposed approach. The hyperlink features combination outputs 13-dimensional feature vector as \(F_{H} = \left\langle {f_{3} ,f_{4} ,f_{5} , \ldots ,f_{{15}} } \right\rangle\) , and the URL character sequence features combination outputs 200-dimensional feature vector as \(F_{U} = \left\langle {c_{1} ,c_{2} ,c_{3} , \ldots ,c_{{200}} } \right\rangle\) , we set a fixed URL length to 200. If the URL length is greater than 200, the additional part will be ignored. Otherwise, we put a 0 in the remainder of the URL string. The setting of this value depends on the distribution of URL lengths within our dataset. We have noticed that most of the URL lengths are less than 200 which means that when a vector is long, it may contain useless information, in contrast when the feature vector is too short, it may contain insufficient features. TF-IDF character level combination outputs \(D\) -dimensional feature vector as \(F_{T} = \left\langle {t_{1} ,t_{2} ,t_{3} , \ldots ,t_{D} } \right\rangle\) where \(D\) is the size of dictionary computed from the textual content corpus. It is observed from the experimental analysis that the size of dictionary \(D\)  = 20,332 and the size increases with an increase in number of corpus. The above three feature vectors are combined to generate final feature vector \(F_{V} = F_{T} \cup F_{U} \cup F_{H} = \left\langle {t_{1} ,t_{2} , \ldots ,t_{D} ,c_{1} ,c_{2} \ldots ,c_{{200}} ,f_{3} ,f_{4} ,f_{5} , \ldots ,f_{{15}} } \right\rangle\) that is fed as input to machine learning algorithms to classify the website.

Detection module

The Detection phase includes building a strong classifier by using the boosting method, XGBoost classifier. Boosting integrates many weak and relatively accurate classifiers to build a strong and therefore robust classifier for detecting phishing offences. Boosting also helps to combine diverse features resulting in improved classification performance 34 . Here, XGBoost classifier is employed on integrated feature sets of URL character sequence \({F}_{U}\) , various hyperlinks information \({F}_{H}\) , login form features \({F}_{L}\) , and textual content-based features \({F}_{T}\) to build a strong classifier for phishing detection. In the training phase, XGBoost classifier is trained using the feature vector \(({F}_{U}\cup {F}_{H} \cup {F}_{L} \cup {F}_{T})\) collected from each record in the training dataset. At the testing phase, the classifier detects whether a particular website is a malicious website or not. The detailed description is shown in Fig.  2 .

figure 2

Phishing detection algorithm.

Features extraction

Due to the limited search engine and third-party methods discussed in the literature, we extract the particular features from the client side in our approach. We have introduced eleven hyperlink features (F3–F13), two login form features (F14 and F15), character level TF-IDF features (F2), and URL character sequence features (F1). All these features are discussed in the following subsections.

URL character sequence features (F1)

The URL stands for Uniform Resource Locator. It is used for providing the location of the resources on the web such as images, files, hypertext, video, etc. URL. Each URL starts with a protocol (http, https, and ftp) used to access the resource requested. In this part, we extract character sequence features from URL. We employ the method used in 35 to process the URL at the character level. More information is contained at the character level. Phishers also imitate the URLs of legitimate websites by changing many unnoticeable characters, e.g., “ www.icbc.com ” as “ www.1cbc.com ”. Character level URL processing is a solution to the out of vocabulary problem. Character level sequences identify substantial information from specific groups of characters that appear together which could be a symptom of phishing. In general, a URL is a string of characters or words where some words have little semantic meanings. Character sequences help find this sensitive information and improve the efficiency of phishing URL detection. During the learning task, machine learning techniques can be applied directly using the extracted character sequence features without the expert intervention. The main processes of character sequences generating include: preparing the character vocabulary, creating a tokenizer object using Keras preprocessing package ( https://Keras.io ) to process URLs in char level and add a “UNK” token to the vocabulary after the max value of chars dictionary, transforming text of URLs to sequence of tokens, and padding the sequence of URLs to ensure equal length vectors. The description of URL features extraction is shown in Algorithm 1.

figure a

HTML features

The webpage source code is the programming behind any webpage, or software. In case of websites, this code can be viewed by anyone using various tools, even in the web browser itself. In this section, we extract the textual and hyperlink features existing in the HTML source code of the webpage.

Textual content-based features (F2)

TF-IDF stands for Term Frequency-Inverse Document Frequency. TF-IDF weight is a statistical measure that tells us the importance of a term in a corpus of documents 36 . TF-IDF vectors can be created at various levels of input tokens (words, characters, n-grams) 37 . It is observed that TF-IDF technique has been implemented in many approaches to catch phish of webpages by inspecting URLs 13 , obtain the indirect associated links 38 , target website 11 , and validity of suspected website 39 . In spite of TF-IDF technique extracts outstanding keywords from the text content of the webpage, it has some limitations. One of the limitations is that TF-IDF technique fails when the extracted keywords are meaningless, misspelled, skipped or replaced with images. Since plaintext and noisy data (i.e., attribute values for div, h1, h2, body and form tags) are extracted in our approach from the given webpage using BeautifulSoup parser, TF-IDF character level technique is applied with max features as 25,000. To obtain valid textual information, extra portions (i.e., JavaScript code, CSS code, punctuation symbols, and numbers) of the webpage are removed through regular expressions, including Natural Language Processing packages ( http://www.nltk.org/nltk_data/ ) such as sentence segmentation, word tokenization, text lemmatization and stemming as shown in Fig.  3 .

figure 3

The process of generating text features.

Phishers usually mimic the textual content of the target website to trick the user. Moreover, phishers may mistake or override some texts (i.e., title, copyright, metadata, etc.) and tags in phishing webpages to bypass revealing the actual identification of the webpage. However, tag attributes stay the same to preserve the visual similarity between phishing and targeted site using the same style and theme as that of the benign webpage. Therefore, it is needful to extract the text features (plaintext and noisy part of HTML) of the webpage. The basic of this step is to extract the vectored representation of the text and the effective webpage content. A TF-IDF object is employed to vectorize text of the webpage. The detailed process of the text vector generation algorithm as follows.

figure b

Script, CSS, img, and anchor files (F3, F4, F5, and F6)

External JavaScript or external Cascading Style Sheets (CSS) files are separate files that can be accessed by creating a link within the head section of a webpage. JavaScript, CSS, images, etc. files may contain malicious code while loading a webpage or clicking on a specific link. Moreover, phishing websites have fragile and unprofessional content as the number of hyperlinks referring to a different domain name increases. We can use <img> and <script> tags that have the "src" attribute to extract images and external JavaScript files in the website. Similarly, CSS and anchor files are within "href" attribute in <link> and <a> tags. In Eqs. ( 1 – 4 ), basically we calculated the rate of img and script tags that have the “src” attribute, link and anchor tags that have “href” attribute to the total hyperlinks available in a webpage, these tags usually link to image, Javascript, anchor, and CSS files required for a website

where \({\text{F}}_{\text{Script}\_\text{files}}\) , \({\text{F}}_{\text{CSS}\_\text{files}}\) , \({\text{F}}_{\text{Img}\_\text{files}}\) , \({\text{F}}_{\text{a}\_\text{files}}\) are the numbers of Javascript, CSS, image, anchor files existing in a webpage, and \({\text{F}}_{\text{Total}}\) is the total hyperlinks available in a webpage.

Empty hyperlinks (F7 and F8)

In the empty hyperlink, the “href” or “src” attributes of anchor, link, script, or img tags do not contain any URL. The empty link returns on the same webpage again when the user clicks on it. A benign website contains many webpages; thus, the scammer does not place any values in hyperlinks to make a phishing website behave like the benign website, and the hyperlinks look active on the phishing website. For example, <a href = “#”>, <a href = “#content”> and <a href = “javascript:void(0);”> HTML coding are used to design null hyperlinks 24 . To establish the empty hyperlink features, we define the rate of empty hyperlinks to the total number of hyperlinks available in a webpage, and the rate of anchor tag without “href” attribute to the total number of hyperlinks in a webpage. Following formulas are used to compute empty hyperlink features

where \({\text{F}}_{\text{a}\_\text{null}}\) and \({\text{F}}_{\text{null}}\) are the numbers of anchor tags without href attribute, and null hyperlinks in a webpage.

Total hyperlinks feature (F9)

Phishing websites usually contain minimal pages as compared to benign websites. Furthermore, sometimes the phishing webpage does not contain any hyperlink because the phishers usually only create a login page. Equation ( 7 ) computes the number of hyperlinks in a webpage by extracting the hyperlinks from an anchor, link, script, and img tags in the HTML source code.

Internal and external hyperlinks (F10, F11, and F12)

The base domain name in the external hyperlink is different from the website domain name, unlike the internal hyperlink; the base domain name is the same as the website domain name. The phishing websites may contain many external hyperlinks that indicate to the target websites due to the cybercriminals commonly copy the HTML code from the targeted authorized websites to create their phishing websites. Most of hyperlinks in a benign website contain the similar base domain name, whereas many hyperlinks in a phishing site may include the corresponding benign website domain. In our approach, the internal and external hyperlinks are extracted from the “src” attribute of img, script, frame tags, “action” attribute of form tag, and “href” attribute of the anchor and link tags. We compute the rate of internal hyperlinks to the total links available in a webpage (Eq.  8 ) to establish the internal hyperlink feature, and the rate of external hyperlinks to the total links (Eq.  9 ) to set the external hyperlink feature. Moreover, to set the external/internal hyperlink feature, we compute the rate of external hyperlinks to the internal hyperlinks (Eq.  10 ). A specified number has been used as a way of detecting the suspected websites in some previous studies 5 , 9 , 24 that these features used for classification. For example, if the rate of external hyperlinks to the total links is greater than 0.5, it will indicate that the website is phishing. However, determining a specific number as a parametric detection may cause errors in classification.

where \({\text{F}}_{\text{Internal}}\) , \({\text{F}}_{\text{External}}\) , and \({\text{F}}_{\text{Total}}\) are the number of external, internal, and total hyperlinks in a website.

Error in hyperlinks (F13)

Phishers sometimes add some hyperlinks in the fake website which are dead or broken links. In the hyperlink error feature, we check whether the hyperlink is a valid URL in the website. We do not consider the 403 and 404 error response code of hyperlinks due to the time consumed of the internet access to get the response code of each link. Hyperlink error is defined by dividing the total number of invalid links to the total links as represented in Eq. ( 11 )

where \({\text{F}}_{\text{Error}}\) is the total invalid hyperlinks.

Login form features (F14 and F15)

In the fraudulent website, the common trick to acquire the user's personal information is to include a login form. In the benign webpage, the action attribute of login form commonly includes a hyperlink that has the similar base domain as appear in in the browser address bar 24 . However, in the phishing websites, the form action attribute includes a URL that has a different base domain (external link), empty link, or not valid URL (Eq.  13 ). The suspicious form feature (Eq.  14 ) is defined by dividing the total number of suspicious forms S to the total forms available in a webpage (Eq.  12 )

where \({\text{F}}_{\text{S}}\) and \({\text{L}}_{\text{Total}}\) are the number of suspicious forms and total forms present in a webpage.

Figure  4 shows a comparison between benign and fishing hyperlink features based on the average occurrence rate per feature within each website in our dataset. From the figure, we noticed that the ratios of the external hyperlinks to the internal hyperlinks, and null hyperlinks in the phishing websites are higher than that in benign websites. Whereas, benign sites contain more anchor files, internal hyperlinks, and total hyperlinks.

figure 4

Distribution of hyperlink-based features in our data.

Classification algorithms

To measure the effectiveness of the proposed features, we have used various machine learning classifiers such as eXtreme Gradient Boosting (XGBoost), Random Forest, Logistic Regression, Naïve Bayes, and Ensemble of Random Forest and Adaboost classifiers to train our proposed approach. The major aim of comparing different classifiers is to expose the best classifier fit for our feature set. To apply different machine learning classifiers, Scikit-learn.org package is used, and Python is employed for feature extraction. From the empirical results, we noticed that XGBoost outperformed other classifiers. XGBoost algorithm is a type of ensemble classifiers, that transform weak learners to robust ones and convenient for our proposed feature set, thus it has high performance.

XGBoost (extreme gradient boosting) is a scalable machine learning system for tree boosting proposed by Chen and Guestrin 40 . Suppose there are \(N\) websites in the dataset \(\left\{ {\left( {x_{i} ,y_{i} } \right)|i = 1,2,...,N} \right\}\) , where \(x_{i} \in R^{d}\) is the extracted features associated with the \(i - th\) website, \(y_{i} \in \left\{ {0,\left. 1 \right\}} \right.\) is the class label, such that \(y_{i} = 1\) if and only if the website is a labelled phishing website. The final output \(f_{K} \left( x \right)\) of model is as follows 41 , 46 :

where l is the training loss function and  \(\Omega \left( {G_{k}} \right) = \gamma T + \frac{1}{2}\lambda \sum\limits_{t = 1}^{T} {\omega_{t}^{2} }\) is the regulation term, since XGBoost introduces additive training and all previous k-1 base learners are fixed, here we assumed that we are in step k that optimizes our function  \(f_{k} \left( x \right)\) , T is the number of leaves nodes in the base learner G k , γ is the complexity of each leaf, λ is a parameter to scale the penalty, and ω t is the output value at each final leaf node. If we apply the Taylor expansion to expand the Loss function at f k-1  ( x ) we will have 41 :

where  \(g_{i} = \frac{{\partial l\left( {y_{i} ,f_{k - 1} \left( {x_{i} } \right)} \right)}}{{\partial f_{k - 1} \left( x \right)}},h_{i} = \frac{{\partial l\left( {y_{i} ,f_{k - 1} \left( {x_{i} } \right)} \right)}}{{\partial f_{k - 1}^{2} \left( x \right)}}\) are respectively first and second derivative of the Loss function.

XGBoost classifier is a type of ensemble classifiers, that transform weak learners to robust ones and convenient for our proposed feature set for the prediction of phishing websites, thus it has high performance. Moreover, XGBoost provides a number of advantages, some of which include: (i) The strength to handle missing values existing within the training set, (ii) handling huge datasets that do not fit into memory and (iii) For faster computing, XGBoost can make use of multiple cores on the CPU. The websites are classified into two possible categories: phishing and benign using a binary classifier. When a user requests a new site, the trained XGBoost classifier determines the validity of a particular webpage from the created feature vector.

Experiments and result analysis

In this section we describe the training and testing dataset, performance metrics, implementation details, and outcomes of our approach. The proposed features described in “ Features extraction ” section are used to build a binary classifier, which classify phishing and benign websites accurately.

We collected the dataset from two sources for our experimental implementation. The benign webpages are collected in February 2020 from Stuff Gate 42 , whereas the phishing webpages are collected from PhishTank 43 , which have been validated from August 2016 to April 2020. Our dataset consists of 60,252 webpages and their HTML source codes, wherein 27,280 ones are phishing and 32,972 ones are benign. Table 3 provides the distribution of the benign and phishing instances. We have divided the dataset into two groups where D1 is our dataset, and D2 is dataset used in existing literature 6 . The database management system (i.e., pgAdmin) has been employed with python to import and pre-process the data. The data sets were randomly split in 80:20 ratios for training and testing, respectively.

Performance metrics

To measure the performance of proposed anti-phishing approach, we used different statistical metrics such true-positive rate (TPR), true-negative rate (TNR), false-positive rate (FPR), false-negative rate (FNR), sensitivity or recall, accuracy (Acc), precision (Pre), F-Score, AUC, and they are presented in Table 4 . \({N}_{B}\) and \({N}_{P}\) indicate the total number of benign and phishing websites, respectively. \({N}_{B\to B}\) are the benign websites are correctly marked as benign, \({N}_{B\to P}\) are the benign websites are incorrectly marked as phishing, \({N}_{P\to P}\) are the phishing websites are correctly marked as phishing, and \({N}_{P\to B}\) are the phishing websites are incorrectly marked as benign. The receiver operating characteristic (ROC) arch and AUC are commonly used to evaluate the measures of a binary classifier. The horizontal coordinate of the ROC arch is FPR, which indicates the probability that the benign website is misclassified as a phishing; the ordinate is TPR, which indicates the probability that the phishing website is identified as a phishing.

Evaluation of features

In this section, we evaluated the performance of our proposed features (URL and HTML). We have implemented different Machine Learning (ML) classifiers for feature evaluation used in our approach. In Table 5 , we extracted various text features such as TF-IDF word level, TF-IDF N-gram level (the length of n-gram between 2 and 3), TF-IDF character level, count vectors (bag-of-words), word sequences vectors, global to vector (GloVe) pre-trained word embedding, trained word embedding, character sequences vectors and implemented various classifiers such as XGBoost, Random forest, logistic regression, Naïve Bayes, Deep Neural Networks (DNN), Convolutional Neural Networks (CNN), and Long Short-Term Memory (LSTM) network. The main intention of this experiment was to reveal the best textual content features convenient for our data. From the experimental results, it is noticed that TF-IDF character level features outperformed other features with significant accuracy, precision, F-Score, Recall, and AUC using XGBoost and DNN classifiers. Hence, we implemented TF-IDF character level technique to generate text features (F2) of the webpage. Figure  5 presents the performance of textual content-based features. As shown in the figure, text features can correctly filter a high amount of phishing websites and achieved an accuracy of 88.82%.

figure 5

Performance of textual content features.

Table 6 shows the experiment results with hyperlinks features. From the empirical results, it is noticed that Random Forest classifier superior to the other classifiers with an accuracy of 82.27%, precision of 77.59%, F_Measure of 81.63%, recall of 86.10%, and AUC of 82.57%. It is also noticed that ensemble and XGBoost classifiers attained good accuracy of 82.18% and 80.49%, respectively. Figure  6 presents the classification results of hyperlink based features (F3–F15). As shown in the figure, hyperlink based features can accurately clarify 79.04% of benign websites and 86.10% of phishing websites.

figure 6

Performance of hyperlink based features.

In Table 7 , we integrated features of URL and HTML (hyperlink and text) using various classifiers to verify complementary behavior in phishing websites detection. From the empirical results, it is noticed that LR classifier has sufficient accuracy, precision, F-Score, AUC, and recall in terms of the HTML features. In contrast, NB classifier has good accuracy, precision, F-Score, AUC, and recall with respect to combining all the features. RF and ensemble classifiers achieved high accuracy, recall, F-Score, and AUC with respect to URL based features. XGBoost classifier outperformed the others with an accuracy of 96.76%, F-Score of 96.38%, AUC of 96.58% and recall of 94.56% with respect to combining all the features. It is observed that URL and HTML features are valuable in phishing detection. However, one type of feature is not suitable to identify all kinds of phishing webpages and does not result in high accuracy. Thus, we have combined all features to get more comprehensive features. The results on various classifiers of combined feature set are also shown in Fig.  7 . In Fig.  8 we compare the three feature sets in terms of accuracy, TNR, FPR, FNR, and TPR.

figure 7

Test results of various classifiers with respect to combined features.

figure 8

Performance of different feature combinations using XGBoost on dataset D1.

The confusion matrix is used to measure results where each row of the matrix represents the instances in a predicted class, while each column represents the instances in an actual class (or vice versa). The confusion matrix of the proposed approach is created as represented in Table 8 . From the results, combining all kind of features together as an entity correctly identified 5212 out of 5512 phishing webpages and 6448 out of 6539 benign webpages and attained an accuracy of 96.76%. Our approach results in low false positive rate (i.e., less than 1.39% of benign webpages incorrectly classified as phishing), and high true positive rate (i.e., more than 94.56% of phishing webpages accurately classified). We have also tested our feature sets (URL and HTML) on the existing dataset D2. Since dataset D2 only contains legitimate and malicious URLs, we needed to extract the HTML source code features for these URLs. The results are given in Table 9 and Fig.  9 . From the results, it is noticed that combining all kinds of features had outperformed other feature sets with a significant accuracy of 98.48%, TPR of 99.04%, and FPR of 2.09%.

figure 9

Performance of the proposed approach on dataset D2.

Comparison with existing approaches

In this experiment, we compare our approach with existing anti-phishing approaches. Notice that we have applied Le et al. 29 and Aljofey et al. 3 works on dataset D1 to evaluate the efficiency of the proposed approach. While for comparison of the proposed approach with Sahingoz et al. 6 , Rao et al. 13 , Chatterjee and Namin 30 works, we evaluated our approach on benchmark dataset D2 6 , 13 , 30 based on the four-statistics metrics used in the papers. The comparison results are shown in Table 10 . From the results, it is observed that our approach gives better performance than other approaches discussed in the literature, which shows the efficiency of detecting phishing websites over the existing approaches.

In Table 11 , we implemented Le et al. 29 and Aljofey et al. 3 methods to our dataset D1 and our approach outperformed the others with an accuracy of 96.76%, precision of 98.28%, and F-Score of 96.38%. It should also be mentioned that Aljofey et al. method achieved 97.86% recall, which is 3.3% greater than our method, whereas our approach gives TNR that is higher by 4.97%, and FPR that is lesser by 4.96%. Our approach accurately identifies the legitimate websites with a high TNR and low FPR. Some phishing detection methods achieve high recall, however inaccurate classification of the legitimate websites is more serious compared to inaccurate classification of the phishing sites.

Discussion and limitations

The phishing website seems similar to its benign official website, and the defiance is how to distinguish between them. This paper proposed a novel anti-phishing approach, which involves different features (URL, hyperlink, and text) that have never been taken into consideration. The proposed approach is a completely client-side solution. We applied these features on various machine learning algorithms and found that XGBoost attained the best performance. Our major aim is to design a real-time approach, which has a high true-negative rate and low false-positive rate. The results show that our approach correctly filtered the benign webpages with a low amount of benign webpages incorrectly classified as phishing. In the process of phishing webpage classification, we construct the dataset by extracting the relevant and useful features from benign and phishing webpages.

A desktop machine having a core™ i7 processor with 3.4 GHz clock speed and 16 GB RAM is used to executed the proposed anti-phishing approach. Since Python provides excellent support of its libraries and has sensible compile-time, the proposed approach is implemented using Python programming language. BeautifulSoup library is employed to parse the HTML of the specified URL. The detection time is the time between entering URL to generating outputs. When the URL is entered as a parameter, the approach attempts to fetch all specific features from the URL and HTML code of the webpage as debated in feature extraction section. This is followed by current URL classification in form of benign or phishing based on the value of the extracted feature. The total execution time of our approach in phishing webpage detection is around 2–3 s, which is quite low and acceptable in a real-time environment. Response time depends on different factors, such as input size, internet speed, and server configuration. Using our data D1, we also attempted to compute the time taken for training, testing and detecting of proposed approach (all feature combinations) for the webpage classification. The results are given in Table 12 .

In pursuit of a further understanding of the learning capabilities, we also present the classification error as well as log loss regarding the number of iterations implemented by XGBoost. Log loss, short for logarithmic loss is a loss function for classification that indicates the price paid for the inaccuracy of predictions in classification problems. Figure  10 show the logarithmic loss and the classification error of the XGBoost approach for each epoch on the training and test dataset D1. From reviewing the figure, we might note that the learning algorithm is converging after approximately 100 iterations.

figure 10

XGBoost learning curve of logarithmic loss and classification error on dataset D1.

Limitations

Although our proposed approach has attained outstanding accuracy, it has some limitations. First limitation is that the textual features of our phishing detection approach depend on the English language. This may cause an error in generating efficient classification results when the suspicious webpage includes language other than English. About half (60.5%) of the websites use English as a text language 44 . However, our approach employs URL, noisy part of HTML, and hyperlink based features, which are language-independent features. The second limitation is that despite the proposed approach uses URL based features, our approach may fail to identify the phishing websites in case when the phishers use the embedded objects (i.e., Javascript, images, Flash, etc.) to obscure the textual content and HTML coding from the anti-phishing solutions. Many attackers use single server-side scripting to hide the HTML source code. Based on our experiments, we noticed that legitimate pages usually contain rich textual content features, and high amount of hyperlinks (At least one hyperlink in the HTML source code). At present, some phishing webpages include malware, for example, a Trojan horse that installs on user’s system when the user opens the website. Hence, the next limitation of this approach is that it is not sufficiently capable of detecting attached malware because our approach does not read and process content from the web page's external files, whether they are cross-domain or not. Finally, our approach's training time is relatively long due to the high dimensional vector generated by textual content features. However, the trained approach is much better than the existing baseline methods in terms of accuracy.

Conclusion and future work

Phishing website attacks are a massive challenge for researchers, and they continue to show a rising trend in recent years. Blacklist/whitelist techniques are the traditional way to alleviate such threats. However, these methods fail to detect non-blacklisted phishing websites (i.e., 0-day attacks). As an improvement, machine learning techniques are being used to increase detection efficiency and reduce the misclassification ratio. However, some of them extract features from third-party services, search engines, website traffic, etc., which are complicated and difficult to access. In this paper, we propose a machine learning-based approach which can speedily and precisely detect phishing websites using URL and HTML features of the given webpage. The proposed approach is a completely client-side solution, and does not rely on any third-party services. It uses URL character sequence features without expert intervention, and hyperlink specific features that determine the relationship between the content and the URL of a webpage. Moreover, our approach extracts TF-IDF character level features from the plaintext and noisy part of the given webpage's HTML.

A new dataset is constructed to measure the performance of the phishing detection approach, and various classification algorithms are employed. Furthermore, the performance of each category of the proposed feature set is also evaluated. According to the empirical and comparison results from the implemented classification algorithms, the XGBoost classifier with integration of all kinds of features provides the best performance. It acquired 1.39% false-positive rate and 96.76% of overall detection accuracy on our dataset. An accuracy of 98.48% with a 2.09% false-positive rate on a benchmark dataset.

In future work, we plane to include some new features to detect the phishing websites that contain malware. As we said in “ Limitations ” section, our approach could not detect the attached malware with phishing webpage. Nowadays, blockchain technology is more popular and seems to be a perfect target for phishing attacks like phishing scams on the blockchain. Blockchain is an open and distributed ledger that can effectively register transactions between receiving and sending parties, demonstrably and constantly, making it common among investors 45 . Thus, detecting phishing scams in the blockchain environment is a defiance for more research and evolution. Moreover, detecting phishing attacks in mobile devices is another important topic in this area due to the popularity of smart phones 47 , which has made them a common target of phishing offenses.

Data availability

The dataset generated during the current study are available in the Google Drive repository: https://drive.google.com/file/d/18ZZHsCeMmF9HKTaL_yd41oJ_3Fgk0gWE/view?usp=sharing .

RSA. Rsa fraud report. https://go.rsa.com/l/797543/2020-07-08/3njln/797543/48525/RSA_Fraud_Report_Q1_2020.pdf (2020) (Accessed 14 January 2021).

APWG. Phishing Attack Trends Reports, 24, November 2020. https://docs.apwg.org/reports/apwg_trends_report_q3_2020.pdf (2020) (Accessed 14 January 2021).

Aljofey, A., Jiang, Q., Qu, Q., Huang, M. & Niyigena, J.-P. An effective phishing detection model based on character level convolutional neural network from URL. Electronics 9 , 1514 (2020).

Article   Google Scholar  

Dhamija, R., Tygar, J.D., & Hearst, M. Why phishing works. in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Montreal, QC, Canada, 22–27 April 2006 , 581–590 (2006).

Jain, A. K. & Gupta, B. B. A novel approach to protect against phishing attacks at client side using auto-updated white-list. EURASIP J. on Info. Security. 9 , 1–11. https://doi.org/10.1186/s13635-016-0034-3 (2016).

Sahingoz, O. K., Buber, E., Demir, O. & Diri, B. Machine learning based phishing detection from URLs. Expert Syst. Appl. 2019 (117), 345–357 (2019).

Haruta, S. , Asahina, H., & Sasase, I. Visual Similarity-based Phishing Detection Scheme using Image and CSS with Target Website Finder. 978-1-5090-5019-2/17/$31.00 ©2017 IEEE (2017).

Cook, D. L., Gurbani, V. K., & Daniluk, M. Phishwish: A stateless phishing filter using minimal rules. in Financial Cryptography and Data Security , (ed. Gene Tsudik) 324, (Berlin, Heidelberg, Springer-Verlag, 2008).

Jain, A. K. & Gupta, B. B. A machine learning based approach for phishing detection using hyperlinks information. J. Ambient. Intell. Humaniz. Comput. https://doi.org/10.1007/s12652-018-0798-z (2018).

Li, Y., Yang, Z., Chen, X., Yuan, H. & Liu, W. A stacking model using URL and HTML features for phishing webpage detection. Futur. Gener. Comput. Syst. 94 , 27–39 (2019).

Article   ADS   Google Scholar  

Xiang, G., Hong, J., Rose, C. P. & Cranor, L. CANTINA+: a feature rich machine learning framework for detecting phishing web sites. ACM Trans. Inf. Syst. Secur. 14 (2), 1–28. https://doi.org/10.1145/2019599.2019606 (2011).

Zhang, W., Jiang, Q., Chen, L. & Li, C. Two-stage ELM for phishing Web pages detection using hybrid features. World Wide Web 20 (4), 797–813 (2017).

Rao, R. S., Vaishnavi, T. & Pais, A. R. CatchPhish: Detection of phishing websites by inspecting URLs. J. Ambient. Intell. Humanized Comput. 11 , 813–825 (2019).

Arachchilage, N. A. G., Love, S. & Beznosov, K. Phishing threat avoidance behaviour: An empirical investigation. Comput. Hum. Behav. 60 , 185–197 (2016).

Wang, Y., Agrawal, R., & Choi, B.Y. Light weight anti-phishing with user whitelisting in a web browser. in Region 5 conference, 2008 IEEE, IEEE , 1–4 (2008).

Han, W., Cao, Y., Bertino, E. & Yong, J. Using automated individual white-list to protect web digital identities. Expert Syst. Appl. 39 (15), 11861–11869 (2012).

Prakash, P., Kumar, M., Kompella, R.R., Gupta, M. Phishnet: Predictive blacklisting to detect phishing attacks. in INFOCOM, 2010 Proceedings IEEE, IEEE , 1–5. https://doi.org/10.1109/INFCOM.2010.5462216 (2010)

Felegyhazi, M., Kreibich, C. & Paxson, V. On the potential of proactive domain blacklisting. LEET 10 , 6–6 (2010).

Google Scholar  

Sheng, S., Wardman, B., Warner, G., Cranor, L.F., Hong, J., & Zhang, C. An empirical analysis of phishing blacklists. in Proceedings of the 6th Conference on Email and Anti-Spam (CEAS’09) (2010).

Qi, L. et al. Privacy-aware data fusion and prediction with spatial-temporal context for smart city industrial environment. IEEE Trans. Ind. Inform. 17 (6), 4159–4167. https://doi.org/10.1109/TII.2020.3012157 (2021).

Liu, Y. et al. A label noise filtering and label missing supplement framework based on game theory. Digital Commun. Netw. https://doi.org/10.1016/j.dcan.2021.12.008 (2022).

Muzammal, M., Qu, Q. & Nasrulin B. Renovating blockchain with distributed databases: An open source system. Future Gener. Comput. Syst. 90 , 105–117. https://doi.org/10.1016/j.future.2018.07.042 (2019).

Liu, Y. et al. Bidirectional GRU networks-based next POI category prediction for healthcare. Int. J. Intell. Syst. https://doi.org/10.1002/int.22710 (2021).

Jain, A. K. & Gupta, B. B. Towards detection of phishing websites on client-side using machine learning based approach. Telecommun. Syst. https://doi.org/10.1007/s11235-017-0414-0 (2017).

Rao, R. S. & Pais, A. R. Two level filtering mechanism to detect phishing sites using lightweight visual similarity approach. J. Ambient. Intell. Humaniz. Comput. https://doi.org/10.1007/s12652-019-01637-z (2019).

Jain, A. K. & Gupta, B. B. Two-level authentication approach to protect from phishing attacks in real time. J. Ambient. Intell. Human Comput. https://doi.org/10.1007/s12652-017-0616-z (2017).

Rao, R. S., Umarekar, A. & Pais, A. R. Application of word embedding and machine learning in detecting phishing websites. Telecommun. Syst. 79 , 33–45. https://doi.org/10.1007/s11235-021-00850-6 (2022).

Guo, B. et al. HinPhish: An effective phishing detection approach based on heterogeneous information networks. Appl. Sci. 11 (20), 9733. https://doi.org/10.3390/app11209733 (2021).

Le, H., Pham, Q., Sahoo, D., & Hoi, S.C.H. Urlnet: Learning a URL representation with deep learning for malicious URL detection. arXiv 2018, arXiv: 1802.03162 (2018).

Chatterjee, M., & Namin, A.S. Detecting phishing websites through deep reinforcement learning. in 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC) . 978-1-7281-2607-4/19/$31.00 ©2019 IEEE. (IEE Computer Society, 2019). https://doi.org/10.1109/COMPSAC.2019.10211 .

Xiao, X., Zhang, D., Hu, G., Jiang, Y. & Xia, S. CNN-MHSA: A convolutional neural network and multi-head self- attention combined approach for detecting phishing websites. Neural Netw. 125 , 303–312. https://doi.org/10.1016/j.neunet.2020.02.013 (2020).

Article   PubMed   Google Scholar  

Zheng, F., Yan Q., Victor C.M. Leung, F. Richard Yu, Ming Z. HDP-CNN: Highway deep pyramid convolution neural network combining word-level and character-level representations for phishing website detection, computers & security. https://doi.org/10.1016/j.cose.2021.102584 (2021)

Mohammad, R. M., Thabtah, F. & McCluskey, L. Predicting phishing websites based on self-structuring neural network. Neural Comput. Appl. 25 (2), 443–458 (2014).

Ramanathan, V. & Wechsler, H. Phishing detection and impersonated entity discovery using Conditional Random Field and Latent Dirichlet Allocation. Comput. Security. 34 , 123–139 (2013).

Zhang, X., Zhao, J., & LeCun, Y. Character-level convolutional networks for text classification. in Proceedings of the Advances in Neural Information Processing Systems 28 (NIPS 2015), Montreal, QC, Canada, 7–12 December 2015 (2015).

Stecanella, B. What is TF-IDF? https://monkeylearn.com/blog/what-is-tf-idf/ . (2019) (Accessed 20 December 2020).

Bansal, S.A. Comprehensive guide to understand and implement text classification in python. https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-andimplement-text-classification-in-python/ (2018) (Accessed 1 July 2020).

Ramesh, G., Krishnamurthi, I. & Kumar, K. S. S. An efficacious method for detecting phishing webpages through target domain identification. Decis. Support Syst. 2014 (61), 12–22 (2014).

Zhang, Y., Hong, J.I., & Cranor, L.F. Cantina: A content- based approach to detecting phishing websites. in Proceedings of the 16th International Conference on World Wide Web, Banff, AB, Canada, 8–12 May 2007 , 639–648 (2007).

Chen, T., & Guestrin, C.: Xgboost: A scalable tree boosting system. in Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining. ACM , 785–794 (2016)

Aljofey, A., Jiang, Q. & Qu, Q. A supervised learning model for detecting Ponzi contracts in Ethereum Blockchain. In Big Data and Security. ICBDS 2021. Communications in Computer and Information Science Vol. 1563 (eds Tian, Y. et al. ) (Springer, 2022). https://doi.org/10.1007/978-981-19-0852-1_52 .

Chapter   Google Scholar  

http://stuffgate.com/stuff/website/ . (Accessed February 2020).

http://www.phishtank.com . (Accessed April 2020).

Usage of content languages for websites. https://w3techs.com/technologies/overview/content_language/all . (2021) (Accessed 19 January 2021).

Iansiti, M. & Lakhani, K. R. The truth about blockchain. Harvard Bus. Rev. 95 (1), 118–127 (2017).

https://github.com/YC-Coder-Chen/Tree-Math/blob/master/XGboost.md . (Accessed September 2021).

Qu, Q., Liu, S., Yang, B. & Jensen, C. S. Efficient top-k spatial locality search for co-located spatial web objects. 2014 IEEE 15th International Conference on Mobile Data Management. 1 , 269–278 (2014).

Download references

Acknowledgements

This research work is supported by the National Key Research and Development Program of China Grant nos. 2021YFF1200104 and 2021YFF1200100.

Author information

Authors and affiliations.

Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China

Ali Aljofey, Qingshan Jiang, Abdur Rasool, Hui Chen & Qiang Qu

Shenzhen College of Advanced Technology, University of Chinese Academy of Sciences, Beijing, 100049, China

Ali Aljofey, Abdur Rasool & Hui Chen

Department of Computer Science, Guangdong University of Technology, Guangzhou, China

Cloud Computing Center, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China

You can also search for this author in PubMed   Google Scholar

Contributions

Data curation, A.A. and Q.J.; Funding acquisition, Q.J. and Q.Q.; Investigation, Q.J. and Q.Q.; Methodology, A.A. and Q.J.; Project administration, Q.J.; Software, A.A.; Supervision, Q.J.; Validation, A.R. and H.C.; Writing—original draft, A.A.; Writing—review & editing, Q.J., W.L, Y.W, and Q.Q; All authors reviewed the manuscript.

Corresponding author

Correspondence to Qingshan Jiang .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Aljofey, A., Jiang, Q., Rasool, A. et al. An effective detection approach for phishing websites using URL and HTML features. Sci Rep 12 , 8842 (2022). https://doi.org/10.1038/s41598-022-10841-5

Download citation

Received : 17 December 2021

Accepted : 06 April 2022

Published : 25 May 2022

DOI : https://doi.org/10.1038/s41598-022-10841-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Spark-based multi-verse optimizer as wrapper features selection algorithm for phishing attack challenge.

  • Jamil Al-Sawwa
  • Mohammad Almseidin
  • Remah Younisse

Cluster Computing (2024)

Detection of phishing URLs with deep learning based on GAN-CNN-LSTM network and swarm intelligence algorithms

  • Abbas Jabr Saleh Albahadili
  • Ayhan Akbas
  • Javad Rahebi

Signal, Image and Video Processing (2024)

A CNN-Based SIA Screenshot Method to Visually Identify Phishing Websites

  • Dong-Jie Liu
  • Jong-Hyouk Lee

Journal of Network and Systems Management (2024)

  • Adnan Noor Mian
  • Sanaullah Manzoor

Scientific Reports (2023)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

phishing research paper pdf

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Loading metrics

Open Access

Peer-reviewed

Research Article

Detecting phishing websites using machine learning technique

Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

* E-mail: [email protected]

Affiliation Department of Computer Science and Information System, College of Applied Sciences, Almaarefa University, Riyadh, Saudi Arabia

ORCID logo

  • Ashit Kumar Dutta

PLOS

  • Published: October 11, 2021
  • https://doi.org/10.1371/journal.pone.0258361
  • Reader Comments

Fig 1

In recent years, advancements in Internet and cloud technologies have led to a significant increase in electronic trading in which consumers make online purchases and transactions. This growth leads to unauthorized access to users’ sensitive information and damages the resources of an enterprise. Phishing is one of the familiar attacks that trick users to access malicious content and gain their information. In terms of website interface and uniform resource locator (URL), most phishing webpages look identical to the actual webpages. Various strategies for detecting phishing websites, such as blacklist, heuristic, Etc., have been suggested. However, due to inefficient security technologies, there is an exponential increase in the number of victims. The anonymous and uncontrollable framework of the Internet is more vulnerable to phishing attacks. Existing research works show that the performance of the phishing detection system is limited. There is a demand for an intelligent technique to protect users from the cyber-attacks. In this study, the author proposed a URL detection technique based on machine learning approaches. A recurrent neural network method is employed to detect phishing URL. Researcher evaluated the proposed method with 7900 malicious and 5800 legitimate sites, respectively. The experiments’ outcome shows that the proposed method’s performance is better than the recent approaches in malicious URL detection.

Citation: Dutta AK (2021) Detecting phishing websites using machine learning technique. PLoS ONE 16(10): e0258361. https://doi.org/10.1371/journal.pone.0258361

Editor: Zhihan Lv, Qingdao University, CHINA

Received: April 26, 2021; Accepted: September 26, 2021; Published: October 11, 2021

Copyright: © 2021 Ashit Kumar Dutta. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are located within the manuscript and its Supporting information files, and at https://github.com/shreyagopal/Phishing-Website-Detection-by-Machine-Learning-Techniques.git .

Funding: No funding received for this research.

Competing interests: No conflict of interest.

1. Introduction

Phishing is a fraudulent technique that uses social and technological tricks to steal customer identification and financial credentials. Social media systems use spoofed e-mails from legitimate companies and agencies to enable users to use fake websites to divulge financial details like usernames and passwords [ 1 ]. Hackers install malicious software on computers to steal credentials, often using systems to intercept username and passwords of consumers’ online accounts. Phishers use multiple methods, including email, Uniform Resource Locators (URL), instant messages, forum postings, telephone calls, and text messages to steal user information. The structure of phishing content is similar to the original content and trick users to access the content in order to obtain their sensitive data. The primary objective of phishing is to gain certain personal information for financial gain or use of identity theft. Phishing attacks are causing severe economic damage around the world. Moreover, Most phishing attacks target financial/payment institutions and webmail, according to the Anti-Phishing Working Group (APWG) latest Phishing pattern studies [ 1 ].

In order to receive confidential data, criminals develop unauthorized replicas of a real website and email, typically from a financial institution or other organization dealing with financial data [ 2 – 4 ]. This e-mail is rendered using a legitimate company’s logos and slogans. The design and structure of HTML allow copying of images or an entire website [ 5 ]. Also, it is one of the factors for the rapid growth of Internet as a communication medium, and enables the misuse of brands, trademarks and other company identifiers that customers rely on as authentication mechanisms [ 6 – 8 ]. To trap users, Phisher sends "spooled" mails to as many people as possible. When these e-mails are opened, the customers tend to be diverted from the legitimate entity to a spoofed website.

There is a significant chance of exploitation of user information. For these reasons, phishing in modern society is highly urgent, challenging, and overly critical [ 9 , 10 ]. There have been several recent studies against phishing based on the characteristics of a domain, such as website URLs, website content, incorporating both the website URLs and content, the source code of the website and the screenshot of the website [ 11 ]. However, there is a lack of useful anti-phishing tools to detect malicious URL in an organization to protect its users. In the event of malicious code being implanted on the website, hackers may steal user information and install malware, which poses a serious risk to cybersecurity and user privacy. Malicious URLs on the Internet can be easily identified by analyzing it through Machine Learning (ML) technique [ 12 , 13 ]. The conventional URL detection approach is based on a blacklist (set of malicious URLs) obtained by user reports or manual opinions. On the one hand, the blacklist is used to verify an URL and on the other hand the URL in the blacklist is updated, frequently. However, the numbers of malicious URLs not on the blacklist are increasing significantly. For instance, cybercriminals can use a Domain Generation Algorithm (DGA) to circumvent the blacklist by creating new malicious URLs. Thus, an exhaustive blacklist of malicious URLs [ 14 , 15 ] is almost impossible to identify the malicious URLs. Thusnew malicious URLs cannot be identified with the existing approaches. Researchers suggested methods based on the learning of computer to identify malicious URLs to resolve the limitations of the system based on the blacklist [ 16 – 18 ]. Malicious URL detection is considered a binary classification task with two-class predictions: malicious and benign. The training of the ML method consists of finding the best mapping between the d-dimensional vector space and the output variable [ 19 – 21 ]. This strategy has a strong generalization capacity to find unknown malicious URLs compared to the blacklist approach.

Recurrent Neural Network (RNN)—Long Short-Term Memory (LSTM) is one of the ML techniques that presents a solution for the complex real—time problems [ 22 ]. LSTM allow RNN to store inputs for a larger period [ 23 ]. It is similar to the concept of storage in computer. In addition, each feature will be processed according to the uniform distribution [ 24 ]. The combination of RNN and LSTM enables to extract a lot of information from a minimum set of data. Therefore, it supports phishing detection system to identify a malicious site in a shorter duration.

In comparison to most previous approaches, researchers focus on identifying malicious URLs from the massive set of URLs. Therefore, the study proposes Recurrent Neural Network (RNN) based URL detection approach. The objectives of the study are as follows:

  • To develop a novel approach to detect malicious URL and alert users.
  • To apply ML techniques in the proposed approach in order to analyze the real time URLs and produce effective results.
  • To implement the concept of RNN, which is a familiar ML technique that has the capability to handle huge amount of data.

The rest of the paper is organized as follows: Section 1 introduces the concept of malicious URL and objective of the study. The background of the study and related literature in detecting URL is discussed in section 2. Section 3 presents the methodology of the research. Results and discussion are presented in section 4. Finally, section 5 concludes the study with its future direction.

2. Research background and related works

Phishing attacks are categorized according to Phisher’s mechanism for trapping alleged users. Several forms of these attacks are keyloggers, DNS toxicity, Etc., [ 2 ]. The initiation processes in social engineering include online blogs, short message services (SMS), social media platforms that use web 2.0 services, such as Facebook and Twitter, file-sharing services for peers, Voice over IP (VoIP) systems where the attackers use caller spoofing IDs [ 3 , 4 ]. Each form of phishing has a little difference in how the process is carried out in order to defraud the unsuspecting consumer. E-mail phishing attacks occur when an attacker sends an e-mail with a link to potential users to direct them to phishing websites.

2.1 Classification of phishing attack techniques

Phishing websites are challenging to an organization and individual due to its similarities with the legitimate websites [ 5 ]. Fig 1 presents the multiple forms of phishing attacks. Technical subterfuge refers to the attacks include Keylogging, DNS poisoning, and Malwares. In these attacks, attacker intends to gain the access through a tool / technique. On the one hand, users believe the network and on the other hand, the network is compromised by the attackers. Social engineering attacks include Spear phishing, Whaling, SMS, Vishing, and mobile applications. In these attacks, attackers focus on the group of people or an organization and trick them to use the phishing URL [ 6 , 7 ]. Apart from these attacks, many new attacks are emerging exponentially as the technology evolves constantly.

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

https://doi.org/10.1371/journal.pone.0258361.g001

2.2 Phishing detection approaches

Phishing detection schemes which detect phishing on the server side are better than phishing prevention strategies and user training systems. These systems can be used either via a web browser on the client or through specific host-site software [ 8 , 9 ]. Fig 2 presents the classification of Phishing detection approaches. Heuristic and ML based approach is based on supervised and unsupervised learning techniques. It requires features or labels for learning an environment to make a prediction. Proactive phishing URL detection is similar to ML approach. However, URLs are processed and support a system to predict a URL as a legitimate or malicious [ 11 – 15 ]. Blacklist and Whitelist approaches are the traditional methods to identify the phishing sites [ 16 – 21 ]. The exponential growth of web domains reduces the performance of the traditional method [ 22 – 24 ].

thumbnail

https://doi.org/10.1371/journal.pone.0258361.g002

The existing methods rely on new internet users to a minimum. Once they identify phishing website, the site is not accessible, or the user is informed of the probability that the website is not genuine. This approach requires minimum user training and requires no modifications to existing website authentication systems. The performance of the detection systems is calculated according to the following:

  • Number of True Positives (TP): The total number of malicious websites.
  • Number of True Negatives (TN): The total number of legitimate websites.
  • Number of False Positives (FP): The total number of incorrect predictions of legitimate websites as a malicious website.
  • Number of False Negatives (FN): The total number of incorrect predictions of malicious websites as a legitimate website.

Using some benchmark dataset, the accuracy of phishing detection systems is usually evaluated. The familiar phishing dataset to train the ML based techniques are as follows:

2.2.1 Normal dataset.

AlexaRank [ 25 ] is used as a benign and natural website benchmarking dataset. Alexa is a commercial enterprise which carries out web data analysis. It obtains the browsing habits of users from different sources and analyses them objectively for the reporting and classification of Internet web-based URLs. Researchers use the rankings provided by Alexa to collect a number of high standard websites as the normal dataset to test and classify websites. Alexa presents the dataset in the form of a raw text file where each line in the order ascended mentions the grade and domain name of a website.

2.2.2 Phishing dataset.

Phishtank is a familiar phishing website benchmark dataset which is available at https://phishtank.org/ . It is a group framework that tracks websites for phishing sites. Various users and third parties send alleged phishing sites that are ultimately selected as legitimate site by a number of users. Thus, Phishtank offers a phishing website dataset in real-time. Researchers to establish data collection for testing and detection of Phishing websites use Phishtank’s website. Phishtank dataset is available in the Comma Separated Value (CSV) format, with descriptions of a specific phrase used in every line of the file. The site provides details include ID, URL, time of submission, checked status, online status and target URLs.

2.3 Research questions

Researcher framed the Research Questions (RQ) according to the objective of the study and its background. They are as follows:

  • RQ1—How URL detectors identify the phishing URLs or websites?
  • RQ2—How to apply ML methods to classify malicious and legitimate websites?
  • RQ3—How to evaluate a URL detector performance?

On the one hand, RQ1 and RQ2 assist to develop a ML based phishing detection system for securing an network from phishing attacks. On the other hand, RQ3 specifies the importance of the performance evaluation of a phishing technique. To address RQ1, authors found some recent literature related to URL detection using Artificial Intelligence (AI) techniques. The following part of this section presents the studies in detail with Table 2.

Authors in the study [ 2 ] proposed a URL-based anti-phishing machine learning method. They have taken 14 features of the URL to detect the website as a malicious or legitimate to test the efficiency of their method. More than 33,000 phishing and valid URLs in Support Vector Machine (SVM) and Naïve Bayes (NB) classifiers were used to train the proposed system. The phishing detection method focused on the learning process. They extracted 14 different features, which make phishing websites different from legitimate websites. The outcome of their experiment reached over 90% of precision when websites with SVM Classification are detected.

The study [ 3 ] explored multiple ML methods to detect URLs by analyzing various URL components using machine learning and deep learning methods. Authors addressed various methods of supervised learning for the identification of phishing URLs based on lexicon, WHOIS properties, PageRank, traffic rank information and page importance properties. They studied how the volume of different training data influences the accuracy of classifiers. The research includes Support Vector Machine (SVM), K-NN, random forest classification (RFC) and Artificial Neural Network (ANN) techniques for the classification.

Based on the output without and with the functionality selection a comparative study of machine learning algorithms is carried out in the study [ 4 ]. Experiments on a phishing dataset were carried out with 30 features including 4898 phished and 6157 benign web pages. Several ML methods were used to yield a better outcome. A method for selecting functions is subsequently employed to increase model performance. Random forests algorithm achieved the highest accuracy prior to and after the selection of features and dramatically increase building time. The results of the experiment shown that using the selection approach with machine learning algorithms can boost the effectiveness of the classification models for the detection of phishing without reducing their performance.

In this study [ 5 ], authors proposed URLNet, a CNN-based deep-neural URL detection network. They argued that current methods often use Bag of Words(BoW) such as features and suffered some essential limitations, such as the failure to detect sequential concepts in a URL string, the lack of automated feature extraction and the failure of unseen features in real—time URLs. They developed a CNNs and Word CNNs for character and configured the network. In addition, they suggested advanced techniques that were particularly effective for handling uncommon terms, a problem commonly exist in malicious URL detection tasks. This method can permit URLNet to identify embeddings and use sub word information from invisible words during testing phase.

Authors in [ 6 ] introduced a method for phishing URLs with innovative lexical features and blacklist. They collected a list of URLs using a crawler from URL repositories and collected 18 common lexical features. They implemented advanced ML techniques consisting of under/oversamples and classification. The automated approaches outperform other existing ML apporaches. The study has focused on content features and not lexical features, which was difficult to implement in real-world environments. The experimental results were better than the existing classification algorithms.

In the study [ 7 ], author investigated how well phishing URLs can be classified in the set of URLs which contain benign URLs. They discussed randomisation, characteristics engineering, the extraction of characteristics using host-based lexical analysis and statistical analysis. For the comparative study, several classifiers were applied and found that the results across the different classifiers are almost consistent. Authors argued that they proposed a convenient approach to remove functionality from URLs with simple standard words. More features could be experimented that lead to an optimum results. The dataset used in the study includes some older URLs. Thus, there is a possibility of lack of performance.

Authors [ 8 ] suggested a URL detector for high precision phishing attacks. They argued that the technique could be scaled to various sizes and proactively adapted. For both legitimate and malicious URLs a limited data collection of 572 cases had been employed. The characteristics were extracted and then weighed as cases to use in the prediction process. The test results were highly reliable with and without online phishing threats. For the improvement of the accuracy, Genetic algorithm (GA) has been used. Table 1 presents the outcome of the comparative study of literature.

thumbnail

https://doi.org/10.1371/journal.pone.0258361.t001

Authors [ 9 ] developed a detection approach for classifying malicious and normal webpages. The outcome of this study indicated that the value of true positive was higher rather than the false positive rate. In other study [ 10 ], authors proposed a Convolutional Neural Network (CNN) to detect a phishing URL. In this study, researchers employed a sequential pattern to capture the URL information. It achieved an accuracy of 98.58%, 95.46%, and 95.22%, respectively on benchmark datasets.

In study [ 11 ], authors employed a generative adversarial network for classifying the URLs and bypass the blacklist-based phishing detectors. In addition, researchers argued that the system can by pass both simple and novice ML detection techniques.

Based on the related work and its performance, authors selected a couple of studies for comparing with the proposed URL detector. The studies of Hung Le et al., [ 5 ] and Hong J. et al., [ 6 ] were selected. The reason for selecting studies is that the studies were applied deep learning methods and achieved an average accuracy of 90%.

3. Research methodology

RQ3 stated that how ML method can be employed to identify a malicious or legitimate URL. To present a solution, authors proposed a framework as shown in Fig 3 for classifying URLs and identify the phishing URLs.

thumbnail

https://doi.org/10.1371/journal.pone.0258361.g003

phishing research paper pdf

During the training phase, RNN stores the properties Pm and Pl to learn the environment. Moreover, each URL of the dataset from Phishtank [ 23 ] and crawled URL is utilized in a way to instruct the model. Algorithm 3.1 and 3.2 presents the steps involved in the data collection and pre-process, correspondingly. Algorithm 3.3 and 3.4 shows the training phase and testing phase, individually. The training phase uses the labels to train RNN to learn the malicious and legitimate URLs. Thus, the testing phase of the proposed RNN model receives each URL and predicts the type of URL. RNN (LSTM) is developed with Python 3.0 in Windows 10 environment with the support of i7 processor.

LSTM model is an effective predictive model. It generates an output based on the arbitrary number of steps. There are five essential components that enables the model to produce long—term and short—term data.

Cell state (CS)—It indicates the cell space that accommodate both long term and short-term memories.

Hidden state (HS)—This is the output status information that user use to determine URL with respect to the current data, hidden condition and current cell input. The secret state is used to recover both short-term and long-term memory, in order to make a prediction.

Input gate (IT)—The total number of information flows to the cell state.

Forget gate (FT)—The total number of data flows from the current input and past cell state into the present cell state.

Output gate (OT)—The total number of information flows to the hidden state.

3.1. Input gate

It identifies an input value for memory alteration. Sigmoid defines the values that can be up to 0,1. And the tanh function weights the values passed by, evaluating their significance from-1 to 1. Eqs 5 and 6 represents the input gate and cell state, respectively. Wn is the weight, HT t −1 is the previous state of hidden state, x t is the input, and b n is the bias vector which need to be learnt during the training phase. CT is calculated using tanh function.

phishing research paper pdf

3.2. Forget gate

It finds out the necessary block information to be discarded from the memory. The sigmoid function is used to describe it. Eq 7 contains ( HT t −1 ) and content( x t ) are examined, and the number of outputs between 0 and 1 is verified by each cell state CT t −1 number.

phishing research paper pdf

3.3. Output gate

The input and the memory of the block is used to determine the output. Sigmoid function determines which values to let through 0 and 1. The tanh function presents weightage to the values which are transferred to determine their degree of importance ranging from-1 to 1 and multiplied with output of Sigmoid.

phishing research paper pdf

Fig 4 represents the processes involved in data collection. Data Repositories such as Phishtank and Crawler are used to collect Malicious and Benign URLs. A crawler is developed in order to collect URLs from AlexaRank website. AlexaRank publishes set of URLs with ranking to support to research community. In this study, the crawler crawled a number of 7658 URLs from AlexaRank between June 2020 to November 2020. 6042 URLs were collected through Phishtank datasets. During the data collection, extracted data are stored in W and returned as W1 with number of URLS, N.

thumbnail

https://doi.org/10.1371/journal.pone.0258361.g004

Fig 5 illustrates the steps of data pre—process. url is one of the elements of URL dataset. In this process, the raw data is pre—processed by scanning each URL in th dataset. A set of functions are developed in order to remove the irrelevant data. Finally, D2 is the set of features returned by the pre—process activity.

thumbnail

https://doi.org/10.1371/journal.pone.0258361.g005

Fig 6 represents the processes of data transformation. “Num” is the vector returned by the data transformation process. During this process, each feature of D2 is converted as a vector. Each data in D2 is processed using the GenerateVectors function. A vector is generated and passed as an input to the training phase.

thumbnail

https://doi.org/10.1371/journal.pone.0258361.g006

Fig 7 provides the processes involved in the training phase. Each URL is processed with the support of vector. LSTMLib is one of the functions in the LSTM to predict an output using the vectors. The library is updated with the extracted features that contains the necessary data related to malicious and normal web pages. Thus, the iterative process is used to scan each vector and suspicious URL and generate a final outcome. Lastly, op is the prediction returned by the proposed method during the training phase.

thumbnail

https://doi.org/10.1371/journal.pone.0258361.g007

Fig 8 indicates the testing phase of the proposed URL detection. The proposed processes each element from LSTMMemory function is compared with the vector of URL and decide an output. The f is the element of the feedback which is collected from the crawler that indicates the page rank of a website. The page rank indicates the value of a website and the lowest ranking website will be declared as malicious or suspicious to alert the users.

thumbnail

https://doi.org/10.1371/journal.pone.0258361.g008

Fig 9 shows the snippet of epoch settings in the training phase. The epoch value is used to indicate the execution time of a method. The learning rate can be increased to improve the performance of a method.

thumbnail

https://doi.org/10.1371/journal.pone.0258361.g009

4. Results and discussions

The proposed method (LURL) is developed in Python 3.0 with the support of Sci—Kit Learn and NUMPY packages. Also, the existing URL detectors are constructed for evaluating the performance of LURL. Table 2 shows the parameters settings of methods during training and testing phases. Learning rate, maximum epoch, batch size, and decay are the parameters to instruct the methods to execute the results for certain number of times. Threshold values and vocabulary size are the important parameters for testing phase to generate results using test dataset.

thumbnail

https://doi.org/10.1371/journal.pone.0258361.t002

The methods are evaluated in terms of learning rate, accuracy, and precision. Table 3 presents the learning rate of the methods during the training phase. The performance of three detectors during the training phase are similar. It is evident that the learning ability of methods are same. Authors maintained similar parameters for all detectors. However, the proposed method, LURL produced a better outcome rather than Hung Le et al. [ 5 ] and Hong J. et al. [ 6 ]. LURL covered 94.3 percent of data with learning rate of 5.0 whereas Hung Le et al. and Hong J. et al. have reached 93.8 and 92.8, respectively. The learning rate of LURL is reasonable comparing to other two methods. It indicates that ML based methods able to scan an average of 84% of dataset to learn the environment at the rate of 1.0.

thumbnail

https://doi.org/10.1371/journal.pone.0258361.t003

Table 4 shows the learning rate of the methods for Crawler dataset. As discussed in the section 3, Crawler dataset was generated with the support of AlexaRank dataset. It contains larger number of normal URLs comparing to the malicious URLs. The intention for employing Crawler is to teach the methods to predict legitimate URLs. It is very difficult to predict a website without analysing content; however, the phishing site is similar to legitimate website. Therefore, it is necessary for methods to understand the differences between legitimate and malicious website. Based on the outcome, it is obvious that the performance of all detectors is like each other. Similar to Phishtank dataset, all three methods consumed an average of 86% of data at the rate of 1.0. The reason for the faster rate is that RNN can read numeric data at faster rate rather than images [ 12 ].

thumbnail

https://doi.org/10.1371/journal.pone.0258361.t004

There is a demand for an effective phishing detection system to secure a network or individual’s privacy and data. RQ3 supports to evaluate the performance of the proposed method using the learning rate, accuracy, and F1 score. RQ3 states that how to measure the efficiency of URL detectors. Tables 5 and 6 presents a solution for it. Table 5 shows the accuracy of detectors with Phishtank and Crawler datasets, accordingly. LURL has produced an average of 97.4% and 96.8% for Phishtank and Crawler datasets respectively. Both Hung Le et al., and Hong J. et al., have reached an average of 93.8, 94.1, 96.7, and 93.6 for Phishtank and Crawler datasets. It is evident that the performance of LURL is better comparing to other URL detectors. Fig 10 illustrates the corresponding graph of Table 4 . It represents that LURL has generated the output in less amount of time rather than the other predictors.

thumbnail

https://doi.org/10.1371/journal.pone.0258361.g010

thumbnail

https://doi.org/10.1371/journal.pone.0258361.t005

thumbnail

https://doi.org/10.1371/journal.pone.0258361.t006

Finally, Table 6 provides the comparison of F1—score of URL detectors. As presented in section 2, TP and TN indicate the malicious and legitimate URLs, accordingly. Based on the TP, TN, FP, and FN, both precision and recall value are calculated. Using these values, F1—measure is computed. It indicates the retrieving ability of URL detector. From the outcome, it is obvious that the proposed URL detector, LURL is superior rather than other two URL detectors. The reason for the better F1—measure is the capability of LSTM memory. Fig 11 shows the F1—score against the computation time. It represents that LURL achieved a F1—Score of 96.4 in 4.62 seconds for Phishtank dataset whereas Hung Le et al., and Hong J. et al., have achieved 95.8, 92.7 in 3.87 and 5.23 respectively. For Crawler dataset, F1—Score of LURL is 94.8 whereas Hung Le et al. and Hong J. et al. has reached 95.6, and 95.3, accordingly.

thumbnail

https://doi.org/10.1371/journal.pone.0258361.g011

5. Conclusion

The proposed study emphasized the phishing technique in the context of classification, where phishing website is considered to involve automatic categorization of websites into a predetermined set of class values based on several features and the class variable. The ML based phishing techniques depend on website functionalities to gather information that can help classify websites for detecting phishing sites. The problem of phishing cannot be eradicated, nonetheless can be reduced by combating it in two ways, improving targeted anti-phishing procedures and techniques and informing the public on how fraudulent phishing websites can be detected and identified. To combat the ever evolving and complexity of phishing attacks and tactics, ML anti-phishing techniques are essential. Authors employed LSTM technique to identify malicious and legitimate websites. A crawler was developed that crawled 7900 URLs from AlexaRank portal and also employed Phishtank dataset to measure the efficiency of the proposed URL detector. The outcome of this study reveals that the proposed method presents superior results rather than the existing deep learning methods. A total of 7900 malicious URLS were detected using the proposed URL detector. It has achieved better accuracy and F1—score with limited amount of time. The future direction of this study is to develop an unsupervised deep learning method to generate insight from a URL. In addition, the study can be extended in order to generate an outcome for a larger network and protect the privacy of an individual.

Supporting information

https://doi.org/10.1371/journal.pone.0258361.s001

https://doi.org/10.1371/journal.pone.0258361.s002

Acknowledgments

The author would like to acknowledge the support provided by AlMaarefa University while conducting this research work.

  • 1. Anti-Phishing Working Group (APWG), https://docs.apwg.org//reports/apwg_trends_report_q4_2019.pdf
  • View Article
  • Google Scholar
  • 4. Gandotra E., Gupta D, “An Efficient Approach for Phishing Detection using Machine Learning”, Algorithms for Intelligent Systems , Springer, Singapore, 2021, https://doi.org/10.1007/978-981-15-8711-5_12 .
  • 5. Hung Le, Quang Pham, Doyen Sahoo, and Steven C.H. Hoi, “URLNet: Learning a URL Representation with Deep Learning for Malicious URL Detection”, Conference’17 , Washington, DC, USA, arXiv:1802.03162, July 2017.
  • 6. Hong J., Kim T., Liu J., Park N., Kim SW, “Phishing URL Detection with Lexical Features and Blacklisted Domains”, Autonomous Secure Cyber Systems . Springer, https://doi.org/10.1007/978-3-030-33432-1_12 .
  • 7. J. Kumar, A. Santhanavijayan, B. Janet, B. Rajendran and B. S. Bindhumadhava, “Phishing Website Classification and Detection Using Machine Learning,” 2020 International Conference on Computer Communication and Informatics (ICCCI) , Coimbatore, India, 2020, pp. 1–6, 10.1109/ICCCI48352.2020.9104161.
  • 11. AlEroud A, Karabatis G. Bypassing detection of URL-based phishing attacks using generative adversarial deep neural networks. In: Proceedings of the Sixth International Workshop on Security and Privacy Analytics 2020 Mar 16 (pp. 53–60).
  • 13. J. Anirudha and P. Tanuja,”Phishing Attack Detection using Feature Selection Techniques “, Proceedings of International Conference on Communication and Information Processing (ICCIP) , 2019, http://dx.doi.org/10.2139/ssrn.3418542
  • 14. Wu CY, Kuo CC, Yang CS,” A phishing detection system based on machine learning” In: 2019 International Conference on Intelligent Computing and its Emerging Applications (ICEA) , pp 28–32, 2019.
  • 16. Srinivasa Rao R, Pais AR, “Detecting phishing websites using automation of human behavior”, In: Proceedings of the 3rd ACM workshop on cyber-physical system security , ACM, pp 33–42, 2017.
  • 21. Gull S and SA Parah, “Color image authentication using dual watermarks”, In: 2019 fifth international conference on image information processing (ICIIP) , pp 240–245, 2019.
  • 25. AlexaRank, https://www.alexa.com/siteinfo , Accessed: 2020–06–01

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

THE 2024 STATE OF PHISHING REPORT IS PUBLISHED!  READ THE REPORT HERE

Perception Point

Read the company overview to learn more about how Perception Point provides unparalleled prevention of advanced cyber threats across all attack vectors.

  • Advanced Email Security
  • Internal Emails
  • Outbound Email Security
  • Security Awareness Training

Gartner Quick Answer: Is Microsoft’s Email Security Capability Good Enough?

Download this complimentary Gartner report .

phishing research paper pdf

Download the Advanced Browser Security datasheet .

  • Cloud Collaboration Apps
  • Business Communication
  • Cloud Storage

Advanced Cloud Collaboration Security Brochure

Read the brochure to learn more about Advanced Cloud Collaboration Security.

Advanced Threat Protection for Web Applications and Services

Check out the solution paper on Advanced Threat Protection for Web Apps.

  • Microsoft 365
  • Google Workspace

High-Tech enterprise securing 3 collaboration channels

Read this case study to see how we helped a high-tech enterprise secure 3 collaboration channels: MS Teams, OneDrive, and SharePoint.

  • Phishing Prevention
  • Quishing Prevention
  • BEC Prevention
  • Account Takeover Prevention
  • Malware Detection
  • Zero-days & N-days Detection

Perception Point Company Brochure

Read the company overview to learn more about how Perception point provides unparalleled prevention of all attacks across email, web browsers and cloud collaboration apps.

  • AI Threat Prevention
  • HAP – Next-Gen Sandbox
  • Anti-Evasion Technology

Perception Point Company Brochure

  • Incident Response
  • End User Reporting
  • More Services

Perception Point Managed Service Brochure

Learn more about Perception Point’s Managed Service .

  • Channel Partners
  • MSP/MSSP Partners
  • SentinelOne
  • Crowdstrike

Perception Point Partner Program

Learn more about the Perception Point Partner Program . Protect your customers from all threats across top attack vectors with one platform.

  • AI Security
  • Browser Security
  • Cloud Storage Security
  • Cybersecurity
  • Email Security
  • Endpoint Security

Blog

Keep up with Perception Point’s updates and new offerings on our blog, here .

International Food and Beverage Company Augments Microsoft EOP to Prevent Attacks

Learn how Red Bull augmented Microsoft EOP to prevent attacks in the case study, here .

Advanced Browser Security: Browser Extension

Access the datasheet to learn about the use cases, technical specifications, and the key features of our Security Browser Extension

phishing research paper pdf

Access the whitepaper to get a comprehensive guide to understanding and preventing business email compromise attacks

The 2023 Gartner Market Guide for Email Security

Download the 2023 Gartner Market Guide for Email Security .

Rethinking Email Security: Why Traditional Approaches Fail and Why You Can’t Afford to Ignore It

Check out our webinar in cooperation with Forrester — “Rethinking Email Security: Why Traditional Approaches Fail and Why You Can’t Afford to Ignore it”

The ROI of Perception Point’s Advanced Email Security

Use this calculator to assess the potential benefit of deploying Perception Point with just 3 variables. We will do the rest for you.

Read the company overview to learn how Perception Point provides unparalleled prevention of all attacks across email, web browsers and cloud collaboration apps.

Perception Point » Blog » Attack Trends » Quishing 2.0: QR Code Phishing Evolves with Two-Step Attacks and SharePoint Abuse

September 23, 2024

Quishing 2.0: QR Code Phishing Evolves with Two-Step Attacks and SharePoint Abuse

  • Peleg Cabra, Product Marketing Manager

Quishing, or QR code phishing , is quickly advancing as threat actors adapt their tactics to bypass email security QR scanners. Quishing has become one of the fastest growing email threat vectors, by introducing QR codes into phishing campaigns, threat actors added an additional layer of evasion, making it harder for traditional security solutions to detect. But now as many cybersecurity vendors have already followed Perception Point’s lead in implementing QR scanners, “Quishing 2.0” attacks have emerged – and they’re more evasive than ever.

One Not-So-Small Step for Quishing: New Evasion Method

In a new quishing campaign discovered by Perception Point’s security research team, threat actors took QR code phishing to a whole new level. This highly complex attack exploits widely trusted platforms like SharePoint and online QR scanning services, combining them in a way that evades almost every email security solution today (almost).

Quishing 2.0 Diagram

To fully understand the complexity of Quishing 2.0 attacks, we will first walk through the target’s journey before diving into the innovative evasion tactics adopted by the threat actors.

From Email to Exploit: A Walkthrough of a Quishing 2.0 Attack

Email message.

The target receives an email appearing to be from a real business in some cases the attacker would even spoof their domain and impersonate a business partner the target is familiar with. The subject line and attached PDF file suggest it’s a Purchase Order (PO).

Email Message

PDF Attachment

Inside the PDF document, the target sees a large QR code along with instructions to scan it in order to view the full purchase order. The PDF includes the physical address of the impersonated business, further reinforcing its credibility.

PDF attachment

QR Scanning Service (Me-QR)

When the target scans the QR code, they are redirected to me-qr.com , a legitimate QR code creation and scanning service. The page indicates that the QR code was successfully scanned, with a button labeled “Skip Advertisement.” This step adds another layer of authenticity, as it uses a trusted service. We’ll touch on this step more when discussing the attackers’ tactics. 

ME QR

SharePoint Folder

Clicking the “Skip advertisement” button leads the recipient to a real SharePoint page, seemingly connected to the impersonated business – based on the URL we believe this Microsoft account was created using the spoofed domain that was used to deliver the email. This is where the attack takes full advantage of trusted services to mask malicious intent. 

The recipient sees what appears to be a .url file (basically a link in the form of a file) with the name of the PO. The file was uploaded to SharePoint by a user with the same name as the one who emailed the target. 

SharePoint folder

.url File and M365 Phishing Page

If the recipient clicks on the file in SharePoint, they are redirected to the final payload and the first real malicious one: a fake OneDrive page. The Microsoft 365 login form, designed to steal the victim’s credentials, appears over what cosmetically seems to be files of scanned invoices from the PO in the background.

M365 phishing page

The Good, the Bad and the SharePoint: Quishing Inception

Before delivering the email to the target, the threat actors came up with an innovative and unique evasion technique that hides the malicious payload behind multiple layers of legitimacy. Quishing 2.0 actually involves two QR codes! 

Let’s break it down.

The first QR code, to be used in the “back-end” of the attack, leads to the legitimate SharePoint page (associated with a compromised or spoofed business account) that will lead to the malicious phishing page. For the sake of simplicity let’s call this one the “Bad” QR code . We can see so far this is pretty much the average quishing scheme but instead of pointing it directly to the phishing page the threat actors add a “legitimate” hop (SharePoint).

The attacker then takes the “Bad” QR code and uploads it to an online QR scanning service, like me-qr.com , this kind of services allow users to submit an image (JPEG/PNG) with a QR code and extract the URL for them. 

upload or scan QR code

Before taking the user to the destined URL the scanner service first shows an advertisement and a prompt saying the QR code was scanned successfully, clicking a button to skip the ad will redirect the user to the URL behind the QR code.

ME QR (2)

The threat actors took the very URL from this result/ad page and generated a second QR code from it – the “ Clean” QR code . This is the “front-end” QR code, the one that targets will ultimately see and interact with on the PDF attachment. 

The “Clean” QR code links to the result on me-qr.com and together with the prompt about the successful scan, appearing completely legitimate to the target and bypassing almost any initial email security scans. 

The sophistication of Quishing 2.0 attacks lies in the multiple evasion tactics employed by the threat actors. Using 2 QR codes, a legitimate scanning service, and a real SharePoint account, they create multiple layers of legitimacy that most email security systems would not follow through. Each element – the clean QR code that appears harmless, the trusted QR scanning service that adds a layer of authenticity, and the use of a widely trusted platform – combines to make the email appear credible to both users and security solutions. However, Perception Point’s Advanced Email Security still was to prevent it before any harm was done – but how? 

Decoding Quishing 2.0 with Dynamic URL Analysis and Computer Vision

Dynamic URL Analysis plays a crucial role in breaking down the layers of Quishing 2.0 and identifying the actual malicious content, even when the target is sent through legitimate services like QR scanners and SharePoint. Perception Point goes beyond surface-level inspection by tracking the journey and scanning every destination URL in real time before the email gets to the user.

But what truly sets Advanced Email Security apart is our Advanced Object Detection Model , designed to combat evasive, two-step phishing techniques like this. Using computer vision, the model analyzes the content of webpages just as a user would see them, detecting clickable elements such as buttons (“skip advertisement”) or login forms. Paired with the Recursive Unpacker, Perception Point automatically clicks through these elements to trace the full path of the attack, uncovering the malicious payloads hidden beneath layers of seemingly legitimate services and QR codes.

Using this multi-layered detection stack, Perception Point provides the most robust, real-time protection against quishing attacks of all types. 

Quishing demo cta

Don't miss out on any industry updates.

Sign up for our newsletter.

TALK TO SALES

Ready to Try Perception Point?

From threats to trends: highlights from perception point’s h1 2024 report, rewriting hysteria: rising abuse of url rewriting in phishing, operation “uncle scam”: ai-powered phishers abuse microsoft dynamics 365 to target us government contractors, attackers (crowd)strike with infostealer malware, two-step phishing campaign exploits microsoft office forms.

AI-powered cybersecurity to protect the modern workspace

PMB 98147 6 Liberty Sq Boston, MA 02109 +1 (857) 278 4184

3 Rothschild St, Floor 6 Tel Aviv, Israel 6688106 +972 (3) 979 7011

  • Privacy Policy
  • Service Levels and Support Description
  • Cybersecurity Glossary

© 2024 Perception Point Inc. All rights reserved.

AWS Marketplace: Perception Point

Phishing Attacks in Social Engineering: A Review

  • August 2023

Kofi Sarpong Adu-Manu at University of Ghana

  • University of Ghana

Richard Kwasi Ahiable at University of Ghana

  • This person is not on ResearchGate, or hasn't claimed this research yet.

Discover the world's research

  • 25+ million members
  • 160+ million publication pages
  • 2.3+ billion citations

Mohammad Badruddoza Talukder

  • Asadullah Safi

Satwinder Singh

  • Michelle Steves
  • Kristen K. Greene
  • Mary F. Theofanos
  • Asma A. Alsufyani
  • Sabah M. Alzahrani
  • Pavel Y. Leonov
  • Alexander V. Vorobyev
  • Anastasia A. Ezhova
  • Nikolay Morozov
  • Shoma Tanaka
  • Takashi Matsunaka
  • Akira Yamada
  • Avumu Kubota

Ankit kumar Jain

  • Brij B. Gupta
  • Recruit researchers
  • Join for free
  • Login Email Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google Welcome back! Please log in. Email · Hint Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google No account? Sign up

IMAGES

  1. (PDF) Study on Phishing Attacks

    phishing research paper pdf

  2. (PDF) Phishing Attacks and Its Preventions

    phishing research paper pdf

  3. (PDF) Towards Intelligent User Interfaces to Prevent Phishing Attacks

    phishing research paper pdf

  4. (PDF) Detection of phishing attacks

    phishing research paper pdf

  5. (PDF) Phishing Detection Using Machine Learning Based on URL's

    phishing research paper pdf

  6. (PDF) PhishStorm: Detecting Phishing With Streaming Analytics

    phishing research paper pdf

VIDEO

  1. PDF Phishing Silent Exploit with Calina Phishing PDF Builder! 1

  2. What You Need To Know About Phishing

  3. How to Write a Scientific Research Paper

  4. Phishing & Anti-Phishing (2-way Authentication System) by Rockey Killer (h4ck3r.in)

  5. PHISHING

  6. DMARC Training-6: Information for Domain Owners and 3rd Parties

COMMENTS

  1. Phishing Attacks: A Recent Comprehensive Study and a New Anatomy

    However, info-security professionals reported a higher frequency of all types of social engineering attacks year-on-year according to a report presented by Proofpoint. Spear phishing increased to 64% in 2018 from 53% in 2017, Vishing and/or SMishing increased to 49% from 45%, and USB attacks increased to 4% from 3%.

  2. A Systematic Literature Review on Phishing and Anti-Phishing Techniques

    h to find out different types of phishing and anti-phishing techniques. Research study evaluated that spear phishing, Email Spoofing, Email Manipul. tion and phone phishing are the most commonly used phishing techniques. On the other hand, according to the SLR, machine learning approaches have the highest accuracy of preventing.

  3. (PDF) Phishing Attacks: A Recent Comprehensive Study and a New Anatomy

    This article proposes a new detailed. anatomy of phishing which involves attack phases, attacker 's types, vulnerabilities, threats, targets, attack mediums, and attacking techniques. Moreover ...

  4. Phishing Attack, Its Detections and Prevention Techniques

    anti-phishing toolbars, machine learning, and artificial intelligence are among t he technolo gies. deployed to detect and prevent phishing attacks. Continuous research and innovation, along with ...

  5. A systematic literature review on phishing website detection techniques

    Phishing is a fraud attempt in which an attacker acts as a trusted person or entity to obtain sensitive information from an internet user. In this Systematic Literature Survey (SLR), different phishing detection approaches, namely Lists Based, Visual Similarity, Heuristic, Machine Learning, and Deep Learning based techniques, are studied and compared.

  6. (PDF) A COMPREHENSIVE STUDY OF PHISHING ATTACKS AND ...

    Abstract. This research paper presents a comprehensive study of phishing attacks and their countermeasures. Phishing attacks are a major threat to individuals and organizations worldwide, and ...

  7. All About Phishing Exploring User Research through a Systematic

    of human-centered research on phishing, as well as justify our focus on the current state of user studies in the phishingliterature. Figure 1: Word cloud depicting relative representation of conference publication venues in our data set of 51 papers. 3. Study Methodology . Our systematic literature review focused on published research on phishing.

  8. Mitigation strategies against the phishing attacks: A systematic

    The paper presents the outcomes of SLR conducted while focusing on four research questions. The paper advocates that technology-only solutions are never going to be enough to protect against attacks targeted toward human users, therefore, there is a need to consider the role and abilities of human users in the development of anti-phishing ...

  9. Defending against Phishing Attacks- Taxonomy of Methods, Current Issues

    Therefore, the aim of this paper is to look at the current phishing literature to determine seriousness of the problem. To give a brief overview of evolution of research in this field as well as current trends in phishing and its remedies to provides a view of the issues and challenges that are still prevailing in this area of research.

  10. A survey of phishing attack techniques, defence mechanisms and open

    Therefore, this paper presents a detailed analysis of phishing attack methods and defense techniques. This survey is presented in five folds. First, we discuss in detail the lifecycle of phishing attack, its history, and motivation behind this attack. Second, we present various distribution methods that are used to spread phishing attacks.

  11. Human Factors in Phishing Attacks: A Systematic Literature Review

    Phishing is the fraudulent attempt to obtain sensitive information by disguising oneself as a trustworthy entity in digital communication. It is a type of cyber attack often successful because users are not aware of their vulnerabilities or are unable to understand the risks. This article presents a systematic literature review conducted to ...

  12. (PDF) Study on Phishing Attacks

    Phishing is. one such type of methodologies which are used to acquire the. information. Phishing is a cyber crime in which emails, telephone, text messages, personally identifiable information ...

  13. A comprehensive survey of AI-enabled phishing attacks detection

    In recent times, a phishing attack has become one of the most prominent attacks faced by internet users, governments, and service-providing organizations. In a phishing attack, the attacker(s) collects the client's sensitive data (i.e., user account login details, credit/debit card numbers, etc.) by using spoofed emails or fake websites. Phishing websites are common entry points of online ...

  14. Phishing in Organizations: Findings from a Large-Scale and Long-Term Study

    To summarize, this paper makes the following contributions: 1) Extensive measurement study on human factors of phish-ing and phishing prevention in large organizations. 2) Supportive results for several previous research findings with improved ecological validity. 3) Contradicting findings that challenge the conclusions of

  15. An effective detection approach for phishing websites using URL and

    Phishing offenses are increasing, resulting in billions of dollars in loss 1.In these attacks, users enter their critical (i.e., credit card details, passwords, etc.) to the forged website which ...

  16. Detecting phishing websites using machine learning technique

    2. Research background and related works. Phishing attacks are categorized according to Phisher's mechanism for trapping alleged users. Several forms of these attacks are keyloggers, DNS toxicity, Etc., [].The initiation processes in social engineering include online blogs, short message services (SMS), social media platforms that use web 2.0 services, such as Facebook and Twitter, file ...

  17. (PDF) Mitigation Strategies against the Phishing Attacks: A Systematic

    The paper presents a systematic literature review featuring 248 articles (from the beginning of 2018 until March 2023) across the main digital libraries to identify, (1) the existing mitigation ...

  18. PDF REEXAMINING PHISHING RESEARCH 1 SOK: A Comprehensive Reexamination of

    For this paper, we mainly focus on research published between the years 2010-2017. We cover papers appearing up to March 2018, and any pre-2010 paper that is highly cited or appeared in a major security venue. We also cover general phishing surveys appearing up to 2018.3 Our search gave us over 734 papers on phishing detection and user studies ...

  19. Phishing Detection: A Literature Survey

    This article surveys the literature on the detection of phishing attacks. Phishing attacks target vulnerabilities that exist in systems due to the human factor. Many cyber attacks are spread via mechanisms that exploit weaknesses found in end-users, which makes users the weakest element in the security chain. The phishing problem is broad and no single silver-bullet solution exists to mitigate ...

  20. Different Types of Phishing Attacks and Detection ...

    In this paper a review has been taken on different types of phishing attacks and detection techniques. Most of the cyber attacks are spreading through users weaknesses, which makes user weakest elements in security chain. Mostly this phishing targets the vulnerability which is already in the system due to human factor. Phishing is the huge problem and there is no single solution for mitigating ...

  21. (PDF) Phishing

    Phishing is a major threat to all Internet users and is difficult to trace or. defend against since it does not present itself as obviously malicious in nature. In today's society, everything is ...

  22. Quishing Evolves with Two-Step Attacks & SharePoint Abuse

    Quishing, or QR code phishing, is quickly advancing as threat actors adapt their tactics to bypass email security QR scanners. Quishing has become one of the fastest growing email threat vectors, by introducing QR codes into phishing campaigns, threat actors added an additional layer of evasion, making it harder for traditional security solutions to detect.

  23. Phishing Attacks Detection A Machine Learning-Based Approach

    LR is a supervised machine learning technique used for predicting discrete output class, classification, and binary classification [3]. It is based on different hypothesis functions for predicting a binary-value output. In this paper, sigmoid function is considered as a hypothesis function. It is given by. 1 h(`J($)a=.

  24. (PDF) Phishing Attacks in Social Engineering: A Review

    Phishing Attacks in Social Engineering: A Revie w. Kofi Sarpong Adu-Manu *, Richard K wasi Ahiable, J ustice K wame Appati and Ebenezer Essel Mensah. Department of Computer Science, University of ...