Detecting Malicious URLs and Preventing Attacks

Malicious URLs and its Forms

Many cyber-attacks, including spamming and phishing, have been carried out using malicious URLs. These attacks can be stopped by identifying malicious URLs and detecting them. The type of threat helps to determine the severity of an attack and help you choose the best countermeasure.

Most existing methods detect malicious URLs only for one type of attack. This paper proposes a machine learning-based method to detect malicious URLs from all popular attack types. We also identify the type of attack that a malicious URL is trying to launch. Our method employs a range of discriminative features, including textual properties and link structures, webpage contents, DNS information, as well as network traffic.

Many of these features have been highly innovative and are extremely effective. Our experiments with 40,000 benign URLs as well as 32,000 malicious URLs, gathered from real-life Internet sources, show that our method has superior performance. The accuracy of the detection of malicious URLs was 98% and 93% respectively. We also discuss the effectiveness of each of these discriminative features and their evitability.


How Does it Effect

An attack-chain is usually a series of tricks that trick users. This could include convincing them to open a malicious PDF because it looks like a video or open an infected PDF because it contains financial information. Although the dangers of using social engineering-based techniques such as these can be difficult to manage, their simplistic nature can make them a double-edged sword and a boon to detection systems.

Social engineering attacks are often carried out in large numbers and are therefore targeted at the lowest common factor: users without tech savvy or those who are easily exploitable. They are often not very sophisticated and most users will be able to recognize the deception if they take the time to think about it. Defenders are able to detect these tricks because they have predictable patterns. A URL could include the words “free” and “cash”, which might tempt users to click it. However, these words can be used as triggers by a detection system to flag them. What would happen if we taught a machine-learning model to give you that extra thought? This is how we built a malicious URL detector to help answer this question.


 Setting Up URLs

The image shows that the user enters a URL, and our machine-learning model analyzes it in real time. This allows us to determine whether the URL is safe. You can adjust the URL to receive immediate feedback about whether it is more or less malicious. Our model can recognize certain red flags, as you can see in the above gif.

Unencrypted pages (i.e. HTTP pages. It has also learned to identify a common trick used by attackers which is to take a well-known website such as Google, and make it falsely appear that you are browsing to that website by using it a subdomain: “” in this example. This model was trained on more than 100 million malicious and benign URLs to learn how to recognize these suspicious patterns.

Although the model’s architecture can be quite complex, it basically uses both character-level embeddings as well as a convolutional neural net. At a high level, this means that the model operates on a character-by-character basis and must learn to recognize meaningful patterns of characters such as “g – o – o – g – l – e” on its own. The model learns to read the URL’s characters from left-to-right, much like a person would. When looking for malicious URLs, it is important to take into account the following words.

Our approach is unique in that it does not feed the model literal characters like “g” and “o”, but instead, it is fed “embeddings.” These embeddings, which are numerical representations of letters in high-dimensional numbers, give the model a richer understanding of the context in the which the letter is found.

The embeddings of “A” and Z may indicate similarities between the letters, due to them being in uppercase. This is a less literal approach than using the exact letters of URLs. It allows our system to score URLs more easily. In the subdomain trick, for example, it is not important that the exact text “securesite” is placed between “google”, and “com”. It doesn’t matter if ANY text is there.

The embeddings are used in conjunction with the convolutional neural networks to allow the model to encode general relationships among letters in a way that allows it to later combine them into useful patterns or rules for detecting malicious websites. While the embeddings help with this, they also contain enough information about individual characters to allow the system to learn specific words such as “google” when needed.


The above image shows how the URL model is being queried as the user type the URL. However, most malicious URLs can be found in static links in emails, word documents and webpages. Social engineering attacks can use URLs to create malicious links from any text, even text that appears legitimate: Sometimes, attackers swap letters with similar appearances such as lowercase L’s and higher case i’s or zeroes, which can obscure the true destination of the link. Verifying the URL’s nature is difficult for humans. Even security-aware users may be fooled by a benign email link. We can stop an attack by simply detecting the links characters.





One comment

Leave a comment