Phishing is a popular attack vector for criminals that often involve the use of malicious URLs. Adversaries tempt their targets into clicking such malicious links, thus compromising the user’s web browser (by exploiting a particular browser vulnerability) or perhaps prompting the download of malicious files or applications. Malicious URLs can cause severe harm, with effects ranging from data theft to major financial scams. Distinguishing between malicious and benign URLs, however, is not always trivial. The CERTITUDE (Certificate Transparency URL Detector) tool aims to enhance the detection of malicious URLs by applying Machine Learning techniques and leveraging external resources such as Certificate Transparency and WHOIS data.
The Machine Learning algorithm was engineered to distinguish both lexical and certificate features. Lexical features (20 in total) are deduced from the string of the URL itself, while certificate features are deduced from metadata in public logs of Certificate Authorities (CAs). Initial testing suggest that certificate features in particular have the potential to significantly discriminate between malicious and benign URLs. Including lexical features reduces the false positive rate but also makes the model more sensitive to biases. Notably, the model is by no means static and should be retrained over time. Incorporating additional training data (from sources not employed thus far) in this process will likely make the model more robust and accurate.
CERTITUDE is distributed as a Python package. It allows users to input a single URL or a list of URLs and the incorporated Machine Learning algorithm will determine which of these it believes to be malicious. CERTITUDE requires whois, which may not be available on some systems and is thus included as a Docker image. The present tool was trained with data acquired from the Anti Phishing Working Group (confirmed malicious URLs dating back to the early 2000s) and the Dataset of Malicious and Benign Webpages published by A.K. Singh. To ensure sufficient data diversity, the data was filtered so that every domain name appears no more than 5 times.