CSE361 -- Fall 2017
Web Crawler + Form bruteforcer
As we mentioned in the course, a major problem with users reusing their passwords across services is that when a password database is compromised, attackers extract usernames and passwords and try them on unrelated websites on the Internet.
In this project, you are called to develop a crawler that can autonomously navigate websites, collecting and tokenizing all the words that it finds which it will later use as potential passwords on the website's login form. The crawler has to be able to autonomously identify the login page and also detect whether a combination of username and password was successful or not.
- The crawler should be configurable, with the configuration options being the following:
- The user must be able to provide a custom user-agent for use with each GET/POST request
- Depth-first/Breadth-first choice of crawling
- Maximum depth of pages to crawl
- Maximum total number of crawled pages
- In addition to using all words as you find them, you must also convert them to lowercase, uppercase, reverse (e.g "facebook" becoming "koobecaf"), and leet-speak with the following case-insensitive conversions: "a" -> "4", "e" -> "3", "l" -> "1", "t" -> "7", "o" -> "0"
- Your crawler must also try to find parts of a website not explicitly linked from other pages by the following two methods:
- Crawl robots.txt and identify paths that you have not crawled before
- Identify whether the site uses common popular subdomains (use this list) and crawl whatever pages you find there.
- Your crawler must be able to remain within the target website (e.g. if we are crawling facebook.com, we don't want to follow a link to cnn.com and start crawling that)
- Your crawler must be able to recover from dead-end pages (e.g. pages without links) and HTTP 4xx/5xx errors
- You may use libraries that help you create network sockets (e.g. Python twisted) but you may not use ready-made GET/POST requests (i.e. I want you to construct your own HTTP requests)
- You may use HTML parsing libraries to identify links and text (e.g. BeautifulSoup) but you may not use ready-made crawlers (i.e. You have to code the depth-first/breadth-first functionality by yourself).