Neo4j Site Crawler

This is going to be a tutorial series, where we will create Yet Another Web-crawler. Yawn! But this will serve as a tutorial for the following technologies:

  • Jsoup – We will make extensive use of this wonderful library
  • ExecutorService and Callable – We’ll crawl with multiple threads
  • Neo4J – A graph database is a natural for representing a complete site map — you’ll see….
  • Spring Data – a super easy way to interact with our database – Native and REST APIs
  • Spring Boot – a way to wireup all our spring goodies.

Each post in this series will focus on a bite-sized chunk of technology. Slowly we will put together a fairly robust crawler. When we are done, you can create a complete graph of a website. Not just a hierarchical list of URLs, but a complete graph showing how all the pages connect to each other. Who knows? It just might uncover a few mysteries buried deep within your pages.