This is going to be a tutorial series, where we will create Yet Another Web-crawler. Yawn! But this will serve as a tutorial for the following technologies:
- Jsoup – We will make extensive use of this wonderful library
- ExecutorService and Callable – We’ll crawl with multiple threads
- Neo4J – A graph database is a natural for representing a complete site map — you’ll see….
- Spring Data – a super easy way to interact with our database – Native and REST APIs
- Spring Boot – a way to wireup all our spring goodies.
Each post in this series will focus on a bite-sized chunk of technology. Slowly we will put together a fairly robust crawler. When we are done, you can create a complete graph of a website. Not just a hierarchical list of URLs, but a complete graph showing how all the pages connect to each other. Who knows? It just might uncover a few mysteries buried deep within your pages.
- Part 1 – Visit pages with Jsoup
- Part 2 – Create a Multi-threaded crawl manager
- Part 3 – Java 8 and Lambda Expressions
- Part 4 – Adding our unit tests and the Mockito framework
- Part 5 – Testing a function that calls static methods
- Part 6 – Using Java 8 BiPredicate to externalize decisions
- Part 7 – Adding Spring Data and Neo4j