Second, we need to have a manager object to:
- Manage the list of URLs visited
- Create tasks to crawl specific pagas
- Collect information from the visits
- Manage multiple threads, without duplicating effort.
There are several ways to implement such a manager. Many times we look at a list of requirements, such as these, and assume we need to write code for all of this. However, the approach at ScrumBucket is to find libraries to perform the work and then design things such that the ‘right things’ happen without actually writing code.
Now you see why we made GrabPage a task rather than just an object. Let’s see how those tasks are used by breaking down our GrabManager class:
1 |
private Set<URL> masterList = new HashSet<>(); |
Somewhere we have to keep track of which url’s we have visited. That screams, “Set!” Sets are perfect for this application as you can only have one of a particular value. We also lucked out because the standard URL class uses the text of the url for its identity. This means we can drop URL objects into the sets as is – no need to wrap the URL with our own class.
1 |
private ExecutorService executorService = Executors.newFixedThreadPool(5); |
The abstract ExecutorService allows us to use all sorts of thread management schemes. Given the simple nature of our project a fixed size thread pool will do just fine. We can drop any number of GrabPage objects into the executorService and expect it to juggle them accurately. Also, it is so easy to tune. You try a single thread or hammer the site by setting it to 100+. Personally, I find 5 a good balance for most sites.
1 |
private List<Future<GrabPage>> futures = new ArrayList<>(); |
Future objects are how we can check the status of a thread and get the results of the operation. There’s many patterns here, but it’s super easy to have the thread return the GrabPage object itself. Again, if you don’t have to create an extra class, don’t. Many times we get wrapped up in future possibilities and adding classes for unforeseen functionality.
In my youth I spent a lot of energy building entire frameworks for future use. Balderdash. Yes, I still build frameworks when that is my intent. But when I’m building code for action, I reuse and stay clean.
One more note on GrabPage objects. We could add logic in there about scrubbing URLs or deciding when enough is enough. But that’s a bad plan and will actually slow down our ability to reuse this class later. Rather, we want ALL the decision making isolated to the GrabManager class. This is known as the Software Pattern: Isolated Ugliness. In other words, let’s keep all the decision making about visitation in one place.
1 2 3 4 5 6 7 8 9 |
private void submitNewURL(URL url, int depth) { if (shouldInclude(url, depth)) { masterList.add(url); GrabPage grabPage = new GrabPage(url, depth); Future<GrabPage> future = executorService.submit(grabPage); futures.add(future); } } |
This is a simple little method. It creates the GrabPage object for a given url and drops it on the execution queue. That’s it. Creating tasks and putting them in a backlog should never be more then that. But for anyone that’s built a thread manager from scratch, knows it’s not that simple to manage tasks! Fortunately, Java executors have a clean abstraction and we can defer the strategy decisions for somewhere else.
After creating the task, we put the Future object into a list for further monitoring. This backlog can be in the thousands, but we don’t care.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
private boolean checkPageGrabs() throws InterruptedException { Thread.sleep(PAUSE_TIME); Set<GrabPage> pageSet = new HashSet<>(); Iterator<Future<GrabPage>> iterator = futures.iterator(); while (iterator.hasNext()) { Future<GrabPage> future = iterator.next(); if (future.isDone()) { iterator.remove(); try { pageSet.add(future.get()); } catch (InterruptedException e) { // skip pages that load too slow } catch (ExecutionException e) { } } } for (GrabPage grabPage : pageSet) { addNewURLs(grabPage); } return (futures.size() > 0); } |
Frankly I struggled with this function. There’s as many ways to check status of threadpools as there are programmers. This little method is pretty clean.
Here’s how it works:
- Sleep for a little while. Note: when polling tasks in other threads, you have to inject some sort of sleep, otherwise your manager thread uses all the resources.
- Loop thru the futures.
- When a future is done, remove it from the polling list.
- If there is an execution exception or a time out, then the task is just removed from the list. We can flesh this out later for our crawler stats. For now, we just need to keep the list clean.
- Go thru the completed GrabPage objects looking for more URLs to process.
- Return true if there are still futures to process.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
package com.stevesando.crawljsoup; import org.junit.Test; import java.io.IOException; import java.net.URL; public class TestGrabManager { @Test public void happy() throws IOException, InterruptedException { GrabManager grabManager = new GrabManager(2, 50); grabManager.go(new URL("http://news.yahoo.com")); grabManager.write("urllist.txt"); } } |
Here’s a happy path test. It will go 2 levels deep and grab up to 50 urls. Our manager spawns 5 threads to process the urls and the output goes to urllist.txt.
Bingo! We have a working multi-threaded crawler in less than 200 lines of code!
Here is the complete system. First the manager class:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 |
package com.stevesando.crawljsoup; import org.apache.commons.io.FileUtils; import org.apache.commons.lang3.StringUtils; import org.apache.commons.lang3.time.StopWatch; import java.io.File; import java.io.IOException; import java.net.MalformedURLException; import java.net.URL; import java.util.*; import java.util.concurrent.*; public class GrabManager { public static final int THREAD_COUNT = 5; private static final long PAUSE_TIME = 1000; private Set<URL> masterList = new HashSet<>(); private List<Future<GrabPage>> futures = new ArrayList<>(); private ExecutorService executorService = Executors.newFixedThreadPool(THREAD_COUNT); private String urlBase; private final int maxDepth; private final int maxUrls; public GrabManager(int maxDepth, int maxUrls) { this.maxDepth = maxDepth; this.maxUrls = maxUrls; } public void go(URL start) throws IOException, InterruptedException { // stay within same site urlBase = start.toString().replaceAll("(.*//.*/).*", "$1"); StopWatch stopWatch = new StopWatch(); stopWatch.start(); submitNewURL(start, 0); while (checkPageGrabs()) ; stopWatch.stop(); System.out.println("Found " + masterList.size() + " urls"); System.out.println("in " + stopWatch.getTime() / 1000 + " seconds"); } private boolean checkPageGrabs() throws InterruptedException { Thread.sleep(PAUSE_TIME); Set<GrabPage> pageSet = new HashSet<>(); Iterator<Future<GrabPage>> iterator = futures.iterator(); while (iterator.hasNext()) { Future<GrabPage> future = iterator.next(); if (future.isDone()) { iterator.remove(); try { pageSet.add(future.get()); } catch (InterruptedException e) { // skip pages that load too slow } catch (ExecutionException e) { } } } for (GrabPage grabPage : pageSet) { addNewURLs(grabPage); } return (futures.size() > 0); } private void addNewURLs(GrabPage grabPage) { for (URL url : grabPage.getUrlList()) { if (url.toString().contains("#")) { try { url = new URL(StringUtils.substringBefore(url.toString(), "#")); } catch (MalformedURLException e) { } } submitNewURL(url, grabPage.getDepth() + 1); } } private void submitNewURL(URL url, int depth) { if (shouldVisit(url, depth)) { masterList.add(url); GrabPage grabPage = new GrabPage(url, depth); Future<GrabPage> future = executorService.submit(grabPage); futures.add(future); } } /** * Redementary visitation filter. */ private boolean shouldVisit(URL url, int depth) { if (masterList.contains(url)) { return false; } if (!url.toString().startsWith(urlBase)) { return false; } if (url.toString().endsWith(".pdf")) { return false; } if (depth > maxDepth) { return false; } if (masterList.size() >= maxUrls) { return false; } return true; } public void write(String path) throws IOException { FileUtils.writeLines(new File(path), masterList); } } |
The GrabPage task:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 |
package com.stevesando.crawljsoup; import org.apache.commons.lang3.StringUtils; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import java.net.MalformedURLException; import java.net.URL; import java.util.HashSet; import java.util.Set; import java.util.concurrent.Callable; public class GrabPage implements Callable<GrabPage> { static final int TIMEOUT = 60000; // one minute private URL url; private int depth; private Set<URL> urlList = new HashSet<>(); public GrabPage(URL url, int depth) { this.url = url; this.depth = depth; } @Override public GrabPage call() throws Exception { Document document = null; System.out.println("Visiting (" + depth + "): " + url.toString()); document = Jsoup.parse(url, TIMEOUT); processLinks(document.select("a[href]")); return this; } private void processLinks(Elements links) { for (Element link : links) { String href = link.attr("href"); if (StringUtils.isBlank(href) || href.startsWith("#")) { continue; } try { URL nextUrl = new URL(url, href); urlList.add(nextUrl); } catch (MalformedURLException e) { // ignore bad urls } } } public Set<URL> getUrlList() { return urlList; } public int getDepth() { return depth; } } |
Maven:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.stevesando.crawljsoup</groupId> <artifactId>basecrawler</artifactId> <version>1.0-SNAPSHOT</version> <dependencies> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.11</version> </dependency> <dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.7.1</version> </dependency> <dependency> <groupId>org.apache.commons</groupId> <artifactId>commons-lang3</artifactId> <version>3.0</version> </dependency> <dependency> <groupId>commons-io</groupId> <artifactId>commons-io</artifactId> <version>2.4</version> </dependency> </dependencies> </project> |
- Part 1 – Visit pages with Jsoup
- Part 2 – Create a Multi-threaded crawl manager
- Part 3 – Java 8 and Lambda Expressions
- Part 4 – Adding our unit tests and the Mockito framework
- Part 5 – Testing a function that calls static methods
- Part 6 – Using Java 8 BiPredicate to externalize decisions
- Part 7 – Adding Spring Data and Neo4j