First, we create a ‘callable’ task that will visit a url and find all the links on that page. We are making the task callable so we can invoke it from a Java ExecutorService. Here’s the task code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
package com.stevesando.crawljsoup; import org.apache.commons.lang3.StringUtils; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import java.net.MalformedURLException; import java.net.URL; import java.util.HashSet; import java.util.Set; import java.util.concurrent.Callable; public class GrabPage implements Callable<GrabPage> { static final int TIMEOUT = 60000; // one minute private URL url; private Set<URL> urlList = new HashSet<>(); public GrabPage(URL url) { this.url = url; } @Override public GrabPage call() throws Exception { Document document = null; document = Jsoup.parse(url, TIMEOUT); Elements links = document.select("a[href]"); for (Element link : links) { String href = link.attr("href"); if (StringUtils.isBlank(href) || href.startsWith("#")) { continue; } try { URL nextUrl = new URL(url, href); // NOTE: the set will not store the same url twice, even if two different objects. urlList.add(nextUrl); } catch (MalformedURLException e) { ; // just ignore bad URLs } } return this; } public void dump() { for (URL url1 : urlList) { System.out.println("Links to " + url1.toString()); } } } |
Here’s a few key points:
- Line 14 – Here we implement the ‘Callable’ interface. This allows the object to run in a separate thread. In addition it has the ability to return itself so we can retrieve the information later.
- Line 18 – We are going to store the links as URLs. By using a set we ensure we never store the same URL twice.
- Line 26 – Our callable.call will return ourself. It also throws exceptions. This might be for a timeout, 404, or some other problem. The manager of these tasks will have to decide how the handle the problem.
- Line 30 – This is our Jsoup selector. It finds all the a tags with an ‘href’ attribute. Of course, we can evolve our GrabPage class to look in other place, but this gets us start.
- Line 34 – Apache StringUtils are really handy. This test will return true if it’s null, empty, or all blanks.
- Line 40 – We are constructing URL objects to get two things. One, there’s a built-in validator that will make sure we didn’t pickup stray text. Two, it will process relative hrefs for us. Maximal use of libraries is true to the ScrumBucket style!
We need a pom.xml to get the ball rolling:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.stevesando.crawljsoup</groupId> <artifactId>basecrawler</artifactId> <version>1.0-SNAPSHOT</version> <dependencies> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.11</version> </dependency> <dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.7.1</version> </dependency> <dependency> <groupId>org.apache.commons</groupId> <artifactId>commons-lang3</artifactId> <version>3.0</version> </dependency> </dependencies> </project> |
Just so you can see how this page grabber runs, here’s a unit test:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
package com.stevesando.crawljsoup; import org.junit.Test; import java.net.MalformedURLException; import java.net.URL; import java.util.concurrent.ExecutionException; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; import java.util.concurrent.Future; public class TestGrabPage { @Test public void happy() throws MalformedURLException, ExecutionException, InterruptedException { ExecutorService executorService = Executors.newSingleThreadExecutor(); Future<GrabPage> future = executorService.submit(new GrabPage(new URL("http://example.com"))); GrabPage done = future.get(); done.dump(); } } |
Executors are simple to use, considering all their power. By putting our page grabber in a task, we can take advantage of all sorts of executor methods. This test uses the simple SingleTreadExecutor to exercise the interface.
- Part 1 – Visit pages with Jsoup
- Part 2 – Create a Multi-threaded crawl manager
- Part 3 – Java 8 and Lambda Expressions
- Part 4 – Adding our unit tests and the Mockito framework
- Part 5 – Testing a function that calls static methods
- Part 6 – Using Java 8 BiPredicate to externalize decisions
- Part 7 – Adding Spring Data and Neo4j