Part 1 – Visit pages with Jsoup

First, we create a ‘callable’ task that will visit a url and find all the links on that page. We are making the task callable so we can invoke it from a Java ExecutorService. Here’s the task code:

Here’s a few key points:

  • Line 14 – Here we implement the ‘Callable’ interface. This allows the object to run in a separate thread. In addition it has the ability to return itself so we can retrieve the information later.
  • Line 18 – We are going to store the links as URLs. By using a set we ensure we never store the same URL twice.
  • Line 26 – Our callable.call will return ourself. It also throws exceptions. This might be for a timeout, 404, or some other problem. The manager of these tasks will have to decide how the handle the problem.
  • Line 30 – This is our Jsoup selector. It finds all the a tags with an ‘href’ attribute. Of course, we can evolve our GrabPage class to look in other place, but this gets us start.
  • Line 34 – Apache StringUtils are really handy. This test will return true if it’s null, empty, or all blanks.
  • Line 40 – We are constructing URL objects to get two things. One, there’s a built-in validator that will make sure we didn’t pickup stray text. Two, it will process relative hrefs for us. Maximal use of libraries is true to the ScrumBucket style!

We need a pom.xml to get the ball rolling:

Just so you can see how this page grabber runs, here’s a unit test:

Executors are simple to use, considering all their power. By putting our page grabber in a task, we can take advantage of all sorts of executor methods. This test uses the simple SingleTreadExecutor to exercise the interface.