Part 2 – Create a Multi-threaded crawl manager

Second, we need to have a manager object to:

  • Manage the list of URLs visited
  • Create tasks to crawl specific pagas
  • Collect information from the visits
  • Manage multiple threads, without duplicating effort.

There are several ways to implement such a manager.  Many times we look at a list of requirements, such as these, and assume we need to write code for all of this.  However, the approach at ScrumBucket is to find libraries to perform the work and then design things such that the ‘right things’ happen without actually writing code.

Now you see why we made GrabPage a task rather than just an object.  Let’s see how those tasks are used by breaking down our GrabManager class:

Somewhere we have to keep track of which url’s we have visited.  That screams, “Set!”  Sets are perfect for this application as you can only have one of a particular value.  We also lucked out because the standard URL class uses the text of the url for its identity.  This means we can drop URL objects into the sets as is – no need to wrap the URL with our own class.

The abstract ExecutorService allows us to use all sorts of thread management schemes.  Given the simple nature of our project a fixed size thread pool will do just fine.  We can drop any number of GrabPage objects into the executorService and expect it to juggle them accurately.  Also, it is so easy to tune.  You try a single thread or hammer the site by setting it to 100+.  Personally, I find 5 a good balance for most sites.

Future objects are how we can check the status of a thread and get the results of the operation.  There’s many patterns here, but it’s super easy to have the thread return the GrabPage object itself.  Again, if you don’t have to create an extra class, don’t.  Many times we get wrapped up in future possibilities and adding classes for unforeseen functionality.

In my youth I spent a lot of energy building entire frameworks for future use.  Balderdash.  Yes, I still build frameworks when that is my intent.  But when I’m building code for action, I reuse and stay clean.

One more note on GrabPage objects.  We could add logic in there about scrubbing URLs or deciding when enough is enough.  But that’s a bad plan and will actually slow down our ability to reuse this class later.  Rather, we want ALL the decision making isolated to the GrabManager class.  This is known as the Software Pattern: Isolated Ugliness.  In other words, let’s keep all the decision making about visitation in one place.

This is a simple little method.  It creates the GrabPage object for a given url and drops it on the execution queue.  That’s it.  Creating tasks and putting them in a backlog should never be more then that.  But for anyone that’s built a thread manager from scratch, knows it’s not that simple to manage tasks!  Fortunately, Java executors have a clean abstraction and we can defer the strategy decisions for somewhere else.

After creating the task, we put the Future object into a list for further monitoring.  This backlog can be in the thousands, but we don’t care.

Frankly I struggled with this function.  There’s as many ways to check status of threadpools as there are programmers.  This little method is pretty clean.

Here’s how it works:

  1. Sleep for a little while.  Note: when polling tasks in other threads, you have to inject some sort of sleep, otherwise your manager thread uses all the resources.
  2. Loop thru the futures.
  3. When a future is done, remove it from the polling list.
  4. If there is an execution exception or a time out, then the task is just removed from the list.  We can flesh this out later for our crawler stats.  For now, we just need to keep the list clean.
  5. Go thru the completed GrabPage objects looking for more URLs to process.
  6. Return true if there are still futures to process.

Here’s a happy path test.  It will go 2 levels deep and grab up to 50 urls.  Our manager spawns 5 threads to process the urls and the output goes to urllist.txt.

Bingo! We have a working multi-threaded crawler in less than 200 lines of code!

Here is the complete system.  First the manager class:


The GrabPage task:

Maven: