02
Jan
2022

Build A Web Crawler with Java

Creating a web crawler is a smart way of retrieving useful information available online. With a web crawler, you can scan the Internet, browse through individual websites, and analyze and extract their content.

The Java programming language provides a simple way of building a web crawler and harvesting data from websites. You can use the extracted data for various use cases, such as for analytical purposes, providing a service that uses third-party data, or generating statistical data.

In this article, we’ll walk you through the process of building a web crawler using Java and ProxyCrawl.

What you’ll need

Typically, crawling web data involves creating a script that sends a request to the targeted web page, accesses its underlying HTML code, and scrapes the required information.

To accomplish that objective, you’ll need the following:

Before we develop the crawling logic, let’s clear the air why using ProxyCrawl is important for web crawling.

Why use ProxyCrawl for Crawling

ProxyCrawl is a powerful data crawling and scraping tool you can use to harvest information from websites fast and easily.

Here are some reasons why you should use it for crawling online data:

  • Easy to use It comes with a simple API that you can set up quickly without any programming hurdles. With just a few lines of code, you can start using the API to crawl websites and retrieve their content.
  • Supports advanced crawling ProxyCrawl allows you to perform advanced web crawling and scrape data from complicated websites. Since it supports JavaScript rendering, ProxyCrawl lets you extract data from dynamic websites. It offers a headless browser that allows you to extract what real users see on their web browsers—even if a site is created using modern frameworks like Angular or React.js.
  • Bypass crawling obstacles ProxyCrawl can handle all the restrictions often associated with crawling online data. It has an extensive network of proxies as well as more than 17 data centers around the world. You can use it to avoid access restrictions, resolve CAPTCHAs, and evade other anti-scraping measures implemented by web applications. What’s more, you can crawl websites while remaining anonymous; you’ll not worry about exposing your identity.
  • Free trial account You can test how ProxyCrawl works without giving out your payment details. The free account comes with 1,000 credits for trying out the tool’s capabilities.

How ProxyCrawl Works

ProxyCrawl provides the Crawling API for crawling and scraping data from websites. You can easily integrate the API in your Java development project and retrieve information from web pages smoothly.

Each request made to the Crawling API starts with the following base part:

https://api.proxycrawl.com

Also, you’ll need to add the following mandatory parameters to the API:

  • Authentication token
  • URL

The authentication token is a unique token that authorizes you to use the Crawling API. Once you sign up for an account, ProxyCrawl will give you two types of tokens:

  • Normal token This is for making generic crawling requests.
  • JavaScript token This is for crawling dynamic websites. It provides you with headless browser capabilities for crawling web pages rendered using JavaScript. As pointed out earlier, it’s a useful way of crawling advanced websites.

Here is how to add the authentication token to your API request:

https://api.proxycrawl.com/?token=INSERT_TOKEN

The second mandatory parameter is the URL to crawl. It should start with HTTP or HTTPS, and be completely encoded. Encoding converts the URL string into a format that can be transferred over the Internet validly and easily.

Here is how to insert the URL to your API request:

https://api.proxycrawl.com/?token=INSERT_TOKEN&url=INSERT_URL

If you run the above line—for example, on your terminal using cURL or pasting it on a browser’s address bar—it’ll execute the API request and return the entire HTML source code of the targeted web page.

It’s that easy and simple!

If you want to perform advanced crawling, you may add other parameters to the API request. For example, when using the JavaScript token, you can add the page_wait parameter to instruct the browser to wait for the specified number of milliseconds before the resulting HTML code is captured.

Here is an example:

https://api.proxycrawl.com/?token=INSERT_TOKEN&page_wait=1000&url=INSERT_URL

Building a Web Crawler in Java and ProxyCrawl

In this Java web crawling tutorial, we’ll use the HttpClient API to create the crawling logic. The API was introduced in Java 11, and it comes with lots of useful features for sending requests and retrieving their responses.

The HttpClient API supports both HTTP/1.1 and HTTP/2. By default, it uses the HTTP/2 protocol to send requests. If a request is sent to a server that does not already support HTTP/2, it will automatically be downgraded to HTTP/1.

Furthermore, its requests can be sent asynchronously or synchronously, it handles requests and response bodies as reactive-streams, and uses the common builder pattern.

The API is comprised of three core classes:

  • HttpRequest
  • HttpClient
  • HttpResponse

Let’s talk about each of them in more detail.

1. HttpRequest

The HttpRequest, as the name implies, is an object encapsulating the HTTP request to be sent. To create new instances of HttpRequest, call HttpRequest.newBuilder(). After it has been created, the request is immutable and can be sent multiple times.

The Builder class comes with different methods for configuring the request.

These are the most common methods:

  • URI method
  • Request method
  • Protocol version method
  • Timeout method

Let’s talk about each of them in more detail.

a) URI method

The first thing to do when configuring the request is to set the URL to crawl. We can do so by calling the uri() method on the Builder instance. We’ll also use the URI.create() method to create the URI by parsing the string of the URL we intend to crawl.

Here is the code:

String url =
URLEncoder.encode(“https://www.forextradingbig.com/7-reasons-why-you-should
-quit-forex-trading/”, StandardCharsets.UTF_8.name());

HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create(“https://api.proxycrawl.com/?token=INSERT_TOKEN&url=”
+ url))

Notice that we provided the URL string using ProxyCrawl’s settings. This is the web page we intend to scrape its contents.

We also encoded the URL using the Java URLEncoder class. As earlier mentioned, ProxyCrawl requires URLs to be encoded.

b) Request method

The next thing to do is to specify the HTTP method to be used for making the request. We can call any of the following methods from Builder:

  • GET()
  • POST()
  • PUT()
  • DELETE()

In this case, since we want to request data from the target web page, we’ll use the GET() method.

Here is the code:

HttpRequest request = HttpRequest.newBuilder()
.GET()

So far, HttpRequest has all the parameters that should be passed to HttpClient. However, you may need to include other parameters, such as the HTTP protocol version and timeout.

Let’s see how you can add the additional parameters.

c) Protocol version method

As earlier mentioned, the HttpClient API uses the HTTP/2 protocol by default. Nonetheless, you can specify the version of the HTTP protocol you want to use.

Here is the code:

HttpRequest request = HttpRequest.newBuilder()
.version(HttpClient.Version.HTTP_2)

d) Timeout method

You can set the amount of time to wait before a response is received. Once the defined period expires, an HttpTimeoutException will be thrown. By default, the timeout is set to infinity.

You can define timeout by calling the timeout() method on the builder instance. You’ll also need to pass the Duration object to specify the amount of time to wait.

Here is the code:

HttpRequest request = HttpRequest.newBuilder()
.timeout(Duration.ofSeconds(20))

2. HttpClient

The HttpClient class is the main entry point of the API—it acts as a container for the configuration details shared among multiple requests. It is the HTTP client used for sending requests and receiving responses.

You can call either the HttpClient.newBuilder() or the HttpClient.newHttpClient() method to instantiate it. After an instance of the HttpClient has been created, it’s immutable.

The HttpClient class offers several helpful and self-describing methods you can use when working with requests and responses.

These are some things you can do:

  • Set protocol version
  • Set redirect policy
  • Send synchronous and asynchronous requests

Let’s talk about each of them in more detail.

a) Set protocol version

As earlier mentioned, the HttpClient class uses the HTTP/2 protocol by default. However, you can set your preferred protocol version, either HTTP/1.1 or HTTP/2.

Here is an example:

HttpClient client = HttpClient.newBuilder()
.version(Version.HTTP_1_1)

b) Set redirect policy

If the targeted web page has moved to a different address, you’ll get a 3xx HTTP status code. Since the address of the new URI is usually provided with the status code information, setting the correct redirect policy can make HttpClient forward the request automatically to the new location.

You can set it by using the followRedirects() method on the Builder instance.

Here is an example:

HttpClient client = HttpClient.newBuilder()
.followRedirects(Redirect.NORMAL)

c) Send synchronous and asynchronous requests

HttpClient supports two ways of sending requests:

  • Synchronously by using the send() method. This blocks the client until the response is received, before continuing with the rest of the execution.

Here is an example:

HttpResponse<String> response = client.send(request,
BodyHandlers.ofString());

Note that we used BodyHandlers and called the ofString() method to return the HTML response as a string.

  • Asynchronously by using the sendAsync() method. This does not wait for the response to be received; it’s non-blocking. Once the sendAsync() method is called, it returns instantly with a CompletableFuture< HttpResponse >, which finalizes once the response is received. The returned CompletableFuture can be joined using various techniques to define dependencies among varied asynchronous tasks.

Here is an example:

CompletableFuture<HttpResponse<String>> response = HttpClient.newBuilder()
.sendAsync(request, HttpResponse.BodyHandler.ofString());

3. HttpResponse

The HttpResponse, as the name implies, represents the response received after sending an HttpRequest. HttpResponse offers different helpful methods for handling the received response.

These are the most important methods:

  • statusCode() This method returns the status code of the response. It’s of int type
  • Body() This method returns a body for the response. The return type is based on the kind of response BodyHandler parameter that is passed to the send() method.

Here is an example:

// Handling the response body as a String
HttpResponse<String> response = client
.send(request, BodyHandlers.ofString());

// Printing response body
System.out.println(response.body());

// Printing status code
System.out.println(response.statusCode());
// Handling the response body as a file
HttpResponse<Path> response = client
.send(request, BodyHandlers.ofFile(Paths.get(“myexample.html”)));

Synchronous Example

Here is an example that uses the HttpClient synchronous method to crawl a web page and output its content:

package javaHttpClient;

import java.io.IOException;
import java.net.URI;
import java.net.URLEncoder;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.net.http.HttpResponse.BodyHandlers;
import java.nio.charset.StandardCharsets;

public class SyncExample {

public static void main(String[] args) throws IOException, InterruptedException {

// Encoding the URL
String url = URLEncoder.encode(“https://www.forextradingbig.com/7-reasons-why-you-should-quit-forex-trading/”, StandardCharsets.UTF_8.name());

// Instantiating HttpClient
HttpClient client = HttpClient.newHttpClient();

// Configuring HttpRequest
HttpRequest request = HttpRequest.newBuilder()
.GET()
.uri(URI.create(“https://api.proxycrawl.com/?token=INSERT_TOKEN&url=” + url))
.build();

// Handling the response
HttpResponse<String> response = client.send(request, BodyHandlers.ofString());
System.out.println(response.body());
}

}

Here is the output:

Asynchronous Example

When using the HttpClient asynchronous method to crawl a web page, the sendAsync() method is called, instead of send().

Here is an example:

package javaHttpClient;

import java.io.IOException;
import java.net.URI;
import java.net.URLEncoder;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.nio.charset.StandardCharsets;
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.TimeoutException;

public class AsyncExample {


public static void main(String[] args) throws IOException, InterruptedException, ExecutionException, TimeoutException {

// Encoding the URL
String url = URLEncoder.encode(“https://www.forextradingbig.com/7-reasons-why-you-should-quit-forex-trading/”, StandardCharsets.UTF_8.name());

// Instantiating HttpClient
HttpClient client = HttpClient.newHttpClient();

// Configuring HttpRequest
HttpRequest request = HttpRequest.newBuilder()
.GET()
.version(HttpClient.Version.HTTP_2)
.uri(URI.create(“https://api.proxycrawl.com/?token=INSERT_TOKEN&url=” + url))
.build();

// Handling the response
CompletableFuture<HttpResponse<String>> response =
client.sendAsync(request, HttpResponse.BodyHandlers.ofString());

String result = response.thenApply(HttpResponse::body).get(5, TimeUnit.SECONDS);

System.out.println(result);

}

}

Conclusion

That’s how to build a web crawler in Java. The HttpClient API, which was introduced in Java 11, makes it easy to send and handle responses from a server.

And if the API is combined with a versatile tool like ProxyCrawl, it can make web crawling tasks smooth and rewarding.

With ProxyCrawl, you can create a scraper that can help you to retrieve information from websites anonymously and without worrying about being blocked.

It’s the tool you need to take your crawling efforts to the next level.

Click here to create a free ProxyCrawl account.

Happy scraping!

Share

Yaniv Levy

Yaniv Levy, Entrepreneur, visioner & technology passionate with over 20 years on vast experience as a Senior Software Engineer and a Software Architect.

You may also like...