Tagged: Java

Java

Build A Web Crawler with Java
02
Jan
2022

Build A Web Crawler with Java

Creating a web crawler is a smart way of retrieving useful information available online. With a web crawler, you can scan the Internet, browse through individual websites, and analyze and extract their content.

The Java programming language provides a simple way of building a web crawler and harvesting data from websites. You can use the extracted data for various use cases, such as for analytical purposes, providing a service that uses third-party data, or generating statistical data.

In this article, we’ll walk you through the process of building a web crawler using Java and ProxyCrawl.

What you’ll need

Typically, crawling web data involves creating a script that sends a request to the targeted web page, accesses its underlying HTML code, and scrapes the required information.

To accomplish that objective, you’ll need the following:

Before we develop the crawling logic, let’s clear the air why using ProxyCrawl is important for web crawling.

Why use ProxyCrawl for Crawling

ProxyCrawl is a powerful data crawling and scraping tool you can use to harvest information from websites fast and easily.

Here are some reasons why you should use it for crawling online data:

  • Easy to use It comes with a simple API that you can set up quickly without any programming hurdles. With just a few lines of code, you can start using the API to crawl websites and retrieve their content.
  • Supports advanced crawling ProxyCrawl allows you to perform advanced web crawling and scrape data from complicated websites. Since it supports JavaScript rendering, ProxyCrawl lets you extract data from dynamic websites. It offers a headless browser that allows you to extract what real users see on their web browsers—even if a site is created using modern frameworks like Angular or React.js.
  • Bypass crawling obstacles ProxyCrawl can handle all the restrictions often associated with crawling online data. It has an extensive network of proxies as well as more than 17 data centers around the world. You can use it to avoid access restrictions, resolve CAPTCHAs, and evade other anti-scraping measures implemented by web applications. What’s more, you can crawl websites while remaining anonymous; you’ll not worry about exposing your identity.
  • Free trial account You can test how ProxyCrawl works without giving out your payment details. The free account comes with 1,000 credits for trying out the tool’s capabilities.

How ProxyCrawl Works

ProxyCrawl provides the Crawling API for crawling and scraping data from websites. You can easily integrate the API in your Java development project and retrieve information from web pages smoothly.

Each request made to the Crawling API starts with the following base part:

https://api.proxycrawl.com

Also, you’ll need to add the following mandatory parameters to the API:

  • Authentication token
  • URL

The authentication token is a unique token that authorizes you to use the Crawling API. Once you sign up for an account, ProxyCrawl will give you two types of tokens:

  • Normal token This is for making generic crawling requests.
  • JavaScript token This is for crawling dynamic websites. It provides you with headless browser capabilities for crawling web pages rendered using JavaScript. As pointed out earlier, it’s a useful way of crawling advanced websites.

Here is how to add the authentication token to your API request:

https://api.proxycrawl.com/?token=INSERT_TOKEN

The second mandatory parameter is the URL to crawl. It should start with HTTP or HTTPS, and be completely encoded. Encoding converts the URL string into a format that can be transferred over the Internet validly and easily.

Here is how to insert the URL to your API request:

https://api.proxycrawl.com/?token=INSERT_TOKEN&url=INSERT_URL

If you run the above line—for example, on your terminal using cURL or pasting it on a browser’s address bar—it’ll execute the API request and return the entire HTML source code of the targeted web page.

It’s that easy and simple!

If you want to perform advanced crawling, you may add other parameters to the API request. For example, when using the JavaScript token, you can add the page_wait parameter to instruct the browser to wait for the specified number of milliseconds before the resulting HTML code is captured.

Here is an example:

https://api.proxycrawl.com/?token=INSERT_TOKEN&page_wait=1000&url=INSERT_URL

Building a Web Crawler in Java and ProxyCrawl

In this Java web crawling tutorial, we’ll use the HttpClient API to create the crawling logic. The API was introduced in Java 11, and it comes with lots of useful features for sending requests and retrieving their responses.

The HttpClient API supports both HTTP/1.1 and HTTP/2. By default, it uses the HTTP/2 protocol to send requests. If a request is sent to a server that does not already support HTTP/2, it will automatically be downgraded to HTTP/1.

Furthermore, its requests can be sent asynchronously or synchronously, it handles requests and response bodies as reactive-streams, and uses the common builder pattern.

The API is comprised of three core classes:

  • HttpRequest
  • HttpClient
  • HttpResponse

Let’s talk about each of them in more detail.

1. HttpRequest

The HttpRequest, as the name implies, is an object encapsulating the HTTP request to be sent. To create new instances of HttpRequest, call HttpRequest.newBuilder(). After it has been created, the request is immutable and can be sent multiple times.

The Builder class comes with different methods for configuring the request.

These are the most common methods:

  • URI method
  • Request method
  • Protocol version method
  • Timeout method

Let’s talk about each of them in more detail.

a) URI method

The first thing to do when configuring the request is to set the URL to crawl. We can do so by calling the uri() method on the Builder instance. We’ll also use the URI.create() method to create the URI by parsing the string of the URL we intend to crawl.

Here is the code:

String url =
URLEncoder.encode(“https://www.forextradingbig.com/7-reasons-why-you-should
-quit-forex-trading/”, StandardCharsets.UTF_8.name());

HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create(“https://api.proxycrawl.com/?token=INSERT_TOKEN&url=”
+ url))

Notice that we provided the URL string using ProxyCrawl’s settings. This is the web page we intend to scrape its contents.

We also encoded the URL using the Java URLEncoder class. As earlier mentioned, ProxyCrawl requires URLs to be encoded.

b) Request method

The next thing to do is to specify the HTTP method to be used for making the request. We can call any of the following methods from Builder:

  • GET()
  • POST()
  • PUT()
  • DELETE()

In this case, since we want to request data from the target web page, we’ll use the GET() method.

Here is the code:

HttpRequest request = HttpRequest.newBuilder()
.GET()

So far, HttpRequest has all the parameters that should be passed to HttpClient. However, you may need to include other parameters, such as the HTTP protocol version and timeout.

Let’s see how you can add the additional parameters.

c) Protocol version method

As earlier mentioned, the HttpClient API uses the HTTP/2 protocol by default. Nonetheless, you can specify the version of the HTTP protocol you want to use.

Here is the code:

HttpRequest request = HttpRequest.newBuilder()
.version(HttpClient.Version.HTTP_2)

d) Timeout method

You can set the amount of time to wait before a response is received. Once the defined period expires, an HttpTimeoutException will be thrown. By default, the timeout is set to infinity.

You can define timeout by calling the timeout() method on the builder instance. You’ll also need to pass the Duration object to specify the amount of time to wait.

Here is the code:

HttpRequest request = HttpRequest.newBuilder()
.timeout(Duration.ofSeconds(20))

2. HttpClient

The HttpClient class is the main entry point of the API—it acts as a container for the configuration details shared among multiple requests. It is the HTTP client used for sending requests and receiving responses.

You can call either the HttpClient.newBuilder() or the HttpClient.newHttpClient() method to instantiate it. After an instance of the HttpClient has been created, it’s immutable.

The HttpClient class offers several helpful and self-describing methods you can use when working with requests and responses.

These are some things you can do:

  • Set protocol version
  • Set redirect policy
  • Send synchronous and asynchronous requests

Let’s talk about each of them in more detail.

a) Set protocol version

As earlier mentioned, the HttpClient class uses the HTTP/2 protocol by default. However, you can set your preferred protocol version, either HTTP/1.1 or HTTP/2.

Here is an example:

HttpClient client = HttpClient.newBuilder()
.version(Version.HTTP_1_1)

b) Set redirect policy

If the targeted web page has moved to a different address, you’ll get a 3xx HTTP status code. Since the address of the new URI is usually provided with the status code information, setting the correct redirect policy can make HttpClient forward the request automatically to the new location.

You can set it by using the followRedirects() method on the Builder instance.

Here is an example:

HttpClient client = HttpClient.newBuilder()
.followRedirects(Redirect.NORMAL)

c) Send synchronous and asynchronous requests

HttpClient supports two ways of sending requests:

  • Synchronously by using the send() method. This blocks the client until the response is received, before continuing with the rest of the execution.

Here is an example:

HttpResponse<String> response = client.send(request,
BodyHandlers.ofString());

Note that we used BodyHandlers and called the ofString() method to return the HTML response as a string.

  • Asynchronously by using the sendAsync() method. This does not wait for the response to be received; it’s non-blocking. Once the sendAsync() method is called, it returns instantly with a CompletableFuture< HttpResponse >, which finalizes once the response is received. The returned CompletableFuture can be joined using various techniques to define dependencies among varied asynchronous tasks.

Here is an example:

CompletableFuture<HttpResponse<String>> response = HttpClient.newBuilder()
.sendAsync(request, HttpResponse.BodyHandler.ofString());

3. HttpResponse

The HttpResponse, as the name implies, represents the response received after sending an HttpRequest. HttpResponse offers different helpful methods for handling the received response.

These are the most important methods:

  • statusCode() This method returns the status code of the response. It’s of int type
  • Body() This method returns a body for the response. The return type is based on the kind of response BodyHandler parameter that is passed to the send() method.

Here is an example:

// Handling the response body as a String
HttpResponse<String> response = client
.send(request, BodyHandlers.ofString());

// Printing response body
System.out.println(response.body());

// Printing status code
System.out.println(response.statusCode());
// Handling the response body as a file
HttpResponse<Path> response = client
.send(request, BodyHandlers.ofFile(Paths.get(“myexample.html”)));

Synchronous Example

Here is an example that uses the HttpClient synchronous method to crawl a web page and output its content:

package javaHttpClient;

import java.io.IOException;
import java.net.URI;
import java.net.URLEncoder;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.net.http.HttpResponse.BodyHandlers;
import java.nio.charset.StandardCharsets;

public class SyncExample {

public static void main(String[] args) throws IOException, InterruptedException {

// Encoding the URL
String url = URLEncoder.encode(“https://www.forextradingbig.com/7-reasons-why-you-should-quit-forex-trading/”, StandardCharsets.UTF_8.name());

// Instantiating HttpClient
HttpClient client = HttpClient.newHttpClient();

// Configuring HttpRequest
HttpRequest request = HttpRequest.newBuilder()
.GET()
.uri(URI.create(“https://api.proxycrawl.com/?token=INSERT_TOKEN&url=” + url))
.build();

// Handling the response
HttpResponse<String> response = client.send(request, BodyHandlers.ofString());
System.out.println(response.body());
}

}

Here is the output:

Asynchronous Example

When using the HttpClient asynchronous method to crawl a web page, the sendAsync() method is called, instead of send().

Here is an example:

package javaHttpClient;

import java.io.IOException;
import java.net.URI;
import java.net.URLEncoder;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.nio.charset.StandardCharsets;
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.TimeoutException;

public class AsyncExample {


public static void main(String[] args) throws IOException, InterruptedException, ExecutionException, TimeoutException {

// Encoding the URL
String url = URLEncoder.encode(“https://www.forextradingbig.com/7-reasons-why-you-should-quit-forex-trading/”, StandardCharsets.UTF_8.name());

// Instantiating HttpClient
HttpClient client = HttpClient.newHttpClient();

// Configuring HttpRequest
HttpRequest request = HttpRequest.newBuilder()
.GET()
.version(HttpClient.Version.HTTP_2)
.uri(URI.create(“https://api.proxycrawl.com/?token=INSERT_TOKEN&url=” + url))
.build();

// Handling the response
CompletableFuture<HttpResponse<String>> response =
client.sendAsync(request, HttpResponse.BodyHandlers.ofString());

String result = response.thenApply(HttpResponse::body).get(5, TimeUnit.SECONDS);

System.out.println(result);

}

}

Conclusion

That’s how to build a web crawler in Java. The HttpClient API, which was introduced in Java 11, makes it easy to send and handle responses from a server.

And if the API is combined with a versatile tool like ProxyCrawl, it can make web crawling tasks smooth and rewarding.

With ProxyCrawl, you can create a scraper that can help you to retrieve information from websites anonymously and without worrying about being blocked.

It’s the tool you need to take your crawling efforts to the next level.

Click here to create a free ProxyCrawl account.

Happy scraping!

Composite Primary Keys in JPA
14
May
2021

Composite Primary Keys in JPA

1. Introduction

In this tutorial, we’ll learn about Composite Primary Keys and the corresponding annotations in JPA.

2. Composite Primary Keys

A composite primary key – also called a composite key – is a combination of two or more columns to form a primary key for a table.

In JPA, we have two options to define the composite keys: The @IdClass and @EmbeddedId annotations.

In order to define the composite primary keys, we should follow some rules:

  • The composite primary key class must be public
  • It must have a no-arg constructor
  • It must define equals() and hashCode() methods
  • It must be Serializable

3. The IdClass Annotation

Let’s say we have a table called Account and it has two columns – accountNumber, accountType – that form the composite key. Now we have to map it in JPA.

As per the JPA specification, let’s create an AccountId class with these primary key fields:

public class AccountId implements Serializable {
    private String accountNumber;

    private String accountType;

    // default constructor

    public AccountId(String accountNumber, String accountType) {
        this.accountNumber = accountNumber;
        this.accountType = accountType;
    }

    // equals() and hashCode()
}

Next, let’s associate the AccountId class with the entity Account.

In order to do that, we need to annotate the entity with the @IdClass annotation. We must also declare the fields from the AccountId class in the entity Account and annotate them with @Id:

@Entity
@IdClass(AccountId.class)
public class Account {
    @Id
    private String accountNumber;

    @Id
    private String accountType;

    // other fields, getters and setters
}

4. The EmbeddedId Annotation

@EmbeddedId is an alternative to the @IdClass annotation.

Let’s consider another example where we have to persist some information of a Book with title and language as the primary key fields.

In this case, the primary key class, BookId, must be annotated with @Embeddable:

@Embeddable
public class BookId implements Serializable {
    private String title;
    private String language;

    // default constructor

    public BookId(String title, String language) {
        this.title = title;
        this.language = language;
    }

    // getters, equals() and hashCode() methods
}

Then, we need to embed this class in the Book entity using @EmbeddedId:

@Entity
public class Book {
    @EmbeddedId
    private BookId bookId;

    // constructors, other fields, getters and setters
}

5. @IdClass vs @EmbeddedId

As we just saw, the difference on the surface between these two is that with @IdClass, we had to specify the columns twice – once in AccountId and again in Account. But, with @EmbeddedId we didn’t.

There are some other tradeoffs, though.

For example, these different structures affect the JPQL queries that we write.

For example, with @IdClass, the query is a bit simpler:

SELECT account.accountNumber FROM Account account

With @EmbeddedId, we have to do one extra traversal:

SELECT book.bookId.title FROM Book book

Also, @IdClass can be quite useful in places where we are using a composite key class that we can’t modify.

Finally, if we’re going to access parts of the composite key individually, we can make use of @IdClass, but in places where we frequently use the complete identifier as an object, @EmbeddedId is preferred.

6. Conclusion

In this quick article, we explore composite primary keys in JPA.

@Scope – How to get Scope of Bean
14
May
2021

@Scope – How to get Scope of Bean

@Scope – How to get Scope of Bean from Code

When we create a Bean we are creating  actual instances of the class defined by that bean definition. We can also control the scope of the objects created from a particular bean definition.

There are 5 types of scopes in bean,

  • singleton (default scope)
  • prototype
  • request
  • session
  • global-session

Singleton:
Single instance per spring IoC container

Prototype:
Single bean definition to any number of object instances.

Request:
Bean definition for each request. Only valid web-aware Spring ApplicationContext.

Session:
Bean definition for a session. Only valid web-aware Spring ApplicationContext.

Global-Session:
Similar to session but the only makes sense in the context of portlet-based web applications. Only valid web-aware Spring ApplicationContext.

Kafka For Beginners
31
Mar
2021

Apache Kafka For Beginners

What is Kafka?

We use Apache Kafka when it comes to enabling communication between producers and consumers using message-based topics. Apache Kafka is a fast, scalable, fault-tolerant, publish-subscribe messaging system. Basically, it designs a platform for high-end new generation distributed applications. Also, it allows a large number of permanent or ad-hoc consumers. One of the best features of Kafka is, it is highly available and resilient to node failures and supports automatic recovery. This feature makes Apache Kafka ideal for communication and integration between components of large-scale data systems in real-world data systems.

Moreover, this technology replaces the conventional message brokers, with the ability to give higher throughput, reliability, and replication like JMS, AMQP and many more. In addition, core abstraction Kafka offers a Kafka broker, a Kafka Producer, and a Kafka Consumer. Kafka broker is a node on the Kafka cluster, its use is to persist and replicate the data. A Kafka Producer pushes the message into the message container called the Kafka Topic. Whereas a Kafka Consumer pulls the message from the Kafka Topic.

Before moving forward in Kafka Tutorial, let’s understand the actual meaning of term Messaging System in Kafka.

a. Messaging System in Kafka

When we transfer data from one application to another, we use the Messaging System. It results as, without worrying about how to share data, applications can focus on data only. On the concept of reliable message queuing, distributed messaging is based. Although, messages are asynchronously queued between client applications and messaging system. There are two types of messaging patterns available, i.e. point to point and publish-subscribe (pub-sub) messaging system. However, most of the messaging patterns follow pub-sub.

Apache Kafka — Kafka Messaging System
  • Point to Point Messaging System

Here, messages are persisted in a queue. Although, a particular message can be consumed by a maximum of one consumer only, even if one or more consumers can consume the messages in the queue. Also, it makes sure that as soon as a consumer reads a message in the queue, it disappears from that queue.

  • Publish-Subscribe Messaging System

Here, messages are persisted in a topic. In this system, Kafka Consumers can subscribe to one or more topic and consume all the messages in that topic. Moreover, message producers refer publishers and message consumers are subscribers here.

History of Apache Kafka

Previously, LinkedIn was facing the issue of low latency ingestion of huge amount of data from the website into a lambda architecture which could be able to process real-time events. As a solution, Apache Kafka was developed in the year 2010, since none of the solutions was available to deal with this drawback, before.

However, there were technologies available for batch processing, but the deployment details of those technologies were shared with the downstream users. Hence, while it comes to Real-time Processing, those technologies were not enough suitable. Then, in the year 2011 Kafka was made public.

Why Should we use Apache Kafka Cluster?

As we all know, there is an enormous volume of data in Big Data. And, when it comes to big data, there are two main challenges. One is to collect the large volume of data, while another one is to analyze the collected data. Hence, in order to overcome those challenges, we need a messaging system. Then Apache Kafka has proved its utility. There are numerous benefits of Apache Kafka such as:

  • Tracking web activities by storing/sending the events for real-time processes.
  • Alerting and reporting the operational metrics.
  • Transforming data into the standard format.
  • Continuous processing of streaming data to the topics.

Therefore, this technology is giving a tough competition to some of the most popular applications like ActiveMQ, RabbitMQ, AWS etc. because of its wide use.

Kafka Tutorial — Audience

Professionals who are aspiring to make a career in Big Data Analytics using Apache Kafka messaging system should refer this Kafka Tutorial article. It will give you complete understanding about Apache Kafka.

Kafka Tutorial — Prerequisites

You must have a good understanding ofJavaScala, Distributed messaging system, and Linux environment, before proceeding with this Apache Kafka Tutorial.

Kafka Architecture

Below we are discussing four core APIs in this Apache Kafka tutorial:

Apache Kafka — Kafka Architecture

a. Kafka Producer API

This Kafka Producer API permits an application to publish a stream of records to one or more Kafka topics.

b. Kafka Consumer API

To subscribe to one or more topics and process the stream of records produced to them in an application, we use this Kafka Consumer API.

c. Kafka Streams API

In order to act as a stream processor consuming an input stream from one or more topics and producing an output stream to one or more output topics and also effectively transforming the input streams to output streams, this Kafka Streams API gives permission to an application.

d. Kafka Connector API

This Kafka Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.

Kafka Components

Using the following components, Kafka achieves messaging:

a. Kafka Topic

Basically, how Kafka stores and organizes messages across its system and essentially a collection of messages are Topics. In addition, we can replicate and partition Topics. Here, replicate refers to copies and partition refers to the division. Also, visualize them as logs wherein, Kafka stores messages. However, this ability to replicate and partitioning topics is one of the factors that enable Kafka’s fault tolerance and scalability.

Apache Kafka — Kafka Topic

b. Kafka Producer

It publishes messages to a Kafka topic.

c. Kafka Consumer

This component subscribes to a topic(s), reads and processes messages from the topic(s).

d. Kafka Broker

Kafka Broker manages the storage of messages in the topic(s). If Kafka has more than one broker, that is what we call a Kafka cluster.

e. Kafka Zookeeper

To offer the brokers with metadata about the processes running in the system and to facilitate health checking and broker leadership election, Kafka uses Kafka zookeeper.

Kafka Tutorial — Log Anatomy

We view log as the partitions in this Kafka tutorial. Basically, a data source writes messages to the log. One of the advantages is, at any time one or more consumers read from the log they select. Here, the below diagram shows a log is being written by the data source and the log is being read by consumers at different offsets.

Apache Kafka Tutorial — Log Anatomy

Kafka Tutorial — Data Log

By Kafka, messages are retained for a considerable amount of time. Also, consumers can read as per their convenience. However, if Kafka is configured to keep messages for 24 hours and a consumer is down for time greater than 24 hours, the consumer will lose messages. And, messages can be read from last known offset, if the downtime on part of the consumer is just 60 minutes. Kafka doesn’t keep state on what consumers are reading from a topic.

Kafka Tutorial — Partition in Kafka

There are few partitions in every Kafka broker. Moreover, each partition can be either a leader or a replica of a topic. In addition, along with updating of replicas with new data, Leader is responsible for all writes and reads to a topic. The replica takes over as the new leader if somehow the leader fails.

Apache Kafka Tutorial — Partition In Kafka

Importance of Java in Apache Kafka

Apache Kafka is written in pure Java and also Kafka’s native API is java. However, many other languages like C++, Python, .Net, Go, etc. also support Kafka. Still, a platform where there is no need of using a third-party library is Java. Also, we can say, writing code in languages apart from Java will be a little overhead.

In addition, we can useJavalanguage if we need the high processing rates that come standard on Kafka. Also, Java provides a good community support for Kafka consumer clients. Hence, it is a right choice to implement Kafka in Java.

Kafka Use Cases

There are several use Cases of Kafka that show why we actually use Apache Kafka.

  • Messaging

For a more traditional message broker, Kafka works well as a replacement. We can say Kafka has better throughput, built-in partitioning, replication, and fault-tolerance which makes it a good solution for large-scale message processing applications.

  • Metrics

For operational monitoring data, Kafka finds the good application. It includes aggregating statistics from distributed applications to produce centralized feeds of operational data.

  • Event Sourcing

Since it supports very large stored log data, that means Kafka is an excellent backend for applications of event sourcing.

Kafka Tutorial — Comparisons in Kafka

Many applications offer the same functionality as Kafka like ActiveMQ, RabbitMQ, Apache Flume, Storm, and Spark. Then why should you go for Apache Kafka instead of others?

Let’s see the comparisons below:

a. Apache Kafka vs Apache Flume

Kafka Tutorial — Apache Kafka vs Flume

i. Types of tool

Apache Kafka– For multiple producers and consumers, it is a general-purpose tool.

Apache Flume– Whereas, it is a special-purpose tool for specific applications.

ii. Replication feature

Apache Kafka– Using ingest pipelines, it replicates the events.

Apache Flume- It does not replicate the events.

b. RabbitMQ vs Apache Kafka

One among the foremost Apache Kafka alternatives is RabbitMQ. So, let’s see how they differ from one another:

Kafka Tutorial — Kafka vs RabbitMQ

i. Features

Apache Kafka– Basically, Kafka is distributed. Also, with guaranteed durability and availability, the data is shared and replicated.

RabbitMQ– It offers relatively less support for these features.

ii. Performance rate

Apache Kafka — Its performance rate is high to the tune of 100,000 messages/second.

RabbitMQ — Whereas, the performance rate of RabbitMQ is around 20,000 messages/second.

iii. Processing

Apache Kafka — It allows reliable log distributed processing. Also, stream processing semantics built into the Kafka Streams.

RabbitMQ — Here, the consumer is just FIFO based, reading from the HEAD and processing 1 by 1.

c. Traditional queuing systems vs Apache Kafka

Kafka Tutorial — Traditional queuing systems vs Apache Kafka

i. Messages Retaining

Traditional queuing systems — Most queueing systems remove the messages after it has been processed typically from the end of the queue.

Apache Kafka — Here, messages persist even after being processed. They don’t get removed as consumers receive them.

ii. Logic-based processing

Traditional queuing systems — It does not allow to process logic based on similar messages or events.

Apache Kafka — It allows to process logic based on similar messages or events.

So, this was all about Apache Kafka Tutorials. Hope you like our explanation.

Conclusion: Kafka Tutorial

Hence, in this Kafka Tutorial, we have seen the whole concept of Apache Kafka and seen what is Kafka. Moreover, we discussed Kafka components, use cases, and Kafka architecture. At last, we discussed the comparison of Kafka vs other messaging tools. Furthermore, if you have any query regarding Kafka Tutorial, feel free to ask in the comment section. Also, keep visiting Data Flair for more knowledgeable articles on Apache Kafka.

Spring Boot JSP View Resolver Example
30
Mar
2021

Spring Boot JSP View Resolver Example

Learn to create and configure spring boot jsp view resolver which uses JSP template files to render view layer. This example uses embedded Tomcat server to run the application.

Maven dependencies – pom.xml

This application uses given below dependencies.

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"><modelVersion>4.0.0</modelVersion><groupId>com.fusebes</groupId><artifactId>spring-boot-demo</artifactId><packaging>war</packaging><version>0.0.1-SNAPSHOT</version><name>spring-boot-demo Maven Webapp</name><url>http://maven.apache.org</url><parent><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-parent</artifactId><version>1.5.1.RELEASE</version></parent><properties><java.version>1.8</java.version></properties><dependencies><!-- Web --><dependency><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-web</artifactId></dependency><!-- Tomcat Embed --><dependency><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-tomcat</artifactId><scope>provided</scope></dependency><!-- JSTL --><dependency><groupId>javax.servlet</groupId><artifactId>jstl</artifactId></dependency><!-- To compile JSP files --><dependency><groupId>org.apache.tomcat.embed</groupId><artifactId>tomcat-embed-jasper</artifactId><scope>provided</scope></dependency></dependencies></project>

Spring Boot Application Initializer

The first step in producing a deployable war file is to provide a SpringBootServletInitializer subclass and override its configure() method. This makes use of Spring Framework’s Servlet 3.0 support and allows you to configure your application when it’s launched by the servlet container.

package com.fusebes.app.controller; import org.springframework.boot.SpringApplication;import org.springframework.boot.autoconfigure.SpringBootApplication;import org.springframework.boot.builder.SpringApplicationBuilder;import org.springframework.boot.web.support.SpringBootServletInitializer; @SpringBootApplicationpublic class SpringBootWebApplication extends SpringBootServletInitializer { @Overrideprotected SpringApplicationBuilder configure(SpringApplicationBuilder application) {return application.sources(SpringBootWebApplication.class);} public static void main(String[] args) throws Exception {SpringApplication.run(SpringBootWebApplication.class, args);}}

Spring Controller

Controller classes can have methods mapped to specific URLs in the application. In given application, it has two views i.e. “/” and “/next”.

package com.fusebes.app.controller; import java.util.Map; import org.springframework.stereotype.Controller;import org.springframework.web.bind.annotation.RequestMapping; @Controllerpublic class IndexController { @RequestMapping("/")public String home(Map<String, Object> model) {model.put("message", "Fusebes Reader !!");return "index";} @RequestMapping("/next")public String next(Map<String, Object> model) {model.put("message", "You are in new page !!");return "next";} }

Spring Boot JSP ViewResolver Configuration

To resolve JSP files location, you can have two approaches.

1) Add entries in application.properties

spring.mvc.view.prefix=/WEB-INF/view/spring.mvc.view.suffix=.jsp //For detailed logging during development logging.level.org.springframework=TRACElogging.level.com=TRACE

2) Configure InternalResourceViewResolver to serve JSP pages

package com.fusebes.app.controller; import org.springframework.context.annotation.ComponentScan;import org.springframework.context.annotation.Configuration;import org.springframework.web.servlet.config.annotation.EnableWebMvc;import org.springframework.web.servlet.config.annotation.ViewResolverRegistry;import org.springframework.web.servlet.config.annotation.WebMvcConfigurerAdapter;import org.springframework.web.servlet.view.InternalResourceViewResolver;import org.springframework.web.servlet.view.JstlView; @Configuration@EnableWebMvc@ComponentScanpublic class MvcConfiguration extends WebMvcConfigurerAdapter{@Overridepublic void configureViewResolvers(ViewResolverRegistry registry) {InternalResourceViewResolver resolver = new InternalResourceViewResolver();resolver.setPrefix("/WEB-INF/view/");resolver.setSuffix(".jsp");resolver.setViewClass(JstlView.class);registry.viewResolver(resolver);}}

JSP Files

Two used JSP files in this spring boot jsp example – are below.

index.jsp

<!DOCTYPE html><%@ taglib prefix="spring" uri="http://www.springframework.org/tags"%><html lang="en"><body><div><div><h1>Spring Boot JSP Example</h1><h2>Hello ${message}</h2> Click on this <strong><a href="next">link</a></strong> to visit another page.</div></div></body></html>

next.jsp

<!DOCTYPE html><%@ taglib prefix="spring" uri="http://www.springframework.org/tags"%><html lang="en"><body><div><div><h1>Another page</h1><h2>Hello ${message}</h2> Click on this <strong><a href="/">link</a></strong> to visit previous page.</div></div></body></html>

Demo

After whole code is written and placed inside folders, run the application by executing main() method in SpringBootWebApplication class.

Now hit the URL: http://localhost:8080/

And Click next link

Create maven web project in eclipse step by step
30
Mar
2021

Create Maven Web Project In Eclipse

Learn to create maven web project in eclipse which we should be able to import on eclipse IDE for further development.

To create eclipse supported web project, we will need to create first a normal maven we application and then we will make it compatible to eclipse IDE.

1. Create maven web project in eclipse

Run this maven command to create a maven web project named ‘demoWebApplication‘. Maven archetype used is ‘maven-archetype-webapp‘.

$ mvn archetype:generate -DgroupId=com.howtodoinjava -DartifactId=demoWebApplication-DarchetypeArtifactId=maven-archetype-webapp -DinteractiveMode=false

This will create maven web project structure and web application specific files like web.xml.

create-web-project-using-maven-1844893

2. Convert to eclipse dynamic web project

To Convert created maven web project to eclipse dynamic web project, following maven command needs to be run.

$ mvn eclipse:eclipse -Dwtpversion=2.0
convert-to-eclipse-webproject-9798545

Please remember that adding “-Dwtpversion=2.0” is necessary, otherwise using only “mvn eclipse:eclipse” will convert it to only normal Java project (without web support), and you will not be able to run it as web application.

3. Import web project in Eclipse

  1. Click on File menu and click on Import option.
    import-project-menu-9106164
  2. Now, click on “Existing project..” in general section.
    existing-project-menu-9473492
  3. Now, browse the project root folder and click OK. Finish.
    browse-project-menu-2890673
  4. Above steps will import the project into eclipse work space. You can verify the project structure like this.
    project-created-success-4773138

In this maven tutorial, we learned how to create maven dynamic web project in eclipse. In this example, I used eclipse oxygen. You may have different eclipse version but the steps to follow will be same.

Happy Learning !!

Spring Boot Annotations
30
Mar
2021

Spring Boot Annotations

The Spring Boot annotations are mostly placed in
org.springframework.boot.autoconfigure and
org.springframework.boot.autoconfigure.condition packages.
Let’s learn about some frequently used spring boot annotations as well as which work behind the scene.

1. @SpringBootApplication

Spring boot is mostly about auto-configuration. This auto-configuration is done by component scanning i.e. finding all classes in classspath for @Component annotation. It also involve scanning of @Configuration annotation and initialize some extra beans.

@SpringBootApplication annotation enable all able things in one step. It enables the three features:

  1. @EnableAutoConfiguration : enable auto-configuration mechanism
  2. @ComponentScan : enable @Component scan
  3. @SpringBootConfiguration : register extra beans in the context

The java class annotated with @SpringBootApplication is the main class of a Spring Boot application and application starts from here.

import org.springframework.boot.SpringApplication;import org.springframework.boot.autoconfigure.SpringBootApplication; @SpringBootApplicationpublic class Application { public static void main(String[] args) {SpringApplication.run(Application.class, args);} }

2. @EnableAutoConfiguration

This annotation enables auto-configuration of the Spring Application Context, attempting to guess and configure beans that we are likely to need based on the presence of predefined classes in classpath.

For example, if we have tomcat-embedded.jar on the classpath, we are likely to want a TomcatServletWebServerFactory.

As this annotation is already included via @SpringBootApplication, so adding it again on main class has no impact. It is also advised to include this annotation only once via @SpringBootApplication.

Auto-configuration classes are regular Spring Configuration beans. They are located using the SpringFactoriesLoader mechanism (keyed against this class). Generally auto-configuration beans are @Conditional beans (most often using @ConditionalOnClass and @ConditionalOnMissingBean annotations).

3. @SpringBootConfiguration

It indicates that a class provides Spring Boot application configuration. It can be used as an alternative to the Spring’s standard @Configuration annotation so that configuration can be found automatically.

Application should only ever include one @SpringBootConfiguration and most idiomatic Spring Boot applications will inherit it from @SpringBootApplication.

The main difference is both annotations is that @SpringBootConfiguration allows configuration to be automatically located. This can be especially useful for unit or integration tests.

4. @ImportAutoConfiguration

It import and apply only the specified auto-configuration classes. The difference between @ImportAutoConfiguration and @EnableAutoConfiguration is that later attempts to configure beans that are found in the classpath during scanning, whereas @ImportAutoConfiguration only runs the configuration classes that we provide in the annotation.

We should use @ImportAutoConfiguration when we don’t want to enable the default auto-configuration.

@ComponentScan("path.to.your.controllers")@ImportAutoConfiguration({WebMvcAutoConfiguration.class,DispatcherServletAutoConfiguration.class,EmbeddedServletContainerAutoConfiguration.class,ServerPropertiesAutoConfiguration.class,HttpMessageConvertersAutoConfiguration.class})public class App {public static void main(String[] args) {SpringApplication.run(App.class, args);}}

5. @AutoConfigureBefore, @AutoConfigureAfter, @AutoConfigureOrder

We can use the @AutoConfigureAfter or @AutoConfigureBefore annotations if our configuration needs to be applied in a specific order (before of after).

If we want to order certain auto-configurations that should not have any direct knowledge of each other, we can also use @AutoConfigureOrder. That annotation has the same semantic as the regular @Order annotation but provides a dedicated order for auto-configuration classes.

@Configuration@AutoConfigureAfter(CacheAutoConfiguration.class)@ConditionalOnBean(CacheManager.class)@ConditionalOnClass(CacheStatisticsProvider.class)public class RedissonCacheStatisticsAutoConfiguration {@Beanpublic RedissonCacheStatisticsProvider redissonCacheStatisticsProvider(){return new RedissonCacheStatisticsProvider();}}

5. Condition Annotations

All auto-configuration classes generally have one or more @Conditional annotations. It allow to register a bean only when the condition meets. Following are some useful conditional annotations to use.

5.1. @ConditionalOnBean and @ConditionalOnMissingBean

These annotations let a bean be included based on the presence or absence of specific beans.

It’s value attribute is used to specify beans by type or by name. Also the search attribute lets us limit the ApplicationContext hierarchy that should be considered when searching for beans.

Using these annotations at the class level prevents registration of the @Configuration class as a bean if the condition does not match.

In below example, bean JpaTransactionManager will only be loaded if a bean of type JpaTransactionManager is not already defined in the application context.

@Bean@ConditionalOnMissingBean(type = "JpaTransactionManager")JpaTransactionManager transactionManager(EntityManagerFactory entityManagerFactory) {JpaTransactionManager transactionManager = new JpaTransactionManager();transactionManager.setEntityManagerFactory(entityManagerFactory);return transactionManager;}

5.2. @ConditionalOnClass and @ConditionalOnMissingClass

These annotations let configuration classes be included based on the presence or absence of specific classes. Notice that annotation metadata is parsed by using spring ASM module, and even if a class might not be present in runtime – you can still refer to the class in annotation.

We can also use value attribute to refer the real class or the name attribute to specify the class name by using a String value.

Below configuration will create EmbeddedAcmeService only if this class is available in runtime and no other bean with same name is present in application context.

@Configuration@ConditionalOnClass(EmbeddedAcmeService.class)static class EmbeddedConfiguration { @Bean@ConditionalOnMissingBeanpublic EmbeddedAcmeService embeddedAcmeService() { ... } }

5.3. @ConditionalOnNotWebApplication and @ConditionalOnWebApplication

These annotations let configuration be included depending on whether the application is a “web application” or not. In Spring, a web application is one which meets at least one of below three requirements:

  1. uses a Spring WebApplicationContext
  2. defines a session scope
  3. has a StandardServletEnvironment

5.4. @ConditionalOnProperty

This annotation lets configuration be included based on the presence and value of a Spring Environment property.

For example, if we have different datasource definitions for different environments, we can use this annotation.

@Bean@ConditionalOnProperty(name = "env", havingValue = "local")DataSource dataSource() {// ...} @Bean@ConditionalOnProperty(name = "env", havingValue = "prod")DataSource dataSource() {// ...}

5.5. @ConditionalOnResource

This annotation lets configuration be included only when a specific resource is present in the classpath. Resources can be specified by using the usual Spring conventions.

@ConditionalOnResource(resources = "classpath:vendor.properties")Properties additionalProperties() {// ...}

5.6. @ConditionalOnExpression

This annotation lets configuration be included based on the result of a SpEL expression. Use this annotation when condition to evaluate is complex one and shall be evaluated as one condition.

@Bean@ConditionalOnExpression("${env} && ${havingValue == 'local'}")DataSource dataSource() {// ...}

5.7. @ConditionalOnCloudPlatform

This annotation lets configuration be included when the specified cloud platform is active.

@Configuration@ConditionalOnCloudPlatform(CloudPlatform.CLOUD_FOUNDRY)public class CloudConfigurationExample {@Beanpublic MyBean myBean(MyProperties properties) {return new MyBean(properties.getParam);}}

Drop me your questions related to spring boot annotations in comments.

Happy Learning !!

Ref: Spring Boot Docs

Clarified CQRS
28
Mar
2021

Clarified CQRS

After listening how the community has interpreted Command-Query Responsibility Segregation I think that the time has come for some clarification. Some have been tying it together to Event Sourcing. Most have been overlaying their previous layered architecture assumptions on it. Here I hope to identify CQRS itself, and describe in which places it can connect to other patterns.
This is quite a long post.
Why CQRS Before describing the details of CQRS we need to understand the two main driving forces behind it: collaboration and staleness.
Collaboration refers to circumstances under which multiple actors will be using/modifying the same set of data – whether or not the intention of the actors is actually to collaborate with each other. There are often rules which indicate which user can perform which kind of modification and modifications that may have been acceptable in one case may not be acceptable in others. We’ll give some examples shortly. Actors can be human like normal users, or automated like software.Staleness refers to the fact that in a collaborative environment, once data has been shown to a user, that same data may have been changed by another actor – it is stale. Almost any system which makes use of a cache is serving stale data – often for performance reasons. What this means is that we cannot entirely trust our users decisions, as they could have been made based on out-of-date information.Standard layered architectures don’t explicitly deal with either of these issues. While putting everything in the same database may be one step in the direction of handling collaboration, staleness is usually exacerbated in those architectures by the use of caches as a performance-improving afterthought.A picture for referenceI’ve given some talks about CQRS using this diagram to explain it:The boxes named AC are Autonomous Components. We’ll describe what makes them autonomous when discussing commands. But before we go into the complicated parts, let’s start with queries:QueriesIf the data we’re going to be showing users is stale anyway, is it really necessary to go to the master database and get it from there? Why transform those 3rd normal form structures to domain objects if we just want data – not any rule-preserving behaviors? Why transform those domain objects to DTOs to transfer them across a wire, and who said that wire has to be exactly there? Why transform those DTOs to view model objects?In short, it looks like we’re doing a heck of a lot of unnecessary work based on the assumption that reusing code that has already been written will be easier than just solving the problem at hand. Let’s try a different approach:How about we create an additional data store whose data can be a bit out of sync with the master database – I mean, the data we’re showing the user is stale anyway, so why not reflect in the data store itself. We’ll come up with an approach later to keep this data store more or less in sync.Now, what would be the correct structure for this data store? How about just like the view model? One table for each view. Then our client could simply SELECT * FROM MyViewTable (or possibly pass in an ID in a where clause), and bind the result to the screen. That would be just as simple as can be. You could wrap that up with a thin facade if you feel the need, or with stored procedures, or using AutoMapper which can simply map from a data reader to your view model class. The thing is that the view model structures are already wire-friendly, so you don’t need to transform them to anything else.You could even consider taking that data store and putting it in your web tier. It’s just as secure as an in-memory cache in your web tier. Give your web servers SELECT only permissions on those tables and you should be fine.Query Data StorageWhile you can use a regular database as your query data store it isn’t the only option. Consider that the query schema is in essence identical to your view model. You don’t have any relationships between your various view model classes, so you shouldn’t need any relationships between the tables in the query data store.So do you actually need a relational database?The answer is no, but for all practical purposes and due to organizational inertia, it is probably your best choice (for now).Scaling QueriesSince your queries are now being performed off of a separate data store than your master database, and there is no assumption that the data that’s being served is 100% up to date, you can easily add more instances of these stores without worrying that they don’t contain the exact same data. The same mechanism that updates one instance can be used for many instances, as we’ll see later.This gives you cheap horizontal scaling for your queries. Also, since your not doing nearly as much transformation, the latency per query goes down as well. Simple code is fast code.Data modificationsSince our users are making decisions based on stale data, we need to be more discerning about which things we let through. Here’s a scenario explaining why:Let’s say we have a customer service representative who is one the phone with a customer. This user is looking at the customer’s details on the screen and wants to make them a ‘preferred’ customer, as well as modifying their address, changing their title from Ms to Mrs, changing their last name, and indicating that they’re now married. What the user doesn’t know is that after opening the screen, an event arrived from the billing department indicating that this same customer doesn’t pay their bills – they’re delinquent. At this point, our user submits their changes.Should we accept their changes?Well, we should accept some of them, but not the change to ‘preferred’, since the customer is delinquent. But writing those kinds of checks is a pain – we need to do a diff on the data, infer what the changes mean, which ones are related to each other (name change, title change) and which are separate, identify which data to check against – not just compared to the data the user retrieved, but compared to the current state in the database, and then reject or accept.Unfortunately for our users, we tend to reject the whole thing if any part of it is off. At that point, our users have to refresh their screen to get the up-to-date data, and retype in all the previous changes, hoping that this time we won’t yell at them because of an optimistic concurrency conflict.As we get larger entities with more fields on them, we also get more actors working with those same entities, and the higher the likelihood that something will touch some attribute of them at any given time, increasing the number of concurrency conflicts.If only there was some way for our users to provide us with the right level of granularity and intent when modifying data. That’s what commands are all about.CommandsA core element of CQRS is rethinking the design of the user interface to enable us to capture our users’ intent such that making a customer preferred is a different unit of work for the user than indicating that the customer has moved or that they’ve gotten married. Using an Excel-like UI for data changes doesn’t capture intent, as we saw above.We could even consider allowing our users to submit a new command even before they’ve received confirmation on the previous one. We could have a little widget on the side showing the user their pending commands, checking them off asynchronously as we receive confirmation from the server, or marking them with an X if they fail. The user could then double-click that failed task to find information about what happened.Note that the client sends commands to the server – it doesn’t publish them. Publishing is reserved for events which state a fact – that something has happened, and that the publisher has no concern about what receivers of that event do with it.Commands and ValidationIn thinking through what could make a command fail, one topic that comes up is validation. Validation is different from business rules in that it states a context-independent fact about a command. Either a command is valid, or it isn’t. Business rules on the other hand are context dependent.In the example we saw before, the data our customer service rep submitted was valid, it was only due to the billing event arriving earlier which required the command to be rejected. Had that billing event not arrived, the data would have been accepted.Even though a command may be valid, there still may be reasons to reject it.As such, validation can be performed on the client, checking that all fields required for that command are there, number and date ranges are OK, that kind of thing. The server would still validate all commands that arrive, not trusting clients to do the validation.Rethinking UIs and commands in light of validationThe client can make of the query data store when validating commands. For example, before submitting a command that the customer has moved, we can check that the street name exists in the query data store.At that point, we may rethink the UI and have an auto-completing text box for the street name, thus ensuring that the street name we’ll pass in the command will be valid. But why not take things a step further? Why not pass in the street ID instead of its name? Have the command represent the street not as a string, but as an ID (int, guid, whatever).On the server side, the only reason that such a command would fail would be due to concurrency – that someone had deleted that street and that that hadn’t been reflected in the query store yet; a fairly exceptional set of circumstances.Reasons valid commands fail and what to do about itSo we’ve got a well-behaved client that is sending valid commands, yet the server still decides to reject them. Often the circumstances for the rejection are related to other actors changing state relevant to the processing of that command.In the CRM example above, it is only because the billing event arrived first. But “first” could be a millisecond before our command. What if our user pressed the button a millisecond earlier? Should that actually change the business outcome? Shouldn’t we expect our system to behave the same when observed from the outside?So, if the billing event arrived second, shouldn’t that revert preferred customers to regular ones? Not only that, but shouldn’t the customer be notified of this, like by sending them an email? In which case, why not have this be the behavior for the case where the billing event arrives first? And if we’ve already got a notification model set up, do we really need to return an error to the customer service rep? I mean, it’s not like they can do anything about it other than notifying the customer.So, if we’re not returning errors to the client (who is already sending us valid commands), maybe all we need to do on the client when sending a command is to tell the user “thank you, you will receive confirmation via email shortly”. We don’t even need the UI widget showing pending commands.Commands and AutonomyWhat we see is that in this model, commands don’t need to be processed immediately – they can be queued. How fast they get processed is a question of Service-Level Agreement (SLA) and not architecturally significant. This is one of the things that makes that node that processes commands autonomous from a runtime perspective – we don’t require an always-on connection to the client.Also, we shouldn’t need to access the query store to process commands – any state that is needed should be managed by the autonomous component – that’s part of the meaning of autonomy.Another part is the issue of failed message processing due to the database being down or hitting a deadlock. There is no reason that such errors should be returned to the client – we can just rollback and try again. When an administrator brings the database back up, all the message waiting in the queue will then be processed successfully and our users receive confirmation.The system as a whole is quite a bit more robust to any error conditions.Also, since we don’t have queries going through this database any more, the database itself is able to keep more rows/pages in memory which serve commands, improving performance. When both commands and queries were being served off of the same tables, the database server was always juggling rows between the two.Autonomous ComponentsWhile in the picture above we see all commands going to the same AC, we could logically have each command processed by a different AC, each with it’s own queue. That would give us visibility into which queue was the longest, letting us see very easily which part of the system was the bottleneck. While this is interesting for developers, it is critical for system administrators.Since commands wait in queues, we can now add more processing nodes behind those queues (using the distributor with NServiceBus) so that we’re only scaling the part of the system that’s slow. No need to waste servers on any other requests.Service LayersOur command processing objects in the various autonomous components actually make up our service layer. The reason you don’t see this layer explicitly represented in CQRS is that it isn’t really there, at least not as an identifiable logical collection of related objects – here’s why:In the layered architecture (AKA 3-Tier) approach, there is no statement about dependencies between objects within a layer, or rather it is implied to be allowed. However, when taking a command-oriented view on the service layer, what we see are objects handling different types of commands. Each command is independent of the other, so why should we allow the objects which handle them to depend on each other?Dependencies are things which should be avoided, unless there is good reason for them.Keeping the command handling objects independent of each other will allow us to more easily version our system, one command at a time, not needing even to bring down the entire system, given that the new version is backwards compatible with the previous one.Therefore, keep each command handler in its own VS project, or possibly even in its own solution, thus guiding developers away from introducing dependencies in the name of reuse (it’s a fallacy). If you do decide as a deployment concern, that you want to put them all in the same process feeding off of the same queue, you can ILMerge those assemblies and host them together, but understand that you will be undoing much of the benefits of your autonomous components.Whither the domain model?Although in the diagram above you can see the domain model beside the command-processing autonomous components, it’s actually an implementation detail. There is nothing that states that all commands must be processed by the same domain model. Arguably, you could have some commands be processed by transaction script, others using table module (AKA active record), as well as those using the domain model. Event-sourcing is another possible implementation.Another thing to understand about the domain model is that it now isn’t used to serve queries. So the question is, why do you need to have so many relationships between entities in your domain model?(You may want to take a second to let that sink in.)Do we really need a collection of orders on the customer entity? In what command would we need to navigate that collection? In fact, what kind of command would need any one-to-many relationship? And if that’s the case for one-to-many, many-to-many would definitely be out as well. I mean, most commands only contain one or two IDs in them anyway.Any aggregate operations that may have been calculated by looping over child entities could be pre-calculated and stored as properties on the parent entity. Following this process across all the entities in our domain would result in isolated entities needing nothing more than a couple of properties for the IDs of their related entities – “children” holding the parent ID, like in databases.In this form, commands could be entirely processed by a single entity – viola, an aggregate root that is a consistency boundary.Persistence for command processingGiven that the database used for command processing is not used for querying, and that most (if not all) commands contain the IDs of the rows they’re going to affect, do we really need to have a column for every single domain object property? What if we just serialized the domain entity and put it into a single column, and had another column containing the ID? This sounds quite similar to key-value storage that is available in the various cloud providers. In which case, would you really need an object-relational mapper to persist to this kind of storage?You could also pull out an additional property per piece of data where you’d want the “database” to enforce uniqueness.I’m not suggesting that you do this in all cases – rather just trying to get you to rethink some basic assumptions.Let me reiterateHow you process the commands is an implementation detail of CQRS.Keeping the query store in syncAfter the command-processing autonomous component has decided to accept a command, modifying its persistent store as needed, it publishes an event notifying the world about it. This event often is the “past tense” of the command submitted:MakeCustomerPerferredCommand -> CustomerHasBeenMadePerferredEventThe publishing of the event is done transactionally together with the processing of the command and the changes to its database. That way, any kind of failure on commit will result in the event not being sent. This is something that should be handled by default by your message bus, and if you’re using MSMQ as your underlying transport, requires the use of transactional queues.The autonomous component which processes those events and updates the query data store is fairly simple, translating from the event structure to the persistent view model structure. I suggest having an event handler per view model class (AKA per table).Here’s the picture of all the pieces again:
Bounded ContextsWhile CQRS touches on many pieces of software architecture, it is still not at the top of the food chain. CQRS if used is employed within a bounded context (DDD) or a business component (SOA) – a cohesive piece of the problem domain. The events published by one BC are subscribed to by other BCs, each updating their query and command data stores as needed.UI’s from the CQRS found in each BC can be “mashed up” in a single application, providing users a single composite view on all parts of the problem domain. Composite UI frameworks are very useful for these cases.SummaryCQRS is about coming up with an appropriate architecture for multi-user collaborative applications. It explicitly takes into account factors like data staleness and volatility and exploits those characteristics for creating simpler and more scalable constructs.One cannot truly enjoy the benefits of CQRS without considering the user-interface, making it capture user intent explicitly. When taking into account client-side validation, command structures may be somewhat adjusted. Thinking through the order in which commands and events are processed can lead to notification patterns which make returning errors unnecessary.While the result of applying CQRS to a given project is a more maintainable and performant code base, this simplicity and scalability require understanding the detailed business requirements and are not the result of any technical “best practice”. If anything, we can see a plethora of approaches to apparently similar problems being used together – data readers and domain models, one-way messaging and synchronous calls.
An Introduction to CQRS
28
Mar
2021

An Introduction to CQRS

CQRS has been around for a long time, but if you’re not familiar with it, it’s new to you. Take a look at a quick introduction to what it is and how it works.

CQRS, Command Query Responsibility Segregation, is a method for optimizing writes to databases (queries) and reads from them (commands). Nowadays, many companies work with one large database. But these databases weren’t originally built to scale. When they were planned in the 1990s, there wasn’t so much data that needed to be consumed quickly.

In the age of Big Data, many databases can’t handle the growing number of complex reads and writes, resulting in errors, bottlenecks, and slow customer service. This is something DevOps engineers need to find a solution for.

Take a ride-sharing service like Uber or Lyft (just as an example; this is not an explanation of how these companies work). Traditionally (before CQRS), the client app queries the service database for drivers, whose profiles they see on their screens. At the same time, drivers can send commands to the service database to update their profile. The service database needs to be able to crosscheck queries for drivers, user locations, and driver commands about their profile and send the cache to client apps and drivers. These kinds of queries can put a strain on the database.

Image title

CQRS to the rescue. CQRS dictates the segregation of complex queries and commands. Instead of all queries and commands going to the database by the same code, the queries and data manipulation routines are simplified by adding a process in between for the data processing. This reduces stress on the database by enabling easier reading and writing cache data from the database.

A manifestation of such a segregation would be:

  • Pushing the commands into a robust and scalable queue platform like Apache Kafka and storing the raw commands in a data warehouse software like AWS Redshift or HP Vertica.
  • Stream processing with tools like Apache Flink or Apache Spark.
  • Creating an aggregated cache table where data is queried from and displayed to the users.
Image title

The magic happens in the streaming part. Advanced calculations based on Google’s MapReduce technology enable quick and advanced distributed calculations across many machines. As a result, large amounts of data are quickly processed, and the right data gets to the right users, quickly and on time.

CQRS can be used by any service that is based on fast-growing data, whether user data (i.e. user profiles) or machine data (i.e. monitoring metrics). If you want to learn more about CQRS, check out this article.

Persistence in Event Driven Architectures
28
Mar
2021

Persistence in Event Driven Architectures

The importance of being persistent in event driven architectures.

Enterprises have to constantly adapt and evolve their enterprise architecture strategies in order to deliver the desired business outcomes. The evolving architecture patterns may involve business processing of sales transactions with a human in the loop or they may involve machine to machine data processing using automation. Enterprises earlier adopted a request-driven model where microservices made a call to a service and the service responded to the request being made. In this request-driven model, you run into challenges around flexibility as you try to scale your global deployment footprint.

A new approach, that is quickly gaining adoption in enterprises is called the event-driven architecture. In the new approach, you are able to increase application agility and flexibility by allowing for multiple data producers to coexist with multiple data consumers and you process data only after an event or state change. In this enterprise architecture, the producers and consumers of data can be quickly extended to deliver better flexibility and agility as you scale your operation globally. Examples of event-driven solutions are available in a hosted managed format from most cloud providers today. In this blog post, we will look at a Kafka solution running in a Kubernetes cluster and how you can make sure persistence is achieved for the solution. This approach is running at customers in production today supported by Confluent and Portworx.

Why Use Event-Driven Architecture?

With the advent of 5G technology, a vast amount of data will be generated by sensors, devices, systems, and humans in the loop to track, manage and achieve business outcomes. The use case for Event-Driven architectures may include the following:

  • Business Process state changes – You want to notify the change of state between a purchase order and accounts receivable with an event. The event-based approach allows a human or a decision engine to take the next appropriate step in the process.
  • Log and Metrics processing – The event-driven model allows for multiple actions to be triggered based on a single metric. The ability to send messages to multiple event handlers in different subsystems offers the scalability and resiliency required by certain business applications.

Why Use Kafka?

Apache Kafka is a scalable, fault-tolerant messaging system that enables you to build distributed real-time applications with an event-driven architecture. Kafka delivers events, with fast ingestion rates, and provides persistence and in-order guarantees. Kafka adoption in your solution will depend on your specific use-case. Below are some important concepts about Apache Kafka:

  • Kafka organizes messages into “topics”.
  • The process that does the work in Kafka is called the “broker”. A producer pushes data into a topic hosted on a broker and a consumer pulls messages from a topic via a broker.
  • Kafka topics can be divided into “partitions”. This allows for parallelizing a topic across multiple brokers and increasing message ingests and throughput.
  • Brokers can hold multiple partitions but at any given time, only one partition can act as leader of a topic. A leader is responsible for updating any replicas with new data.
  • Brokers are responsible for storing messages to disk. The messages are stored with unique offsets. Messages in Kafka do not have a unique ID.

Persistence With Portworx and Kafka on Kubernetes

Kafka needs Zookeeper to be deployed as a StatefulSets in Kubernetes. Kafka Brokers, which maintains the state of the topics and partitions also need to be deployed as StatefulSets which should be backed by persistent volumes.

  • Kafka offers replication of topics between different brokers. In the case of a node failure, Kafka can recover from failure using the replicated topics. This recovery mechanism does create additional network calls in order to synchronize with the replica on a different broker. The recovery time of the failed node and its broker depend on the amount of data that needs to be rehydrated and network latencies in the cluster.
  • Portworx offers data replication using the replication parameter in the Kuberbetes storage class. In this scenario, the storage system is responsible for maintaining copies of the topic on different nodes. In the case of a node failure, the Kafka broker is rescheduled on a node that already contains the replicated topic data. The rebuild time of the broker is reduced because it uses the storage system to rehydrate the topic data without any network latencies. Once the data is rehydrated using the storage system, the broker can quickly catch up on the topic offset from an existing broker and thus reducing the overall recovery time.

Let’s walk through a Kubernetes node failure scenario on a Kubernetes cluster where a Kafka application is running and is backed by Portworx volumes.

Figure 1: We will describe the deployment of Kafka and Portworx on a 5 node Kubernetes cluster. In the image below, you can see that the Kafka deployment has 3 Brokers, each with 2 partitions and a replication factor of 2. For the Portworx data platform, we have a volume replication factor set to 3 for each volume.
Blog-Tech: Persistence_in_Event_Driven_Architectures figure 1

Figure 2: We will describe a node failure event. In the diagram below, we have simulated a node failure and identified the Kubernetes worker node 2 has been taken out of production. Kubernetes worker node 2 contains a Kafka broker with 2 partitions and 2 Portworx persistent volumes.
Blog-Tech: Persistence_in_Event_Driven_Architectures figure 2
Figure 3: Once Kubernetes node failure is detected by the Kubernetes API server, the Kaffka broker is deployed to a node with available resources. The Portworx Platform has the ability to influence the placement of the broker on a Kubernete node. Since Kubernetes node 4 already has a copy of Kafka Broker’s persistent volumes, Portworx makes sure that the Kafka broker is deployed on Kubernetes node 4. On the Portworx platform’s recovery side, the volumes are also created on other nodes in order to maintain the defined replication factor of 3.

Blog-Tech: Persistence_in_Event_Driven_Architectures figure 3 Figure 4: Once the failed recovery node is back in production use, the Kaffka application does not get scheduled on the recovered node since the application is already at the desired number of Kafka brokers. On the Portworx platform side, replicated volumes are placed on the recovered Kubernetes node to maintain the desired replication factor for the volumes.

Blog-Tech: Persistence_in_Event_Driven_Architectures figure 4