Data dumping through REST API using Spring Batch

  •        0
  

We aggregate and tag open source projects. We have collections of more than one million projects. Check out the projects section.



Most of the cloud services provide API to fetch their data. But data will be given as paginated results as returning the complete data will overshoot the response payload.  To discover the complete list of books or e-courses or cloud machine details, we need to call the API page-wise till the end. In this scenario, we can use Spring Batch to get the data page by page and dump it into a file. 

In this blog, we will use one of the free-to-use API from Coursera, to take the dump of e-courses. Coursera is one of the popular MOOCs site which exposes its e-courses through the REST API. To have a basic introduction about Spring Batch and getting started docs, please refer to the previous blog.

In Spring Batch, tasklet We can use tasklet which will give free-handed to kick start the task and repeat it as per our designed logic. Tasklet will be a single task executed inside a step. The traditional step will have a reader, processor and writer, which works well for file transformation or loading. Fitting our paginated get and dump scenario will be a bit cumbersome. Tasklet gives us the free-hand of placing the GET API request inside the execute and repeat logic till we reach the end of data. 

public class CourseGetTasklet implements Tasklet, StepExecutionListener {
    public RepeatStatus execute(StepContribution stepContribution, ChunkContext chunkContext) 
	throws Exception {
        //task logic happens here..
        //if RepeatStatus.CONTINUABLE given, this will execute the tasklet again.. 
    }

    public void beforeStep(StepExecution stepExecution) {
        //before starting the tasklet, it will get executed.. 
    }

    public ExitStatus afterStep(StepExecution stepExecution) {
        //after completion of the tasklet, it will get executed.. 
    }
}

Let's set up the spring batch application, through the annotation itself. Create a class and provide for SpringApplication run method. It will have the @EnableBatchProcessing method which enables Spring Batch features and provide a base configuration for setting up batch jobs in an @Configuration class, roughly equivalent to using the <batch:*> XML namespace. @Configuration will mark this class as Spring Configuration class. @EnableAutoConfiguration will scan and adds the other class beans available in the classpath. 

JobBuilderFactory is used to create the job with the job id having the RunIdIncrementer. StepBuilderFactory is for creating the steps which kick start the tasklet (CourseGetTasklet) option to build it.  

SampleBatchApplication.java

@Configuration
@EnableAutoConfiguration
@EnableBatchProcessing
public class SampleBatchApplication {

    @Autowired
    private JobBuilderFactory jobs;
    @Autowired
    private StepBuilderFactory steps;
    @Bean
    public Job job() throws Exception {
        return this.jobs.get("job").incrementer(new RunIdIncrementer())
                .listener(new JobExecutionListener()).start(step1()).build();
    }
    @Bean
    protected Step step1() throws Exception {
        String epochStr = String.valueOf(new Date().getTime());
        return this.steps.get("step1v" + epochStr)
                        .tasklet(new CourseGetTasklet()).throttleLimit(1).build();
    }
    public static void main(String[] args) throws Exception {
        // System.exit is common for Batch applications since the exit code can be used to
        // drive a workflow
        System.exit(SpringApplication.exit(SpringApplication.run(
                SampleBatchApplication.class, args)));
    }
}

Provide the datasource properties in the application.properties which will auto configure the data source for the job repository. spring.batch.jdbc.initialize-schema property will initialize the database with the tables required for jobs and steps. 

application.properties:

## Spring DATASOURCE (DataSourceAutoConfiguration & DataSourceProperties)
spring.datasource.url = jdbc:mysql://localhost:3306/courseradump?createDatabaseIfNotExist=true&allowPublicKeyRetrieval=true&useSSL=false
spring.datasource.username = root
spring.datasource.password = ****
spring.datasource.platform=mysql
spring.datasource.driver-class-name=com.mysql.cj.jdbc.Driver
spring.datasource.initialization-mode=always
spring.batch.jdbc.initialize-schema=always
spring.jpa.hibernate.ddl-auto=create


Coursera API documentation can be found in this location. List of courses GET API will help us to get all the courses. We will get the courses with a page size of 100 using CloseableHttpClient. Before starting, we get the step context to get the current offset. After getting the response, we will put the offset back to the step context. No of times is also kept in step context and each time it will be incremented and updated.

Get the response entity in string format and use Jackson deserialize to the CourseResponse model which has the course elements and paging parameters. Filewriter is opened with a try with resources block, so once we get the response, using jackson deserialize the course elements and then write using fileWriter.write method.  

Now check the paging next value is greater than the total value then or the next value is null then it says reached the end of page, so we will return repeat status as finished. In other scenarios, update the page offset and no of times in step context and return the Repeat status continuable. 

 

@Override
 public RepeatStatus execute(StepContribution stepContribution, ChunkContext chunkContext) 
 throws Exception {

                int noOfTimes  =
                        stepContribution.getStepExecution().getExecutionContext().getInt(
                                "noOfTimes", 0);
                int offset = stepContribution.getStepExecution().getExecutionContext().getInt(
                        "offset", 0);

                StringBuilder courseraUrl =
                        new StringBuilder("https://api.coursera.org/api/courses.v1? start=")
                                .append(String.valueOf(offset)).append("&limit=")
                                .append(String.valueOf(PAGE_LIMIT));

                CourseResponse courseResponse = null;
                CloseableHttpClient httpClient = HttpClients.createDefault();
                logger.info("Get the courseurl {}", courseraUrl.toString());

                try(FileWriter fileWriter = new FileWriter("output.json")) {

                    HttpGet request = new HttpGet(courseraUrl.toString());

                    // add request headers
                    request.addHeader("Accept", "application/json");

                    CloseableHttpResponse response = httpClient.execute(request);
                    HttpEntity entity = response.getEntity();
                    if (entity != null) {
                        // return it as a String
                        String result = EntityUtils.toString(entity);
                        courseResponse = objectMapper.readValue(result, CourseResponse.class);
                        fileWriter.write(objectMapper
                                         .writeValueAsString(courseResponse.getElements()));
                        logger.info("elements {} paging {}", 
                                         courseResponse.getElements(), 
                                         courseResponse.getPaging());
                    }

                } catch (IOException exception) {
                    logger.error("io exception happened {}", exception);
                } catch (Exception exception) {
                    logger.error("general exception happened {}", exception);
                }
                CourseResponse.PageModel paging = courseResponse.getPaging();
                noOfTimes++;

                if (paging.isNextNull() || paging.getNextValue() > paging.getTotalValue())
                    return RepeatStatus.FINISHED;
                else {
                    stepContribution.getStepExecution()
					       .getExecutionContext()
						.putInt("noOfTimes",  noOfTimes);
                    
		    stepContribution.getStepExecution()
                                               .getExecutionContext()
			   		       .putInt("offset", paging.getNextValue());
										   
                    return RepeatStatus.CONTINUABLE;
                }
    }

Course Response model aligns to the course get API response which in turn has the elements and paging. Elements are defined as Course Meta Data model. 

CourseResponse.java

package com.springbatch.tutorials.batch.model;

import com.fasterxml.jackson.annotation.JsonIgnoreProperties;
import com.fasterxml.jackson.annotation.JsonInclude;

import java.util.List;


@JsonInclude(JsonInclude.Include.NON_NULL)
@JsonIgnoreProperties(ignoreUnknown = true)
public class CourseResponse {

    private List<CourseMetaData> elements;

    private PageModel paging;


    public List<CourseMetaData> getElements() {
        return elements;
    }

    public void setElements(List<CourseMetaData> elements) {
        this.elements = elements;
    }

    public PageModel getPaging() {
        return paging;
    }

    public void setPaging(PageModel paging) {
        this.paging = paging;
    }

    public static class PageModel {
        private String next;
        private String total;

        public String getNext() {
            return next;
        }

        public Integer getNextValue() {
            return Integer.parseInt(next);
        }

        public void setNext(String next) {
            this.next = next;
        }

        public String getTotal() {
            return total;
        }

        public Integer getTotalValue() {
            return Integer.parseInt(total);
        }

        public boolean isNextNull() {
            return this.next == null;
        }

        public void setTotal(String total) {
            this.total = total;
        }

        @Override
        public String toString() {
            return "PageModel{" +
                    "next='" + next + '\'' +
                    ", total='" + total + '\'' +
                    '}';
        }
    }

}

 

CourseMetaData.java 

package com.springbatch.tutorials.batch.model;

public class CourseMetaData {
    private String courseType;
    private String id;
    private String slug;
    private String name;

    public String getCourseType() {
        return courseType;
    }

    public void setCourseType(String courseType) {
        this.courseType = courseType;
    }

    public String getId() {
        return id;
    }

    public void setId(String id) {
        this.id = id;
    }

    public String getSlug() {
        return slug;
    }

    public void setSlug(String slug) {
        this.slug = slug;
    }

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    @Override
    public String toString() {
        return "CourseMetaData{" +
                "courseType='" + courseType + '\'' +
                ", id='" + id + '\'' +
                ", slug='" + slug + '\'' +
                ", name='" + name + '\'' +
                '}';
    }
}

In the after step, we can access the step context. Step context provides type-based getters by which we can get the times and offset and write in the log files. 

@Override
    public ExitStatus afterStep(StepExecution stepExecution) {
        try {
            int noOfTimes  =
                    stepExecution.getExecutionContext().getInt(
                            "noOfTimes", 0);
            int offset = stepExecution.getExecutionContext().getInt(
                    "offset", 0);

            logger.info("After step execution completed {} 
			             after running  {} times and last offset {}",
							stepExecution.getStartTime(),
							noOfTimes, offset);
							
        } catch(Exception exception) {
            logger.error("exception ");
        }
        return null;
    }

  

Screenshots, while the batch jobs run and finally output file, are given below.


Complete source code is available in the GitHub repo, it includes the output JSON file.


   

DevGroves Technologies is a IT consulting and services start-up company which is predominately to web technologies catering to static website, workflow based CRM websites, e-commerce websites and reporting websites tailoring to the customer needs. We also support open source community by writing blogs about how, why and where it need to be used for.

Subscribe to our newsletter.

We will send mail once in a week about latest updates on open source tools and technologies. subscribe our newsletter



Related Articles

Getting Started with Spring Batch

  • spring-batch spring-boot batch-processing

The best way to design a system for handling bulk workloads is to make it a batch system. If we are already using Spring, it will be easy to add a Spring Batch to the project. Spring batch provides a lot of boiler plate features required for batch processing like chunk based processing, transaction management and declarative input/output operations. It also provides job control for start, stop, restart, retry and skip processing also.

Read More


Advanced Programming Guide in Redis using Jedis

  • redis jedis advanced-guide cluster pipline publish-subscribe

Redis is an in-memory data structure store, used as a database, cache and message broker. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, geospatial indexes with radius queries and streams. This blog covers the advanced concepts like cluster, publish and subscribe, pipeling concepts of Redis using Jedis Java library.

Read More


WebSocket implementation with Spring Boot

  • websocket web-sockets spring-boot java

Spring Boot is a microservice-based Java framework used to create web application. WebSocket API is an advanced technology that provides full-duplex communication channels over a single TCP connection. This article explains about how to implement WebSocket using Spring Boot.

Read More


Introduction to Light 4J Microservices Framework

  • light4j microservice java programming framework

Light 4j is fast, lightweight, secure and cloud native microservices platform written in Java 8. It is based on pure HTTP server without Java EE platform. It is hosted by server UnderTow. Light-4j and related frameworks are released under the Apache 2.0 license.

Read More


PySpark: Installation, RDDs, MLib, Broadcast and Accumulator

  • pyspark spark python rdd big-data

We knew that Apace Spark- the most famous parallel computing model or processing the massive data set is written in Scala programming language. The Apace foundation offered a tool to support the Python in Spark which was named PySpark. The PySpark allows us to use RDDs in Python programming language through a library called Py4j. This article provides basic introduction about PySpark, RDD, MLib, Broadcase and Accumulator.

Read More



Light4j Cookbook - Rest API, CORS and RDBMS

  • light4j sql cors rest-api

Light 4j is a fast, lightweight and cloud-native microservices framework. In this article, we will see what and how hybrid framework works and integrate with RDMS databases like MySQL, also built in option of CORS handler for in-flight request.

Read More


Getting Started on Undertow Server

  • java web-server undertow rest

Undertow is a high performing web server which can be used for both blocking and non-blocking tasks. It is extermely flexible as application can assemble the parts in whatever way it would make sense. It also supports Servlet 4.0, JSR-356 compliant web socket implementation. Undertow is licensed under Apache License, Version 2.0.

Read More


RESTEasy Advanced guide - File Upload

  • resteasy rest-api file-upload java

RESTEasy is a JBoss project that provides various frameworks to help you build RESTful Web Services and RESTful Java applications. It is a fully certified and portable implementation of the JAX-RS 2.1 specification, a JCP specification that provides a Java API for RESTful Web Services over the HTTP protocol. It is licensed under the ASL 2.0.

Read More


RESTEasy Advanced Guide - Filters and Interceptors

  • resteasy rest-api filters interceptors java

RESTEasy is JAX-RS 2.1 compliant framework for developing rest applications. It is a JBoss project that provides various frameworks to help you build RESTful Web Services and RESTful Java applications. It is a fully certified and portable implementation of the JAX-RS 2.1 specification, a JCP specification that provides a Java API for RESTful Web Services over the HTTP protocol.

Read More


JWT Authentication using Auth0 Library

  • java jwt authentication security

Json Web Token shortly called as JWT becomes defacto standard for authenticating REST API. In a traditional web application, once the user login credentials are validated, loggedin user object will be stored in session. Till user logs out, session will remain and user can work on the web application without any issues. Rest world is stateless, it is difficult to identify whether the user is already authenticated. One way is to use authenticate every API but that would be too expensive task as the client has to provide credentials in every API. Another approach is to use token.

Read More


Thymeleaf - Text display, Iteration and Conditionals

  • thymeleaf template-engine web-programming java

Thymeleaf is a server-side Java template engine for both web and standalone environments. It is a better alternative to JavaServer Pages (JSP). Spring MVC and Thymeleaf compliment each other if chosen for web application development. In this article, we will discuss how to use Thymeleaf.

Read More


Activiti - Open Source Business Automation

  • business-automation business bpm

Activiti Cloud is the first Cloud Native BPM framework built to provide a scalable and transparent solution for BPM implementations in cloud environments. The BPM discipline was created to provide a better understanding of how organisations do their work and how this work can be improved in an iterative fashion.

Read More


Data Fetching and Form Building using NextJS

  • nextjs data-fetching form-handling

Next.js is one of the easy-to-learn frameworks for server-side pre-render pages for client-side web applications. In this blog, we will see how we can fetch data from API and make it pre-render pages. Also, let's see how forms work in Next.js and collect the data without maintaining the database.

Read More


8 Best Open Source Searchengines built on top of Lucene

  • lucene solr searchengine elasticsearch

Lucene is most powerful and widely used Search engine. Here is the list of 7 search engines which is built on top of Lucene. You could imagine how powerful they are.

Read More


Desktop Apps using Electron JS with centralized data control

  • electronjs couchdb pouchdb desktop-app

When there is a requirement for having local storage for the desktop application context and data needs to be synchronized to central database, we can think of Electron with PouchDB having CouchDB stack. Electron can be used for cross-platform desktop apps with pouch db as local storage. It can sync those data to centralized database CouchDB seamlessly so any point desktop apps can recover or persist the data. In this article, we will go through of creation of desktop apps with ElectronJS, PouchDB and show the sync happens seamlessly with remote CouchDB.

Read More


Push Notifications using Angular

  • angular push-notifications notifications

Notifications is a message pushed to user's device passively. Browser supports notifications and push API that allows to send message asynchronously to the user. Messages are sent with the help of service workers, it runs as background tasks to receive and relay the messages to the desktop if the application is not opened. It uses web push protocol to register the server and send message to the application. Once user opt-in for the updates, it is effective way of re-engaging users with customized content.

Read More


Caching using Ehcache Java Library

  • ehcache cache java map key-value

Ehcache from Terracotta is one of Java's most widely used Cache. It is concurrent and highly scalable. It has small footprint with SL4J as the only dependencies. It supports multiple strategies like Expiration policies, Eviction policies. It supports three storage tiers, heap, off-heap, disk storage. There are very few caching products supports multiple tier storage. If you want to scale, you cannot store all items in heap there should be support for off-heap and disk storage. Ehcache is licensed under Apache 2.0. In this article, we can see about basic usage of Ehcache.

Read More


How to install and setup Redis

  • redis install setup redis-cluster

Redis is an open source (BSD licensed), in-memory data structure store, used also as a database cache and message broker. It is written in ANSI C and works in all the operating systems. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, geospatial indexes with radius queries and streams. This article explains about how to install Redis.

Read More


COVID19 Stats using Angular Material Design

  • angular material-design covid covid-stats

Material design is inspired from the real world building architecture language. It is an adaptable system of guidelines, components, and tools that support the best practices of user interface design. Backed by open-source code, Material streamlines collaboration between designers and developers, and helps teams quickly build beautiful products. In this article, we will build COVID stats using Angular Material design.

Read More


Quick Start Programming Guide for redis using java client Jedis

  • redis jedis redis-client programming database java

Redis is an open source (BSD licensed), in-memory data structure store, used also as a database cache and message broker. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, geospatial indexes with radius queries and streams. This article explains about how to communicate with Redis using Java client Jedis.

Read More







We have large collection of open source products. Follow the tags from Tag Cloud >>


Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.