Data Integration (ETL) Lab

Learning Objectives:

In this tutorial you will learn:

  1. How to make a connection between external data source and IBM Cloud Pak for Data.
  2. How to create ETL pipeline using IBM DataStage.
  3. How to create scheduling rule to create data integration pipeline.

Prerequisites

  1. IBM Cloud Pak for Data
  2. IBM DataStage
  3. External Data Sources (Amazon S3, Amazon Aurora PostgreSQL)
  4. IBM Watson Knowledge Catalog

Estimated time

It should take you approximately 15 minutes to complete this lab.

Lab Steps:

Step 1: Login to IBM Cloud Pak for Data

To perform this lab you need IBM Cloud Pak for Data’s credential which include both username and password. If you do not have the credentials then refer Infra Provisioning Lab to get the one.

Login

Step 2: Create new project

Click on the Navigation Menu and expend Projects and click All Projects. Then click New Project + and then select Analytics project to create new analytics project

DV Menu

Click Create an empty project DV Menu

Specify a name to new project and click Create DV Menu

Once project is created you will see project homepage. DV Menu

Step 3: Create new connections with external data sources.

  1. Click on Assets tab and then click New asset +

Create connection

Scroll down and click Connection

Create connection

  1. Here you should see many IBM CP4D connectors. Choose Amazon S3 connector.

Amazon S3

  1. Specify Amazon S3 connection details such as name, bucket name, endpoint, region, and credential details(Provided as part of Infra Provisioning Lab ) to make connection between Amazon S3 and IBM Cloud Pak for Data.

  2. Click Test connection to validate the connection. If it is successful click Create to create S3 connection.

Amazon S3 Test Connection

Amazon S3 Test Connection

  1. Similarly perform same step to create connection for asset type Amazon Redshift, and Amazon RDS for PostgreSQL.

Create connection

Again from the project page, click New asset + and then click Connections. Select Amazon Redshift connector from available connectors

Data Ingestion

Specify the connection details provided as part of Infra Provisioning Lab and then click on Test connection. If test connection is successful, then click Create to create new Redshift connection

Data Ingestion

In the project home page, click New asset+ to create Amazon RDS connection.

Create connection

Select Amazon RDS for PostgreSQL

Data Ingestion

Specify the connection details provided as part of Infra Provisioning Lab and then click on Test connection. If test connection is successful, then click Create to create new Aurora PostgreSQL DB connection

Data Ingestion

Step 4: Create DataStage pipeline

Now, we have connected to external data sources. Let’s go back to newly created project to integrate data from those 3 connections.

  1. To create integration pipeline, let’s click Add assets + then DataStage flow.

DataStage

  1. Download the DataStage pipeline zip file here Datastage_Integration_Pipeline.zip .

  2. Click ZIP file import tab and click Drag and drop file here or upload to upload the zip file

  3. After uploading file click Import

  4. You will see the screen with Import successful with error. ignore the errors for the moment and close the dialog/popup.

  5. Click on the datastage asset icon

  6. You will see a datastage pipeline where data is ingested from 3 different sources and there are different stages (to perform ETL operation on incoming data) such as Funnel to merge/integrate data, Remove_Duplicates stage to filter out duplicates from integrated data, Sort stage to sort the data and finally we are storing the data in Amazon RDS database.

  7. Now lets link the data assets in this pipeline.

  8. Double click on actavis_pharma_healthcare_personnel_table_1

  9. Ignore the error and Close the error popup. This is because we have not yet link the data assets with the datastage pipeline.

  10. Expand Properties option

  11. Click Connection. here you will see connection which you have created earlier in this lab. Select the connection and click Save

  12. Repeat the same steps for mylan_specialty_personnel_data_table_1 connection.

  13. Now click on apotheca_healthcare_personnel_data_1 and choose the S3 connection under Connection option similar what we did in the last steps. But here you also need to specify the S3 Bucket name which you should receive as the output of infra provisioning lab.

  14. Click Save

  15. Double Click Amazon_RDS_for_PostgreSQL_1 DataStage

  16. Update Connection as shown below. Don’t click save. DataStage

  17. Specify the name to output table name. eg Datastage_Output_Table_v1 and then click Save

  18. Click Compile to validate if everything is file. Within a few seconds you should get Compile successful. message.

  19. Click Run to run the pipeline. After a few couple of seconds you should see Run successful message.

Step 4: Ingest the integrated data into the project

  1. Now we have integrated data available in Amazon RDS for PostgreSQL. Let’s ingest the data from the data source

Create connection

Data Ingestion

  1. Click Select source Data Ingestion

  2. Search the integrated table and click Select

    The file name should be same what you have specified as the output. eg. Datastage_Output_Table_v1

    DataStage Asset Browser

  3. Specify the name of the asset DataStage Asset Browser

  4. Verify the data asset is there in project. This asset will be used later in the next lab. DataStage Asset Browser

Summary

This lab you have learned how to create connection with Amazon S3, RDS, and Redshift and how to collect data from these data sources. Also we learned that how to create ETL pipeline using IBM DataStage.