In this tutorial you will learn:
- How to make a connection between external data sources & IBM CP4D connectors.
- How to query multiple data sources wihout creating data replicas using IBM Data Virtualization service.
At the end of the workshop you will learn how to connects multiple data sources across locations by creating one virtual data view.
For this workshop, you have Covid-19 regional data stored in Amazon S3 and Amazon Aurora PostgreSQL data sources. In this workshop you will use IBM Data Virtualization Service to create virtual data view.
The proposed solution will solves data silos challenges faced by enterprises which force them to copy the data into centralized repository for analytics.
- IBM Cloud Pak for Data
- Data Virtualization on IBM Cloud Pak for Data
- External Data Sources (Amazon S3, Amazon Aurora PostgreSQL)
- IBM Watson Knowledge Catalog
It should take you approximately 10-15 minutes to complete this lab.
Step 1: Login to IBM Cloud Pak for Data with valid credentials
To perform this lab you need IBM Cloud Pak for Data’s user credentials (eg. userxxxx) which you have created earlier. Credentials include both username and password. If you do not have the credentials then refer Infra Provisioning Lab exercise to get the credentials.
Step 2: Create connection with AWS data sources using IBM CP4D connectors
In this lab, you will create connection with Amazon S3, and Amazon RDS data sources. Follow the below steps to create connection with those data sources:
You can also specify a connection name and other details as shown in below image then click Create to create Amazon RDS connection.
Similarly add Amazon S3 datasource by selecting connection type Amazon S3 and fill connection details provided using Infra Provisioning Lab. Similar to previous connection, You will have to specify the connection name and other required information such as Bucket, Endpoint URL, Region, Credentials (Shared, Basic).
Once you create both data sources connection successfully, you should see both Amazon S3 and Amazon RDS PostgreSQL connection listed on the Data sources page as shown below.
Step 3: Create virtual tables
Congratulations!
In the previous step you have successfully created connection between external data sources and Data Virtualization service. Now you can select tables and file from the connection and create virtualized tables or objects. Once tables are virtualized you can create a VIEW by joining two virtual tables.
If there are multiple connections listed, choose only one which you have created in previous step.
Select the tables (ts_wallonia_region_table and ts_flanders_region_table) one by one and then click Add to cart. Once after adding both the tables to the cart, click View cart.
You might need tables name in the next step. so, note it down for future reference.
Now click on three dots and then click Edit columns option (eg ts_flanders_region_table) to verify name, type and length of the columns matches with image below and click Apply.
Match the column name, type and length with the above image. Otherwise you may get error while creating View.
Repeat the same step for another table (here ts_wallonia_region_table).
Once after making the changes click Apply and then click Virtualize in Review cart and virtualize table page.
Click Virtualize more data option
Match the column name, type and length with the below image. Otherwise you may get error while creating View.
In the Review cart and virtualize page check Virtualized data option then click Virtualize button.
Click Continue to create Virtual tables.
Step 4: Create VIEW by joining two virtual tables/objects
Till now we have created connections with external data sources and from external data sources we picked tables and files to Virtualize them. Now we will use those Virtualize tables/objects and join them to create a VIEW. This VIEW will give us capability to query multiple data sources without creating data replicas.
You can follow below steps to create VIEW:
You may see multiple tables, choose only those two that you have created.
Click go to virtualized data
By following all the steps you have created a single joined view from different data source. Now let’s go to the Catalog to view the data.
Click on the view for more details.
This lab you have learned how to use Data Virtualization on IBM Cloud Pak for Data to virtualize data and create merged VIEW to query multiple data sources without moving or copying the data. You have also learned how view can be exported to catalog and how IBM Watson Knowledge Catalog can generate statistics of the data inside view.