Data Cleansing & Reshaping Lab
In the previous two excersies of this Lab you have learned how to integrate data. In this lab, you will learn how to clean and reshape data using IBM Data Refinery flow.
Learning Objectives:
In this tutorial you will learn:
- How to filter data based on age criteria
- How to visualize data
Prerequisites
- IBM Cloud Pak for Data
- IBM Data Refinery
- External Data Sources (Amazon S3, Amazon Aurora PostgreSQL)
- IBM Watson Knowledge Catalog
Estimated time
It should take you approximately 15 minutes to complete this lab.
Lab Steps:
Step 1: Create Data Refinery flow
- Go to project home page by clicking Navigation Menu -> Projects -> All projects. From the project home page click Add asset and select Data Refinery.

- Select the final ingested data from datastage lab. eg Datastage_Output_Table_v1

Step 2. Define filter and criteria
-
Click New step

-
Click Remove duplicate to remove duplicate Email Addresses

- Filter healthcare personnel by age 65+

- Click Profile to see data statistics

- Click Visualization to create Pie chart of Places

6.Create job to apply the changes using Save and create a job option.

- Give a name to the job and click Next

- Within a few seconds reshaped asset will be added to the Project.

- Click navigation menu then click Catalog then All catalogs and then New Catalog to create a new catalog.

- Provide a name to catalog and then click on Enforce Data Protection Rules and the click on Create to create new catalog.

- Go back to Project (Data_Fabric_Project) and the check the reshaped asset generated by Data Refinery flow. Click on the right most three dots of the asset as shown below and then click Publish to Catalog.

- Go back to Data_Fabric_Catalog and click on asset to open.

- Click Asset tab to view the data.

- Click Profile to view data statistics.

Summary
This lab you have learned how to clean or reshape the data using IBM data refinery. Also we have learned how to create catalog and how to export data from project to catalog.