# Notebook for generating training data distribution and configuring Fairness

This notebook analyzes training data and outputs a JSON which contains information related to data distribution and fairness configuration.  In order to use this notebook you need to do the following:

1. Read the training data into a pandas dataframe called "data_df".  
2. Edit the below cells and provide the training data and fairness configuration information. 
3. Run the notebook. It will generate a JSON and a download link for the JSON will be present at the very end of the notebook.
4. Download the JSON by clicking on the link and upload it in the IBM AI OpenScale GUI.

If you have multiple models (deployments), you will have to repeat the above steps for each model (deployment).

**Note:** Please restart the kernel after executing below cell

In [1]:
!pip install pandas
!pip install ibm-cos-sdk
!pip install numpy
!pip install scikit-learn==0.24.1 
!pip install pyspark
!pip install lime
!pip install --upgrade ibm-watson-openscale
!pip install "ibm-wos-utils==4.1.1"

Collecting scikit-learn==0.24.1
  Downloading scikit_learn-0.24.1-cp39-cp39-manylinux2010_x86_64.whl (23.8 MB)
[K     |████████████████████████████████| 23.8 MB 4.2 MB/s eta 0:00:01
Installing collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.0.2
    Uninstalling scikit-learn-1.0.2:
      Successfully uninstalled scikit-learn-1.0.2
Successfully installed scikit-learn-0.24.1
Collecting pyspark
  Downloading pyspark-3.3.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 137 kB/s eta 0:00:01
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 99.7 MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25ldone
[?25h  Created wheel for pyspark: filename=pyspark-3.3.0-py2.py3-none-any.whl size=281764027 sha256=71e7ad3f13443a688e8771947c68f241677607138298

# Read training data into a pandas data frame

In [2]:
VERSION = "5.0.1"

In [3]:
import pandas as pd
data_df=pd.read_csv('https://ibm-aws-immersion-day.s3.us-east-2.amazonaws.com/publicdata/Data-region-RI-SM-encoded.csv')

In [4]:
data_df.dtypes

REGION         int64
Total_cases    int64
Risk_Index     int64
dtype: object

# Select the services for which configuration information needs to be generated

This notebook has support to generaton configuration information related to fairness , explainability and drift service. The below can be used by the user to control service specific configuration information.

Details of the service speicifc flags available:

- enable_fairness : Flag to allow generation of fairness specific data distribution needed for configuration
- enable_explainability : Flag to allow generation of explainability specific information
- enable_drift: Flag to allow generation of drift detection model needed by drift service


service_configuration_support = { <br>
&nbsp;&nbsp;&nbsp;&nbsp;"enable_fairness": True,   
&nbsp;&nbsp;&nbsp;&nbsp;"enable_explainability": True,    
&nbsp;&nbsp;&nbsp;&nbsp;"enable_drift": False  
    }  



In [5]:
data_df['REGION'] = pd.to_numeric(data_df['REGION'])
data_df['Total_cases'] = pd.to_numeric(data_df['Total_cases'])
data_df['Risk_Index'] = pd.to_numeric(data_df['Risk_Index'])

In [6]:
data_df.dtypes

REGION         int64
Total_cases    int64
Risk_Index     int64
dtype: object

In [7]:
service_configuration_support = {
    "enable_fairness": True,
    "enable_explainability": True,
    "enable_drift": True
}

# Training Data and Fairness Configuration Information

Please provide information about the training data which is used to train the model.  

In [8]:
training_data_info = {
    "class_label": "Risk_Index",
    "feature_columns": ['REGION','Total_cases'],
    "categorical_columns": ['REGION']
}

# Specify the Model Type

In the next cell, specify the type of your model.  If your model is a binary classification model, then set the type to "binary". If it is a multi-class classifier then set the type to "multiclass". If it is a regression model (e.g., Linear Regression), then set it to "regression".

In [9]:
#Set model_type. Acceptable values are:["binary","multiclass","regression"]
model_type = "multiclass"

# Specify the Fairness Configuration

You need to provide the following information for the fairness configuration: 

- fairness_attributes:  These are the attributes on which you wish to monitor fairness. 
- With Indirect Bias support, you can also monitor protected attributes for fairness. The protected attributes are those attributes which are present in the training data but are not used to train the model. To check if there exists indirect bias with respect to some protected attribute due to possible correlation with some feature column, it can be specified in fairness configuration.
- type: The data type of the fairness attribute (e.g., float or int or double)
- minority:  The minority group for which we want to ensure that the model is not biased.  

In [10]:
fairness_attributes = [{
                           "type" : "int", #data type of the column eg: float or int or double
                           "feature": "REGION", 
                           "majority": [
                               [1,1] # range of values for column eg: [31, 45] for int or [31.4, 45.1] for float
                           ],
                           "minority": [
                               [0,0],    # range of values for column eg: [80, 100] for int or [80.0, 99.9] for float                    
                           ],
                           "threshold": 0.9 
                       }]

# Specify the Favorable and Unfavorable class values

The second part of fairness configuration is about the favourable and unfavourable class values. In other words in order to measure fairness, we need to know the target field values which can be considered as being favourable and those values which can be considered as unfavourable.  

In [11]:
# For classification models use the below.
parameters = {
        "favourable_class" :  [0],
        "unfavourable_class": [2]
    }

# Specify the number of records which should be processed for Fairness

The final piece of information that needs to be provided is the number of records (min_records) that should be used for computing the fairness.

In [12]:
# min_records = <Minimum number of records to be considered for preforming scoring>
min_records = 10

# End of Input 

You need not edit anything beyond this point.  Run the notebook and go to the very last cell.  There will be a link to download the JSON file (called: "Download training data distribution JSON file").  Download the file and upload it using the IBM AI OpenScale GUI.

*Note: drop_na parameter of TrainingStats object should be set to 'False' if NA values are taken care while reading the training data in the above cells*

In [13]:
from ibm_watson_openscale.utils.training_stats import TrainingStats

enable_explainability = service_configuration_support.get('enable_explainability')
enable_fairness = service_configuration_support.get('enable_fairness')

if enable_explainability or enable_fairness:
    fairness_inputs = None
    if enable_fairness:
        fairness_inputs = {
                "fairness_attributes": fairness_attributes,
                "min_records" : min_records,
                "favourable_class" :  parameters["favourable_class"],
                "unfavourable_class": parameters["unfavourable_class"]
            }
    
    input_parameters = {
        "label_column": training_data_info["class_label"],
        "feature_columns": training_data_info["feature_columns"],
        "categorical_columns": training_data_info["categorical_columns"],
        "fairness_inputs": fairness_inputs,  
        "problem_type" : model_type  
    }

    training_stats = TrainingStats(data_df,input_parameters, explain=enable_explainability, fairness=enable_fairness, drop_na=True)
    config_json = training_stats.get_training_statistics()
    config_json["notebook_version"] = VERSION
#print(config_json)



### Indirect Bias
In case of Indirect bias i.e if protected attributes(the sensitive attributes like race, gender etc which are present in the training data but are not used to train the model) are being monitored for fairness:
- Bias service identifies correlations between the protected attribute and model features. Correlated attributes are also known as proxy features.
- Existence of correlations with model features can result in indirect bias w.r.t protected attribute even though it is not used to train the model.
- Highly correlated attributes based on their correlation strength are considered while computing bias for a given protected attribute.

The following cell identifies if user has configured protected attribute for fairness by checking the feature, non-feature columns and the fairness configuration. If protected attribute/s are configured then it identifies correlations and stores it in the fairness configuration.

In [14]:
# Checking if protected attributes are configured for fairness monitoring. If yes, then computing correlation information for each meta-field and updating it in the fairness configuration
if enable_fairness:
    fairness_configuration = config_json.get('fairness_configuration')
    training_columns = data_df.columns.tolist()
    label_column = training_data_info.get('class_label')
    training_columns.remove(label_column)
    feature_columns = training_data_info.get('feature_columns')
    non_feature_columns = list(set(training_columns) - set(feature_columns))
    if non_feature_columns is not None and len(non_feature_columns) > 0:
        protected_attributes = []
        fairness_attributes_list = [attribute.get('feature') for attribute in fairness_attributes]
        for col in non_feature_columns:
            if col in fairness_attributes_list:
                protected_attributes.append(col)
        if len(protected_attributes) > 0:
            from ibm_watson_openscale.utils.indirect_bias_processor import IndirectBiasProcessor
            fairness_configuration = IndirectBiasProcessor().get_correlated_attributes(data_df, fairness_configuration, feature_columns, protected_attributes, label_column)        

In [15]:
import json

print("Finished generating training distribution data")

# Create a file download link
import base64
from IPython.display import HTML

def create_download_link( title = "Download training data distribution JSON file", filename = "training_distribution.json"):  
    if enable_explainability or enable_fairness:
        output_json = json.dumps(config_json, indent=2)
        b64 = base64.b64encode(output_json.encode())
        payload = b64.decode()
        html = '<a download="{filename}" href="data:text/json;base64,{payload}" target="_blank">{title}</a>'
        html = html.format(payload=payload,title=title,filename=filename)
        return HTML(html)
    else:
        print('No download link generated as fairness/explainability services are disabled.')

create_download_link()

Finished generating training distribution data


# Drift detection model generation

Please update the score function which will be used forgenerating drift detection model which will used for drift detection . This might take sometime to generate model and time taken depends on the training dataset size. The output of the score function should be a 2 arrays 1. Array of model prediction 2. Array of probabilities 

- User is expected to make sure that the data type of the "class label" column selected and the prediction column are same . For eg : If class label is numeric , the prediction array should also be numeric

- Each entry of a probability array should have all the probabities of the unique class lable .
  For eg: If the model_type=multiclass and unique class labels are A, B, C, D . Each entry in the probability array should be a array of size 4 . Eg : [ [0.50,0.30,0.10,0.10] ,[0.40,0.20,0.30,0.10]...]
  
**Note:**
- *User is expected to add "score" method , which should output prediction column array and probability column array.*
- *The data type of the label column and prediction column should be same . User needs to make sure that label column and prediction column array should have the same unique class labels*
- **Please update the score function below with the help of templates documented [here](https://github.com/IBM-Watson/aios-data-distribution/blob/master/Score%20function%20templates%20for%20drift%20detection.md)**

## Update SageMaker Credentials in the below cell

In [16]:
SAGEMAKER_CREDENTIALS = {
    "access_key_id": '',
    "secret_access_key": '',
    "region": 'us-east-2'
}

### Update SageMaker endpoint name in the below cell

In [17]:
def score(training_data_frame):
    #User input needed
    endpoint_name = 'UPDATE THE SAGEMAKER ENDPOINT NAME'

    access_id = SAGEMAKER_CREDENTIALS.get('access_key_id')
    secret_key = SAGEMAKER_CREDENTIALS.get('secret_access_key')
    region = SAGEMAKER_CREDENTIALS.get('region')
    
    #Covert the training data frames to bytes
    import io
    import numpy as np
    train_df_bytes = io.BytesIO()
    np.savetxt(train_df_bytes, training_data_frame.values, delimiter=',', fmt='%g')
    payload_data = train_df_bytes.getvalue().decode().rstrip()

    #Score the training data
    import requests
    import time
    import json
    import boto3

    runtime = boto3.client('sagemaker-runtime', region_name=region, aws_access_key_id=access_id, aws_secret_access_key=secret_key)
    start_time = time.time()

    response = runtime.invoke_endpoint(EndpointName=endpoint_name, ContentType='text/csv', Body=payload_data)
    response_time = int((time.time() - start_time)*1000)
    results_decoded = json.loads(response['Body'].read().decode())

    #Extract the details
    results = results_decoded['predictions']

    predicted_vector_list = []
    probability_array_list = []

    for value in results:
        predicted_vector_list.append(value['predicted_label'])
        probability_array_list.append(value['score'])

    #Conver to numpy arrays
    probability_array = np.array(probability_array_list)
    predicted_vector = np.array(predicted_vector_list)

    return probability_array, predicted_vector

In [18]:
print(data_df.shape[0])
print(score)

2220
<function score at 0x7fe40f026dc0>


In [19]:
#Generate drift detection model
from ibm_wos_utils.drift.drift_trainer import DriftTrainer
enable_drift = service_configuration_support.get('enable_drift')
if enable_drift:
    drift_detection_input = {
        "feature_columns":training_data_info.get('feature_columns'),
        "categorical_columns":training_data_info.get('categorical_columns'),
        "label_column": training_data_info.get('class_label'),
        "problem_type": model_type
    }
    
    drift_trainer = DriftTrainer(data_df,drift_detection_input)
    if model_type != "regression":
        #Note: batch_size can be customized by user as per the training data size
        drift_trainer.generate_drift_detection_model(score,batch_size=32)
    
    #Note:
    # - Two column constraints are not computed beyond two_column_learner_limit(default set to 200)
    # - Categorical columns with large (determined by categorical_unique_threshold; default > 0.8) number of unique values relative to total rows in the column are discarded. 
    #User can adjust the value depending on the requirement
    
    drift_trainer.learn_constraints(two_column_learner_limit=2, categorical_unique_threshold=0.8)
    drift_trainer.create_archive()

Scoring training dataframe...: 100%|██████████| 1776/1776 [00:02<00:00, 627.03rows/s]
Optimising Drift Detection Model...: 100%|██████████| 40/40 [00:37<00:00,  1.06models/s]
Scoring training dataframe...: 100%|██████████| 444/444 [00:00<00:00, 775.00rows/s]
Computing feature stats...: 100%|██████████| 2/2 [00:00<00:00, 461.88features/s]
Learning single feature constraints...: 100%|██████████| 3/3 [00:00<00:00, 404.76constraints/s]
Learning two feature constraints...: 100%|██████████| 2/2 [00:00<00:00, 121.22constraints/s]


In [20]:
#Generate a download link for drift detection model
from IPython.display import HTML
import base64
import io

def create_download_link_for_ddm( title = "Download Drift detection model", filename = "drift_detection_model.tar.gz"):  
    
    #Retains stats information    
    if enable_drift:
        with open(filename,'rb') as file:
            ddm = file.read()
        b64 = base64.b64encode(ddm)
        payload = b64.decode()
        
        html = '<a download="{filename}" href="data:text/json;base64,{payload}" target="_blank">{title}</a>'
        html = html.format(payload=payload,title=title,filename=filename)
        return HTML(html)
    else:
        print("Drift Detection is not enabled. Please enable and rerun the notebook")

create_download_link_for_ddm()
