# ML Experiment Tracking and Model Management

ML experiment tracking is the process of organizing, recording, and analyzing the results of machine learning experiments. This document explains how to create a workflow to enable ML experiment tracking.

You can find the complete ML experiment tracking workflow code in [Treasure Boxes](https://github.com/treasure-data/treasure-boxes/blob/automl/machine-learning-box/automl/ml_experiment.dig)

**Table of Contents**

* [Track ML Experiments](#track-ml-experiments)
* [Record Evaluation Results for each Model](#record-evaluation-results-for-each-model)
* [Detect Drift in Model Performance over Time](#detect-drift-in-model-performance-over-time)


# Track ML Experiments

As a best practice, as part of an end-to-end data processing workflow, you should track each ML experiment using a "*track_experiment"* task following a train task. The *track_experiment* task issues a SQL query to record ML experiment information and the model name into a TD table named "automl_experiments". Sample Workflow Code, is as follows:


```yaml
+create_db_tbl_if_not_exists:
  td_ddl>: null
  create_databases:
    - '${ output_database}'
  create_tables:
    - automl_experiments
    - automl_eval_results
+train:
  ml_train>:
    docker:
      task_mem: 128g
    notebook: gluon_train
    model_name: 'gluon_model_${session_id}'
    input_table: '${input_database}.${train_data_table}'
    target_column: '${target_column}'
    time_limit: '${fit_time_limit}'
    share_model: true
    export_leaderboard: '${output_database}.leaderboard_${train_data_table}'
    export_feature_importance: '${output_database}.feature_importance_${train_data_table}'
+track_experiment:
  td>: queries/track_experiment.sql
  insert_into: '${output_database}.automl_experiments'
  last_executed_notebook: '${automl.last_executed_notebook}'
  user_id: '${automl.last_executed_user_id}'
  user_email: '${automl.last_executed_user_email}'
  model_name: 'gluon_model_${session_id}'
  shared_model: '${automl.shared_model}'
  task_attempt_id: '${attempt_id}'
  session_time: '${session_local_time}'
  engine: presto
```

The above workflow code generates the following example content in the *automl_experiments* table:

| task_attempt_id | session_time | user_id | user_email | model_name | shared_model | notebook_url |
|  --- | --- | --- | --- | --- | --- | --- |
| 849779333 | 2023-05-18 7:19:18 | 7776 | xxx@treasure-data.com | gluon_model_161722236 | b4a568da-e6f3-4057-b694-e2e19bf0e924 | https://console.treasuredata.com/app/workflows/automl/notebook/4a3c431b3aea4705b32a47d85ca46368 |
| 849772621 | 2023-05-18 7:08:30 | 7776 | xxx@treasure-data.com | gluon_model_161721046 | 94ad5d0e-89ac-4836-99c4-2bc8f975ccbe | https://console.treasuredata.com/app/workflows/automl/notebook/b390b932d4a64fd3a2dc3b75503430fb |
| 849768123 | 2023-05-18 7:01:13 | 7777 | yyy@treasure-data.com | gluon_model_161720337 | 4f2351a3-dd8c-418e-8057-4c8ec9a90cbe | https://console.treasuredata.com/app/workflows/automl/notebook/e8b3319c982345a48ff74db0003d7c9c |
| 849760942 | 2023-05-18 6:49:50 | 7776 | xxx@treasure-data.com | gluon_model_161718676 | 93e68b09-1a2f-4049-bb89-2bfe596ca9b3 | https://console.treasuredata.com/app/workflows/automl/notebook/b02959b1469e4b9c86ec6c6809acc5ff |
| 849753199 | 2023-05-18 6:36:36 | 7776 | xxx@treasure-data.com | gluon_model_161717236 | a7e456d3-8fcf-4173-afb7-f2d58bb985cd | https://console.treasuredata.com/app/workflows/automl/notebook/d3dcbbab99774bd594106a496ec2b2ab |


In the table, each records contains model name, details of the user who created the models, the session time when a model is created, and link to the generated notebook.

# Record Evaluation Results for each Model

You can optionally record each model's quality using an evaluation dataset. The following workflow is an example recording model quality that uses [AUROC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic), a standard evaluation measure for classification problems. The `record_evaluation` task records evaluation results in the automl_eval_results table.


```yaml
+predict:
  ml_predict>:
    docker:
      task_mem: 64g
    notebook: gluon_predict
    model_name: 'gluon_model_${session_id}'
    input_table: '${input_database}.${test_data_table}'
    output_table: '${output_database}.predicted_${test_data_table}_${session_id}'
+evaluation:
  td>: queries/auc.sql
  table: '${output_database}.predicted_${test_data_table}_${session_id}'
  target_column: '${target_column}'
  positive_class: ' >50K'
  store_last_results: true
  engine: hive
+record_evaluation:
  td>: queries/record_evaluation.sql
  insert_into: '${output_database}.automl_eval_results'
  engine: presto
  model_name: 'gluon_model_${session_id}'
  test_table: '${input_database}.${test_data_table}'
  session_time: '${session_local_time}'
  auc: '${td.last_results.auc}'
```

Treasure Data's Hive execution engine supports Hivemall, which supports a number of evaluation measures. See [Hivemall document for details](https://hivemall.github.io/eval/binary_classification_measures.html)

Example content in "automl_eval_results" table:

| session_time | model_name | ml_datasets.gluon_test | auroc |
|  --- | --- | --- | --- |
| 2023-06-06 6:21:40 | gluon_model_164947310 | ml_datasets.gluon_test | 0.9226243033 |
| 2023-06-14 6:49:22 | gluon_model_166350110 | ml_datasets.gluon_test | 0.9299335758 |
| 2023-06-15 7:35:30 | gluon_model_166532223 | ml_datasets.gluon_test | 0.9300292252 |
| 2023-05-18 7:19:18 | gluon_model_161722236 | ml_datasets.gluon_test | 0.9238149699 |


# Detect Drift in Model Performance over Time

"Drift" is a term used in machine learning to describe how the performance of a machine learning model slowly gets worse or stale over time. There are two main types for drifts: data drift and [concept drift](https://en.wikipedia.org/wiki/Concept_drift). Both data drift and concept drift can lead to a decline in the performance of a machine learning model.

Using the following workflow tasks, you can records each model's accuracy and quality to detect drift in data and model performance. You can use a scheduled workflow job to keep track of model performance and give a warning if the model performance drifts.

There are several schemes for drift detection. See the following example workflow to identify a degradation in ML model performance using an evaluation measure. When a drift is detected, you can trigger an alert email, as follows:


```yaml
# timezone: PST
# schedule:
#  daily>: 07:00:00
+evaluation:
  td>: queries/auc.sql
  table: '${output_database}.predicted_${test_data_table}_${session_id}'
  target_column: '${target_column}'
  positive_class: ' >50K'
  store_last_results: true
  engine: hive
+alert_if_drift_detected:
  if>: '${td.last_results.auc < 0.93}'
  _do: null
mail>: null
data: 'Detect drift in model performance. AUC was ${td.last_results.auc}.'
subject: Drift detected
to:
  - me@example.com
bcc:
  - foo@example.com
  - bar@example.com
```

You can [schedule workflow executions](https://docs.digdag.io/scheduling_workflow.html?highlight=schedule) for drift detection. And when drift is detected, you can send alert email or rebuild a model using a [conditional operator](https://docs.digdag.io/operators/if.html).