Run Commands

AutoML Pipeline

Learn how to build an automated machine learning pipeline.

April 4, 2024

You can use HPE ML Data Management to build an automated machine learning pipeline that trains a model on a CSV file.

Before You Start #

Tutorial #

Our Docker image’s user code for this tutorial is built on top of the python:3.7-slim-buster base image. It also uses the mljar-supervised package to perform automated feature engineering, model selection, and hyperparameter tuning, making it easy to train high-quality machine learning models on structured data.

1. Create a Project & Input Repo #

  1. Create a project named automl-tutorial.
    pachctl create project automl-tutorial
  2. Set the project as current.
    pachctl config update context --project automl-tutorial
  3. Create a new csv-data repo.
    pachctl create repo csv-data
  4. Upload the housing-simplified-1.csv file to the repo.
    pachctl put file csv_data@master:housing-simplified.csv -f /path/to/housing-simplified-1.csv

2. Create a Jsonnet Pipeline #

  1. Download or save our automl.jsonnet template.

    ////
    // Template arguments:
    //
    // name : The name of this pipeline, for disambiguation when 
    //          multiple instances are created.
    // input : the repo from which this pipeline will read the csv file to which
    //       it applies automl.
    // target_col : the column of the csv to be used as the target
    // args : additional parameters to pass to the automl regressor (e.g. "--random_state 42")
    ////
    function(name='regression', input, target_col, args='')
    {
      pipeline: { name: name},
      input: {
        pfs: {
          glob: "/",
          repo: input
        }
      },
      transform: {
        cmd: [ "python","/workdir/automl.py","--input","/pfs/"+input+"/", "--target-col", target_col, "--output","/pfs/out/"]+ std.split(args, ' '),
        image: "jimmywhitaker/automl:dev0.02"
      }
    }
  2. Create the AutoML pipeline by referencing and filling out the template’s arguments:

    pachctl update pipeline --jsonnet /path/to/automl.jsonnet  \
        --arg name="regression" \
        --arg input="csv_data" \
        --arg target_col="MEDV" \
        --arg args="--mode Explain --random_state 42"

    The model automatically starts training. Once complete, the trained model and evaluation metrics are output to the AutoML output repo.

3. Upload the Dataset #

  1. Update the dataset using housing-simplified-2.csv; HPE ML Data Management retrains the model automatically.
pachctl put file csv_data@master:housing-simplified.csv -f /path/to/housing-simplified-2.csv
  1. Repeat the previous step as many times as you want. Each time, HPE ML Data Management automatically retrains the model and outputs the new model and evaluation metrics to the AutoML output repo.

User Code Assets #

The Docker image used in this tutorial was built with the following assets:

Assets: