Run Commands

Task Parallelism Pipeline

Learn how to build a scalable inference pipeline using task parallelism.

April 4, 2024

In this tutorial, we’ll build a scalable inference pipeline for breast cancer detection using task parallelism.

Before You Start #

Tutorial #

Our Docker image’s user code for this tutorial is built on top of the pytorch/pytorch base image, which includes necessary dependencies. The underlying code and pre-trained breast cancer detection model comes from this repo, developed by the Center of Data Science and Department of Radiology at NYU. Their original paper can be found here.

1. Create an Input Repo #

  1. Make sure your Tutorials project we created in the Standard ML Pipeline tutorial is set to your active context. (This would only change if you have updated your active context since completing the first tutorial.)

    pachctl config get context localhost:80
    
    # {
    #   "pachd_address": "grpc://localhost:80",
    #   "cluster_deployment_id": "KhpCZx7c8prdB268SnmXjELG27JDCaji",
    #   "project": "Tutorials"
    # }
  2. Create the following repos:

    pachctl create repo models
    pachctl create repo sample_data

2. Create CPU Pipelines #

In task parallelism, we separate out the CPU-based preprocessing and GPU-related tasks, saving us cloud costs when scaling. By separating inference into multiple tasks, each task pipeline can be updated independently, allowing ease of model deployment and collaboration.

We can split the run.sh script used in the previous tutorial (Data Parallelism Pipeline) into 5 separate processing steps (4 already defined in the script + a visualization step) which will become Pachyderm pipelines, so each can be scaled separately.

Crop Pipeline #

  1. Create a file named crop.json with the following contents:
{
"pipeline": {
  "name": "crop"
},
"description": "Remove background of image and save cropped files.",
"input": {
  "pfs": {
    "repo": "sample_data",
    "glob": "/*"
  }
},
"transform": {
  "cmd": [
    "/bin/bash",
    "multi-stage/crop.sh"
  ],
  "image": "pachyderm/breast_cancer_classifier:1.11.6"
}
}
  1. Save the file.
  2. Create the pipeline.
pachctl create pipeline -f /path/to/crop.json

Extract Centers Pipeline #

  1. Create a file named extract_centers.json with the following contents:
{
  "pipeline": {
      "name": "extract_centers"
  },
  "description": "Compute and Extract Optimal Image Centers.",
  "input": {
    "pfs": {
      "repo": "crop",
      "glob": "/*"
    }
  },
  "transform": {
      "cmd": [
          "/bin/bash",
          "multi-stage/extract_centers.sh"
      ],
      "image": "pachyderm/breast_cancer_classifier:1.11.6"
  }
}
  1. Save the file.
  2. Create the pipeline.
pachctl create pipeline -f /path/to/extract_centers.json

3. Create GPU Pipelines #

Generate Heatmaps Pipeline #

  1. Create a file named generate_heatmaps.json with the following contents:
{
"pipeline": {
  "name": "generate_heatmaps"
},
"description": "Generates benign and malignant heatmaps for cropped images using patch classifier.",
"input": {
  "cross": [
    {
      "join": [
        {
          "pfs": {
            "repo": "crop",
            "glob": "/(*)",
            "join_on": "$1",
            "lazy": false
          }
        },
        {
          "pfs": {
            "repo": "extract_centers",
            "glob": "/(*)",
            "join_on": "$1",
            "lazy": false
          }
        }
      ]
    },
    {
      "pfs": {
        "repo": "models",
        "glob": "/",
        "lazy": false
      }
    }
  ]
},
"transform": {
  "cmd": [
    "/bin/bash",
    "multi-stage/generate_heatmaps.sh"
  ],
  "image": "pachyderm/breast_cancer_classifier:1.11.6"
},
"resource_limits": {
  "gpu": {
    "type": "nvidia.com/gpu",
    "number": 1
  }
},
"resource_requests": {
  "memory": "4G",
  "cpu": 1
}
}
  1. Save the file.
  2. Create the pipeline.
pachctl create pipeline -f /path/to/generate_heatmaps.json

Classify Pipeline #

  1. Create a file named classify.json with the following contents:
 {
"pipeline": {
  "name": "classify"
},
"description": "Runs the image only model and image+heatmaps model for breast cancer prediction.",
"input": {
  "cross": [
    {
      "join": [
        {
          "pfs": {
            "repo": "crop",
            "glob": "/(*)",
            "join_on": "$1"
          }
        },
        {
          "pfs": {
            "repo": "extract_centers",
            "glob": "/(*)",
            "join_on": "$1"
          }
        },
        {
          "pfs": {
            "repo": "generate_heatmaps",
            "glob": "/(*)",
            "join_on": "$1"
          }
        }
      ]
    },
    {
      "pfs": {
        "repo": "models",
        "glob": "/"
      }
    }
  ]
},
"transform": {
  "cmd": [
    "/bin/bash",
    "multi-stage/classify.sh"
  ],
  "image": "pachyderm/breast_cancer_classifier:1.11.6"
},
"resource_limits": {
  "gpu": {
    "type": "nvidia.com/gpu",
    "number": 1
  }
},
"resource_requests": {
  "memory": "4G",
  "cpu": 1
}
}
  1. Save the file.
  2. Create the pipeline
    pachctl create pipeline -f /path/to/classify.json

4. Upload Dataset #

  1. Open or download this github repo.
    gh repo clone pachyderm/docs-content
  2. Navigate to this tutorial.
    cd content/2.7.x/build-dags/tutorials/task-parallelism
  3. Upload the sample_data and models folders to your repos.
    pachctl put file -r sample_data@master -f sample_data/
    pachctl put file -r models@master -f models/

User Code Assets #

The Docker image used in this tutorial was built with the following assets:

Assets: