Run Commands

Data Parallelism Pipeline

Learn how to build a scalable inference pipeline using data parallelism.

April 4, 2024

In this tutorial, we’ll build a scalable inference data parallelism pipeline for breast cancer detection using data parallelism.

Before You Start #

Tutorial #

Our Docker image’s user code for this tutorial is built on top of the pytorch/pytorch base image, which includes necessary dependencies. The underlying code and pre-trained breast cancer detection model comes from this repo, developed by the Center of Data Science and Department of Radiology at NYU. Their original paper can be found here.

1. Create a Project & Input Repos #

  1. Create a project named data-parallelism-tutorial.

    pachctl create project data-parallelism-tutorial
  2. Set the project as current.

    pachctl config update context --project data-parallelism-tutorial
  3. Create the following repos:

    pachctl create repo models
    pachctl create repo sample_data

2. Create a Classification Pipeline #

We’re going to need to first build a pipeline that will classify the breast cancer images. We’ll use a cross input to combine the sample data and models.

  1. Create a file named bc_classification.json with the following contents:

    Resource:
  2. Save the file.

  3. Create the pipeline.

    pachctl create pipeline -f /path/to/bc_classification.json
💡

Datum Shape #

When you define a glob pattern in your pipeline, you are defining how HPE ML Data Management should split the data so that the code can execute as parallel jobs without having to modify the underlying implementation.

In this case, we are treating each exam (4 images and a list file) as a single datum. Each datum is processed individually, allowing parallelized computation for each exam that is added. The file structure for our sample_data is organized as follows:

sample_data/
├── <unique_exam_id>
│   ├── L_CC.png
│   ├── L_MLO.png
│   ├── R_CC.png
│   ├── R_MLO.png
│   └── gen_exam_list_before_cropping.pkl
├── <unique_exam_id>
│   ├── L_CC.png
│   ├── L_MLO.png
│   ├── R_CC.png
│   ├── R_MLO.png
│   └── gen_exam_list_before_cropping.pkl
...

The gen_exam_list_before_cropping.pkl is a pickled version of the image list, a requirement of the underlying library being used.

3. Upload Dataset #

  1. Open or download this github repo.

    gh repo clone pachyderm/docs-content
  2. Navigate to this tutorial.

    cd content/2.7.x/build-dags/tutorials/data-parallelism
  3. Upload the sample_data and models folders to your repos.

    pachctl put file -r sample_data@master -f sample_data/
    pachctl put file -r models@master -f models/

User Code Assets #

The Docker image used in this tutorial was built with the following assets:

Assets: