Run Commands

Beginner Tutorial

Learn how to quickly ingest photos, trace their outlines, and output a collage using the transformed data.

April 4, 2024

Before You Start #

Context #

How HPE ML Data Management Works #

HPE ML Data Management is deployed within a Kubernetes cluster to manage and version your data using projects, input repositories, pipelines, datums and output repositories. A project can house many repositories and pipelines, and when a pipeline runs a data transformation job it chunks your inputs into datums for processing.

The number of datums is determined by the glob pattern defined in your pipeline specification; if the shape of your glob pattern encompasses all inputs, it will process one datum; if the shape of your glob pattern encompasses each input individually, it will process one datum per file in the input, and so on.

The end result of your data transformation should always be saved to /pfs/out. The contents of /pfs/out are automatically made accessible from the pipeline’s output repository by the same name. So all files saved to /pfs/out for a pipeline named foo are accessible from the foo output repository.

Pipelines combine to create DAGs, and a DAG can be comprised of just one pipeline. Don’t worry if this sounds confusing! We’ll walk you through the process step-by-step.

How to Interact with HPE ML Data Management #

You can interact your HPE ML Data Management cluster using the PachCTL CLI or through Console, a GUI.

  • PachCTL is great for users already experienced with using a CLI.
  • Console is great for beginners and helps with visualizing relationships between projects, repos, and pipelines.

Tutorial: Image & Video Processing with OpenCV #

In this tutorial, we’ll walk you through how to use HPE ML Data Management to process images and videos using OpenCV. OpenCV is a popular open-source computer vision library that can be used to perform image processing and video analysis.

This DAG has 6 steps with the goal of intaking raw photos and video content, drawing edge-detected traces, and outputting a comparison collage of the original and processed images:

  1. Convert videos to MP4 format
  2. Extract frames from videos
  3. Trace the outline of each frame and standalone image
  4. Create .gifs from the traced video frames
  5. Re-shuffle the content so it is organized by “original” and “traced” images
  6. Build a comparison collage using a static HTML page

1. Create a Project #

By default, when you first start up an instance, the default project is attached to your active context. Create a new project and set the project to your active PachCTL context to avoid having to specify the project name (e.g., --project video-to-frame-traces) in each command.

Tool:

2. Create an Input Repo #

Tool:

3. Upload Content #

To upload content, you need to specify the repo and branch you’d like to upload to (e.g., a master or staging branch). In Console, it automatically defaults to repo@master — but for PachCTL, you’ll need to use the repo@master:filename.ext pattern. By default, your pipeline will trigger any time new data is uploaded to the master branch unless otherwise specified in the pipeline spec at input.pfs.branch or through a branch trigger. For this tutorial, we’re going to stick with the default master branch.

Tool:

4. Create the Video Converter Pipeline #

We want to make sure that our DAG can handle videos in multiple formats, so first we’ll create a pipeline that will:

  • Skip images
  • Skip videos already in the correct format (.mp4)
  • Convert videos to .mp4 format

The converted videos will be made available to the next pipeline in the DAG via the video_mp4_converter repo by declaring in the user code to save all converted images to /pfs/out/. This is the standard location for storing output data so that it can be accessed by the next pipeline in the DAG.

  1. Open your IDE terminal.
  2. Create a new folder for your project called video-to-frame-traces.
  3. Copy and paste the following pipeline spec into the terminal to create the file.
cat <<EOF > video_mp4_converter.yaml
pipeline:
  name: video_mp4_converter
input:
  pfs:
    repo: raw_videos_and_images
    glob: "/*"
transform:
  image: lbliii/video_mp4_converter:1.0.14
  cmd:
    - python3
    - /video_mp4_converter.py
    - --input
    - /pfs/raw_videos_and_images/
    - --output
    - /pfs/out/
autoscaling: true
EOF
  1. Create the pipeline by running the following command:
pachctl create pipeline -f video_mp4_converter.yaml 
View:
πŸ“–

Every pipeline, at minimum, needs a name, an input, and a transform. The input is the data that the pipeline will process, and the transform is the user code that will process the data. transform.image is the Docker image available in a container registry (Docker Hub) that will be used to run the user code. transform.cmd is the command that will be run inside the Docker container; it is the entrypoint for the user code to be executed against the input data.

5. Create the Image Flattener Pipeline #

Next, we’ll create a pipeline that will flatten the videos into individual .png image frames. Like the previous pipeline, the user code outputs the frames to /pfs/out so that the next pipeline in the DAG can access them in the image_flattener repo.

cat <<EOF > image_flattener.yaml
pipeline:
  name: image_flattener
input:
  pfs:
    repo: video_mp4_converter
    glob: "/*"
transform:
  image: lbliii/image_flattener:1.0.0
  cmd:
    - python3
    - /image_flattener.py
    - --input
    - /pfs/video_mp4_converter
    - --output
    - /pfs/out/
autoscaling: true
EOF
pachctl create pipeline -f image_flattener.yaml
View:

6. Create the Image Tracing Pipeline #

Up until this point, we’ve used a simple single input from the Pachyderm file system (input.pfs) and a basic glob pattern (/*) to specify shape of our datums. This particular pattern treats each top-level file and directory as a single datum. However, in this pipeline, we have some special requirements:

  • We want to process only the raw images from the raw_videos_and_images repo
  • We want to process all of the flattened video frame images from the image_flattener pipeline

To achieve this, we’re going to need to use a union input (input.union) to combine the two inputs into a single input for the pipeline.

  • For the raw_videos_and_images input, we can use a more powerful glob pattern to ensure that only image files are processed (/*.{png,jpg,jpeg})
  • For the image_flattener input, we can use the same glob pattern as before (/*) to ensure that each video’s collection of frames is processed together

Notice how we also update the transform.cmd to accommodate having two inputs.

cat <<EOF > image_tracer.yaml
pipeline:
  name: image_tracer
description: A pipeline that performs image edge detection by using the OpenCV library.
input:
  union:
    - pfs:
        repo: raw_videos_and_images
        glob: "/*.{png,jpg,jpeg}"
    - pfs:
        repo: image_flattener
        glob: "/*"
transform:
  image: lbliii/image_tracer:1.0.8
  cmd:
    - python3
    - /image_tracer.py
    - --input
    - /pfs/raw_videos_and_images
    - /pfs/image_flattener
    - --output
    - /pfs/out/
autoscaling: true
EOF
pachctl create pipeline -f image_tracer.yaml
View:
πŸ“–

Since this pipeline is converting videos to video frames, it may take a few minutes to complete.

7. Create the Gif Pipeline #

Next, we’ll create a pipeline that will create two gifs:

  1. A gif of the original video’s flattened frames (from the image_flattener output repo)
  2. A gif of the video’s traced frames (from the image_tracer output repo)

To make a gif of both the original video frames and the traced frames, we’re going to again need to use a union input so that we can process the image_flattener and image_tracer output repos.

Notice that the glob pattern has changed; here, we want to treat each directory in an input as a single datum, so we use the glob pattern /*/. This is because we’ve declared in the user code to store the video frames in a directory with the same name as the video file.

cat <<EOF > movie_gifer.yaml
pipeline:
  name: movie_gifer
description: A pipeline that converts frames into a gif using the OpenCV library.
input:
  union:
    - pfs:
        repo: image_flattener
        glob: "/*/"
    - pfs:
        repo: image_tracer
        glob: "/*/"
transform:
  image: lbliii/movie_gifer:1.0.5
  cmd:
    - python3
    - /movie_gifer.py
    - --input
    - /pfs/image_flattener
    - /pfs/image_tracer
    - --output
    - /pfs/out/
autoscaling: true
EOF
pachctl create pipeline -f movie_gifer.yaml
View:
πŸ“–

Since this pipeline is converting video frames to gifs, it may take a few minutes to complete.

8. Create the Content Shuffler Pipeline #

We have everything we need to make the comparison collage, but before we do that we need to re-shuffle the content so that the original images and gifs are in one directory (originals) and the traced images and gifs are in another directory (edges). This will help us more easily process the data via our user code for the collage. This is a common step you will encounter while using HPE ML Data Management referred to as a shuffle pipeline.

cat <<EOF > content_shuffler.yaml
pipeline:
  name: content_shuffler
description: A pipeline that collapses our inputs into one datum for the collager.
input:
  union:
    - pfs:
        repo: movie_gifer
        glob: "/"
    - pfs:
        repo: raw_videos_and_images
        glob: "/*.{png,jpg,jpeg}"
    - pfs:
        repo: image_tracer
        glob: "/*.{png,jpg,jpeg}"

transform:
  image: lbliii/content_shuffler:1.0.0
  cmd:
    - python3
    - /content_shuffler.py
    - --input
    - /pfs/movie_gifer
    - /pfs/raw_videos_and_images
    - /pfs/image_tracer
    - --output
    - /pfs/out/
autoscaling: true
EOF
pachctl create pipeline -f content_shuffler.yaml
View:

9. Create the Content Collager Pipeline #

Finally, we’ll create a pipeline that produces a static html page for viewing the original and traced content side-by-side.

cat <<EOF > content_collager.yaml
pipeline:
  name: content_collager
  description: A pipeline that creates a static HTML collage.
input:
  pfs:
    glob: "/"
    repo: content_shuffler


transform:
  image: lbliii/content_collager:1.0.64
  cmd:
    - python3
    - /content_collager.py
    - --input
    - /pfs/content_shuffler
    - --output
    - /pfs/out/
autoscaling: true
EOF
pachctl create pipeline -f content_collager.yaml
View:

Exploring Resources, Data, & Logs #

Congratulations! You’ve successfully created a DAG of pipelines that process video files into a collage. However, we’ve only just scratched the surface of what you can do with HPE ML Data Management. Now that you have a working pipeline, try out some of these commands to explore all of the details associated with the DAG.

Tips:

For a comprehensive list of operations, check out the Build DAGs section of the documentation or browse the Command Library.

Bonus Exercise #

  • How would you update the glob pattern in the video converter pipeline spec (video_mp4_converter.yaml) to only process video files in the raw_videos_and_images repo? That would enable you to reduce the complexity of the user code in def process_video_files and make the pipeline more efficient.