Run Commands

Sidecar S3 Gateway

Learn how to use S3-protocol-enabled pipelines and interact with external input/output data.

November 29, 2023

You can interact with input/output data through the S3 protocol using HPE ML Data Management’s S3-protocol-enabled pipelines.

About #

HPE ML Data Management’s S3-protocol-enabled pipelines run a separate S3 gateway instance in a sidecar container within the pipeline-worker pod. Using this approach enables maintaining data provenance since the external code (e.g., within a Kubeflow pod) is executed in (and associated with) a HPE ML Data Management job.

When enabled, input and output repositories are exposed as S3 Buckets via the S3 gateway sidecar instance.

Example with Kubeflow Pod #

The following diagram shows communication between the S3 gateway deployed in a sidecar and the Kubeflow pod.

Kubeflow S3 gateway

Configure an S3-enabled Pipeline #

  1. Open your pipeline spec.
  2. Add "s3": true to input.pfs.
  3. Add "s3Out": true to pipeline.
  4. Save your spec.
  5. Update your pipeline.

Example Pipeline Spec #

The following spec example reads files in the input bucket labresults and copies them in the pipeline’s output bucket:

  "pipeline": {
    "name": "s3_protocol_enabled_pipeline"
  "input": {
    "pfs": {
      "glob": "/",
      "repo": "labresults",
      "name": "labresults",
      "s3": true
  "transform": {
    "cmd": [ "sh" ],
    "stdin": [ "set -x && mkdir -p /tmp/result && aws --endpoint-url $S3_ENDPOINT s3 ls && aws --endpoint-url $S3_ENDPOINT s3 cp s3://labresults/ /tmp/result/ --recursive && aws --endpoint-url $S3_ENDPOINT s3 cp /tmp/result/ s3://out --recursive" ],
    "image": "pachyderm/ubuntu-with-s3-clients:v0.0.1"
  "s3Out": true

User Code Requirements #

Your user code is responsible for:

Accessing the Sidecar #

Use the S3_ENDPOINT environment variable to access the sidecar. No authentication is needed; you can only read the input bucket and write in the output bucket.

aws --endpoint-url $S3_ENDPOINT s3 cp /tmp/result/ s3://out --recursive

Triggering External Pipelines #

If Authentication is enabled, you can access the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY env vars in your pipeline user code to forward your pipeline’s auth credentials to third-party tools like Spark.

Constraints #