Run Commands
Read the PPS series >

Input Group PPS

Group files stored in one or multiple repos.

December 4, 2023

Spec #

This is a top-level attribute of the pipeline spec.

{
  "pipeline": {...},
  "transform": {...},
  "input": {
    "group": [
      {
        "pfs": {
          "project": string,
          "name": string,
          "repo": string,
          "branch": string,
          "glob": string,
          "groupBy": string,
          "lazy": bool,
          "emptyFiles": bool,
          "s3": bool
        }
      },
      {
        "pfs": {
          "project": string,
          "name": string,
          "repo": string,
          "branch": string,
          "glob": string,
          "groupBy": string,
          "lazy": bool,
          "emptyFiles": bool,
          "s3": bool
        }
      }
    ]
  },
  ...
}

Attributes #

AttributeDescription
nameThe name of the PFS input that appears in the INPUT field when you run the pachctl list pipeline command. If an input name is not specified, it defaults to the name of the repo.
repoSpecifies the name of the HPE ML Data Management repository that contains the input data.
branchThe branch to watch for commits. If left blank, HPE ML Data Management sets this value to master.
globA wildcard pattern that defines how a dataset is broken up into datums for further processing. When you use a glob pattern in a group input, it creates a naming convention that HPE ML Data Management uses to group the files.
groupByA parameter that is used to group input files by a specific pattern.
lazyControls how the data is exposed to jobs. The default is false which means the job eagerly downloads the data it needs to process and exposes it as normal files on disk. If lazy is set to true, data is exposed as named pipes instead, and no data is downloaded until the job opens the pipe and reads it. If the pipe is never opened, then no data is downloaded.
emptyFilesControls how files are exposed to jobs. If set to true, it causes files from this PFS to be presented as empty files. This is useful in shuffle pipelines where you want to read the names of files and reorganize them by using symlinks.
s3Indicates whether the input data is stored in an S3 object store.

Behavior #

The group input in a HPE ML Data Management Pipeline Spec allows you to group input files by a specific pattern.

To use the group input, you specify one or more PFS inputs with a groupBy parameter. This parameter specifies a pattern or field to use for grouping the input files. The resulting groups are then passed to your pipeline as a series of grouped datums, where each datum is a single group of files.

You can specify multiple group input fields in a HPE ML Data Management Pipeline Spec, each with their own groupBy parameter. This allows you to group files by multiple fields or patterns, and pass each group to your pipeline as a separate datum.

The glob and groupBy parameters must be configured.

When to Use #

You should consider using the group input in a HPE ML Data Management Pipeline Spec when you have large datasets with multiple files that you want to partition or group by a specific field or pattern. This can be useful in a variety of scenarios, such as when you need to perform complex data analysis on a large dataset, or when you need to group files by some attribute or characteristic in order to facilitate further processing.

Example scenarios: