Run Commands
Read the PPS series >

Input Group PPS

Group files stored in one or multiple repos.

April 4, 2024

Spec #

This is a top-level attribute of the pipeline spec.

{
  "pipeline": {...},
  "transform": {...},
  "input": {
    "group": [
      {
        "pfs": {
          "project": string,
          "name": string,
          "repo": string,
          "branch": string,
          "glob": string,
          "group_by": string,
          "lazy": bool,
          "empty_files": bool,
          "s3": bool
        }
      },
      {
        "pfs": {
          "project": string,
          "name": string,
          "repo": string,
          "branch": string,
          "glob": string,
          "group_by": string,
          "lazy": bool,
          "empty_files": bool,
          "s3": bool
        }
      }
    ]
  },
  ...
}

Attributes #

AttributeDescription
nameThe name of the PFS input that appears in the INPUT field when you run the pachctl list pipeline command. If an input name is not specified, it defaults to the name of the repo.
repoSpecifies the name of the HPE ML Data Management repository that contains the input data.
branchThe branch to watch for commits. If left blank, HPE ML Data Management sets this value to master.
globA wildcard pattern that defines how a dataset is broken up into datums for further processing. When you use a glob pattern in a group input, it creates a naming convention that HPE ML Data Management uses to group the files.
group_byA parameter that is used to group input files by a specific pattern.
lazyControls how the data is exposed to jobs. The default is false which means the job eagerly downloads the data it needs to process and exposes it as normal files on disk. If lazy is set to true, data is exposed as named pipes instead, and no data is downloaded until the job opens the pipe and reads it. If the pipe is never opened, then no data is downloaded.
empty_filesControls how files are exposed to jobs. If set to true, it causes files from this PFS to be presented as empty files. This is useful in shuffle pipelines where you want to read the names of files and reorganize them by using symlinks.
s3Indicates whether the input data is stored in an S3 object store.

Behavior #

The group input in a HPE ML Data Management Pipeline Spec allows you to group input files by a specific pattern.

To use the group input, you specify one or more PFS inputs with a group_by parameter. This parameter specifies a pattern or field to use for grouping the input files. The resulting groups are then passed to your pipeline as a series of grouped datums, where each datum is a single group of files.

You can specify multiple group input fields in a HPE ML Data Management Pipeline Spec, each with their own group_by parameter. This allows you to group files by multiple fields or patterns, and pass each group to your pipeline as a separate datum.

The glob and group_by parameters must be configured.

When to Use #

You should consider using the group input in a HPE ML Data Management Pipeline Spec when you have large datasets with multiple files that you want to partition or group by a specific field or pattern. This can be useful in a variety of scenarios, such as when you need to perform complex data analysis on a large dataset, or when you need to group files by some attribute or characteristic in order to facilitate further processing.

Example scenarios:

  • Partitioning data by time: If you have a large dataset that spans a long period of time, you might want to partition it by day, week, or month in order to perform time-based analysis or modeling. In this case, you could use the group input field to group files by date or time, and then process each group separately.

  • Grouping data by user or account: If you have a dataset that includes data from multiple users or accounts, you might want to group the data by user or account in order to perform user-based analysis or modeling. In this case, you could use the group input field to group files by user or account, and then process each group separately.

  • Partitioning data by geography: If you have a dataset that includes data from multiple geographic regions, you might want to partition it by region in order to perform location-based analysis or modeling. In this case, you could use the group input field to group files by region, and then process each group separately.