Run Commands
Read the PPS series >

Datum Set Spec PPS

Define how a pipeline should group its datums.

December 4, 2023

Spec #

This is a top-level attribute of the pipeline spec.

    "pipeline": {...},
    "transform": {...},
    "datumSetSpec": {
        "number": 0,
        "sizeBytes": 0,
        "perWorker": 0,

Attributes #

numberThe desired number of datums in each datum set. If specified, each datum set will contain the specified number of datums. If the total number of input datums is not evenly divisible by the number of datums per set, the last datum set may contain fewer datums than the others.
sizeBytesThe desired target size of each datum set in bytes. If specified, HPE ML Data Management will attempt to create datum sets with the specified size, though the actual size may vary due to the size of the input files.
perWorkerThe desired number of datum sets that each worker should process at a time. This field is similar to number, but specifies the number of sets per worker instead of the number of datums per set.

Behavior #

The datumSetSpec attribute in a HPE ML Data Management Pipeline Spec is used to control how the input data is partitioned into individual datum sets for processing. Datum sets are the unit of work that workers claim, and each worker can claim 1 or more datums. Once done processing, it commits a full datum set.

When to Use #

You should consider using the datumSetSpec attribute in your HPE ML Data Management pipeline when you are experiencing stragglers, which are situations where most of the workers are idle but a few are still processing jobs. This can happen when the work is not divided up in a balanced way, which can cause some workers to be overloaded with work while others are idle.