Skip to content

Latest commit

 

History

History
383 lines (344 loc) · 14.4 KB

job-specification.md

File metadata and controls

383 lines (344 loc) · 14.4 KB

Ingestion Job Spec

The ingestion job spec is used while generating, running, and pushing segments from the input files.

The Job spec can be in either YAML or JSON format (0.5.0 onwards). Property names remain the same in both formats.

To use the JSON format, add the propertyjob-spec-format=jsonin the properties file while launching the ingestion job. The properties file can be passed as follows

pinot-admin.sh LaunchDataIngestionJob -jobSpecFile /path/to/job_spec.json -propertyFile /path/to/job.properties

The following configurations are supported by Pinot

Top Level Spec

Property Description
executionFrameworkSpec Contains config related to the executor to use to ingest data. See Execution Framework Spec
jobType

Type of job to execute. The following types are supported

  • SegmentCreation -
  • SegmentTarPush
  • SegmentUriPush
  • SegmentMetadataPush
  • SegmentCreationAndTarPush
  • SegmentCreationAndUriPush
  • SegmentCreationAndMetadataPush
inputDirURI Absolute Path along with scheme of the directory containing all the files to be ingested, e.g. s3://bucket/path/to/input, /path/to/local/input
includeFileNamePattern

Include file name pattern, supported glob and regex patterns. E.g.

'glob:*.avro'or 'regex:^.*\.(avro)$' will include all avro files just under the inputDirURI, not sub directories

'glob:**/*.avro' will include all the avro files under inputDirURI recursively.

excludeFileNamePattern Exclude file name pattern, supported glob pattern. Similar usage as includeFilePatternName
outputDirURI Absolute Path along with scheme of the directory where to output all the segments.
overwriteOutput Set to true to overwrite segments if already present in the output directory. Or set tofalseto raise exceptions.
pinotFSSpecs List of all the filesystems to be used for ingestions. You can mention multiple values in case input and output directories are present in different filesystems. For more details, scroll down to Pinot FS Spec.
tableSpec Defines table name and where to fetch corresponding table config and table schema. For more details, scroll down to Table Spec.
recordReaderSpec Parser to use to read and decode input data. For more details, scroll down to Record Reader Spec.
segmentNameGeneratorSpec Defines how the names of the segments will be. For more details, scroll down to Segment Name Generator Spec.
pinotClusterSpecs Defines the Pinot Cluster Access Point. For more details, scroll down to Pinot Cluster Spec.
pushJobSpec Defines segment push job-related configuration. For more details, scroll down to Push Job Spec.

Example

executionFrameworkSpec:
  name: 'spark'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentTarPushJobRunner'
  segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentUriPushJobRunner'
  extraConfigs:
    stagingDir: hdfs://examples/batch/airlineStats/staging
jobType: SegmentCreationAndTarPush
inputDirURI: 'examples/batch/airlineStats/rawdata'
includeFileNamePattern: 'glob:**/*.avro'
outputDirURI: 'hdfs:///examples/batch/airlineStats/segments'
overwriteOutput: true
pinotFSSpecs:
  - scheme: hdfs
    className: org.apache.pinot.plugin.filesystem.HadoopPinotFS
  - scheme: file
    className: org.apache.pinot.spi.filesystem.LocalPinotFS 
recordReaderSpec:
  className: 'org.apache.pinot.plugin.inputformat.avro.AvroRecordReader'
tableSpec:
  tableName: 'airlineStats'
  schemaURI: 'http://localhost:9000/tables/airlineStats/schema'
  tableConfigURI: 'http://localhost:9000/tables/airlineStats'
segmentNameGeneratorSpec:
  type: normalizedDate
  configs:
    segment.name.prefix: 'airlineStats_batch'
    exclude.sequence.id: true
pinotClusterSpecs:
  - controllerURI: 'http://localhost:9000'
pushJobSpec:
  pushParallelism: 2
  pushAttempts: 2
  pushRetryIntervalMillis: 1000

Execution Framework Spec

The configs specify the execution framework to use to ingest data. Check out Batch Ingestion for configs related to all the supported frameworks

Property Description
name name of the execution framework. can be one of spark,hadoop or standalone
segmentGenerationJobRunnerClassName The class name implements org.apache.pinot.spi.batch.ingestion.runner.SegmentGenerationJobRunner interface to run the segment generation job
segmentTarPushJobRunnerClassName The class name implements org.apache.pinot.spi.batch.ingestion.runner.SegmentTarPushJobRunner interface to push the segment TAR file
segmentUriPushJobRunnerClassName The class name implements org.apache.pinot.spi.batch.ingestion.runner.SegmentUriPushJobRunner interface to send segment URI
segmentMetadataPushJobRunnerClassName The class name implements org.apache.pinot.spi.batch.ingestion.runner.SegmentMetadataPushJobRunner interface to send segment Metadata
extraConfigs Key-value pairs of configs related to the framework of the executions

Example

executionFrameworkSpec:
    name: 'standalone'
    segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
    segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
    segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'

Pinot FS Spec

field description
schema used to identify a PinotFS. E.g. local, hdfs, dbfs, etc
className

Class name used to create the PinotFS instance. E.g.

org.apache.pinot.spi.filesystem.LocalPinotFS is used for local filesystem

org.apache.pinot.plugin.filesystem.HadoopPinotFS is used for HDFS

configs configs used to init PinotFS instance

Table Spec

Table spec is used to specify the table in which data should be populated along with schema.

Property Description
tableName name of the table in which to populate the data
schemaURI location from which to read the schema for the table. Supports both File systems as well as HTTP URI
tableConfigURI location from which to read the config for the table. Supports both File systems as well as HTTP URI

Example

tableSpec:
  tableName: 'airlineStats'
  schemaURI: 'http://localhost:9000/tables/airlineStats/schema'
  tableConfigURI: 'http://localhost:9000/tables/airlineStats'
​

Record Reader Spec

field description
dataFormat Record data format, e.g. 'avro', 'parquet', 'orc', 'csv', 'json', 'thrift' etc.
className

Corresponding RecordReader class name. E.g.

org.apache.pinot.plugin.inputformat.avro.AvroRecordReader

org.apache.pinot.plugin.inputformat.csv.CSVRecordReader

org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader

org.apache.pinot.plugin.inputformat.json.JSONRecordReader

org.apache.pinot.plugin.inputformat.orc.ORCRecordReader

org.apache.pinot.plugin.inputformat.thrift.ThriftRecordReader

configClassName

Corresponding RecordReaderConfig class name, it's mandatory for CSV and Thrift file format. E.g.

org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig

org.apache.pinot.plugin.inputformat.thrift.ThriftRecordReaderConfig

configs Used to init RecordReaderConfig class name, this config is required for CSV and Thrift data format.

Segment Name Generator Spec

Property Description
type

the type of name generator to use. Following values are supported -

  • simple - this is the default spec.
  • normalizedDate - use this type when the time column in your data is in the String format instead of epoch time.
  • fixed - configure the segment name by the user.
configs Configs to init SegmentNameGenerator
segment.name For fixed SegmentNameGenerator. Explicitly set the segment name.
segment.name.postfix For simple SegmentNameGenerator.
Postfix will be appended to all the segment names.
segment.name.prefix For normalizedDate SegmentNameGenerator.
The Prefix will be prepended to all the segment names.
exclude.sequence.id

Whether to include sequence ids in segment name.

Needed when there are multiple segments for the same time range.

use.global.directory.sequence.id

Assign sequence ids to input files based on all input files under the directory.

Set to false to use local directory sequence id. This is useful when generating multiple segments for multiple days. In that case, each of the days will start from sequence id 0.

Example

segmentNameGeneratorSpec:
  type: normalizedDate
  configs:
    segment.name.prefix: 'airlineStats_batch'
    exclude.sequence.id: true

Pinot Cluster Spec

Property Description
controllerURI URI to use to fetch table/schema information and push data

Example

pinotClusterSpecs:
  - controllerURI: 'http://localhost:9000'

Push Job Spec

Property Description
pushAttempts Number of attempts for push job. Default is 1, which means no retry
pushParallelism Workers to use for push job. Default is 1
pushRetryIntervalMillis Time in milliseconds to wait for between retry attempts Default is 1 second.
segmentUriPrefix append this string before the path of the push destination. Generally, it is the scheme of the filesystem e.g. s3:// , file:// etc.
segmentUriSuffix append this string after the path of the push destination.

Example

pushJobSpec:
  pushParallelism: 2
  pushAttempts: 2
  pushRetryIntervalMillis: 1000
  segmentUriPrefix : 'file://'
  segmentUriSuffix : my-dir/