This is an intro level workshop and is designed to get you started with Apache Pinot, familiarize you with its components, and help you run Pinot locally. For this workshop, we will focus on running Pinot locally using a pre-built Docker image and ingesting some batch data.
In this workshop, you will learn:
- The basics of Apache Pinot
- What are the essential components
- How to create tables and schemas
- How to injest batch data
- How to use the Pinot UI
In order to run this workshop, you will need the following prerequisites:
- Docker Desktop We will be using Docker to run Pinot locally. If you need to install it, you can go download Docker Destktop here and follow the intructions to install it.
- Resources
Pinot is not designed as a desktop solution. Running it locally needs a minimum of the following resources:
- 8gb Memory
- 10 gb disk space
In this workshop, we will be completing three challenges. The challenges build upon each other, so each one needs to be completed for the next one to work.
If you do not have Docker Desktop installed, complete the prerequisite. If you do have Docker Desktop intalled, make sure it's running before moving on to the next steps.
Use the following command from a command line to pull the Docker image:
docker pull apachepinot/pinot:1.1.0
Verify that this step worked by checking, using the command:
docker images
You should see the Apache Pinot image.
To run the Docker container, run the following command in the command window:
docker run -it --entrypoint /bin/bash -p 9000:9000 apachepinot/pinot:1.1.0
We are overriding the default entry point of this Docker image with the bash shell. This is to have more control over how the Pinot components are started.
We map the port 9000 of the Docker container to that of the local system, so we can access it via localhost. Port 9000 is used to run the Pinot UI by default.
At this point, you should be running the bash shell in the docker container. Take a moment to look at the contents of the container by listing the contents and looking around.
In this section, we will start the Pinot cluster. For our purposes, we will run all four components that are essential to run a Pinot cluster in one docker container. We are only doing this for learning purposes; usually each component would run in its own container.
A typical Pinot cluster consists of the following components:
Zookeeper (not strictly a Pinot component, but Pinot depends on it)
To start Zookeeper, run the following command:
bin/pinot-admin.sh StartZookeeper &
Zookeeper is the strongly consistent distributed state store for maintaining the global metadata (e.g. configs and schemas) of the system.
Pinot Controller
To start the Pinot controller, run the following command:
bin/pinot-admin.sh StartController &
The Pinot controller is responsible for mapping which servers are responsible for which segments, maintaining admin endpoints, and other management activities for the Pinot cluster.
Pinot Broker
To start Pinot broker, run the following command:
bin/pinot-admin.sh StartBroker &
The Pinot Broker is responsible for routing queries to the right servers and consolidating the results.
Pinot Server
To start Pinot server, run the following command:
bin/pinot-admin.sh StartServer &
Servers host the data segments and serve queries off the data they host.
To verify that the cluster is running, in a browser, navigate to http://localhost:9000
You should see the Pinot UI. We will discuss the UI further in the upcoming sections.
Ingesting data from a filesystem involves the following steps:
- Define a schema
- Define a table config
- Upload the schema and table configs
- Upload the data
For our workshop, we will be using the the configs and data that are part of the docker container we downloaded.
Navigate to the examples/batch folder:
cd examples/batch
You will notice that there are several folders here. Let's navigate to the githubEvents folder:
cd githubEvents
You should see the following files:
- githubEvents_offline_table_config.json
- ingestionJobSpec.yaml
- spartIngestionJobSpec.yaml
- githubEvents_schema.json
- rawdata (folder)
Use the following command to create the table and schema:
/opt/pinot/bin/pinot-admin.sh AddTable -schemaFile /opt/pinot/examples/batch/githubEvents/githubEvents_schema.json -tableConfigFile /opt/pinot/examples/batch/githubEvents/githubEvents_offline_table_config.json -exec
At this point, you should be able to verify that the Table and schema were created by navigating to http://localhost:9000 and selecting tables.
Now that we have the table created, let's add some data to the table. From the folder /opt/pinot, run the following command:
bin/pinot-admin.sh LaunchDataIngestionJob -jobSpecFile /opt/pinot/examples/batch/githubEvents/ingestionJobSpec.yaml
You can verify that the table is populated by navigating to the Pinot UI at http://localhost:9000 and selecting Tables
in the left-hand nav. You should see that the table size is greater than 0 MB.
We have already looked at the UI, but in this section we will explore the Pinot UI.
Navigate to the URL http://localhost:9000. On the home page, you can see the number of controllers, brokers, servers and tables. When you scroll down, you can see the IP address and ports for each of the Pinot cluster components.
Note that minions are not essential for running the Pinot cluster. Note also that a tenant is created by default.
To navigate to the Queries section, select the second option titled 'Query Console' from the left hand side menu. This should show you the tables creates as wellas a query window where you can typ some queries.
You can click on the table name, and it will populate the SQL editor with a defaul query and execute it.
Run the following queries to explore the data:
select type, count(type) from githubEvents group by type
select actor, count(actor) from githubEvents group by actor order by count(actor) desc
Note the timeUseMS and numDocsScanned.
Select the menu item from the left hand side menu titled SWAGGER REST APIs. This should open a new browser tab with SWAGGER end points. Let's check the health of the cluster:
- Select the Cluster 'Get/pinot-controller/admin' option
- select the Try It Out button
- You should see that the cluster is running
GOOD
.
Since most of the work was done in a Docker container, the tear down is easy. You can either stop the Docker container, or delete it.
- Start a new command window
- Get the name of the container by running 'docker ps'
- Stop the docker container by running 'docker stop'
- Remove (optionally) the container by running 'docker rm'
docker ps
docker stop <name>
docker rm <name>