Skip to content

Hands on Kite Lab 1: Using the Kite CLI (Solution)

joey edited this page Dec 18, 2014 · 6 revisions

Download the Kite dataset tool:

wget http://central.maven.org/maven2/org/kitesdk/kite-tools/0.17.1/kite-tools-0.17.1-binary.jar -O kite-dataset
chmod +x kite-dataset

Download the sample data:

wget --http-user=kite --http-password=kite http://bits.cloudera.com/10c0b54d/ml-100k.tar.gz
tar -zxf ml-100k.tar.gz

Infer the schema for movie records from the sample data:

./kite-dataset csv-schema ml-100k/u.item --delimiter '|' --class org.kitesdk.examples.data.Movie -o movie.avsc

Infer the schema for rating records from the sample data:

./kite-dataset csv-schema ml-100k/u.data --class org.kitesdk.examples.data.Rating -o rating.avsc

Since we'll be partitioning on some of the columns later, we need to update our automatically generated schema to change all of the fields to be non-null:

sed -iorig -e 's/\[ "null", "long" \]/"long"/' rating.avsc

Now we can create the movies dataset:

./kite-dataset create movies -s movie.avsc

We can also import the sample data into our new dataset:

./kite-dataset csv-import --delimiter '|' ml-100k/u.item movies

We want to partition the rating data, so lets create a partifion configuration:

./kite-dataset partition-config timestamp:year timestamp:month timestamp:day -s rating.avsc -o rating-part.json

Now we can create the ratings dataset from the schema and partition configuration:

./kite-dataset create ratings -s rating.avsc -p rating-part.json

Since we want to write to a partitioned dataset, it's useful to stage the raw data in HDFS so we can launch a MapReduce job to partition the data and load it into the dataset:

hdfs dfs -put ml-100k/u.data
./kite-dataset csv-import hdfs:/user/cloudera/u.data ratings

Let's take a look at some of the movies data:

./kite-dataset show movies

We can do the same thing with the ratings data:

./kite-dataset show ratings