-
Notifications
You must be signed in to change notification settings - Fork 262
Hands on Kite Lab 1: Using the Kite CLI (Solution)
Download the Kite dataset tool:
wget http://central.maven.org/maven2/org/kitesdk/kite-tools/0.17.1/kite-tools-0.17.1-binary.jar -O kite-dataset
chmod +x kite-dataset
Download the sample data:
wget --http-user=kite --http-password=kite http://bits.cloudera.com/10c0b54d/ml-100k.tar.gz
tar -zxf ml-100k.tar.gz
Infer the schema for movie records from the sample data:
./kite-dataset csv-schema ml-100k/u.item --delimiter '|' --class org.kitesdk.examples.data.Movie -o movie.avsc
Infer the schema for rating records from the sample data:
./kite-dataset csv-schema ml-100k/u.data --class org.kitesdk.examples.data.Rating -o rating.avsc
Since we'll be partitioning on some of the columns later, we need to update our automatically generated schema to change all of the fields to be non-null:
sed -iorig -e 's/\[ "null", "long" \]/"long"/' rating.avsc
Now we can create the movies dataset:
./kite-dataset create movies -s movie.avsc
We can also import the sample data into our new dataset:
./kite-dataset csv-import --delimiter '|' ml-100k/u.item movies
We want to partition the rating data, so lets create a partifion configuration:
./kite-dataset partition-config timestamp:year timestamp:month timestamp:day -s rating.avsc -o rating-part.json
Now we can create the ratings dataset from the schema and partition configuration:
./kite-dataset create ratings -s rating.avsc -p rating-part.json
Since we want to write to a partitioned dataset, it's useful to stage the raw data in HDFS so we can launch a MapReduce job to partition the data and load it into the dataset:
hdfs dfs -put ml-100k/u.data
./kite-dataset csv-import hdfs:/user/cloudera/u.data ratings
Let's take a look at some of the movies data:
./kite-dataset show movies
We can do the same thing with the ratings data:
./kite-dataset show ratings