Skip to content

Let's profile a dataset

Javier Flores edited this page Nov 15, 2021 · 2 revisions

First, you will need a Spark session to work with NextiaJD. If you have not created it, you can use the following code

val spark = SparkSession
      .builder()
      .master("local[*]")
      .getOrCreate()

Note that this tutorial requires to import the implicit class. You can import it like this:

import edu.upc.essi.dtim.NextiaJD.implicits

Profiling your first dataset

Now, let's get started with our first profile. We will need to read a dataset using Apache Spark read method. For reading a csv file, you can use the following method:

val path = "/Users/javierflores/Documents/UPC_projects/tmp3/n/nextiajd_wiki/data"
val filename = "wikipedia-iso-country-codes.csv"
val dataset = spark.read.option("header", true).option("inferSchema",true).csv(s"${path}/${filename}")

Note that depending on your dataset, you might need more configurations to read it. For example, Spark assumes the separator "," for csv files, but if your csv uses as separator the following ";", you will need to add the method .option("delimiter",";") in the read method. For more configurations on how to read a csv file go to here. Therefore, it is always a good idea to display a preview of the dataset data. You can do this with the following code:

dataset.show()
// The method will display something like the following:
+-----------------------------+------------+------------+------------+-------------+
|English short name lower case|Alpha-2 code|Alpha-3 code|Numeric code|   ISO 3166-2|
+-----------------------------+------------+------------+------------+-------------+
|                  Afghanistan|          AF|         AFG|           4|ISO 3166-2:AF|
|                Åland Islands|          AX|         ALA|         248|ISO 3166-2:AX|
|                      Albania|          AL|         ALB|           8|ISO 3166-2:AL|
...

After reading a dataset, profiling is as simple as calling the method attProfile() from a dataframe object which will return a profile dataframe object. For this example, to make a profile will be:

val profile = dataset.attProfile()

And to visualize the profile, we can called the show() method on the profile variable:

profile.show()

Clone this wiki locally