-
Notifications
You must be signed in to change notification settings - Fork 3
Let's profile a dataset
First, you will need a Spark session to work with NextiaJD. If you have not created it, you can use the following code
val spark = SparkSession
.builder()
.master("local[*]")
.getOrCreate()
Note that this tutorial requires to import the implicit class. You can import it like this:
import edu.upc.essi.dtim.NextiaJD.implicits
Now, let's get started with our first profile. We will need to read a dataset using Apache Spark read method. For reading a csv file, you can use the following method:
val path = "/Users/javierflores/Documents/UPC_projects/tmp3/n/nextiajd_wiki/data"
val filename = "wikipedia-iso-country-codes.csv"
val dataset = spark.read.option("header", true).option("inferSchema",true).csv(s"${path}/${filename}")
Note that depending on your dataset, you might need more configurations to read it. For example, Spark assumes the separator "," for csv files, but if your csv uses as separator the following ";", you will need to add the method .option("delimiter",";") in the read method. For more configurations on how to read a csv file go to here. Therefore, it is always a good idea to display a preview of the dataset data. You can do this with the following code:
dataset.show()
// The method will display something like the following:
+-----------------------------+------------+------------+------------+-------------+
|English short name lower case|Alpha-2 code|Alpha-3 code|Numeric code| ISO 3166-2|
+-----------------------------+------------+------------+------------+-------------+
| Afghanistan| AF| AFG| 4|ISO 3166-2:AF|
| Åland Islands| AX| ALA| 248|ISO 3166-2:AX|
| Albania| AL| ALB| 8|ISO 3166-2:AL|
...
After reading a dataset, profiling is as simple as calling the method attProfile()
from a dataframe object which will return a profile dataframe object. For this example, to make a profile will be:
val profile = dataset.attProfile()
And to visualize the profile, we can called the show()
method on the profile
variable:
profile.show()