-
Notifications
You must be signed in to change notification settings - Fork 3
Compute multiple profiles
First, you will need a Spark session to work with NextiaJD. If you have not created it, you can use the following code
val spark = SparkSession
.builder()
.master("local[*]")
.getOrCreate()
Note that this tutorial requires to import the implicit class. You can import it like this:
import edu.upc.essi.dtim.NextiaJD.implicits
There are different ways to profile multiple datasets. But for this example, we will asume a csv file with all the datasets information we would like to profile. This csv file will contain the columns dataset, path and delimiter and will look something like the following:
dataset,path,delimiter
dataset1,/home/nextiajd/datasets/,","
dataset2,/home/javier/datasets/,","
dataset3,/home/nextiajd/,";"
Then, we read this dataset in Spark with the following code:
val path = "/Users/javierflores/Documents/UPC_projects/tmp3/n/nextiajd_wiki/data"
val filename = "datasets_info.csv"
val datasetsInfo = spark.read.option("header","true").option("inferSchema","true").option("escape", "\"")
.csv(s"${path}/${filename}")
We will need to create a method to read the datasets. We will use the following method:
def readDataset(spark: SparkSession, pathCSV:String, delim : String ):DataFrame = {
spark.read
.option("header", "true").option("inferSchema", "true")
.option("delimiter", delim)
.csv(s"$pathCSV")
}
Finally, the idea is to iterate each row from the datasetInfo
variable and call the method readDataset()
providing the row's information. This will look like the following:
datasetsInfo.select("dataset","path","delimiter").collect()
.foreach{
case Row( dataset: String, path: String, delimiter: String ) =>
readDataset(spark, s"${path}/${dataset}",delimiter).attProfile()
}
In case, you would like to store the datasets or the profiles in a variable. We can also used a hashmap where the key will be the dataset name and the value the dataframe object. Then our code would look like this:
var datasets = Map[String, DataFrame]()
datasetsInfo.select("dataset","path","delimiter").collect()
.foreach{
case Row( dataset: String, path: String, delimiter: String ) =>
val df = readDataset(spark, s"${path}/${dataset}",delimiter)
// this will compute the profile
df.attProfile()
// we save the dataset using as key the dataset name and the value the dataframe object.
datasets = datasets + (dataset -> df)
}