Skip to content

Compute multiple profiles

Javier Flores edited this page Nov 15, 2021 · 3 revisions

First, you will need a Spark session to work with NextiaJD. If you have not created it, you can use the following code

val spark = SparkSession
      .builder()
      .master("local[*]")
      .getOrCreate()

Note that this tutorial requires to import the implicit class. You can import it like this:

import edu.upc.essi.dtim.NextiaJD.implicits

Profiling multiple dataset

There are different ways to profile multiple datasets. But for this example, we will asume a csv file with all the datasets information we would like to profile. This csv file will contain the columns dataset, path and delimiter and will look something like the following:

dataset,path,delimiter
dataset1,/home/nextiajd/datasets/,","
dataset2,/home/javier/datasets/,","
dataset3,/home/nextiajd/,";"

Then, we read this dataset in Spark with the following code:

val path = "/Users/javierflores/Documents/UPC_projects/tmp3/n/nextiajd_wiki/data"
val filename = "datasets_info.csv"
val datasetsInfo = spark.read.option("header","true").option("inferSchema","true").option("escape", "\"")
      .csv(s"${path}/${filename}")

We will need to create a method to read the datasets. We will use the following method:

  def readDataset(spark: SparkSession, pathCSV:String, delim : String ):DataFrame = {
    spark.read
      .option("header", "true").option("inferSchema", "true")
      .option("delimiter", delim)
      .csv(s"$pathCSV")
  }

Finally, the idea is to iterate each row from the datasetInfo variable and call the method readDataset() providing the row's information. This will look like the following:

datasetsInfo.select("dataset","path","delimiter").collect()
      .foreach{

        case Row( dataset: String, path: String, delimiter: String ) =>
          readDataset(spark, s"${path}/${dataset}",delimiter).attProfile()
      }

In case, you would like to store the datasets or the profiles in a variable. We can also used a hashmap where the key will be the dataset name and the value the dataframe object. Then our code would look like this:

 var datasets = Map[String, DataFrame]()
    datasetsInfo.select("dataset","path","delimiter").collect()
      .foreach{

        case Row( dataset: String, path: String, delimiter: String ) =>
          val df = readDataset(spark, s"${path}/${dataset}",delimiter)
          // this will compute the profile
          df.attProfile()
          // we save the dataset using as key the dataset name and the value the dataframe object.
          datasets = datasets + (dataset -> df)

      }
Clone this wiki locally