Skip to content

Let's do data discovery

Javier Flores edited this page Nov 15, 2021 · 1 revision

First, you will need a Spark session to work with NextiaJD. If you have not created it, you can use the following code:

val spark = SparkSession
      .builder()
      .master("local[*]")
      .getOrCreate()

Note that this tutorial requires to import the implicit class. You can import it like this:

import edu.upc.essi.dtim.NextiaJD.implicits

Perform data discovery

First, NextiaJD provides two types of discovery: query-by-attribute and query-by-dataset.

Query-by-attribute

A query-by-attribute focus on the discovery from a reference attribute. First, let's read the dataset we are interested to find other datasets that can be joined with our dataset. This dataset will call it query dataset:

val path = "/Users/javierflores/Documents/UPC_projects/tmp3/n/nextiajd_wiki/data/datasets"
val filename = "wikipedia-iso-country-codes.csv"
val queryDataset = spark.read.option("header", true).option("inferSchema",true).csv(s"${path}/${filename}")

Then, let's read the candidate datasets which we would like to compare. We will create a variable called datasets with the list of dataframes objects:

val filename1 = "Life_expectancy.csv"
val candidateDataset1 = spark.read.option("header", true).option("inferSchema",true).csv(s"${path}/${filename1}")

val filename2 = "Military_Expenditure.csv"
val candidateDataset2 = spark.read.option("header", true).option("inferSchema",true).csv(s"${path}/${filename2}")

val datasets = Seq(candidateDataset1, candidateDataset2)

Finally, to perform a discovery we use the method discovery() from the queryDataset variable. This method requires the list of candidate datasets and the attribute from the query dataset we are interested to find join attributes.

val resultDiscovery = queryDataset.discovery(datasets,"English short name lower case")
resultDiscovery.show(false)

Query-by-dataset

Query-by-dataset refers to a full discovery from all the attributes in the query dataset. In this case, we just need to provide the querydataset and a list of candidate datasets.

val resultDiscovery = queryDataset.discovery(datasets)
resultDiscovery.show(false)
Clone this wiki locally