-
Notifications
You must be signed in to change notification settings - Fork 3
Let's do data discovery
First, you will need a Spark session to work with NextiaJD. If you have not created it, you can use the following code:
val spark = SparkSession
.builder()
.master("local[*]")
.getOrCreate()
Note that this tutorial requires to import the implicit class. You can import it like this:
import edu.upc.essi.dtim.NextiaJD.implicits
First, NextiaJD provides two types of discovery: query-by-attribute and query-by-dataset.
A query-by-attribute focus on the discovery from a reference attribute. First, let's read the dataset we are interested to find other datasets that can be joined with our dataset. This dataset will call it query dataset:
val path = "/Users/javierflores/Documents/UPC_projects/tmp3/n/nextiajd_wiki/data/datasets"
val filename = "wikipedia-iso-country-codes.csv"
val queryDataset = spark.read.option("header", true).option("inferSchema",true).csv(s"${path}/${filename}")
Then, let's read the candidate datasets which we would like to compare. We will create a variable called datasets
with the list of dataframes objects:
val filename1 = "Life_expectancy.csv"
val candidateDataset1 = spark.read.option("header", true).option("inferSchema",true).csv(s"${path}/${filename1}")
val filename2 = "Military_Expenditure.csv"
val candidateDataset2 = spark.read.option("header", true).option("inferSchema",true).csv(s"${path}/${filename2}")
val datasets = Seq(candidateDataset1, candidateDataset2)
Finally, to perform a discovery we use the method discovery()
from the queryDataset variable. This method requires the list of candidate datasets and the attribute from the query dataset we are interested to find join attributes.
val resultDiscovery = queryDataset.discovery(datasets,"English short name lower case")
resultDiscovery.show(false)
Query-by-dataset refers to a full discovery from all the attributes in the query dataset. In this case, we just need to provide the querydataset and a list of candidate datasets.
val resultDiscovery = queryDataset.discovery(datasets)
resultDiscovery.show(false)