-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate EOSC services mining with IIS primary workflow #1398
Integrate EOSC services mining with IIS primary workflow #1398
Conversation
Implementing datasource importer module importing services. Supplementing unit and integration tests suite.
Implementing service mining exporter module. Supplementing unit and integration tests suite.
Introducing sqlite_builder phase producing sqlitedb with services imported from an avro datastore. Introducing main_sqlite phase performing the text mining. Binding all sub-phases in main uber workflow. Introducing integration tests for eosc service mining.
Integrating EOSC services mining with IIS primary workflow. Integrating EOSC services mining with the IIS primary processing integration test.
Changing service record eligibility rule from eosctype=Service to monitoring if given record was collected from the specified datasource. This was required due to the datasource deduplication process which resulted in some datasources having eosctype set to 'Service' being replaced by the new representative having eosctype set to "Data Source". Increasing allowed mapper timeout for main_sqlite phase due to slow performance when running mining on the full set of datasources, not services only.
Making `url` field in `eu.dnetlib.iis.importer.schemas.Service` avro record mandatory as it is required by the mining script.
One of the commits addresses an obsolete class removal: #1393. |
Is there a good reason to mix this change in? Looks quite separate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good as far as I can tell. I only have two small questions and one small suggestion that could be handled in a separate PR.
// ------------------------ PRIVATE -------------------------- | ||
|
||
/** | ||
* Creates similarity related actions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm probably missing something, but is this code related to similarity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this is probably some copy/paste leftover. Fixed.
@@ -136,10 +133,14 @@ public static void main(String[] args) throws Exception { | |||
OafEntityToAvroConverter<Result, DataSetReference> datasetConverter = new DatasetMetadataConverter(dataInfoBasedApprover); | |||
OafEntityToAvroConverter<Organization, eu.dnetlib.iis.importer.schemas.Organization> organizationConverter = new OrganizationConverter(); | |||
OafEntityToAvroConverter<Project, eu.dnetlib.iis.importer.schemas.Project> projectConverter = new ProjectConverter(); | |||
OafEntityToAvroConverter<Datasource, eu.dnetlib.iis.importer.schemas.Service> serviceConverter = new ServiceConverter( | |||
(StringUtils.isNotBlank(params.eligibleServiceCollectedFromDatasourceId) | |||
&& !UNDEFINED_NONEMPTY_VALUE.equals(params.eligibleServiceCollectedFromDatasourceId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe extract the test to a method? It looks reusable and is quite long. Possibly even do:
String eligibleId = nullIfNotValid(params.eligibleServiceCollectedFromDatasourceId);
serviceConverter = new ServiceConverter(eligibleId);
There are already similar methods in WorkflowRuntimeParameters
but not an exact match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap, WorkflowRuntimeParameters
is where this piece of code was taken from. Just didn't bother to do the refactoring but you're right, let's keep it clean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just introduced getValueOrNullIfNotValid()
in WorkflowRuntimeParameters
and refactored other methods to rely on this newly introduced method.
|
||
String targetDbLocation = System.getProperty("java.io.tmpdir") + File.separatorChar + "services.db"; | ||
File targetDbFile = new File(targetDbLocation); | ||
targetDbFile.setWritable(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method does not work for nonexistent files, so is the file supposed to exist already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I vaguely remember it was required at some point so it looks like under some circumstances (idk, maybe running this process on the same datanode which run this process before?) this file might already exist.
This piece of code also originates from other *DBBuilder
class so what I could do is to extract this to a dedicated common method and create a dedicated issue to take a closer look at this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial idea was to put this change into a separate commit which seemed be enough in terms of "isolation". But now I realized the was followed up with a bunch of #1394 related commits which I'd prefer to squash into the single one thus making this initial idea obsolete. I will find a way to revert this change and reintroduce it as a separate commit outside this thread. |
Introducing code review fixes.
Originally requested in: #1394 and on redmine: https://support.openaire.eu/issues/6354.
This PR introduces new EOSC services mining subworkflow and fully integrates it with the IIS primary workflow.
It includes importing services from the graph, processing (mining) and exporting matched datasources as a dedicated ActionSet.