performance on 8.8k triples (508kb turtle) ontology #154

VladimirAlexiev · 2020-03-02T11:23:00Z

(Extracted from #153).

https://github.com/VladimirAlexiev/soml/tree/master/owl2soml/eg#schemaorg describes running the script https://github.com/VladimirAlexiev/soml/tree/master/owl2soml on schema.org renditions (508k ttl, 730k rdf, 808k jsonld). The script produces 428k yaml and takes substantial time to process: 4 minutes for ttl (Have not yet been able to make it run on jsonld and rdf).

time perl ../owl2soml.pl -voc schema schema.ttl    > schema1.yaml
real    4m9.203s
user    0m0.000s
sys     0m0.094s

My code doesn't use SPARQL (for now), just Model -> subjects/properties/objects/holds and Iter -> next/elements. What's the easiest way to profile this code?

I suspect significant time is spent converting between Attean::IRI and URI (#151). I use lazy to suspend IRI parsing, but there is no such option for URI

sub iri ($) {
  # convert string or URI (returned by URI::NamespaceMap $MAP) to Attean::IRI
  my $uri = shift or return;
  Attean::IRI->new (value => ref($uri) ? $uri->as_string : $uri, lazy => 1)
}

sub uri ($) {
  my $iri = shift;
  URI->new (ref($iri) ? $iri->as_string : $iri);
}

The Turtle file is 8.8k triples and RIOT takes 6s to convert it to ttl:

time riot -out ntriples schema.ttl | wc -l
8858

real    0m6.228s
user    0m0.169s
sys     0m0.479s

I wonder how long would Attean take on such conversion...

The text was updated successfully, but these errors were encountered:

kasei · 2020-03-02T16:07:14Z

Doing some quick profiling suggests to me that the bulk of the time is not spent in IRI, but in Type::Tiny. This is an area where I don't have a lot of intuition behind the performance, but I'll try to take a look.

kasei · 2020-03-02T17:17:55Z

Apologies. The performance issues I was seeing in Type::Tiny are a result of my having more aggressive (opt-in) type checking turned on.

VladimirAlexiev · 2020-03-06T14:55:42Z

@kasei can I help turning some of this off, to see how much performance will improve?

kasei · 2020-03-06T16:37:38Z

Turning what off? I'd be happy to see PRs on IRI or the Attean parsers to improve performance.

kasei · 2020-03-06T16:39:20Z

That being said, I think any changes that completely bypass the IRI validity checks will have to be opt-in in an obvious way that helps to indicate that it may cause problems elsewhere in Attean or related modules.

kasei · 2020-03-06T17:41:46Z

Also, in profiling the code I noticed that a lot fo the time is spent not in the serialization but in the memory model/store code. This is an area where Attean has lagged behind RDF::Trine. Improvements to this code, or a port (or new implementation) of something like RDF::Trine::Store::DBI (based on SQLite or other) might have a bigger impact on performance than trying to avoid IRI parsing...

VladimirAlexiev · 2020-03-08T07:50:30Z

"Turn off" the optional Type checks.
The IRI constructor (unlike URI) has a lazy option that I use.
Re DBI or SQLite: but the number of triples in this case is very small, so a simple in-memory store should be fastest?

kasei · 2020-03-09T01:58:30Z

The IRI constructor (unlike URI) has a lazy option that I use.

The lazy option just defers IRI component parsing until anything is done with the IRI object (like use any of its accessors). This helps in cases where an IRI is constructed but never used (as in query evaluation where lots of intermediate results do not end up in the final result set), but I suspect would not help in your case where you are constructing IRIs and then accessing their contents to re-serialize.

Re DBI or SQLite: but the number of triples in this case is very small, so a simple in-memory store should be fastest?

Not necessarily. The memory store in Attean is a trivial implementation, but even though it's all in-memory and you are using a small dataset, something like SQLite might be faster just as a result of working with native datatypes (for example). I'm not guaranteeing that such an implementation would be faster, but work in this area (whether on a more optimized in-memory store, or on a bridge to something like SQLite) would certainly improve performance in this sort of use case.

VladimirAlexiev · 2020-09-12T17:16:21Z

I should add timing with a java (rdf4j) reimplenentation that we have. Afair it's 10x faster

kasei · 2020-11-26T20:41:46Z

I've pushed a beta version of AtteanX::Store::LMDB to CPAN and along with some minor performance improvements in Attean (unreleased, available via GitHub for now), saw a large performance improvement on your owl2soml.pl code. To act similarly to a memory store, initialize it like this:

use File::Temp qw(tempdir);
my $path = tempdir(CLEANUP => 1);
my $store = Attean->get_store('LMDB')->new(filename => $path, initialize => 1);

There's probably still improvements to be had with an actually lazy implementation of IRI, but I'd be interested to hear how a more performant store impacts your use cases.

VladimirAlexiev · 2020-11-28T12:19:37Z

Thanks! I'll try it soon.
I have another case: on 3.5Mb of IEC CIM (ENTSOE CGMES) ontologies the current version takes 80 min.

kasei · 2020-11-28T18:47:20Z

@VladimirAlexiev is that 3.5Mb file available somewhere? I'd be happy to give it a try and profile the run to see where else might benefit from improvements.

kasei · 2020-11-30T06:07:47Z

@VladimirAlexiev following up on the mention of the LMDB store, I just noticed that it requires manually installing LMDB as a system library. I had thought it was built-in to the LMDB_File module, but that seems not to be the case. I think it's still the best solution right now for performant use, but obviously might be an issue in some environments. I'll try to have a look at some of the more portable store options for improving performance (either improving the memory store or porting the SQLite store from RDF::Trine).

kasei · 2020-12-04T17:08:40Z

@VladimirAlexiev It turns out I had the SQLite code sitting around unreleased, which I've now pushed to CPAN. So if the system library installation for LMDB is problematic, AtteanX::Store::DBI will probably be the next best option. To get a temporary SQLite store (in-memory only), do this:

our $store = Attean->get_store('DBI')->temporary_store();

VladimirAlexiev changed the title ~~performance on 500-700k ontology~~ performance on 8.8k triples (508kb turtle) ontology Mar 2, 2020

This was referenced Oct 10, 2024

make rdfpuml faster VladimirAlexiev/rdf2rml#36

Open

make owl2soml faster VladimirAlexiev/soml#10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance on 8.8k triples (508kb turtle) ontology #154

performance on 8.8k triples (508kb turtle) ontology #154

VladimirAlexiev commented Mar 2, 2020

kasei commented Mar 2, 2020

kasei commented Mar 2, 2020

VladimirAlexiev commented Mar 6, 2020

kasei commented Mar 6, 2020

kasei commented Mar 6, 2020

kasei commented Mar 6, 2020

VladimirAlexiev commented Mar 8, 2020

kasei commented Mar 9, 2020

VladimirAlexiev commented Sep 12, 2020

kasei commented Nov 26, 2020

VladimirAlexiev commented Nov 28, 2020

kasei commented Nov 28, 2020

kasei commented Nov 30, 2020

kasei commented Dec 4, 2020

performance on 8.8k triples (508kb turtle) ontology #154

performance on 8.8k triples (508kb turtle) ontology #154

Comments

VladimirAlexiev commented Mar 2, 2020

kasei commented Mar 2, 2020

kasei commented Mar 2, 2020

VladimirAlexiev commented Mar 6, 2020

kasei commented Mar 6, 2020

kasei commented Mar 6, 2020

kasei commented Mar 6, 2020

VladimirAlexiev commented Mar 8, 2020

kasei commented Mar 9, 2020

VladimirAlexiev commented Sep 12, 2020

kasei commented Nov 26, 2020

VladimirAlexiev commented Nov 28, 2020

kasei commented Nov 28, 2020

kasei commented Nov 30, 2020

kasei commented Dec 4, 2020