-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.json
1 lines (1 loc) · 31.7 KB
/
index.json
1
[{"authors":["admin"],"categories":null,"content":"Nelson Bighetti is a professor of artificial intelligence at the Stanford AI Lab. His research interests include distributed robotics, mobile computing and programmable matter. He leads the Robotic Neurobiology group, which develops self-reconfiguring robots, systems of self-organizing robots, and mobile sensor networks.\nLorem ipsum dolor sit amet, consectetur adipiscing elit. Sed neque elit, tristique placerat feugiat ac, facilisis vitae arcu. Proin eget egestas augue. Praesent ut sem nec arcu pellentesque aliquet. Duis dapibus diam vel metus tempus vulputate.\n","date":-62135596800,"expirydate":-62135596800,"kind":"term","lang":"en","lastmod":-62135596800,"objectID":"2525497d367e79493fd32b198b28f040","permalink":"https://ga11u.github.io/author/marc-gallofre/","publishdate":"0001-01-01T00:00:00Z","relpermalink":"/author/marc-gallofre/","section":"authors","summary":"Nelson Bighetti is a professor of artificial intelligence at the Stanford AI Lab. His research interests include distributed robotics, mobile computing and programmable matter. He leads the Robotic Neurobiology group, which develops self-reconfiguring robots, systems of self-organizing robots, and mobile sensor networks.","tags":null,"title":"Marc Gallofré","type":"authors"},{"authors":["Marc Gallofré"],"categories":["Terraform","Ansible","Open Stack","Cloud Infrastructure","News Hunter"],"content":"Upgrading a running cloud infrastructure is a critical task that have to planned carefully in advance. Before choosing which strategy to follow for upgrading, we have to asks ourselves some questions:\n Is okay for us to have some downtime in our service? Can we stop our services and for how long? Is there any data compromised which need to be saved and restored? Do we really need to updrage our infrastructure? Is there any security risk in doing so? Do we have the necessary resources? In this article we will deal with the scenario where we can have some downtime and there is data that need to be saved and restored.\nIn our scenario we start with 11 instances running in a OpenStack cloud provider, with the following configuration:\n 3 x m1.medium instances (1 VCPUs + 4GB RAM) 6 x m1.large instances (2 VCPUs + 8GB RAM) 2 x m1.xlarge instances (4 VCPUs + 16GB RAM) 6 x 80GB disk We have a Docker Swarm platform running in all instances with services to run real-time processes, 3 Apache Cassandra nodes, 3 Blazegraph nodes, 3 Zookeeper nodes, 3 Kafka nodes, a Mongo DB and other services.\nWe want to upgrade our infrastructure to:\n 4 x m1.medium instances (1 VCPUs + 4GB RAM) 9 x m1.large instances (2 VCPUs + 8GB RAM) 4 x m1.xlarge instances (4 VCPUs + 16GB RAM) 3 x 3TB disk 1 x 11TB disk The Apache Cassandra and Blazegraph data need to be migrated to the new disks and all services will have to start running again with the minimun downtime possible.\nTo do so, we will make use of Terraform, to modify the previous configuration with the need requirments and create volume disks snaptshots for migrating our data. Then, with Ansible we will mount the new volumes with the migrated data and install the Docker Swarm platform again.\nCreating voume disks snaptshots First, we need to backup our data. To do so, we can either use an external service such as Google Cloud Storage or use the OpenStack\u0026rsquo;s volume snaptshots. In any case, my advice is to stop the services running in the instances where you have the data to avoid any data corruption during the process.\nIn our cause, we are going to create volume snapshots with OpenStack and rememeber the snapshot ID, which will be needed leter with Terraform configuration.\n\rAlthough this is the first step, if you are working in a real-time environment I recommend to do it at the end before you run Terraform, to minimaice the data loss during the downtime.\r\r\rAs our platform is using Swarm to orquestrate docker containers, we don\u0026rsquo;t have an option to stop a running container as it is with docker-compose or stand-alone containers. Thus, we will have to remove the running service to stop it:\ndocker service rm \u0026lt;service-name\u0026gt; The removing process of a service may take sometime, thus it is important to verify that the manager have removed the service:\ndocker service inspect \u0026lt;service-name\u0026gt; As well as, to verify that the service was removed from the node where it was running:\ndocker ps If the service does not appear, then we can be sure that we successfuly removed the service and we can process to create a spanshot.\nOnce the service has been removed, we can proceed to detach the volume from the instance and create the snapshot. This process can be done manually from the OpenStack Dashboard or CLI. When detaching the volume from the instance, the status of the volum should appear as Available instead of In-use. The snapshot cam be created with attached volumes too, however doing a snapshot with a detached volume is safer and recomended for our purpose.\n\rIt is not possible to delete the old volumes, because they have an assigned snapshot. To delete them, first we have to delete the snapshots. This process, should be done at the end, once we have verified that our data migration succeded.\r\r\rUpadating Terraform files The Terraform files configuration should be updated to reflect our desired newer configuration.\nThe flavors type and the amount of instances for each type can be updated like this, where we map the type of instance to their characteristics and numbers.\nflavor_name = { \u0026quot;manager\u0026quot; = \u0026quot;m1.medium\u0026quot;, \u0026quot;cassandra\u0026quot; = \u0026quot;m1.large\u0026quot;, \u0026quot;blazegraph\u0026quot; = \u0026quot;m1.xlarge\u0026quot;, \u0026quot;worker\u0026quot; = \u0026quot;m1.large\u0026quot;, \u0026quot;hpc-worker\u0026quot; = \u0026quot;m1.xlarge\u0026quot;, \u0026quot;mongo\u0026quot; = \u0026quot;m1.medium\u0026quot;, \u0026quot;zookafka\u0026quot; = \u0026quot;m1.large\u0026quot; } instance_count = { \u0026quot;manager\u0026quot; = 3, \u0026quot;cassandra\u0026quot; = 3, \u0026quot;blazegraph\u0026quot; = 1, \u0026quot;worker\u0026quot; = 3, \u0026quot;hpc-worker\u0026quot; = 3, \u0026quot;mongo\u0026quot; = 1, \u0026quot;zookafka\u0026quot; = 3 } The image used in the instances can be updated to a newer version. It is important to check the image version since the images can be outdated or unavailable. To look for updated and available images:\nopenstack image list --status active To update the terraform file with the desired image, use the Image ID instead of the image name to ensure that you always use the same image:\nimage_id = { \u0026quot;Ubuntu20.04LTS\u0026quot; = \u0026quot;7085d64d-f591-4a23-bdfe-dbbd1288afcf\u0026quot; } To create new instances, we have to define new resources. In that case, we are creating a new hpc-worker instance using the previous maped values. The count key is used to create multple instances according to our previous declared variable instance_count, and the same apply for the other variables mapings (var.). The count.index is used to extract the index of each instance [0,1,2 \u0026hellip; n) and generate a name like hpc-worker-0 for the first instance.\nresource \u0026quot;openstack_compute_instance_v2\u0026quot; \u0026quot;hpc-worker\u0026quot; { count = var.instance_count[\u0026quot;hpc-worker\u0026quot;] name = \u0026quot;${var.node_name[\u0026quot;hpc-worker\u0026quot;]}-${count.index}\u0026quot; image_id = var.image_id[var.image_name[\u0026quot;image\u0026quot;]] flavor_name = var.flavor_name[\u0026quot;hpc-worker\u0026quot;] key_pair = var.key_pub security_groups = var.security_group network { name = var.network } metadata = { ssh_user = var.role_ssh_user[\u0026quot;hpc-worker\u0026quot;], prefer_ipv6 = false, my_server_role = var.node_name[\u0026quot;hpc-worker\u0026quot;], python_bin = \u0026quot;/usr/bin/python3\u0026quot; } } To create a 3TB volume using the previous volume snapshot, so the data will be migrated to the new volume. Thus, we need to indicate the snapshot_id which we want to use. If the snpashot is smaller than the new volume, we will need to expand the volume later, otherwise our instance will not use the full volume capacity. To attach the volume to the new instance, we need to provide the instance_id where we want to attach the volume and the volum_id to attach.\nresource \u0026quot;openstack_blockstorage_volume_v3\u0026quot; \u0026quot;volume_cassandra_1\u0026quot; { name = \u0026quot;${var.volume_name}-cassandra-1\u0026quot; size = 3000 snapshot_id = \u0026quot;f2714817-f9f8-42f3-aa7c-363d7b887983\u0026quot; } resource \u0026quot;openstack_compute_volume_attach_v2\u0026quot; \u0026quot;attach_cassandra_volume_0_to_db_instances\u0026quot; { instance_id = openstack_compute_instance_v2.cassandra[1].id volume_id = openstack_blockstorage_volume_v3.volume_cassandra_1.id } Upgrading our infrastrcutre with Terraform Once we have defined our desired infrastructure, we can start with the deployment:\nterraform plan terraform apply Mounting, expanding disks and ploying Docker Swarm with Ansible mkdir -p /mnt/data sudo mount /dev/sdb /mnt/data xfs_growfs /dev/sdb ","date":1599130432,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1599130432,"objectID":"7b5d2fd2235f3bbf8666cf55a7f0c3c1","permalink":"https://ga11u.github.io/post/upgrading-cloud-platform-with-terraform/","publishdate":"2020-09-03T12:53:52+02:00","relpermalink":"/post/upgrading-cloud-platform-with-terraform/","section":"post","summary":"Upgrading a running cloud infrastructure is a critical task that have to planned carefully in advance. Before choosing which strategy to follow for upgrading, we have to asks ourselves some questions:","tags":["Terraform","Ansible","Open Stack","Cloud Infrastructure","News Hunter"],"title":"Upgrading a Cloud Infrastructure With Terraform and ANsible","type":"post"},{"authors":["Marc Gallofré"],"categories":["Cassandra","Nodetool"],"content":"Apache Cassadra is a well-known and powefurl distributed database which comes with its own tool for managing cassandra cluster and getting interesting information. In this post, I will show different nodetool commands that I found really useful.\nThe nodetool command looks like:\nnodetool \u0026lt;options\u0026gt; \u0026lt;command\u0026gt;\r Where the basic option are the -h (\u0026ndash;host) to pass as argument the hostname or IP address of a cassandra node and the -p (\u0026ndash;port) to pass as a argument the port number where nodetool will make the connection (by default 7199). By can also use the -pwf (\u0026ndash;password-file) to pass the password file path or the -pw (\u0026ndash;password) to pass the\tpassword string.\nNodetool can be runned from either the same machine where Cassandra is installed and running or from an external machine using the apropiate options. In case we would like to use the nodetool in a running Cassandra docker service on a Swarm stack deployment it is enough to execture the nodetool command inside the running container:\ndocker exec -it \u0026lt;CONTAINER_ID\u0026gt; nodetool\r#To find the CONTAINER_ID, it is possible to use `docker ps` on the node where Cassandra container is running\r nodetool status Nodetool status provides information about each node such as the status (up U or down D) and the state (normal N, leaving L, joining J, moving M), the node IP Address, the amount of data used (Load), the number of tokes, how much data is owned by the node (Owns), the Host-ID and the Rack.\n-- Address Load Tokens Owns (effective) Host ID Rack\rUN 10.0.1.28 70.68 MiB 256 67.5% da80b540-0424-4ad1-b64c-51cf4c46dcfe rack1\rUN 10.0.1.12 73.48 MiB 256 69.4% fd8af11e-f1d9-4adb-abe5-84925113f9bb rack1\rUN 10.0.1.14 67.69 MiB 256 63.0% e04c747d-ae77-476f-92fc-34bda99ab1e4 rack1\r The status accept the argument keyspace where we can pass the name of a keyspace to get information related to this specific keyspace, in case we have set more than one.\nnodetool \u0026lt;options\u0026gt; status \u0026lt;keyspace\u0026gt;\r nodetool tablestats Nodetool tablestats provides information and statistics about keyspaces, index and tables. The information I found more interesting to check is the SSTable Compression Ratio which represents the compression ratio (e.g., 0.25 means that 10MB of information are compressed to 2.5MB in Cassadra), the Read/Write count and Local read/write count which represent the total amount of read or write requests since the startup and the Read/Write latency and Local read/write latency the round trip time in milliseconds to comple the most recent read or write request.\nKeyspace : example-case\rRead Count: 1349\rRead Latency: 0.245 ms\rWrite Count: 4032\rWrite Latency: 0.10584201388888889 ms\rPending Flushes: 0\rTable: corpus\rSSTable count: 1\rSpace used (live): 74714036\rSpace used (total): 74714036\rSpace used by snapshots (total): 0\rOff heap memory used (total): 95800\rSSTable Compression Ratio: 0.25010121781248695\rNumber of partitions (estimate): 42120\rMemtable cell count: 1344\rMemtable data size: 11265339\rMemtable off heap memory used: 0\rMemtable switch count: 0\rLocal read count: 345\rLocal read latency: 0.226478953 ms\rLocal write count: 1344\rLocal write latency: 0.1053876324 ms\r... [output truncated]\r With tablestats is possible to provide the argument \u0026lt;keyspace.table\u0026gt; to visualise stats for a specific keyspace or table.\nnodetool \u0026lt;options\u0026gt; tablestats \u0026lt;keyspace.table\u0026gt;\r ","date":1598882510,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1598882510,"objectID":"299cd38b5556d4a5454fce2016b31feb","permalink":"https://ga11u.github.io/post/cassandra-using-nodetool-for-statistics/","publishdate":"2020-08-31T16:01:50+02:00","relpermalink":"/post/cassandra-using-nodetool-for-statistics/","section":"post","summary":"Apache Cassadra is a well-known and powefurl distributed database which comes with its own tool for managing cassandra cluster and getting interesting information. In this post, I will show different nodetool commands that I found really useful.","tags":["Cassandra","Nodetool"],"title":"Using Nodetool in Cassandra for Interesting Statistics","type":"post"},{"authors":["Marc Gallofré"],"categories":["Docker","Swarm","Private Repository"],"content":"By default Docker Swarm pull images from Docker Hub, but sometimes we want to have our own private repositories and private images. Then, a few things have to be changed to allow it.\nIn this guide, I will show you how to do it, by using the example of GitLab repositories.\nLets assume that we have our private images in GitLab and we have already deployed a swarm stack called example-project.\nThe first thing you need to do is:\ndocker login \u0026lt;gitlab-url\u0026gt;:\u0026lt;repository-port\u0026gt;\r You will need to check the port with your GitLab or repository provider, a common used port in GitLab is 4567.\nThen you will be asked to provide your username and password. The password is often a token or a secure key you have created only for accessing to your private repository.\nFinnaly, you can deploy your stack as always by adding the option --with-registry-auth.\ndocker stack deploy -c docker-compose.yml example-project --with-registry-auth\r ","date":1598877620,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1598877620,"objectID":"7e7f22f67ecb2d5a7ea675fba1f6b31e","permalink":"https://ga11u.github.io/post/swarm-deploying-from-private-repo/","publishdate":"2020-08-31T14:40:20+02:00","relpermalink":"/post/swarm-deploying-from-private-repo/","section":"post","summary":"By default Docker Swarm pull images from Docker Hub, but sometimes we want to have our own private repositories and private images. Then, a few things have to be changed to allow it.","tags":["Docker","Swarm","Private Repository"],"title":"Deploying in Docker Swarm From a Private Repository","type":"post"},{"authors":["Marc Gallofré"],"categories":["Docker","Swarm","docker-compose"],"content":"Updating a running container may be a critical task if not done properly, as it can cause undesired side effects or stop production processes. Thus, with this guide I want to show you how to conduct a conteiner update process in an easy and safe way.\nI assume you are using a container deployment and managment tool such as Swarm or Kubernates, otherwise I strongly recommend you to get familar with such kind of tools and use it on your applications. For the rest of this guide, I will use Swarm as a example.\nLets imagine that we have the following service definition in a docker-compose.yml file:\nservices:\rcassandra1:\r\u0026lt;\u0026lt;: *cassandra-base\renvironment:\rCASSANDRA_CLUSTER_NAME: cassandra-cluster\rCASSANDRA_PASSWORD: cassandra\rCASSANDRA_BROADCAST_ADDRESS: tasks.cassandra1\rCASSANDRA_LISTEN_ADDRESS: tasks.cassandra1\rCASSANDRA_SEEDS: \u0026quot;tasks.cassandra1,tasks.cassandra2,tasks.cassandra3\u0026quot; CASSANDRA_PASSWORD_SEEDER: \u0026quot;yes\u0026quot;\rports:\r- target: 7000\rpublished: 7000\r- target: 9042\rpublished: 9042\r This service creates an Apache Cassandra node which is connected in a cluster to other 2 nodes, making a cluster of 3 nodes. It also opens the ports 7000 and 9042.\nNow, we run docker stack deploy -c docker-compose.yml example_project and our cluster of 3 nodes is succesfuly running and working. So far, so god!\nHowever, after a while we realise that we have forgoten to open the port 7199 from where we can connect with nodetols tool to manage our cluster and get some statistics. So what can we do now?\nIt\u0026rsquo;s easy, (1) update the docker-compose.yml as we wish, e.g., with the port 7199:\nservices:\rcassandra1:\r\u0026lt;\u0026lt;: *cassandra-base\renvironment:\rCASSANDRA_CLUSTER_NAME: cassandra-cluster\rCASSANDRA_PASSWORD: cassandra\rCASSANDRA_BROADCAST_ADDRESS: tasks.cassandra1\rCASSANDRA_LISTEN_ADDRESS: tasks.cassandra1\rCASSANDRA_SEEDS: \u0026quot;tasks.cassandra1,tasks.cassandra2,tasks.cassandra3\u0026quot; CASSANDRA_PASSWORD_SEEDER: \u0026quot;yes\u0026quot;\rports:\r- target: 7000\rpublished: 7000\r- target: 9042\rpublished: 9042\r- target: 7199\rpublished: 7199\r (2) Run again the command:\ndocker stack deploy -c docker-compose.yml example_project\r In that way the Docker Swarm will create a new version of our service and stop old version after that.\nFinally, if everything has worked fine, we should be able to do something like this nodetool tablestats -h \u0026lt;CASSANDRA_NODE_IP\u0026gt;.\n","date":1598869580,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1598869580,"objectID":"f8002ac0f8b9a0bb3ad524d22792f557","permalink":"https://ga11u.github.io/post/updating-doker-containers-without-downtime/","publishdate":"2020-08-31T12:26:20+02:00","relpermalink":"/post/updating-doker-containers-without-downtime/","section":"post","summary":"How to update and modify a running docker container without downtime using Docker Swarm and docker-compose","tags":["Docker","Swarm","docker-compose"],"title":"Updating Doker Containers Without Downtime","type":"post"},{"authors":["Marc Gallofré"],"categories":["News Angler","triple-store"],"content":"","date":1598539912,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1598539912,"objectID":"734f0bd9e3ced27eb1cbfc04a28760f1","permalink":"https://ga11u.github.io/post/expanding-blazegraph-data-volumes/","publishdate":"2020-08-27T16:51:52+02:00","relpermalink":"/post/expanding-blazegraph-data-volumes/","section":"post","summary":"","tags":["News Angler","triple-store"],"title":"Expanding Blazegraph Data Volume","type":"post"},{"authors":["Marc Gallofré"],"categories":["News Angler","Cassandra"],"content":"","date":1598539891,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1598539891,"objectID":"03fb7afc75de589c5f48bc5ffd260dda","permalink":"https://ga11u.github.io/post/expanding-cassandra-data-volumes/","publishdate":"2020-08-27T16:51:31+02:00","relpermalink":"/post/expanding-cassandra-data-volumes/","section":"post","summary":"","tags":["News Angler","Cassandra"],"title":"Expanding Cassandra Data Volumes","type":"post"},{"authors":["Marc Gallofré"],"categories":["News Angler","Blazegraph","Software Architecture","Semantic technologies"],"content":"Blazegraph1 is a high performance scale-out triple-store for big data which can support up to ~12.7B triples in a single machine (see previous post). Even though it is presented as an ultra high-performance graph database and designed to scale-out, the scale-out feature has not been that much supported as developers would wish.\nAt the beginning (from the 2.0.0 release), the scale-out was moved to a Enterprise fueture under licence supriptions, as we can see from the following information about High Availability (HA) and Scale-out features in the Blazegraph Blog (https://blog.blazegraph.com/):\n Enterprise Features (HA and Scale-out)\nStarting in release 2.0.0, the Scale-out and HA capabilities are moved to Enterprise features. These are available to uses with support and/or license subscription. If you are an existing GPLv2 user of these features, we have some easy ways to migrate. Contact us for more information. We’d like to make it as easy as possible.\n Later, the support for High Availability was droped out from the project, due to the lack of open source community, as it is corroborated by Bryan Thompson (@thompsonbry), the Chief Scientist and founder of SYSTAP and one of the contributors of Blazegraph, in the issue #116 at Blazegraph/database GitHub (https://github.com/blazegraph/database/issues/116):\n The HA configuration is not functional in more recent releases. Systap halted development of the Blazegraph HA feature several years ago (long before we came to Amazon). Full HA is a complex thing to develop and maintain with master failure, testing of the various failover configurations, longevity testing, targeted failure mode tests, etc. We self-funded quite a bit, but we did not get the engagement from the open source community to make it worth while to continue HA as an open source feature.\nYou can always do the poor man\u0026rsquo;s HA, put the updates onto a durable queue, and then apply writes to each server in parallel. You would need to handle master failover of course. Or you can capture the IChangeLog from one server and replicate the post-facto changes (in terms of statements added and removed) to the other servers, again using a durable queue to capture the post-commit change set. To do the latter, you would also need to report additions to the dictionary indices (which is not currently done, but which would not be that difficult to add in the LexiconRelation and an apply loop interface for the replicas to apply the deltas on their local indices). I think this might \u0026ldquo;just work\u0026rdquo;. The local journal tracks the transactions in flight and manages the recycling of deleted records once no transaction can read on those records. So transactional access to data should \u0026ldquo;work\u0026rdquo; on the replicas without doing anything else. Again, you would need to handle master failover, etc.\nThanks, Bryan\n Currently, since Blazegraph was taken over by Amazon (Neptune AWS2), the project does not seem to be activly maintained. However, it is possible to scale-out Blazegraph and configure with HA clusters.\nThe Blazegraph scale-out architecture (Figure 1) is based on a shared disk volume where all Blazegraph\u0026rsquo;s nodes have access to the data. This shared disck volume can be set-up accross different machines, racks or regions with servies like Gluster. A load balancer distributes the data requests and updates between the Blazegraph\u0026rsquo;s nodes and Zookeeper manages the services running on each node.\n\rBlazegraph scale-out architecture\r\r\rThe Blazegraph scale-out architecture provides with horisontal scaling as both nodes an disks space can be extended. However, this architecture does not provide of distributed data along each node, as it would be the case of Apache Cassandra3.\nDue to the lack of documentation and support for clustered and HA Blazegraph deployment, the scale-out option is only for those fearless adventurers who wants to try out the clustered and HA version of Blazegraph. A guide about how to deploy the clustered configuration can be found at https://github.com/blazegraph/database/wiki/ClusterGuide with the following advice: We recommend that you ask for help when attempting your first cluster install!, the HA configuration is explained at https://github.com/blazegraph/database/wiki/HAJournalServer#Basic_Deployment, and a example with the Wikidata deployment of Blazegraph (with 3 nodes) can be found at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikidata-query/Documentation (the author of this post doesn\u0026rsquo;t take responsabilities for your failed attempts or disasters \u0026ndash; if you success, I will like to hear and lear how you have manage it 😄).\n\rStill, the stand-alone version Blazegraph is a really interesting option for working with an open source triple-store which can manage big data volumes while providing high performance and support for RDF, SPARQL 1.1 and Gremlin.\r\r\r https://blazegraph.com \u0026#x21a9;\u0026#xfe0e;\n https://aws.amazon.com/neptune \u0026#x21a9;\u0026#xfe0e;\n https://cassandra.apache.org \u0026#x21a9;\u0026#xfe0e;\n ","date":1598430388,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1598430388,"objectID":"5a6a67b99261a867f82e6cdc648fdd8f","permalink":"https://ga11u.github.io/post/blazegraph-scale-out/","publishdate":"2020-08-26T10:26:28+02:00","relpermalink":"/post/blazegraph-scale-out/","section":"post","summary":"Blazegraph1 is a high performance scale-out triple-store for big data which can support up to ~12.7B triples in a single machine (see previous post). Even though it is presented as an ultra high-performance graph database and designed to scale-out, the scale-out feature has not been that much supported as developers would wish.","tags":["News Angler","Blazegraph","Software Architecture","Semantic technologies"],"title":"Scaling-out Blazegraph, is it possible?","type":"post"},{"authors":["Marc Gallofré"],"categories":["News Angler","Triple Store","Software Architecture","Semantic technologies"],"content":"Semantic technologies are really interesting from a Big Data prespective. Yet, these technologies together with semantic web resources enhance Knowledge Graphs by providing reacher means for representing and defining facts, concepts, properties, relations and logic-rules, while facilitating knowledge graph understanding, integration and manipulation by using ontologies, standard vocabularies and linking its concepts to Linked Open Data (LOD) resources such as Wikidata, Schema.org or DBPedia. Nevertheless, Big Data cames with its known challanges like the need for systems that are able to scale-up and -out (vertical and horisontal scaling). Thus, semantic technologies and more precicely triple-stores which support RDF and SPARQL following the W3C and semantic web standards (i.e., the databases designed for storing knowledge graphs represented with triples in RDF and query them with SPARQL) must adapt and be redesigned to facilitate horisonal and vertical scaling.\nSo far, the scale-up seems to be solved with the large triple-stores\r[1] that can handle big data volumes by adding more resources. E.g., propiertary triple-stores like the Spatial and Graph features in Oracle Database1, the AnzoGraph DB2 and the AllegroGraph3 that can deal more than ~1T (1012) triples or open-source triple stores like the Virtuoso Open Source Edition4 (~58.58B = ~58.58x109), the Blazegraph5 (~12.7B) and the Jena TDB6 (~1.7B). As we can see, in terms of dealing with big data volumes, open-source solutions are behind of those propiertary or licensing alternatives.\nAre those numbers big enough?\nAccording to Orgacle\u0026rsquo;s white paper\r[2], 1T triples can represent:\n 1000 tweets for every one of the 1B Twitter users. 770 facts about every one of the 1.3B Facebook users. 400 metabolic readings for eachof the 2.5 Billion heart beats over an average human life time. 12 facts about every one of the 86B neurons in the human brain. 5 facts about every one of the 200B stars in the Milky Way Galaxy. 7 facts about every one of the 150B galaxies in the universe. 10 facts about each of the 107B people who ever lived. On the other hand, when talking about scale-out solutions for large triple-stores we find that scale-out solutions are offered by most of the propiertary platforms or only in licensing versions like Virtuoso Enterprise Edition, with the exception of Blazegraph which is the only open source platform that offers scale-out.\nTo know more about Blazegraph scale-out possibilities, I recommend you to read the following post: \rScaling-out Blazegraph, is it possible?\r\rSo what? What can we do if we need to scale-out triple-stores and we want to use and support open-source projects?\nIf we don\u0026rsquo;t want to work with a propertary or licensing tripe-stores but still needing a scale-out configuration, then we have two options: (1) using some open source triple-store on top of a highly scalable database like Apache HBase7 or (2) we can use a graph database in combination with Gremlin. Both options have their pros and cons. Using a triple-store in combination with a highly scalable DB provides with the benefits of both platforms but the system maintenance and complexity is considerably increased. Whereas, while most of the graph databases do not support RDF, SPARQL, do not have reasoning or infering services and SPARQL quieres have to be translated to other query languages like Gremlin8 (although not all SPARQL queries can be transformed to Gremlin and Gremlin only supports SPARQL 1.0 and not 1.1), there is only one system to configure and maintain.\nExamples of such soultions are: the Jena+HBase triple-store which combines Apache Jena with Apache HBase to provide a scalable triple-store using RDF and SPARQL, Titan9 and JanusGraph10 graph databases that run on top of Apache HBase and Cassandra11 and suport Gremlin, or Neo4j12 and ArangoDB13 graph databases that also suport Gremlin.\n\nBibliography [1] W3C. LargeTripleStores: https://www.w3.org/wiki/LargeTripleStores (last accessed 26/08/2020)\n[2] Oracle. Oracle Spatial Graph RDF graph 1 trillion Benchmark https://download.oracle.com/otndocs/tech/semantic_web/pdf/OracleSpatialGraph_RDFgraph_1_trillion_Benchmark.pdf (last accessed 26/08/2020)\n https://www.oracle.com/database/technologies/spatialandgraph.html \u0026#x21a9;\u0026#xfe0e;\n https://www.cambridgesemantics.com/anzograph \u0026#x21a9;\u0026#xfe0e;\n https://allegrograph.com \u0026#x21a9;\u0026#xfe0e;\n https://virtuoso.openlinksw.com \u0026#x21a9;\u0026#xfe0e;\n https://blazegraph.com \u0026#x21a9;\u0026#xfe0e;\n https://jena.apache.org \u0026#x21a9;\u0026#xfe0e;\n https://hbase.apache.org \u0026#x21a9;\u0026#xfe0e;\n https://tinkerpop.apache.org \u0026#x21a9;\u0026#xfe0e;\n http://titan.thinkaurelius.com \u0026#x21a9;\u0026#xfe0e;\n https://janusgraph.org \u0026#x21a9;\u0026#xfe0e;\n https://cassandra.apache.org \u0026#x21a9;\u0026#xfe0e;\n https://neo4j.com \u0026#x21a9;\u0026#xfe0e;\n https://www.arangodb.com \u0026#x21a9;\u0026#xfe0e;\n ","date":1597999945,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1597999945,"objectID":"6f7a315425b5e7736fb68e278e25d318","permalink":"https://ga11u.github.io/post/triplestores-scale-out/","publishdate":"2020-08-21T10:52:25+02:00","relpermalink":"/post/triplestores-scale-out/","section":"post","summary":"Semantic technologies are really interesting from a Big Data prespective. Yet, these technologies together with semantic web resources enhance Knowledge Graphs by providing reacher means for representing and defining facts, concepts, properties, relations and logic-rules, while facilitating knowledge graph understanding, integration and manipulation by using ontologies, standard vocabularies and linking its concepts to Linked Open Data (LOD) resources such as Wikidata, Schema.","tags":["News Angler","Triple Store","Software Architecture","Semantic technologies"],"title":"Is It Possible to Sale-out Open Source Triple-Stores?","type":"post"}]