Skip to content

1: User Manual

Juanjo Costa edited this page Jul 21, 2023 · 57 revisions

image

Introduction

In this manual, we describe how to implement a Python application using Hecuba and which are the main features that Hecuba implements to boost the performance of an application using Cassandra. Hecuba developers can improve their productivity as Hecuba implements all the necessary code to access the data. Thus, applications can access data as if it was in memory and Hecuba translates this code at runtime to access the underlying storage system. Also, Hecuba implements some optimizations to favor data locality and to reduce the number of interactions with the backing storage and thus, to speedup the accesses to data.

Hecuba applications at a glance

One of the goals of Hecuba is to provide programmers with an easy and portable interface to access data. This interface is independent of the type of system and storage used to keep data, enhancing the portability of the applications. To sum up, using Hecuba the applications can access data like regular objects stored in memory and Hecuba translates the code at runtime into the proper code, according to the backing storage used in each scenario.

The current implementation of Hecuba handles in-memory objects, or persistent data storage provided by Apache Casssandra databases. This chapter guides how to create a Python application that uses Hecuba object abstractions to handle data persistence.

We will start by defining a set of classes that represent the persistent data. The user must inherit from one of the main abstractions provided by Hecuba, thus, a StorageObj or a StorageDict. Programmers can also use StorageNumpy Hecuba class to instantiate persistent numpy ndarrays.

The StorabeObj allows the user to define persistent attributes, accessed with the python object protocol. On the other hand, the StorageDict behaves like a python dictionary, accepting a key to identify the values. In both cases, the in-memory and persistent data will be handled transparently.

Next, the user must define the data model with the concrete data types that will be stored in a persistent layer. The specification is written as a Python comment, and the structure differs if we inherit from a StorageObj or a StorageDict. For instance, to define a set of attributes we will use the @ClassField with a StorageObj, and to define a dictionary the @TypeSpec with a StorageDict.

from hecuba import StorageObj
import numpy as np

class Dataset(StorageObj):
    '''
    @ClassField author str
    @ClassField open_access bool
    @ClassField injected_particles int
    @ClassField geometry numpy.ndarray
    '''

The user data model expresses that there will be four attributes which must be stored onto the persistent layer, for each instance of the class. Also, by adding different @ClassField the user can define any number of persistent attributes. Once the class is defined, the application can instantiate as many objects as needed. On the other hand, we can have a StorageDict to store the some persistent results with the following definition:

from hecuba import StorageDict

class Results(StorageDict):
    '''
    @TypeSpec dict <<particle_id:int, step:float>, x:double, y:double, z:double>
    '''

experiment = Results("breathing.results")

And by mixing both definitions we can write a small application. Note that the Dataset class has an attribute particles referencing the class ParticlesPositions.

from hecuba import StorageObj, StorageDict
import numpy as np



class ParticlesPositions(StorageDict):
    '''
    @TypeSpec dict <<particle_id:int>, x:double, y:double, z:double>
    '''


class Dataset(StorageObj):
    '''
    @ClassField author str
    @ClassField open_access bool
    @ClassField injected_particles int
    @ClassField geometry numpy.ndarray
    @ClassField particles ParticlesPositions
    '''


dt1 = Dataset("breathing.case235")

dt1.author = "BSC"
dt1.open_access = True
dt1.injected_particles = 250000
dt1.geometry = np.load("./geom_file.npy")

for part_id in range(dt1.injected_particles):
    dt1.particles[part_id] = list(np.random.random_sample(3,))

By passing a name of type str to the initializer of a Hecuba class instance, the object becomes persistent and sends the data to the persistent storage. Said name will act as an identifier for its data and other objects created with the same name will access the same data. In this way, if we pass a name which was previously used to create an object we will retrieve the previously persisted data.

Initializing an instance of an hecuba class without a name results in a regular in-memory object. However, its data can be persisted at any moment by calling the instance method make_persistent, provided and implemented by all Hecuba classes. This method expects a str name, in the same way the initializer did, and will be used to identify the data in the future. This method will send the data to the data store, mark the object as persistent, and, future accesses will access the data store if deemed necessary.

class ParticlesPositions(StorageDict):
    '''
    @TypeSpec dict <<particle_id:int>, x:double, y:double, z:double>
    '''

r=ParticlesPositions()

r.make_persistent("OutputData")

Hecuba Deployment

The first step is to download the source code from the following repository:
https://github.com/bsc-dd/hecuba. The directory structure is the following:

  • main folder: contains the following files and sub-folders:

    • README.md: helpful information for the installation and execution.

    • setup.py: Hecuba installer.

    • requirements.txt: requirements needed by Hecuba.

    • hecuba_py folder: contains the Hecuba python code. Should not be modified by users.

    • hecuba_core folder: contains the Hecuba C++ code. Should not be modified by users.

    • tests folder: contains the Hecuba tests. Should not be modified by users.

Find the instructions for the Hecuba Installation Procedure.

The basics

In this chapter we describe in detail the interface provided by the Hecuba. We also illustrate how the supported data types and operations can be applied.

Supported Data Types and Collections

Immutable types supported:

Data types

  • str, bool, decimal, float, int, blob, tuple, buffer.
  • double floating point numbers will be stored as double precision numbers.

Collections

  • numpy.ndarray.
  • frozenset supported in StorageObj only.

Mutable collections supported:

  • dict.
  • set Subject to restrictions, supported only by StorageDict (development underway).
  • list to group a set of values for a given key in a StorageDict. E.g. dict[0] = [1,2,3].

Hecuba Data Classes

Storage Object

The StorageObj is the simplest abstraction provided by Hecuba. It acts like a namedtuple, or a dataclass, where the user can define attributes and access them. However, in this case, the user can choose which attributes will be persisted to the data store.

To declare instances of the StorageObj, the user first needs to define a class inheriting from the StorageObj as well as define the data model of the persistent attributes. The format of the data model is a Python comment with one line per attribute. Each line must start with the keyword @Classfield and continue with the name of the attributes and its data type specification.

class ClassName(StorageObject):
    '''
    @ClassField attribute_name attribute_type
    '''

For example, the following code shows the definition of a class containing an attribute of type integer.

class MyClass(StorageObj):
    '''
    @ClassField MyAttribute_1 int
    '''

When the user needs to use collections as attributes, the syntax needs to be further elaborated. For example, to define a Python dictionary it is necessary to specify the type of the keys and the type of the values. In this case, after the attribute type we can find the rest of the specifications within angular brackets.

class ClassName(StorageObj):
    '''
    @ClassField attribute_name attribute_type <attribute_type_specification>
    '''

For example, the following code adds a dictionary attribute: the key is of type Integer and the value a str.

class MyClass(StorageObj):
    '''
    @ClassField MyAttribute_1 int
    @ClassField MyAttribute_2 dict <<int>, str>
    '''

Each additional level required to complete a specification type can be added within angle brackets. For example, the following code adds the specification of a dictionary that has a key of type tuple, which is composed of an Integer and a str, and that has a value of type Integer.

class MyClass(StorageObj):
    '''
    @ClassField MyAttribute_1 int
    @ClassField MyAttribute_2 dict <<int>, str>
    @ClassField MyAttribute_3 dict <<int, str>, int>
    '''

Attributes of type dict allow the programmer to assign a name to each component of the dictionary (keys and values). These names can help users to give semantic meaning to the data, for instance when accessing the results of a dictionary or when exploring the persistent data with external tools.

class MyClass(StorageObj):
    '''
    @ClassField MyAttribute_1 int
    @ClassField MyAttribute_2 dict <<int>, str>
    @ClassField MyAttribute_3 dict <<int, str>, int>
    @ClassField MyAttribute_4 dict <<mykey1:int, mykey2:str>, myvalue:int>
    '''

Storage Dictionary

The StorageDict abstracts the underlying data model and exposes the user interface of a python dict. The mechanism to create instances of a StorageDict is the same as the StorageObj. A class that inherits from the StorageDict must be defined, and an annotation describing the data model of the keys and values added.

The data model definition must start with the keyword @TypeSpec and continue with the type of the keys, and the values.

class ClassName(StorageDict):
    '''
    @TypeSpec dict <<keys_specification>values_specification>
    '''

For example, the following code shows the definition of a dictionary with one key of type Integer and a str value.

class MyClass(StorageDict):
    '''
    @TypeSpec dict<<int>, str>
    '''

Also, the user can set names to the keys and values to give semantic meaning. It might be desirable to access the results of a dictionary by their name, or when exploring the persistent data with external tools.

class MyClass(StorageDict):
    '''
    @TypeSpec dict<<mykey1:int>, myvalue:str>
    '''

Additional keys or values can be added to a StorageDict definition.

class MyClass(StorageDict):
    '''
    @TypeSpec dict<<mykey1:int, mykey2:str>, myvalue1:int, myvalue2:int, myvalue3:str>
    '''

Distributed sets inside a StorageDict

The usage of distributed sets is a bit restricted. When they a set is used in a StorageDict, the persistent object cannot have more attributes than a single set. The set contains all the persistent storage funcionalities for sets, for example:

from hecuba import StorageDict
class DictWithSet(StorageDict):
    '''
    @TypeSpec dict<<k1:str, k2:int>, s1:set<int>>
    '''
    
my_data = DictWithSet("my_app.my_data")
my_data["1", 1] = {1}
my_data["2", 2] = {1, 2, 3}
my_data["2", 2].remove(2)
other_data = DictWithSet("my_app.other_data")
other_data["2", 2] = my_data["2", 2].union(my_data["1", 1])
for key, set_value in other_data.items():
    if not 2 in set_value:
        other_data[key].add(2)

Cross-class referencing

A previously defined class can be referenced in the definition of a newer class. For instance, a custom StorageObj can have an attribute of type "MyClass", and the latter, be a custom class that inherits from a StorageObj or StorageDict.

The same is possible the other way around, a StorageDict can have as value(s) other StorageDicts or StorageObjs. In order to do so, the programmer needs to specify the data model of both:

# file is named classes.py
from hecuba import StorageDict, StorageObj
class MyObj(StorageObj):
    '''
    @ClassField a int
    @ClassField b str
    '''

class MyDict(StorageDict):
    '''
    @TypeSpec dict<<key:int>, my_obj:classes.MyObj>
    '''
    
my_dict = MyDict("my_app.my_data")
obj1 = MyObj()
obj1.a = 2
obj1.b = "hello"
my_dict[0] = obj1

Storage Numpy

With the StorageNumpy class programmers can instantiate numpy ndarrays that eventually can be persisted. Using the StorageNumpy class there is no need to define any additional class, the user can use this Hecuba class directly in the code to instantiate numpys ndarrays. The shape of the array in inferred from the data assigned. Programmers can instantiate volatile numpy ndarrays and make them persistent later or can instantiate persistent numpy ndarrays. The initial value for the StorageNumpy must be passed as a parameter of the constructor.The following fragment of code shows the different options to instantiate a StorageNumpy:

from hecuba import StorageNumpy
import numpy as np
n = StorageNumpy(np.arange(10).reshape(2,5))                  # n is a volatile StorageNumpy
n = StorageNumpy(np.arange(10).reshape(2,5), "persistent)     # n is a persistent StorageNumpy

Once instantiated, the programmer can user the functions of the numpy library to manipulate the StorageNumpys. Hecuba retrieves from disk (if needed) the values of the numpy ndarrays:

from hecuba import StorageNumpy
import numpy as np
A=StorageNumpy(np.arange(10).reshape(2,5), "matrixA")
B=StorageNumpy(np.arange(10).reshape(5,2), "matrixB")
res=np.dot(A,B) #res is a voltile StorageNumpy that programmers can persist if needed

Persistent StorageNumpy are store distributed in the database. They are splitted in blocks, transparently to the programmer. Hecuba assigns to each block an identifier that will act as the key of the block and will decide which node holds it.

Hecuba Classes instantiation

Hecuba provides two different constructors to instantiate StorageNumpys and classes that inherit from StorageObjs and StorageDicts. The first one is to instantiate new objects that have no persistent data associated, and data will be kept in-memory until the instance method make_persistent is called.

The second constructor is to instantiate objects that will make use of the persistent storage. In this case, the constructor receives a string as the parameter, which is the identifier of the data inside the data store. Hecuba checks if there already exists some persistent data with that identifier, and if it doesn’t exist Hecuba creates it.

If the identifier is already used in the data store, then Hecuba checks if the schema of that existing object matches with the object that the programmer is trying to instantiate. If this is the case, Hecuba assumes that the programmer wants to get access to that object and completes the instantiation: any access will be performed on the existing object. If the schema does not match, the user code will fail with an exception.

Hecuba allows to deal with hierarchical namespaces and to specify several levels of the hierarchy with just one identifier. For example, directory name and file name, in the case of file systems, or keyspace name and table name, in the case of Apache Cassandra. The format of the identifier specifies that a dot must separate the identifiers of each level in the namespace. If the identifier does not contain a dot then Hecuba interprets that the identifier refers just to the name in the lowest level in the namespace hierarchy (there are default values for the rest of the components to identify the data that the user can configure through environment variables, see section Hecuba configuration parameters)

o1=MyClassName() # o1 is a volatile object

o2=MyClassName("Table") # o2 is a persistent object: the name of the table is "Table" and the keyspace is the default name used in this execution

o3=MyClassName("Keyspace.Table") # o3 is a persistent object: the name of the table is "Table" and the name of the keyspace is "Keyspace"

It is also possible to use the static method get_by_alias to instantiate an already existing persistent object.

o4=MyClassName.get_by_alias("Keyspace.Table")

Notice that Hecuba registers the schema of the user defined classes and thus, it is not possible to reuse those class names for a different class definition. The access code to an instance of such a redefined class will fail due to schema mismatch.

Access to Hecuba objects

From the point of view of the programmers, both objects with persistent data and objects without persistent data are accessed in the same way: like regular Python objects. However, Hecuba intercepts all accesses to a Hecuba object and executes the suitable code to refer to the involved data. Notice that some accesses to persistent data may be solved without accessing the data store because Hecuba implements a cache to keep recently used persistent data and thus, save accesses to the data store.

o1=MySOClass()               # instantiation of an object without persistent data
o1.dict_attr[0]=1            # access to a regular Python object in memory
value1=o1.dict_attr[0]       # access to the data store to retrieve the data
value2=o1.dict_attr[0]       # access to Hecuba cache in memory
o2=MySOClass("Table")        # instantiation of persistent object
o2.dict_attr[0]=2            # saved to Hecuba cache in memory, to be stored in the database later

Hecuba allows to define StorageDicts with more than one value. This is implemented as a named tuple and, thus each component of the value can be referred with the name assigned in the class specification or with its positional value.

class MyClass(StorageDict):
    '''
    @TypeSpec dict<<mykey1:int, mykey2:str>, myvalue1:int, myvalue2:int, myvalue3:str>
    '''

d=MyClass("dictname")        # dictname represents an already existing persistent StorageDict
i=d[0,"value"].myvalue2      # access to the second attribute of the value corresponding with key (0,"value")
i=d[0,"value"][1]            # access to the second attribute of the value corresponding with key (0,"value")

Making volatile data persistent

All Hecuba volatile objects can become persistent at any point. The programmer only needs to use the make_persistent method passing as parameter which will be the identifier of the object in the data store. If the volatile object already contains data, all the data is eventually sent to the data store. And from this point on, all the modifications on the object will be considered to be persistent.

o1 = MyObj()
o1.a = 2
o1.b = "hello"
o1.make_persistent("myksp.mytable")

If the identifier is already used in the data store, then Hecuba checks if the schema of that existing object matches with the object that the programmer is trying to persist. If this is the case, then the persisting operation concludes successfully and the data is sent to the data store. If the schema does not match, then the code of the user fails with an exception.

Synchronizing with data store

Hecuba implements some optimizations in the interaction with the database as caching and prefetching. This means that, even an object is defined to be persistent, its contents may be in memory. More over, it implements asynchronous writes to allow overlapping a computing phase of the application with the access to the data store and to reduce the number of interactions with the data store. That is, it is possible that during some time the persistent content of an object may be only in memory. The programmer can force at any moment the actual sending of the data to the data store using the sync() method. Notice that when a persistent object is deallocated (by the garbage collector), the sync method is automatically called, so before the process ends the data is guarantee to be coherently stored in the database.

o1 = MyClass("myname")
o1.myattr = 4
o1.sync()    # this method guarantees that data is stored in the database so if other process instantiates it will access the data up to date

Methods for Iterating

In order to support data partitioning, Hecuba classes implement the method split. This method returns an iterator of partitions, where each partition is a new object of the same class containing just a part of the base object. Using the split method no data loading from storage happens until the data in the partition is accessed.

Partitioning of a dataset was introduced to support the implementation of data-driven distributed applications: developers can define parallel tasks each of them working on one of these chunks of data. Hecuba supports an additional level of iteration that allows iterating over each of these partitions, using the python iteration methods.

Methods for iterating.

The current implementation of the split method does not supports partitioning on volatile objects.

The current criteria to create the partitions is oriented to favor the load balancing between processors and to enhance data locality. Thus, the partition is considered the unit of work and the split method creates enough partitions to facilitate a balanced work assignment between processors.

This method is only implemented for StorageDicts and StorageNumpys as they are the classes intended to have a big collection of data. Notice that if a StorageObject contains some attributes of these classes then it is possible to partition each of this collection using their own class method.

Following, we describe the specificities of this method for StorageDict and for StorageNumpy

  • Iterating over a StorageDict: Hecuba takes into account the location of all data across the distributed storage system and assigns to each partition of a StorageDict only data that resides on the same node. This way the task scheduler has the opportunity to consider data location as a factor when taking task assignment decisions. Currently, the number of partitions is a parameter that the user can configure to tune the load balancing (see section Hecuba configuration parameters). As part of our future work, we plan to automate the calculation of this value.
# sd is the instance of a persistent *StorageDict*
for sd_partition in so.sd.split():          # Iterator on blocks
    # Here we'll have access to each partition of the StorageDict
    for key in sd_partition.keys():         # Iterator on elements of a block using the python method *keys*
        do_something(key)
  • Iterating over a StorageNumpy: by default, each partition of a StorageNumpy corresponds with a StorageNumpy block. In the current implementation, the size of the block is fixed but in future releases it will be a configurable parameter. The distribution of blocks in the storage follows the z-order algorithm, to enhance a uniform distribution. In the case of the 2-dimmensional StorageNumpys, the split method supports a parameter (which is ignored in the case of the StorageDicts and StorageNumpys with a different number of dimmensions). This parameter is intended to support the two typical pattern of accesses to matrices: by rows and by columns. Using this parameter, each partition is either composed of a column of blocks (parameter cols=False) or composed of a row of blocks (parameter cols=True).

Methods for iterating.

# sn is the instance of a persistent StorageNumpy
for sn_block in sn.split():                 # Iterator on blocks 
     for elt in sn_block:                   # Iterator on numpy ndarrays
          #do something with the numpy element     

# if sn is the instance of a persistent 2D-StorageNumpy it is possible to use the cols parameter
for sn_block in sn.split(cols=True):        # Iterator on blocks. Each partition is a column of StorageNumpy blocks
     for elt in sn_block:                   # Iterator on numpy ndarrays
          #do something with the numpy element     

Deleting data from the data store

In order to delete a persistent object from the data store, Hecuba provides the method del_persistent. This method deletes all data and meta-data associated to the specified object.

o1.o1dict.del_persistent()
o1.del_persistent()

Support to automatic parallelization: PyCOMPSs

COMPS Superscalar is a framework which aims to ease the development and execution of applications for distributed infrastructures, such as Clusters, Grids and Clouds. This framework implements several bindings to support applications written in different languages. COMPSs runtime exports an interface to implement some optimizations regarding the data storage that for example can be used to enhance data locality. Here we only introduce the main features implemented by Hecuba to interact with COMPSs and, in particular, with the Python binding (from now on, PyCOMPSs); more Documentation about this can be found in the COMPSs Manual.

Tasks in PyCOMPSs

PyCOMPSs allow programmers to write sequential code and to indicate, through a decorator, which functions can be executed in parallel. The COMPSs runtime interprets this decorator and executes, transparent to the programmer, all the code necessary to schedule each task on a computing node, to manage dependencies between tasks and to send and to serialize the parameters and the returns of the tasks.

When input/output parameters of a tasks are persistent objects (i.e. their classes implement the Storage API defined to interact with PyCOMPSs), the runtime asks the storage system for the data locality information and uses this information to try to schedule the task on the node containing the data. This way no data sending or serialization is needed.

The following code shows an example of PyCOMPSs task. The input parameter of the task could be an object resulting from splitting a StorageDict. In this example the return of the task is a Python dictionary.

@task(returns = dict)
def wordcountTask(partition):
    partialResult = {}
    for word in partition.values():
        if word not in partialResult:
            partialResult[word] = 1
        else:
            partialResult[word] = partialResult[word] + 1
    return partialResult

Data locality.

How to execute

Sequential applications

To run a sequential application using Hecuba is really simple after the requirements have been satisfied. We just need to execute the application with python, and it will use the configured data store:

python3 myApp.py

This command assumes that you have a running a local Cassandra instance. If you need to connect to a different instance of Cassandra you must set the variables CONTACT_NAMES and NODE_PORT with the suitable values (see Hecuba Configuration Parameters).

Parallel applications on queue based systems

To run a parallel Hecuba application using PyCOMPSs you should execute the enqueue_compss command setting the options storage_props and storage_home. The storage_props option is mandatory and should contain the path of an existing file. This file can contain all the Hecuba configuration options that the user needs to set (can be an empty file). The storage_home option contains the path to the Hecuba implementation of the Storage API required by COMPSs. Following, we show an example of how to use PyCOMPSs and Hecuba to run the python application in the file myapp.py. In this example, we ask PyCOMPSs to allocate 4 nodes and to use the scheduler that enhances data locality for tasks using persistent objects. We assume that the variable HECUBA_ROOT contains the path to the installation directory of Hecuba.

    PATH_TO_COMPSS_INSTALLATION/enqueue_compss \
    --num_nodes = 4 \
    --storage_props = storage_props.cfg \
    --storage_home=$HECUBA_ROOT/compss/ \
    --scheduler=es.bsc.compss.scheduler.fifodatanew.FIFODataScheduler \
    --lang=python \
    $(pwd)/myapp.py

Hecuba Configuration Parameters

There are several parameters that can be defined when running our application. The basic parameters are the following

  • CONTACT_NAMES (default value: 'localhost'): list of the Storage System nodes separated by a comma (example: export CONTACT_NAMES=node1,node2,node3)

  • NODE_PORT (default value: 9042): Storage System listening port

  • EXECUTION_NAME (default value: "my_app’): Default name for the upper level in the app namespace hierarchy

  • CREATE_SCHEMA (default value: True): if set to True, Hecuba will create its metadata structures into the storage system. Notice that these metadata structures are kept from one execution to another so it is only necessary to create them if you have deployed from scratch the storage system.

Hecuba Advanced Configuration Parameters

  • NUMBER_OF_BLOCKS (default value: 1024): Number of partitions in which the data will be divided for each node

  • CONCURRENT_CREATION (default value: False): you should set it to True if you need to support concurrent persistent object creation. Setting this variable slows-down the creation task so you should keep it to False if only sequential creation is used or if the concurrent creation involves disjoint objects

  • LOAD_ON_DEMAND (default value: True): if set to True data is retrieved only when it is accessed. If it is set to False data is loaded when an instance to the object is created. It is necessary to set to True if you code uses those functions of the numpy library that do not use the interface to access the elements of the numpy ndarray.

  • DEBUG (default value: False): if set to True Hecuba shows during the execution of the application some output messages describing the steps performed

  • SPLITS_PER_NODE (default value: 32): Number of partitions that generates the split method

  • MAX_CACHE_SIZE (default value: 1000): Size of the cache. You should set it to 0 (and thus deactivate the utilization of the cache) if the persistent objects are small enough to keep them in memory while they are in used

  • PREFETCH_SIZE (default value: 10000): Number of elements read in advance when iterating on a persistent object

  • WRITE_BUFFER_SIZE (default value: 1000): size of the internal buffer used to group insertions to reduce the number of interactions with the storage system

  • WRITE_CALLBACKS_NUMBER (default value: 16): number of concurrent on-the-fly insertions that Hecuba can support

  • REPLICATION_STRATEGY (default value: 'SimpleStrategy'): Strategy to follow in the Cassandra databas

  • REPLICA_FACTOR (default value: 1): The amount of replicas of each data available in the Cassandra cluster

Hecuba Specific Configuration Parameters for the storage_props file of PYCOMPSs

  • CONTACT_NAMES (default value: empty): If this variable is set in the storage_props file, then COMPSs assumes that the variable contains the list of of an already running Cassandra cluster. If this variable is not set in the storage_props file, then the enqueue_compss command will use the Hecuba scripts to deploy and launch a new Cassandra cluster using all the nodes assigned to workers.

  • RECOVER (default value: empty): if this variable is set in the storage_props file, then the enqueue_compss command will use the Hecuba scripts to deploy and launch a new Cassandra cluster starting from the snapshot identified by the variable. Notice that in this case, the number of nodes used to generate the snapshot should match the number of workers requested by the enqueue_compss command.

  • MAKE_SNAPSHOT (default value: 0): the user should set this variable to 1 in the storage_props file if a snapshot of the database should be generated and stored once the application ends the execution (this feature is still under development, users can currently generate snapshots of the database using the c4s tool provided as part of Hecuba).

Access to Hecuba from C++

We have added a layer that allows the interaction with the Hecuba core layer, without using the Python interface. This C++ interface is work in progress and currently does not cover all the functionality implemented by Hecuba. In addition, we expect to improve the usability of the interface in the next releases.

Class declaration

Currenly the C++ interface of Hecuba supports three types of Hecuba Objects:

  • StorageDict
  • StorageObject
  • StorageNumpy

Programmers can define any class inheriting from one of these classes. The method to define each class depends of the base class.

Classes derived from StorageDicts

StorageDict is a templatized class with a template composed of three elements: the class of the key, the class of the value and the name of the derived class.

The class of the key is the templatized class KeyClass defined by Hecuba, where the template is composed of a variable number of types: one for each element of the key. The following sentences define an alias for three different key classes with different number of attributes of different types:

using Key1 as KeyClass<int32_t>
using Key2 as KeyClass<int32_t, float>
using Key3 as KeyClass<int32_t, float, std::string>

The first element of the key acts as partition key and the rest acts as clustering keys.

The class of the value is the class ValueClass also defined by Hecuba and also with a variadic template: one type for each element of the value. For example:

using Value1 as ValueClass <std::string>
using Value2 as ValueClass <int32_t, int32_t, std::string>

The types of both keys and values can be basic types or other Hecuba derived classes.

Thus, the following sentence defines a StorageDict that has one integer as key and a string as value:

class myDictClass:public StorageDict<Key1,Value1,myDictClass> {

// can be empty

}

Classes derived from StorageObjects

Hecuba provides the macro HECUBA_ATTRS to define the persistent attributes of classes derived from StorageObjects. The parameters of this macro are a comma separated pairs composed by type of the attribute and the name of the attribute. The number of persistent attributes is currently limited to 1024. The following sentence defines a class derived from a Storage Object with three attributes: "attr1" of type uint64_t, "attr2" of type std::string and "attr3" of type int32_t:

class myObjClass:public StorageObject{

HECUBA_ATTRS (
     uint64_t, "attr1",
     std::string, "attr2",
     int32_t, "attr3"
    )
}

Classes derived from StorageNumpy

At this moment, Hecuba does not support any specialization of the StorageNumpy, for this reason, the C++ interface of Hecuba does not support the definition of derived classes from StorageNumpy. Programmer can instantiate StorageNumpys without any additional class definition.

Supported basic types

Currently the basic types that C++ Hecuba supports are:

  • int32_t, float, double, std:string, bool, char, int64_t.

New objects instantiation

The object instantiation is as any regular object:

myObjClass o;
myDictClass d;
StorageNumpy sn;

The only consideration is that if the programmer implements in the derived class a constructor with parameters, that constructor should explicitly call the default constructor of the base class.

In the case of the instantiation of a StorageNumpy, Hecuba implements an additional constructor that allows to set the numpy data during the instantiation. The signature of this constructor is:

StorageNumpy::StorageNumpy(void *data, vector<uint32_t> metas);

And this constructor is invoked in a declaration like the following one:

StorageNumpy sn (data, metas);

Where data is a pointer to a memory region with the content of the numpy in C order. Keep in mind that current implementation of StorageNumpy only support numpy.ndarrays of elements of type float. And metas is a vector that contains the size of each dimension of the numpy.ndarray.

Persisting new objects

Once an object is instantiated, the first step is to persist it. The connection with the database is performed when the first Hecuba object is persisted. And this connection will be active during all the execution of the process. The method to persist an Hecuba object is make_persistent:

void make_persistent(std::string name)

The parameter the name of the persistent object (a string that will identify the persistent object in the database):

This operation will prepare in the database the tables that will contain the object data. The operation to persist a Hecuba object is the same for all types of Hecuba base object.

d.make_persistent("mydictname");

All the insertions in a persisted object are sent asynchronously to the database.

At this moment it is not possible to insert data in a volatile Hecuba object but in the next releases we will extend this functionality.

The method make_persistent also generates a python file with the class definition of the object (for StorageDict and StorageObjects), that can be used from any python code that needs to access this persistent object.

Retrieving already existing persistent objects

Hecuba implements the method getByAlias to connect with a previously persisted Hecuba Object:

void getByAlias(std::string name)

The parameter of this method is the name of the object (the one used in the persisting operation).

myDict d;

d.getByAlias("mydictname");

Object Access

The interface to access Hecuba objects depends on the Hecuba base class.

Accessing StorageDicts

In the case of StorageDict the access is implemented with the same operator of C++ maps or vectors: Hecuba overrides the indexing operator ([])).

In the insertion operation, the user has to specify the element of type KeyClass that acts as the index, and the element of type ValueClass that needs to insert. For example, if Key is an alias for a KeyClass composed of two elements of type int32_t and Value an alias for a ValueClass composed of three elements of type int32_t, std::string, and float, the following sentence represents a valid insertion in d:

d[Key(1,2)] = Value(3,"hi",(float)3.14)

In the read operation, the indexing operator returns a Value object. To facilitate the extraction of each element of the Value we have implemented the same interface that offers the standard tuples of C++. The following sentence will return the first element of Value

Value v = d[Key(1,2)];
int32_t v1 = Value::get<0>(v);
std::string v2 = Value::get<1>(v);
float v3 = Value::get<2>(v);

The C++ interface of Hecuba also implements an iterator on the keys of a persistent StorageDict. The following loop accesses all the elements of a StorageDict:

Key k;
Value v;
for(auto it = d.begin(); it != d.end(); it++) {
    k=*it;
    v=d[k];
}

Accessing StorageObjects

In the case of StorageObjects the operator to access the attributes is the same than to access attributes of regular C++ objects: Hecuba overrides the accessing operator (.). If the user instantiates a StorageObject with one attribute named attr1 of type int32_t, then the following sentence will assing 1 to the attribute attr1:

o.attr1 = 1;

And the following sentence will read attr1 from o:

int32_t v_read = o.attr1;

Accessing StorageNumpys

In the case of StorageNumpy the current implementation of C++ interface of Hecuba only supports the insertion of the whole numpy data. The insertion can be performed in two ways:

  • During the instantiation of the StorageNumpy, using the constructor that receives both the data and the metadata.

  • Using the method setNumpy. The signature of this method is:

StorageNumpy::setNumpy(void *data, vector<uint32_t>metadata)

Where data is a pointer to a memory region with the content of the numpy.ndarray in C order and meta is a vector with the size of each dimension of the numpy.ndarray.

Synchronization with disk

All the insertions in persistent Hecuba objects are sent asynchronously to the database. Hecuba guarantees that at the end of the session all the data will be up to date in the database. However, Hecuba offers a method to explicitly synchronize an object with the database. The signature of this method is the following:

void sync()

And can be used with any type o persistent Hecuba objects:

d.sync();

Nested Objects

Both StorageDicts and StorageObjects can contain other Hecuba objects, using the class name to declare the attribute.

For example, a StorageDict indexed with integers and with values of type StorageNumpy could be defined as follows:

using Key as KeyClass<int32_t>;
using Value as ValueClass<StorageNumpy>;
class nestedDict <Key,Value,nestedDict>: public StorageDict{
} 

And then used as follows:

nestedDict nd;
nd.make_persistent("nd_dict");
Key k(0);
StorageNumpy sn(data,metadata); //data and metadata are variables
sn.make_persistent("mynumpy"); //initialized properly
Value v(sn);
nd[k]=v;

We can use this dictionary as an attribute of an object:

class nestedObject: public StorageObject{
public:
    HECUBA_ATTRS(
        std::string, description,
        nestedDict, content
    )
}

And use it:

nestedObject no;
no.make_persistent("nested_object");
no.description = "numpys generated today";
no.content = nd;

Compiling C++ applications using Hecuba

Assuming that the variable HECUBA_ROOT contains the path where Hecuba is installed, applications using the C++ interface of Hecuba should be compiled using the following compiling command:

g++ -o  application \
	application.cpp \
	-std=c++11 \
	-I ${HECUBA_ROOT}/include \
	-I ${HECUBA_ROOT}/include/hecuba \
	-L${HECUBA_ROOT}/lib \
	-lhfetch \
	-Wl,-rpath,${HECUBA_ROOT}/lib

Lambda architecture in Hecuba

Hecuba facilitates the implementation of using both on-line and off-line processing by implementing a lambda architecture. With this architecture, the data is streamed concurrently with the data insertion in storage.

Currently we have implemented this feature for StorageDicts. The user has to define the class as a subclass of both StorageDict and StorageStream Hecuba classes. With this class definition, all the setitems will send asynchronously the data to the storage and synchronously through the stream.

Exemple: python specification (to be used from python code)

class mydict (StorageDict, StorageStream):
   '''
   @TypeSpec dict <<key:int>,val:numpy.ndarray>
   '''

Exemple: C++ class definition (to be used from c++ code)

class myDictClass <Key,Value,myDictClass>: public StorageDict, public
StorageStream{

}

The consumer should use the function poll to receive the streamed data. At this point of the implementation, poll is only supported by the Python interface of Hecuba.

def poll() # returns the key and the value of the streamed data

Example: consumer in Python

d=mydict("dictname")
k,v=d.poll() #k will value 42 and v will be sn

The code for the producer is the same than when the streaming capability is not set.

Example: producer in Python

d=mydict("dictname")
d[42]=sn # sn is a StorageNumpy

Example: producer in C++

myDictClass d;
d.make_persistent("streamdict");

// add here the declaration of k and v

d[k] = v; //will store this item in the database
          // and will send it through the stream

Notice that the data is stored in the database as a regular StorageDict, so it is possible to implement applications to perform off-line analysis without activating the stream feature in the class definition.

A full example can be found in Hecuba Streaming Examples, containing a producer implemented both in Python and C++ and its corresponding consumer in Python.

Clone this wiki locally