documentation

VicentePerezSoloviev · Jun 5, 2020 · ce53137 · ce53137
1 parent 4383b1e
commit ce53137
Show file tree

Hide file tree

Showing 44 changed files with 17,633 additions and 69 deletions.
diff --git a/EDApy.egg-info/PKG-INFO b/EDApy.egg-info/PKG-INFO
@@ -0,0 +1,138 @@
+Metadata-Version: 2.1
+Name: EDApy
+Version: 0.0.1
+Summary: This is a package where some estimation of distribution algorithms are implemented.
+Home-page: https://github.com/VicentePerezSoloviev/EDApy
+Author: Vicente P. Soloviev
+Author-email: [email protected]
+License: UNKNOWN
+Description: # EDApy
+
+        ## Description
+
+        In this package some Estimation of Distribution Algorithms (EDAs) are implemented. EDAs are a type of evolutionary algorithms. Depending on the type of EDA, different dependencies among the variables can be considered.
+
+        Three EDAs have been implemented:
+        * Binary univariate EDA. It can be used as a simple example of EDA, or to use it for feature selection.
+        * Continuous univariate EDA. 
+        * Continuous multivariate EDA. 
+
+        ## Examples
+
+        #### Binary univariate EDA
+        It can be used as a simple example of EDA, or to use it for feature selection. The cost function to optimize is the metric of the model. An example is shown.
+        ```python
+        from EDApy import EDA_discrete as EDAd
+        import pandas as pd
+
+        def check_solution_in_model(dictionary):
+            MAE = prediction_model(dictionary)
+            return MAE
+
+        vector = pd.DataFrame(columns=['param1', 'param2', 'param3'])
+        vector.loc[0] = 0.5
+
+        EDA = EDAd(MAX_IT=200, DEAD_ITER=20, SIZE_GEN=30, ALPHA=0.7, vector=vector, 
+                    cost_function=check_solution_in_model, aim='minimize')
+
+        bestcost, solution, history = EDA.run(output=True)
+        print(bestcost, solution)
+        print(history)
+        ```
+
+        The example is an implementation for feature selection (FS) for a prediction model (prediction_model). prediction_model depends on some variables. The prediction model, receives a dictionary with keys 'param_N' (N from 1 to number of parameters) and values 1 or 0 depending if that variables should be included or not. The model returns a MAE which we intend to minimize.
+
+        The EDA receives as input the maximum number of iterations, the number of iterations with no best global cost improvement, the generations size, the percentage of generations to select as best individuals, the initial vector of probabilities for each variable to be used, the cost functions to optimize, and the aim ('minimize' or 'maximize') of the optimizer.
+
+        Vector probabilities are usually initialized to 0.5 to start from an equilibrium situation. EDA returns the best cost found, the best combination of variables, and the history os costs found to be plotted.
+
+        #### Continuous univariate EDA
+
+        This EDA is used when some continuous parameters must be optimized. 
+        ```python
+        from EDApy import EDA_continuous as EDAc
+        import pandas as pd
+        import numpy as np
+
+        wheights = [20,10,-4]
+
+        def cost_function(dictionary):
+            function = wheights[0]*dictionary['param1']**2 + wheights[1]*(np.pi/dictionary['param2']) - 2 - wheights[2]*dictionary['param3']
+            if function < 0:
+                return 9999999
+            return function
+
+        vector = pd.DataFrame(columns=['param1', 'param2', 'param3'])
+        vector['data'] = ['mu', 'std', 'min', 'max']
+        vector = vector.set_index('data')
+        vector.loc['mu'] = [5, 8, 1]
+        vector.loc['std'] = 20
+        vector.loc['min'] = 0
+        vector.loc['max'] = 100
+
+        EDA = EDAc(SIZE_GEN=40, MAX_ITER=200, DEAD_ITER=20, ALPHA=0.7, vector=vector, 
+                    aim='minimize', cost_function=cost_function)
+        bestcost, params, history = EDAc.run()
+        print(bestcost, params)
+        print(history)
+        ```
+
+        In this case, the aim is to optimize a cost function which we want to minimize. The three parameters to optimize are continuous. This EDA must be initialized with some initial values (mu), and an initial range to search (std). Optionally, a minimum and a maximum can be specified.
+
+        As in the binary EDA, the best cost found, the solution and the cost evolution is returned.
+
+        #### Continuous multivariate EDA
+
+        In this case, dependencies among the variables are considered and managed with a Gaussian Bayesian Network. This EDA must be initialized with historical records in order to try to find the optimum result. A parameter (beta) is included to control the influence of the historical records in the final solution. Some of the variables can be evidences (fixed values for which we want to find the optimum of the other variables). 
+        The optimizer will find the optimum values of the non-evidence-variables based on the value of the evidences. This is widely used in problems where dependencies among variables must be considered.
+
+        ```python
+        from EDApy import EDA_multivariate as EDAm
+        import pandas as pd
+
+        blacklist = pd.DataFrame(columns=['from', 'to'])
+        aux = {'from': 'param1', 'to': 'param2'}
+        blacklist = blacklist.append(aux, ignore_index=True)
+
+        data = pd.read_csv(path_CSV)  # columns param1 ... param5
+        evidences = [['param1', 2.0],['param5', 6.9]]
+
+        def cost_function(dictionary):
+            return dictionary['param1'] + dictionary['param2'] + dictionary['param3'] + dictionary['param4'] + dictionary['param5']
+
+        EDAm = EDAm(MAX_ITER=200, DEAD_ITER=20, data=data, ALPHA=0.7, BETA=0.4, cost_function=cost_function,
+                         evidences=evidences, black_list=blacklist, n_clusters=6, cluster_vars=['param1', 'param5'])
+        output = EDAm.run(output=True)
+
+        print('BEST', output.best_cost_global)
+        ```
+        This is the most complex EDA implemented. Bayesian networks are used to represent an abstraction of the search space of each iteration, where new individuals are sampled. As a graph is a representation with nodes and arcs, some arcs can be forbidden by the black list (pandas dataframe with the forbidden arcs). 
+
+        In this case the cost function is a simple sum of the parameters. The evidences are variables that have fixed values and are not optimized. In this problem, the output would be the optimum value of the parameters which are not in the evidences list.
+        Due to the evidences, to help the structure learning algorithm to find the arcs, a clustering by the similar values is implemented. Thus, the number of clusters is an input, as well as the variables that are considered in the clustering.
+
+        In this case, the output is the self class that can be saved as a pickle in order to explore the attributes. One of the attributes is the optimum structure of the optimum generation, from which the structure can be plotted and observe the dependencies among the variables. The function to plot the structure is the following:
+        ```python
+        from EDApy import print_structure
+        print_structure(structure=structure, var2optimize=['param2', 'param3', 'param4'], evidences=['param1', 'param5'])
+        ```
+
+        ![Structure praph plot](/structure.PNG "Structure of the optimum generation found by the EDA")
+
+        ## Getting started
+
+        #### Prerequisites
+        R must be installed to use the multivariate EDA, with installed libraries c("bnlearn", "dbnR", "data.table")
+        To manage R from python, rpy2 package must also be installed.
+
+        #### Installing
+        ```
+        pip install git+https://github.com/vicenteperezsoloviev/EDApy.git#egg=EDApy
+        ```
+
+Platform: UNKNOWN
+Classifier: Programming Language :: Python :: 3
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Operating System :: OS Independent
+Requires-Python: >=3.6
+Description-Content-Type: text/markdown
diff --git a/EDApy.egg-info/SOURCES.txt b/EDApy.egg-info/SOURCES.txt
@@ -0,0 +1,16 @@
+README.md
+setup.py
+EDApy/__init__.py
+EDApy.egg-info/PKG-INFO
+EDApy.egg-info/SOURCES.txt
+EDApy.egg-info/dependency_links.txt
+EDApy.egg-info/top_level.txt
+EDApy/optimization/__init__.py
+EDApy/optimization/multivariate/EDA_multivariate.py
+EDApy/optimization/multivariate/__BayesianNetwork.py
+EDApy/optimization/multivariate/__clustering.py
+EDApy/optimization/multivariate/__init__.py
+EDApy/optimization/multivariate/__matrix.py
+EDApy/optimization/univariate/__init__.py
+EDApy/optimization/univariate/continuous.py
+EDApy/optimization/univariate/discrete.py
diff --git a/EDApy.egg-info/dependency_links.txt b/EDApy.egg-info/dependency_links.txt
@@ -0,0 +1 @@
+
diff --git a/EDApy.egg-info/top_level.txt b/EDApy.egg-info/top_level.txt
@@ -0,0 +1 @@
+EDApy
diff --git a/EDApy/optimization/__init__.py b/EDApy/optimization/__init__.py
@@ -0,0 +1,6 @@
+#!/usr/bin/env python
+# coding: utf-8
+
+# __init__.py
+
+# empty
diff --git a/EDApy/optimization/multivariate/EDA_multivariate.py b/EDApy/optimization/multivariate/EDA_multivariate.py
@@ -16,6 +16,38 @@
 
 
 class EDAgbn:
+
+    """Multivariate Estimation of Distribution algorithm. Best individuals of each generation are selected and modelled
+    by a Gaussian Bayesian network, from where new individuals are sampled. Some of the variables might be evidences
+    (fixed values). The optimizer will find the optimum values of the non-evidences variables for those evidences.
+    Also it is possible to control de influence of the historic (from which EDA is initialized) in the selection
+    of the best indiviudals.
+
+    :param MAX_ITER: Maximum number of iterations of the algorithm
+    :type MAX_ITER: int
+    :param DEAD_ITER: Number of iterations with no best cost improvement, before stopping
+    :type DEAD_ITER: int
+    :param data: data of the historic
+    :type data: pandas dataframe
+    :param ALPHA: percentage of population to select in the truncation
+    :type ALPHA: float [0,1]
+    :param BETA: percentage of influence of the individual likelihood in the historic
+    :type BETA: float [0,1]
+    :param cost_function: cost function to minimize
+    :type cost_function: callable function which receives a dictionary as input and returns a numeric value
+    :param evidences: name of evidences variables, and fixed values.
+    :type evidences: two fold list. A list that contains list of size 2 with name of variable and value [name, value]
+    :param black_list: forbidden arcs in the structures
+    :type black_list: pandas dataframe with two columns (from, to)
+    :param n_clusters: number of clusters in which, the data can be grouped. The cluster is appended in each iteration
+    :type n_clusters: int
+    :param cluster_vars: list of names of variables to consider in the clustering
+    :type cluster_vars: list of strings
+
+    :raises Exception: cost function is not callable
+
+    """
+
     # initializations
     # structure = -1
 
@@ -29,19 +61,7 @@ class EDAgbn:
     def __init__(self, MAX_ITER, DEAD_ITER, data, ALPHA, BETA, cost_function,
                  evidences, black_list, n_clusters, cluster_vars):
 
-        """
-        Initialize the class
-        :param MAX_ITER: Maximum number of iterations of the algorithm
-        :param DEAD_ITER: Number of iterations with no best cost improvement, before stopping
-        :param data: pandas dataframe with the data of the historic
-        :param ALPHA: percentage of population to select in the truncation. Range [0,1]
-        :param BETA: percentage of influence of the individual likelihood in the historic. Range [0,1]
-        :param cost_function: cost function to minimize
-        :param evidences: list of two-fold-lists with name of variable, and fixed value. [[name, value], ...]
-        :param black_list: pandas dataframe with two columns (from, to), with the forbidden arcs in the structures
-        :param n_clusters: number of clusters in which, the data can be grouped. The cluster is appended in each
-        iteration
-        :param cluster_vars: list of names of variables to consider in the clustering
+        """Constructor method
         """
 
         self.MAX_ITER = MAX_ITER
@@ -92,9 +112,7 @@ def __init__(self, MAX_ITER, DEAD_ITER, data, ALPHA, BETA, cost_function,
 
     def __initialize_data__(self):
 
-        """
-        Initialize the dataset. Assign a column cost to each individual
-        :return: update initial generation
+        """Initialize the dataset. Assign a column cost to each individual
         """
 
         indexes = list(self.generation.index)
@@ -108,10 +126,8 @@ def __initialize_data__(self):
 
     def truncate(self):
 
-        """
-        Select the best individuals of the generation. In this case, not only the cost is considered. Also the
+        """Select the best individuals of the generation. In this case, not only the cost is considered. Also the
         likelihood of the individual in the initial generation. This influence is controlled by beta parameter
-        :return: update generation
         """
 
         likelihood_estimation = bnlearn_package.logLik_bn_fit
@@ -134,11 +150,12 @@ def truncate(self):
 
     def sampling_multivariate(self, fit):
 
-        """
-        Calculate the parameters mu and sigma to sample from a multivariate normal distribution.
+        """Calculate the parameters mu and sigma to sample from a multivariate normal distribution.
+
         :param fit: bnfit object from R of the generation (structure and data)
-        :return: name in order of the parameters returned. mu and sigma parameters of the multivariate
-        normal distribution
+        :type fit: bnfit object from R
+        :return: name in order of the parameters returned. mu and sigma parameters of the multivariate normal distribution
+        :rtype: list, float, float
         """
 
         hierarchical_order = bnlearn_package.node_ordering(self.structure)
@@ -225,9 +242,10 @@ def sampling_multivariate(self, fit):
 
     def new_generation(self):
 
-        """
-        Build a new generation from the parameters calculated.
-        :return: update the generation to the new group of individuals
+        """Build a new generation from the parameters calculated and update the generation to the new group of individuals
+
+        :return: mean and sigma of the individuals costs of the generation
+        :rtype: float, float
         """
 
         valid_individuals = 0
@@ -293,10 +311,10 @@ def new_generation(self):
 
     def soft_restrictions(self, NOISE):
 
-        """
-        Add Gaussian noise to the evidence variables
+        """Add Gaussian noise to the evidence variables
+
         :param NOISE: std of the normal distribution from where noise is sampled
-        :return: update generation variables
+        :type NOISE: float
         """
 
         number_samples = len(self.generation)
@@ -308,21 +326,21 @@ def soft_restrictions(self, NOISE):
 
     def __choose_best__(self):
 
-        """
-        Select the best individual of the generation
+        """Select the best individual of the generation
         :return: cost of the individual, and the individual
+        :rtype: float, pandas dataframe
         """
 
         minimum = self.generation['COSTE'].min()
         best_ind_local = self.generation[self.generation['COSTE'] == minimum]
         return minimum, best_ind_local
 
     def run(self, output=True):
-
-        """
-        Running method
+        """Running method
         :param output: if desired to print the output of each individual. True to print output
+        :type output: boolean
         :return:the class is returned, in order to explore all the parameters
+        :rtype: self python class
         """
 
         self.__initialize_data__()
@@ -376,24 +394,48 @@ def run(self, output=True):
 
     @property
     def best_cost_global(self):
+        """
+        :return: best cost found ever at the end of the execution
+        :rtype: float
+        """
         return self.__best_cost_global
 
     @property
     def best_ind_global(self):
+        """
+        :return: best individual found ever at the end of the execution
+        :rtype: pandas dataframe slice
+        """
         return self.__best_ind_global
 
     @property
     def best_structure(self):
+        """
+        :return: best generation structure found ever at the end of the execution
+        :rtype: bnlearn structure
+        """
         return self.__best_structure
 
     @property
     def history(self):
+        """
+        :return: best individuals from all generations
+        :rtype: pandas dataframe
+        """
         return self.__history
 
     @property
     def history_cost(self):
+        """
+        :return: list of best costs found along the execution
+        :rtype: list
+        """
         return self.__history_cost
 
     @property
     def dispersion(self):
+        """
+        :return: list of double tuples with mean and variance of each generation
+        :rtype: list
+        """
         return self.__dispersion