Skip to content

Commit

Permalink
documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
VicentePerezSoloviev committed Jun 5, 2020
1 parent 4383b1e commit ce53137
Show file tree
Hide file tree
Showing 44 changed files with 17,633 additions and 69 deletions.
138 changes: 138 additions & 0 deletions EDApy.egg-info/PKG-INFO
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
Metadata-Version: 2.1
Name: EDApy
Version: 0.0.1
Summary: This is a package where some estimation of distribution algorithms are implemented.
Home-page: https://github.com/VicentePerezSoloviev/EDApy
Author: Vicente P. Soloviev
Author-email: [email protected]
License: UNKNOWN
Description: # EDApy

## Description

In this package some Estimation of Distribution Algorithms (EDAs) are implemented. EDAs are a type of evolutionary algorithms. Depending on the type of EDA, different dependencies among the variables can be considered.

Three EDAs have been implemented:
* Binary univariate EDA. It can be used as a simple example of EDA, or to use it for feature selection.
* Continuous univariate EDA.
* Continuous multivariate EDA.

## Examples

#### Binary univariate EDA
It can be used as a simple example of EDA, or to use it for feature selection. The cost function to optimize is the metric of the model. An example is shown.
```python
from EDApy import EDA_discrete as EDAd
import pandas as pd

def check_solution_in_model(dictionary):
MAE = prediction_model(dictionary)
return MAE

vector = pd.DataFrame(columns=['param1', 'param2', 'param3'])
vector.loc[0] = 0.5

EDA = EDAd(MAX_IT=200, DEAD_ITER=20, SIZE_GEN=30, ALPHA=0.7, vector=vector,
cost_function=check_solution_in_model, aim='minimize')

bestcost, solution, history = EDA.run(output=True)
print(bestcost, solution)
print(history)
```

The example is an implementation for feature selection (FS) for a prediction model (prediction_model). prediction_model depends on some variables. The prediction model, receives a dictionary with keys 'param_N' (N from 1 to number of parameters) and values 1 or 0 depending if that variables should be included or not. The model returns a MAE which we intend to minimize.

The EDA receives as input the maximum number of iterations, the number of iterations with no best global cost improvement, the generations size, the percentage of generations to select as best individuals, the initial vector of probabilities for each variable to be used, the cost functions to optimize, and the aim ('minimize' or 'maximize') of the optimizer.

Vector probabilities are usually initialized to 0.5 to start from an equilibrium situation. EDA returns the best cost found, the best combination of variables, and the history os costs found to be plotted.

#### Continuous univariate EDA

This EDA is used when some continuous parameters must be optimized.
```python
from EDApy import EDA_continuous as EDAc
import pandas as pd
import numpy as np

wheights = [20,10,-4]

def cost_function(dictionary):
function = wheights[0]*dictionary['param1']**2 + wheights[1]*(np.pi/dictionary['param2']) - 2 - wheights[2]*dictionary['param3']
if function < 0:
return 9999999
return function

vector = pd.DataFrame(columns=['param1', 'param2', 'param3'])
vector['data'] = ['mu', 'std', 'min', 'max']
vector = vector.set_index('data')
vector.loc['mu'] = [5, 8, 1]
vector.loc['std'] = 20
vector.loc['min'] = 0
vector.loc['max'] = 100

EDA = EDAc(SIZE_GEN=40, MAX_ITER=200, DEAD_ITER=20, ALPHA=0.7, vector=vector,
aim='minimize', cost_function=cost_function)
bestcost, params, history = EDAc.run()
print(bestcost, params)
print(history)
```

In this case, the aim is to optimize a cost function which we want to minimize. The three parameters to optimize are continuous. This EDA must be initialized with some initial values (mu), and an initial range to search (std). Optionally, a minimum and a maximum can be specified.

As in the binary EDA, the best cost found, the solution and the cost evolution is returned.

#### Continuous multivariate EDA

In this case, dependencies among the variables are considered and managed with a Gaussian Bayesian Network. This EDA must be initialized with historical records in order to try to find the optimum result. A parameter (beta) is included to control the influence of the historical records in the final solution. Some of the variables can be evidences (fixed values for which we want to find the optimum of the other variables).
The optimizer will find the optimum values of the non-evidence-variables based on the value of the evidences. This is widely used in problems where dependencies among variables must be considered.

```python
from EDApy import EDA_multivariate as EDAm
import pandas as pd

blacklist = pd.DataFrame(columns=['from', 'to'])
aux = {'from': 'param1', 'to': 'param2'}
blacklist = blacklist.append(aux, ignore_index=True)

data = pd.read_csv(path_CSV) # columns param1 ... param5
evidences = [['param1', 2.0],['param5', 6.9]]

def cost_function(dictionary):
return dictionary['param1'] + dictionary['param2'] + dictionary['param3'] + dictionary['param4'] + dictionary['param5']

EDAm = EDAm(MAX_ITER=200, DEAD_ITER=20, data=data, ALPHA=0.7, BETA=0.4, cost_function=cost_function,
evidences=evidences, black_list=blacklist, n_clusters=6, cluster_vars=['param1', 'param5'])
output = EDAm.run(output=True)

print('BEST', output.best_cost_global)
```
This is the most complex EDA implemented. Bayesian networks are used to represent an abstraction of the search space of each iteration, where new individuals are sampled. As a graph is a representation with nodes and arcs, some arcs can be forbidden by the black list (pandas dataframe with the forbidden arcs).

In this case the cost function is a simple sum of the parameters. The evidences are variables that have fixed values and are not optimized. In this problem, the output would be the optimum value of the parameters which are not in the evidences list.
Due to the evidences, to help the structure learning algorithm to find the arcs, a clustering by the similar values is implemented. Thus, the number of clusters is an input, as well as the variables that are considered in the clustering.

In this case, the output is the self class that can be saved as a pickle in order to explore the attributes. One of the attributes is the optimum structure of the optimum generation, from which the structure can be plotted and observe the dependencies among the variables. The function to plot the structure is the following:
```python
from EDApy import print_structure
print_structure(structure=structure, var2optimize=['param2', 'param3', 'param4'], evidences=['param1', 'param5'])
```

![Structure praph plot](/structure.PNG "Structure of the optimum generation found by the EDA")

## Getting started

#### Prerequisites
R must be installed to use the multivariate EDA, with installed libraries c("bnlearn", "dbnR", "data.table")
To manage R from python, rpy2 package must also be installed.

#### Installing
```
pip install git+https://github.com/vicenteperezsoloviev/EDApy.git#egg=EDApy
```

Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
16 changes: 16 additions & 0 deletions EDApy.egg-info/SOURCES.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
README.md
setup.py
EDApy/__init__.py
EDApy.egg-info/PKG-INFO
EDApy.egg-info/SOURCES.txt
EDApy.egg-info/dependency_links.txt
EDApy.egg-info/top_level.txt
EDApy/optimization/__init__.py
EDApy/optimization/multivariate/EDA_multivariate.py
EDApy/optimization/multivariate/__BayesianNetwork.py
EDApy/optimization/multivariate/__clustering.py
EDApy/optimization/multivariate/__init__.py
EDApy/optimization/multivariate/__matrix.py
EDApy/optimization/univariate/__init__.py
EDApy/optimization/univariate/continuous.py
EDApy/optimization/univariate/discrete.py
1 change: 1 addition & 0 deletions EDApy.egg-info/dependency_links.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

1 change: 1 addition & 0 deletions EDApy.egg-info/top_level.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
EDApy
6 changes: 6 additions & 0 deletions EDApy/optimization/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#!/usr/bin/env python
# coding: utf-8

# __init__.py

# empty
110 changes: 76 additions & 34 deletions EDApy/optimization/multivariate/EDA_multivariate.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,38 @@


class EDAgbn:

"""Multivariate Estimation of Distribution algorithm. Best individuals of each generation are selected and modelled
by a Gaussian Bayesian network, from where new individuals are sampled. Some of the variables might be evidences
(fixed values). The optimizer will find the optimum values of the non-evidences variables for those evidences.
Also it is possible to control de influence of the historic (from which EDA is initialized) in the selection
of the best indiviudals.
:param MAX_ITER: Maximum number of iterations of the algorithm
:type MAX_ITER: int
:param DEAD_ITER: Number of iterations with no best cost improvement, before stopping
:type DEAD_ITER: int
:param data: data of the historic
:type data: pandas dataframe
:param ALPHA: percentage of population to select in the truncation
:type ALPHA: float [0,1]
:param BETA: percentage of influence of the individual likelihood in the historic
:type BETA: float [0,1]
:param cost_function: cost function to minimize
:type cost_function: callable function which receives a dictionary as input and returns a numeric value
:param evidences: name of evidences variables, and fixed values.
:type evidences: two fold list. A list that contains list of size 2 with name of variable and value [name, value]
:param black_list: forbidden arcs in the structures
:type black_list: pandas dataframe with two columns (from, to)
:param n_clusters: number of clusters in which, the data can be grouped. The cluster is appended in each iteration
:type n_clusters: int
:param cluster_vars: list of names of variables to consider in the clustering
:type cluster_vars: list of strings
:raises Exception: cost function is not callable
"""

# initializations
# structure = -1

Expand All @@ -29,19 +61,7 @@ class EDAgbn:
def __init__(self, MAX_ITER, DEAD_ITER, data, ALPHA, BETA, cost_function,
evidences, black_list, n_clusters, cluster_vars):

"""
Initialize the class
:param MAX_ITER: Maximum number of iterations of the algorithm
:param DEAD_ITER: Number of iterations with no best cost improvement, before stopping
:param data: pandas dataframe with the data of the historic
:param ALPHA: percentage of population to select in the truncation. Range [0,1]
:param BETA: percentage of influence of the individual likelihood in the historic. Range [0,1]
:param cost_function: cost function to minimize
:param evidences: list of two-fold-lists with name of variable, and fixed value. [[name, value], ...]
:param black_list: pandas dataframe with two columns (from, to), with the forbidden arcs in the structures
:param n_clusters: number of clusters in which, the data can be grouped. The cluster is appended in each
iteration
:param cluster_vars: list of names of variables to consider in the clustering
"""Constructor method
"""

self.MAX_ITER = MAX_ITER
Expand Down Expand Up @@ -92,9 +112,7 @@ def __init__(self, MAX_ITER, DEAD_ITER, data, ALPHA, BETA, cost_function,

def __initialize_data__(self):

"""
Initialize the dataset. Assign a column cost to each individual
:return: update initial generation
"""Initialize the dataset. Assign a column cost to each individual
"""

indexes = list(self.generation.index)
Expand All @@ -108,10 +126,8 @@ def __initialize_data__(self):

def truncate(self):

"""
Select the best individuals of the generation. In this case, not only the cost is considered. Also the
"""Select the best individuals of the generation. In this case, not only the cost is considered. Also the
likelihood of the individual in the initial generation. This influence is controlled by beta parameter
:return: update generation
"""

likelihood_estimation = bnlearn_package.logLik_bn_fit
Expand All @@ -134,11 +150,12 @@ def truncate(self):

def sampling_multivariate(self, fit):

"""
Calculate the parameters mu and sigma to sample from a multivariate normal distribution.
"""Calculate the parameters mu and sigma to sample from a multivariate normal distribution.
:param fit: bnfit object from R of the generation (structure and data)
:return: name in order of the parameters returned. mu and sigma parameters of the multivariate
normal distribution
:type fit: bnfit object from R
:return: name in order of the parameters returned. mu and sigma parameters of the multivariate normal distribution
:rtype: list, float, float
"""

hierarchical_order = bnlearn_package.node_ordering(self.structure)
Expand Down Expand Up @@ -225,9 +242,10 @@ def sampling_multivariate(self, fit):

def new_generation(self):

"""
Build a new generation from the parameters calculated.
:return: update the generation to the new group of individuals
"""Build a new generation from the parameters calculated and update the generation to the new group of individuals
:return: mean and sigma of the individuals costs of the generation
:rtype: float, float
"""

valid_individuals = 0
Expand Down Expand Up @@ -293,10 +311,10 @@ def new_generation(self):

def soft_restrictions(self, NOISE):

"""
Add Gaussian noise to the evidence variables
"""Add Gaussian noise to the evidence variables
:param NOISE: std of the normal distribution from where noise is sampled
:return: update generation variables
:type NOISE: float
"""

number_samples = len(self.generation)
Expand All @@ -308,21 +326,21 @@ def soft_restrictions(self, NOISE):

def __choose_best__(self):

"""
Select the best individual of the generation
"""Select the best individual of the generation
:return: cost of the individual, and the individual
:rtype: float, pandas dataframe
"""

minimum = self.generation['COSTE'].min()
best_ind_local = self.generation[self.generation['COSTE'] == minimum]
return minimum, best_ind_local

def run(self, output=True):

"""
Running method
"""Running method
:param output: if desired to print the output of each individual. True to print output
:type output: boolean
:return:the class is returned, in order to explore all the parameters
:rtype: self python class
"""

self.__initialize_data__()
Expand Down Expand Up @@ -376,24 +394,48 @@ def run(self, output=True):

@property
def best_cost_global(self):
"""
:return: best cost found ever at the end of the execution
:rtype: float
"""
return self.__best_cost_global

@property
def best_ind_global(self):
"""
:return: best individual found ever at the end of the execution
:rtype: pandas dataframe slice
"""
return self.__best_ind_global

@property
def best_structure(self):
"""
:return: best generation structure found ever at the end of the execution
:rtype: bnlearn structure
"""
return self.__best_structure

@property
def history(self):
"""
:return: best individuals from all generations
:rtype: pandas dataframe
"""
return self.__history

@property
def history_cost(self):
"""
:return: list of best costs found along the execution
:rtype: list
"""
return self.__history_cost

@property
def dispersion(self):
"""
:return: list of double tuples with mean and variance of each generation
:rtype: list
"""
return self.__dispersion
Loading

0 comments on commit ce53137

Please sign in to comment.