Skip to content

Latest commit

 

History

History
151 lines (126 loc) · 5.19 KB

generate_data.md

File metadata and controls

151 lines (126 loc) · 5.19 KB

Generate data

from codeE.generate_data import ...

Function to use in order to generate data on the crowdsourcing scenario.

The available functions are:


Generate Gaussians

codeE.generate_data.do_gaussianEX(n=250,std=0.3)

Generate three bidimensional gaussians with n data points, centered on (0,1), (0,-1), *(1,0), all with sigma 0.3.

Parameters

  • n: int, default=250
    Ammount of data to generate on each gaussian
  • std: float, default=0.3
    The standard deviation of the gaussians to use in the generation.

Returns

  • X_data: array-like of shape (n3, 2)*
    The input patterns of the three gaussians shuffled.
  • Z_data: array-like of shape (n3, 1)*
    The output/target/labels of the three gaussians.
Examples
from codeE.generate_data import do_gaussianEX
X, Z = do_gaussianEX(n=1000, std=0.4)

Visualize the generated data

import matplotlib.pyplot as plt
plt.scatter(X[:,0], X[:,1], c=Z)
plt.show()

Simulate Multiple Annotations

class codeE.generate_data.SyntheticData(state=None)

UP

Generate synthetic multiple annotations for some data using the ground truth of it. The method needs confusion matrices of errors patterns to simulate the annotations.

Parameters

  • state: string, tuple or int, default=None
    The random state to set the generation seed. If None it uses the current random state of the machine.
Examples

Build generation parameters

Z_data = ... #data
Tmax = 100 
T_data = 5 
B = np.asarray([  #confusion matrices of groups
                [[0.9, 0.1,0.0],
                  [0.2, 0.6, 0.2],
                  [0.1, 0.0, 0.9]] , #group 1 -- quite expert
                 [[0.1, 0.8, 0.1],
                  [0.3, 0.5, 0.2],
                  [0.0, 0.9, 0.1]] #  biased for class 2
                ])
G = np.asarray([0.7, 0.3]) #groups probability: p(g)

Generate based on the previous parameters

from codeE.generate_data import SyntheticData
GenerateData = SyntheticData()
GenerateData.set_probas(B, G)
y_obs, groups_annot = GenerateData.synthetic_annotate_data(Z_data, Tmax, T_data)

Generate based on some fixed groups behavior that we create

from codeE.generate_data import SyntheticData
folder = "./data/synthetic/"
GenerateData = SyntheticData(state=folder+"datasim_state.pickle")
GenerateData.set_probas(folder+'/matrix_datasim_normal.csv', folder+'/groups_datasim_normal.csv', asfile=True)
y_obs, groups_annot = GenerateData.synthetic_annotate_data(Z_train, Tmax, T_data)
Class Methods
Function Description
set_probas function to set the confusion matrix behaviors of annotations errors
synthetic_annotate_data generate the annotations
save_annotations store the generated annotations into a file
set_probas(file_matrix, file_groups, asfile = False)

Set the confusion matrix behaviors of annotations errors, based on groups of behaviors.

Parameters

  • file_matrix: string or array-like of shape (n_groups, n_classes, n_classes) The groups confusion matrices, could be a string of the file if asfile is seted True. In a probabilistic format, it should sum one over the last axis.
  • file_groups: string or array-like of shape (n_groups, )
    The probabilistic presence of each group, could be a string of the file if asfile is seted True.
  • asfile: boolean, default=False
    The probabilistic presence of each group, could be a string of the file if asfile is seted True.
synthetic_annotate_data(self, Z, Tmax, T_data, deterministic=False, hard=True)

Perform the simulation of multiple annotations by multiple annotators, with the groups of behaviors based on confusion matrices.

Parameters

  • Z: array-like of shape (n_samples, 1)
    The ground truth labels of the data, could be also in one-hot format.
  • Tmax: int
    The number of annotators to be used in the generation
  • T_data: int
    The number of annotations per every input pattern
  • deterministic: boolean, default=False
    If each input pattern (data) has to have specifically T_data annotations. If set to False, each input pattern has to have on average T_data annotations.
  • hard: boolean, default=True
    If the assignment to the annotators to groups is hard (discrete) or soft (probabilistic).

Returns

  • Y_ann: array-like of shape (n_samples, n_annotators, n_classes)
    The generated annotations, with no label symbol =-1.
  • G_ann: array-like of shape

If hard=True: (n_annotators, 1)
If hard=False: (n_annotators, n_groups)
The group assigned to every annotator.

save_annotations(self, annotations, file_name='annotations',npy=True)

Store the annotations into a csv file or a npy file.

Parameters

  • annotations: array-like of shape (n_samples, n_annotators, n_classes) The generated annotations to be stored.
  • file_name: string, default='annotations' The file name to be stored.
  • npy: boolean, default=True
    If the annotation file will be numpy format of csv.