Generate data

from codeE.generate_data import ...

Function to use in order to generate data on the crowdsourcing scenario.

The available functions are:

Generate Gaussians
Simulate Annotations

Generate Gaussians

codeE.generate_data.do_gaussianEX(n=250,std=0.3)

Generate three bidimensional gaussians with n data points, centered on (0,1), (0,-1), *(1,0), all with sigma 0.3.

Parameters

n: int, default=250
Ammount of data to generate on each gaussian
std: float, default=0.3
The standard deviation of the gaussians to use in the generation.

Returns

X_data: array-like of shape (n3, 2)*
The input patterns of the three gaussians shuffled.
Z_data: array-like of shape (n3, 1)*
The output/target/labels of the three gaussians.

Examples

from codeE.generate_data import do_gaussianEX
X, Z = do_gaussianEX(n=1000, std=0.4)

Visualize the generated data

import matplotlib.pyplot as plt
plt.scatter(X[:,0], X[:,1], c=Z)
plt.show()

Simulate Multiple Annotations

class codeE.generate_data.SyntheticData(state=None)

UP

Generate synthetic multiple annotations for some data using the ground truth of it. The method needs confusion matrices of errors patterns to simulate the annotations.

Parameters

state: string, tuple or int, default=None
The random state to set the generation seed. If None it uses the current random state of the machine.

Examples

Build generation parameters

Z_data = ... #data
Tmax = 100 
T_data = 5 
B = np.asarray([  #confusion matrices of groups
                [[0.9, 0.1,0.0],
                  [0.2, 0.6, 0.2],
                  [0.1, 0.0, 0.9]] , #group 1 -- quite expert
                 [[0.1, 0.8, 0.1],
                  [0.3, 0.5, 0.2],
                  [0.0, 0.9, 0.1]] #  biased for class 2
                ])
G = np.asarray([0.7, 0.3]) #groups probability: p(g)

Generate based on the previous parameters

from codeE.generate_data import SyntheticData
GenerateData = SyntheticData()
GenerateData.set_probas(B, G)
y_obs, groups_annot = GenerateData.synthetic_annotate_data(Z_data, Tmax, T_data)

Generate based on some fixed groups behavior that we create

from codeE.generate_data import SyntheticData
folder = "./data/synthetic/"
GenerateData = SyntheticData(state=folder+"datasim_state.pickle")
GenerateData.set_probas(folder+'/matrix_datasim_normal.csv', folder+'/groups_datasim_normal.csv', asfile=True)
y_obs, groups_annot = GenerateData.synthetic_annotate_data(Z_train, Tmax, T_data)

Class Methods

Function	Description
set_probas	function to set the confusion matrix behaviors of annotations errors
synthetic_annotate_data	generate the annotations
save_annotations	store the generated annotations into a file

set_probas(file_matrix, file_groups, asfile = False)

Set the confusion matrix behaviors of annotations errors, based on groups of behaviors.

Parameters

file_matrix: string or array-like of shape (n_groups, n_classes, n_classes) The groups confusion matrices, could be a string of the file if asfile is seted True. In a probabilistic format, it should sum one over the last axis.
file_groups: string or array-like of shape (n_groups, )
The probabilistic presence of each group, could be a string of the file if asfile is seted True.
asfile: boolean, default=False
The probabilistic presence of each group, could be a string of the file if asfile is seted True.

synthetic_annotate_data(self, Z, Tmax, T_data, deterministic=False, hard=True)

Perform the simulation of multiple annotations by multiple annotators, with the groups of behaviors based on confusion matrices.

Parameters

Z: array-like of shape (n_samples, 1)
The ground truth labels of the data, could be also in one-hot format.
Tmax: int
The number of annotators to be used in the generation
T_data: int
The number of annotations per every input pattern
deterministic: boolean, default=False
If each input pattern (data) has to have specifically T_data annotations. If set to False, each input pattern has to have on average T_data annotations.
hard: boolean, default=True
If the assignment to the annotators to groups is hard (discrete) or soft (probabilistic).

Returns

Y_ann: array-like of shape (n_samples, n_annotators, n_classes)
The generated annotations, with no label symbol =-1.
G_ann: array-like of shape

If hard=True: (n_annotators, 1)
If hard=False: (n_annotators, n_groups)
The group assigned to every annotator.

save_annotations(self, annotations, file_name='annotations',npy=True)

Store the annotations into a csv file or a npy file.

Parameters

annotations: array-like of shape (n_samples, n_annotators, n_classes) The generated annotations to be stored.
file_name: string, default='annotations' The file name to be stored.
npy: boolean, default=True
If the annotation file will be numpy format of csv.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

generate_data.md

generate_data.md

Generate data

Generate Gaussians

Examples

Simulate Multiple Annotations

Examples

Class Methods

Files

generate_data.md

Latest commit

History

generate_data.md

File metadata and controls

Generate data

Generate Gaussians

Examples

Simulate Multiple Annotations

Examples

Class Methods