@software{jose_luis_garrido_labrador_2024_10623889,\nauthor={Jos\u00e9 Luis Garrido-Labrador},\ntitle={jlgarridol/sslearn: v1.0.4},\nmonth=feb,\nyear=2024,\npublisher={Zenodo},\nversion={1.0.4},\ndoi={10.5281/zenodo.10623889},\nurl={https://doi.org/10.5281/zenodo.10623889}\n}\n
Fake predict_proba method for classifiers that do not have it. \nWhen predict_proba is called, it will use one hot encoding to fake the probabilities if base_estimator does not have predict_proba method.
\n\n
Examples
\n\n
\n
fromsklearn.svmimportSVC\n# SVC does not have predict_proba method\n\nfromsslearn.baseimportFakedProbaClassifier\nfaked_svc=FakedProbaClassifier(SVC())\nfaked_svc.fit(X,y)\nfaked_svc.predict_proba(X)# One hot encoding probabilities\n
X ({array-like, sparse matrix} of shape (n_samples, n_features)):\nData.
\n
y ({array-like, sparse matrix} of shape (n_samples,) or (n_samples, n_classes)):\nMulti-class targets. An indicator matrix turns on multilabel\nclassification.
The returned estimates for all classes are ordered by label of classes.
\n\n
Note that in the multilabel case, each sample can have any number of\nlabels. This returns the marginal probability that the given sample has\nthe label in question. For example, it is entirely consistent that two\nlabels both have a 90% probability of applying to a given sample.
\n\n
In the single label multiclass case, the rows of the returned matrix\nsum to 1.
\n\n
Parameters
\n\n
\n
X ({array-like, sparse matrix} of shape (n_samples, n_features)):\nInput data.
\n
\n\n
Returns
\n\n
\n
T (array-like of shape (n_samples, n_classes)):\nReturns the probability of the sample for each class in the model,\nwhere classes are ordered as they are in self.classes_.
format (str, optional):\nObject that will contain the data, it can be numpy or pandas, by default \"pandas\"
\n
secure (bool, optional):\nIt guarantees that the dataset has not -1 as valid class, in order to make it semi-supervised after, by default False
\n
target_col ({str, int, None}, optional):\nColumn name or index to select class column, if None use the default value stored in the file, by default None
format (str, optional):\nObject that will contain the data, it can be numpy or pandas, by default \"pandas\"
\n
secure (bool, optional):\nIt guarantees that the dataset has not -1 as valid class, in order to make it semi-supervised after, by default False
\n
target_col ({str, int, None}, optional):\nColumn name or index to select class column, if None use the default value stored in the file, by default None
\n
encoding (str, optional):\nEncoding of file, by default \"utf-8\"
Create an artificial Semi-supervised dataset from a supervised dataset.
\n\n
Parameters
\n\n
\n
X (array-like of shape (n_samples, n_features)):\nTraining data, where n_samples is the number of samples\nand n_features is the number of features.
\n
y (array-like of shape (n_samples,)):\nThe target variable for supervised learning problems.
\n
label_rate (float, optional):\nProportion between labeled instances and unlabel instances, by default 0.1
\n
random_state (int or RandomState, optional):\nControls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls, by default None
\n
force_minimum (int, optional):\nForce a minimum of instances of each class, by default None
\n
indexes (bool, optional):\nIf True, return the indexes of the labeled and unlabeled instances, by default False
\n
shuffle (bool, default=True):\nWhether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.
\n
stratify (array-like, default=None):\nIf not None, data is split in a stratified fashion, using this as the class labels.
\n
\n\n
Returns
\n\n
\n
X (ndarray):\nThe feature set.
\n
y (ndarray):\nThe label set, -1 for unlabel instance.
\n
X_unlabel (ndarray):\nThe feature set for each y mark as unlabel
\n
y_unlabel (ndarray):\nThe true label for each y in the same order.
\n
label (ndarray (optional)):\nThe training set indexes for split mark as labeled.
\n
unlabel (ndarray (optional)):\nThe training set indexes for split mark as unlabeled.
Stratified K-Folds cross-validator for semi-supervised learning.
\n\n
Provides label and unlabel indices for each split. Using the StratifiedKFold method from sklearn.\nThe test set is the labeled set and the train set is the unlabeled set.
Compute the conflict rate of a prediction, given a set of restrictions.\n combine_predictions: \n Combine the predictions of a group of instances to keep the restrictions.
All estimators should specify all the parameters that can be set\nat the class level in their __init__ as explicit keyword\narguments (no *args or **kwargs).
Who is Who Classifier\nKuncheva, L. I., Rodriguez, J. J., & Jackson, A. S. (2017).\nRestricted set classification: Who is there?. Pattern Recognition, 63, 158-170.
\n\n
Parameters
\n\n
\n
base_estimator (ClassifierMixin):\nThe base estimator to be used for training.
\n
method (str, optional):\nThe method to use to assing class, it can be greedy to first-look or hungarian to use the Hungarian algorithm, by default \"hungarian\"
\n
conflict_weighted (bool, default=True):\nWhether to weighted the confusion rate by the number of instances with the same group.
Fit the model according to the given training data.
\n\n
Parameters
\n\n
\n
X ({array-like, sparse matrix} of shape (n_samples, n_features)):\nThe input samples.
\n
y (array-like of shape (n_samples,)):\nThe target values.
\n
instance_group (array-like of shape (n_samples)):\nThe group. Two instances with the same label are not allowed to be in the same group. If None, group restriction will not be used in training.
fromsklearn.model_selectionimporttrain_test_split\nfromsklearn.treeimportDecisionTreeClassifier\nfromsslearn.subviewimportSubViewClassifier\n\n# Mode 'include' will include all columns that contain `string`\nclf=SubViewClassifier(DecisionTreeClassifier(),"sepal",mode="include")\nclf.fit(X,y)\n\n# Mode 'regex' will include all columns that match the regex\nclf=SubViewClassifier(DecisionTreeClassifier(),"sepal.*",mode="regex")\nclf.fit(X,y)\n\n# Mode 'index' will include the columns at the index, useful for numpy arrays\nclf=SubViewClassifier(DecisionTreeClassifier(),[0,1],mode="index")\nclf.fit(X,y)\n
fromsklearn.model_selectionimporttrain_test_split\nfromsklearn.treeimportDecisionTreeClassifier\nfromsslearn.subviewimportSubViewClassifier\n\n# Mode 'include' will include all columns that contain `string`\nclf=SubViewClassifier(DecisionTreeClassifier(),"sepal",mode="include")\nclf.fit(X,y)\n\n# Mode 'regex' will include all columns that match the regex\nclf=SubViewClassifier(DecisionTreeClassifier(),"sepal.*",mode="regex")\nclf.fit(X,y)\n\n# Mode 'index' will include the columns at the index, useful for numpy arrays\nclf=SubViewClassifier(DecisionTreeClassifier(),[0,1],mode="index")\nclf.fit(X,y)\n
Safely divide two numbers preventing division by zero.\n confidence_interval:\n Calculate the confidence interval of the predictions.\n choice_with_proportion: \n Choice the best predictions according to the proportion of each class.\n calculate_prior_probability:\n Calculate the priori probability of each label.\n mode:\n Calculate the mode of a list of values.\n check_n_jobs:\n Check n_jobs parameter according to the scikit-learn convention.\n check_classifier:\n Check if the classifier is a ClassifierMixin or a list of ClassifierMixin.
David Yarowsky. (1995). \nUnsupervised word sense disambiguation rivaling supervised methods. \nIn Proceedings of the 33rd annual meeting on Association for Computational Linguistics (ACL '95). \nAssociation for Computational Linguistics, \nStroudsburg, PA, USA, 189-196. \n10.3115/981658.981684
Self-training. Adaptation of SelfTrainingClassifier from sklearn with data loader compatible.
\n\n
This class allows a given supervised classifier to function as a\nsemi-supervised classifier, allowing it to learn from unlabeled data. It\ndoes this by iteratively predicting pseudo-labels for the unlabeled data\nand adding them to the training set.
\n\n
The classifier will continue iterating until either max_iter is reached, or\nno pseudo-labels were added to the training set in the previous iteration.
\n\n
Parameters
\n\n
\n
base_estimator (estimator object):\nAn estimator object implementing fit and predict_proba.\nInvoking the fit method will fit a clone of the passed estimator,\nwhich will be stored in the base_estimator_ attribute.
\n
threshold (float, default=0.75):\nThe decision threshold for use with criterion='threshold'.\nShould be in [0, 1). When using the 'threshold' criterion, a\n:ref:well calibrated classifier <calibration> should be used.
\n
criterion ({'threshold', 'k_best'}, default='threshold'):\nThe selection criterion used to select which labels to add to the\ntraining set. If 'threshold', pseudo-labels with prediction\nprobabilities above threshold are added to the dataset. If 'k_best',\nthe k_best pseudo-labels with highest prediction probabilities are\nadded to the dataset. When using the 'threshold' criterion, a\n:ref:well calibrated classifier <calibration> should be used.
\n
k_best (int, default=10):\nThe amount of samples to add in each iteration. Only used when\ncriterion is k_best'.
\n
max_iter (int or None, default=10):\nMaximum number of iterations allowed. Should be greater than or equal\nto 0. If it is None, the classifier will continue to predict labels\nuntil no new pseudo-labels are added, or all unlabeled samples have\nbeen labeled.
Create a SETRED classifier. It is a self-training algorithm that uses a rejection mechanism to avoid adding noisy samples to the training set.\nThe main process are:
\n\n\n
Train a classifier with the labeled data.
\n
Create a pool of unlabeled data and select the most confident predictions.
\n
Repeat until the maximum number of iterations is reached:\na. Select the most confident predictions from the unlabeled data.\nb. Calculate the neighborhood graph of the labeled data and the selected instances from the unlabeled data.\nc. Calculate the significance level of the selected instances.\nd. Reject the instances that are not significant according their position in the neighborhood graph.\ne. Add the selected instances to the labeled data and retrains the classifier.\nf. Add new instances to the pool of unlabeled data.
\n
Return the classifier trained with the labeled data.
Li, Ming, and Zhi-Hua Zhou. (2005) \nSETRED: Self-training with editing, \nin Advances in Knowledge Discovery and Data Mining. \nPacific-Asia Conference on Knowledge Discovery and Data Mining \nLNAI 3518, Springer, Berlin, Heidelberg, \n10.1007/11430919_71
Create a SETRED classifier.\nIt is a self-training algorithm that uses a rejection mechanism to avoid adding noisy samples to the training set.
\n\n
Parameters
\n\n
\n
base_estimator (ClassifierMixin, optional):\nAn estimator object implementing fit and predict_proba,, by default DecisionTreeClassifier(), by default KNeighborsClassifier(n_neighbors=3)
\n
max_iterations (int, optional):\nMaximum number of iterations allowed. Should be greater than or equal to 0., by default 40
\n
distance (str, optional):\nThe distance metric to use for the graph.\nThe default metric is euclidean, and with p=2 is equivalent to the standard Euclidean metric.\nFor a list of available metrics, see the documentation of DistanceMetric and the metrics listed in sklearn.metrics.pairwise.PAIRWISE_DISTANCE_FUNCTIONS.\nNote that the cosine metric uses cosine_distances., by default euclidean
\n
poolsize (float, optional):\nMax number of unlabel instances candidates to pseudolabel, by default 0.25
\n
rejection_threshold (float, optional):\nsignificance level, by default 0.1
\n
graph_neighbors (int, optional):\nNumber of neighbors for each sample., by default 1
\n
random_state (int, RandomState instance, optional):\ncontrols the randomness of the estimator, by default None
\n
n_jobs (int, optional):\nThe number of parallel jobs to run for neighbors search. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors, by default None
Avrim Blum and Tom Mitchell. (1998). \nCombining labeled and unlabeled data with co-training \nin Proceedings of the eleventh annual conference on Computational learning theory (COLT' 98). \nAssociation for Computing Machinery, New York, NY, USA, 92-100. \n10.1145/279943.279962
Create a CoTraining classifier. \nMulti-view learning algorithm that uses two classifiers to label instances.
\n\n
Parameters
\n\n
\n
base_estimator (ClassifierMixin, optional):\nThe classifier that will be used in the cotraining algorithm on the feature set, by default DecisionTreeClassifier()
\n
second_base_estimator (ClassifierMixin, optional):\nThe classifier that will be used in the cotraining algorithm on another feature set, if none are a clone of base_estimator, by default None
\n
max_iterations (int, optional):\nThe number of iterations, by default 30
\n
poolsize (int, optional):\nThe size of the pool of unlabeled samples from which the classifier can choose, by default 75
\n
threshold (float, optional):\nThe threshold for label instances, by default 0.5
\n
force_second_view (bool, optional):\nThe second classifier needs a different view of the data. If False then a second view will be same as the first, by default True
\n
random_state (int, RandomState instance, optional):\ncontrols the randomness of the estimator, by default None
Build a CoTraining classifier from the training set.
\n\n
Parameters
\n\n
\n
X ({array-like, sparse matrix} of shape (n_samples, n_features)):\nArray representing the data.
\n
y (array-like of shape (n_samples,)):\nThe target values (class labels), -1 if unlabeled.
\n
X2 ({array-like, sparse matrix} of shape (n_samples, n_features), optional):\nArray representing the data from another view, not compatible with features, by default None
\n
features ({list, tuple}, optional):\nlist or tuple of two arrays with feature index for each subspace view, not compatible with X2, by default None
\n
number_per_class ({dict}, optional):\ndict of class name:integer with the max ammount of instances to label in this class in each iteration, by default None
Return the mean accuracy on the given test data and labels.\nIn multi-label classification, this is the subset accuracy\nwhich is a harsh metric since you require for each sample that\neach label set be correctly predicted.
\n\n
Parameters
\n\n
\n
X (array-like of shape (n_samples, n_features)):\nTest samples.
\n
y (array-like of shape (n_samples,) or (n_samples, n_outputs)):\nTrue labels for X.
\n
sample_weight (array-like of shape (n_samples,), default=None):\nSample weights.
\n
X2 ({array-like, sparse matrix} of shape (n_samples, n_features), optional):\nArray representing the data from another view, by default None
\n
\n\n
Returns
\n\n
\n
score (float):\nMean accuracy of self.predict(X) wrt. y.
M. F. A. Hady and F. Schwenker, \nCo-training by Committee: A New Semi-supervised Learning Framework, \nin 2008 IEEE International Conference on Data Mining Workshops, \nPisa, 2008, pp. 563-572, 10.1109/ICDMW.2008.27
Create a committee trained by cotraining based on\nthe diversity of classifiers.
\n\n
Parameters
\n\n
\n
ensemble_estimator (ClassifierMixin, optional):\nensemble method, works without a ensemble as\nself training with pool, by default BaggingClassifier().
\n
max_iterations (int, optional):\nnumber of iterations of training, -1 if no max iterations, by default 100
\n
poolsize (int, optional):\nmax number of unlabeled instances candidates to pseudolabel, by default 100
\n
random_state (int, RandomState instance, optional):\ncontrols the randomness of the estimator, by default None
Return the mean accuracy on the given test data and labels.\nIn multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
\n\n
Parameters
\n\n
\n
X (array-like of shape (n_samples, n_features)):\nTest samples.
\n
y (array-like of shape (n_samples,) or (n_samples, n_outputs)):\nTrue labels for X.
\n
sample_weight (array-like of shape (n_samples,), optional):\nSample weights., by default None
\n
\n\n
Returns
\n\n
\n
score (float):\nMean accuracy of self.predict(X) wrt. y.
Democratic Co-learning. Ensemble of classifiers of different types.
\n\n
A iterative algorithm that uses a ensemble of classifiers to label instances.\nThe main process is:
\n\n\n
Train each classifier with the labeled instances.
\n
While any classifier is retrained:\n\n
Predict the instances from the unlabeled set.
\n
Calculate the confidence interval for each classifier for define weights.
\n
Calculate the weighted vote for each instance.
\n
Calculate the majority vote for each instance.
\n
Select the instances to label if majority vote is the same as weighted vote.
\n
Select the instances to retrain the classifier, if only_mislabeled is False then select all instances, else select only mislabeled instances for each classifier.
\n
Retrain the classifier with the new instances if the error rate is lower than the previous iteration.
\n
\n
Ignore the classifiers with confidence interval lower than 0.5.
\n
Combine the probabilities of each classifier.
\n\n\n
Methods
\n\n
\n
fit: Fit the model with the labeled instances.
\n
predict : Predict the class for each instance.
\n
predict_proba: Predict the probability for each class.
\n
score: Return the mean accuracy on the given test data and labels.
Y. Zhou and S. Goldman, (2004) \nDemocratic co-learning, \nin 16th IEEE International Conference on Tools with Artificial Intelligence, \npp. 594-602, 10.1109/ICTAI.2004.48.
Democratic Co-learning. Ensemble of classifiers of different types.
\n\n
Parameters
\n\n
\n
base_estimator ({ClassifierMixin, list}, optional):\nAn estimator object implementing fit and predict_proba or a list of ClassifierMixin, by default DecisionTreeClassifier()
\n
n_estimators (int, optional):\nnumber of base_estimators to use. None if base_estimator is a list, by default None
\n
expand_only_mislabeled (bool, optional):\nexpand only mislabeled instances by itself, by default True
\n
alpha (float, optional):\nconfidence level, by default 0.95
\n
q_exp (int, optional):\nexponent for the estimation for error rate, by default 2
\n
random_state (int, RandomState instance, optional):\ncontrols the randomness of the estimator, by default None
\n
\n\n
Raises
\n\n
\n
AttributeError: If n_estimators is None and base_estimator is not a list
Wang, J., Luo, S. W., & Zeng, X. H. (2008). \nA random subspace method for co-training, \nin 2008 IEEE International Joint Conference on Neural Networks \nIEEE World Congress on Computational Intelligence \n(pp. 195-200). IEEE. 10.1109/IJCNN.2008.4633789
base_estimator (ClassifierMixin, optional):\nAn estimator object implementing fit and predict_proba, by default DecisionTreeClassifier()
\n
max_iterations (int, optional):\nMaximum number of iterations allowed. Should be greater than or equal to 0.\nIf is -1 then will be infinite iterations until U be empty, by default 10
\n
n_estimators (int, optional):\nThe number of base estimators in the ensemble., by default 30
\n
subspace_size (int, optional):\nThe number of features for each subspace. If it is None will be the half of the features size., by default None
\n
random_state (int, RandomState instance, optional):\ncontrols the randomness of the estimator, by default None
Is a variation of sslearn.wrapper.Rasco that uses the mutual information of each feature to select the random subspaces.\nThe process of training is the same as Rasco.
\n\n
Methods
\n\n
\n
fit: Fit the model with the labeled instances.
\n
predict : Predict the class for each instance.
\n
predict_proba: Predict the probability for each class.
\n
score: Return the mean accuracy on the given test data and labels.
Yaslan, Y., & Cataltepe, Z. (2010). \nCo-training with relevant random subspaces. \nNeurocomputing, 73(10-12), 1652-1661. \n10.1016/j.neucom.2010.01.018
base_estimator (ClassifierMixin, optional):\nAn estimator object implementing fit and predict_proba, by default DecisionTreeClassifier()
\n
max_iterations (int, optional):\nMaximum number of iterations allowed. Should be greater than or equal to 0.\nIf is -1 then will be infinite iterations until U be empty, by default 10
\n
n_estimators (int, optional):\nThe number of base estimators in the ensemble., by default 30
\n
subspace_size (int, optional):\nThe number of features for each subspace. If it is None will be the half of the features size., by default None
\n
random_state (int, RandomState instance, optional):\ncontrols the randomness of the estimator, by default None
\n
n_jobs (int, optional):\nThe number of jobs to run in parallel. -1 means using all processors., by default None
Li, M., & Zhou, Z.-H. (2007). \nImprove Computer-Aided Diagnosis With Machine Learning Techniques Using Undiagnosed Samples. \nIEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, \n37(6), 1088-1098. 10.1109/tsmca.2007.904745
TriTraining. Trio of classifiers with bootstrapping.
\n\n
The main process is:
\n\n\n
Generate three classifiers using bootstrapping.
\n
Iterate until convergence:\n\n
Calculate the error between two hypotheses.
\n
If the error is less than the previous error, generate a dataset with the instances where both hypotheses agree.
\n
Retrain the classifiers with the new dataset and the original labeled dataset.
\n
\n
Combine the predictions of the three classifiers.
\n\n\n
Methods
\n\n
\n
fit: Fit the model with the labeled instances.
\n
predict : Predict the class for each instance.
\n
predict_proba: Predict the probability for each class.
\n
score: Return the mean accuracy on the given test data and labels.
\n
\n\n
References
\n\n
Zhi-Hua Zhou and Ming Li, \nTri-training: exploiting unlabeled data using three classifiers, \nin IEEE Transactions on Knowledge and Data Engineering, \nvol. 17, no. 11, pp. 1529-1541, Nov. 2005, \n10.1109/TKDE.2005.186
TriTraining. Trio of classifiers with bootstrapping.
\n\n
Parameters
\n\n
\n
base_estimator (ClassifierMixin, optional):\nAn estimator object implementing fit and predict_proba, by default DecisionTreeClassifier()
\n
n_samples (int, optional):\nNumber of samples to generate.\nIf left to None this is automatically set to the first dimension of the arrays., by default None
\n
random_state (int, RandomState instance, optional):\ncontrols the randomness of the estimator, by default None
\n
n_jobs (int, optional):\nThe number of jobs to run in parallel for both fit and predict.\nNone means 1 unless in a joblib.parallel_backend context.\n-1 means using all processors., by default None
It is a variation of the TriTraining, the main difference is that the instances are depurated in each iteration.\nIt means that the instances with their neighbors that have the same class are kept, the rest are removed.\nAt the end of the iterations, the instances are clustered and the class is assigned to the cluster centroid.
\n\n
Methods
\n\n
\n
fit: Fit the model with the labeled instances.
\n
predict : Predict the class for each instance.
\n
predict_proba: Predict the probability for each class.
\n
score: Return the mean accuracy on the given test data and labels.
\n
\n\n
References
\n\n
Deng C., Guo M.Z. (2006) \nTri-training and Data Editing Based Semi-supervised Clustering Algorithm, \nin Gelbukh A., Reyes-Garcia C.A. (eds) MICAI 2006: Advances in Artificial Intelligence. MICAI 2006. \nLecture Notes in Computer Science, vol 4293. Springer, Berlin, Heidelberg. \n10.1007/11925231_61
DeTriTraining - TriTraining with Depurated and Clustering.\nAvoid the noise generated by the TriTraining algorithm by depurating the enlarged dataset and clustering the instances.
\n\n
Parameters
\n\n
\n
base_estimator (ClassifierMixin, optional):\nAn estimator object implementing fit and predict_proba, by default DecisionTreeClassifier()
\n
n_samples (int, optional):\nNumber of samples to generate. \nIf left to None this is automatically set to the first dimension of the arrays., by default None
\n
k_neighbors (int, optional):\nNumber of neighbors for depurate classification. \nIf at least k_neighbors/2+1 have a class other than the one predicted, the class is ignored., by default 3
\n
mode (string, optional):\nHow to calculate the cluster each instance belongs to.\nIf seeded each instance belong to nearest cluster.\nIf constrained each instance belong to nearest cluster unless the instance is in to enlarged dataset, \nthen the instance belongs to the cluster of its class., by default seeded
\n
max_iterations (int, optional):\nMaximum number of iterations, by default 100
\n
n_jobs (int, optional):\nThe number of parallel jobs to run for neighbors search. \nNone means 1 unless in a joblib.parallel_backend context. -1 means using all processors. \nDoesn't affect fit method., by default None
\n
random_state (int, RandomState instance, optional):\ncontrols the randomness of the estimator, by default None
1# Open README.md and added to __doc__ for
+ 2importos
+ 3ifos.path.exists("../README.md"):
+ 4withopen("../README.md","r")asf:
+ 5__doc__=f.read()
+ 6elifos.path.exists("README.md"):
+ 7withopen("README.md","r")asf:
+ 8__doc__=f.read()
+ 9else:
+10__doc__="Semi-Supervised Learning (SSL) is a Python package that provides tools to train and evaluate semi-supervised learning models."
+11
+12
+13__version__='1.0.4.1'
+14__AUTHOR__="José Luis Garrido-Labrador"# Author of the package
+15__AUTHOR_EMAIL__="jlgarrido@ubu.es"# Author's email
+16__URL__="https://pypi.org/project/sslearn/"
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/docs/sslearn.svg b/docs/sslearn.svg
new file mode 100644
index 0000000..52a09d5
--- /dev/null
+++ b/docs/sslearn.svg
@@ -0,0 +1,355 @@
+
+
+
+
diff --git a/docs/sslearn.webp b/docs/sslearn.webp
new file mode 100644
index 0000000..d08a36d
Binary files /dev/null and b/docs/sslearn.webp differ
diff --git a/docs/sslearn/base.html b/docs/sslearn/base.html
new file mode 100644
index 0000000..d6e74ac
--- /dev/null
+++ b/docs/sslearn/base.html
@@ -0,0 +1,1427 @@
+
+
+
+
+
+
+ sslearn.base API documentation
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
1"""
+ 2Summary of module `sslearn.base`:
+ 3
+ 4## Functions
+ 5
+ 6
+ 7get_dataset(X, y):
+ 8 Check and divide dataset between labeled and unlabeled data.
+ 9
+ 10## Classes
+ 11
+ 12
+ 13[FakedProbaClassifier](#FakedProbaClassifier):
+ 14> Create a classifier that fakes predict_proba method if it does not exist.
+ 15
+ 16[OneVsRestSSLClassifier](#OneVsRestSSLClassifier):
+ 17> Adapted OneVsRestClassifier for SSL datasets
+ 18
+ 19"""
+ 20
+ 21importarray
+ 22importwarnings
+ 23fromabcimportABC,abstractmethod
+ 24
+ 25importnumpyasnp
+ 26importpandasaspd
+ 27importscipy.sparseassp
+ 28fromjoblibimportParallel,delayed
+ 29fromsklearn.baseimportBaseEstimator,ClassifierMixin,MetaEstimatorMixin
+ 30fromsklearn.baseimportcloneasskclone
+ 31fromsklearn.baseimportis_classifier
+ 32fromsklearn.multiclassimport(LabelBinarizer,OneVsRestClassifier,
+ 33_ConstantPredictor,_num_samples,
+ 34_predict_binary)
+ 35fromsklearn.preprocessingimportOneHotEncoder
+ 36fromsklearn.utilsimportcheck_X_y,check_array
+ 37fromsklearn.utils.validationimportcheck_is_fitted
+ 38fromsklearn.utils.metaestimatorsimportavailable_if
+ 39fromsklearn.ensemble._baseimport_set_random_states
+ 40fromsklearn.utilsimportcheck_random_state
+ 41
+ 42__all__=["get_dataset","FakedProbaClassifier","OneVsRestSSLClassifier"]
+ 43
+ 44
+ 45
+ 46defget_dataset(X,y):
+ 47"""Check and divide dataset between labeled and unlabeled data.
+ 48
+ 49 Parameters
+ 50 ----------
+ 51 X : ndarray or DataFrame of shape (n_samples, n_features)
+ 52 Features matrix.
+ 53 y : ndarray of shape (n_samples,)
+ 54 Target vector.
+ 55
+ 56 Returns
+ 57 -------
+ 58 X_label : ndarray or DataFrame of shape (n_label, n_features)
+ 59 Labeled features matrix.
+ 60 y_label : ndarray or Serie of shape (n_label,)
+ 61 Labeled target vector.
+ 62 X_unlabel : ndarray or Serie DataFrame of shape (n_unlabel, n_features)
+ 63 Unlabeled features matrix.
+ 64 """
+ 65
+ 66is_df=False
+ 67ifisinstance(X,pd.DataFrame):
+ 68is_df=True
+ 69columns=X.columns
+ 70
+ 71X=check_array(X)
+ 72y=check_array(y,ensure_2d=False,dtype=y.dtype.type)
+ 73
+ 74X_label=X[y!=y.dtype.type(-1)]
+ 75y_label=y[y!=y.dtype.type(-1)]
+ 76X_unlabel=X[y==y.dtype.type(-1)]
+ 77
+ 78X_label,y_label=check_X_y(X_label,y_label)
+ 79
+ 80ifis_df:
+ 81X_label=pd.DataFrame(X_label,columns=columns)
+ 82X_unlabel=pd.DataFrame(X_unlabel,columns=columns)
+ 83
+ 84returnX_label,y_label,X_unlabel
+ 85
+ 86
+ 87classBaseEnsemble(ABC,MetaEstimatorMixin,BaseEstimator):
+ 88
+ 89@abstractmethod
+ 90defpredict_proba(self,X):
+ 91pass
+ 92
+ 93defpredict(self,X):
+ 94"""Predict the classes of X.
+ 95 Parameters
+ 96 ----------
+ 97 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+ 98 Array representing the data.
+ 99 Returns
+100 -------
+101 y : ndarray of shape (n_samples,)
+102 Array with predicted labels.
+103 """
+104predicted_probabilitiy=self.predict_proba(X)
+105classes=self.classes_.take((np.argmax(predicted_probabilitiy,axis=1)),
+106axis=0)
+107
+108# If exists label_encoder_ attribute, use it to transform classes
+109ifhasattr(self,"label_encoder_"):
+110classes=self.label_encoder_.inverse_transform(classes)
+111
+112returnclasses
+113
+114
+115classFakedProbaClassifier(MetaEstimatorMixin,ClassifierMixin,BaseEstimator):
+116"""
+117 Fake predict_proba method for classifiers that do not have it.
+118 When predict_proba is called, it will use one hot encoding to fake the probabilities if base_estimator does not have predict_proba method.
+119
+120 Examples
+121 --------
+122 ```python
+123 from sklearn.svm import SVC
+124 # SVC does not have predict_proba method
+125
+126 from sslearn.base import FakedProbaClassifier
+127 faked_svc = FakedProbaClassifier(SVC())
+128 faked_svc.fit(X, y)
+129 faked_svc.predict_proba(X) # One hot encoding probabilities
+130 ```
+131 """
+132
+133def__init__(self,base_estimator):
+134"""Create a classifier that fakes predict_proba method if it does not exist.
+135
+136 Parameters
+137 ----------
+138 base_estimator : ClassifierMixin
+139 A classifier that implements fit and predict methods.
+140 """
+141self.base_estimator=base_estimator
+142
+143deffit(self,X,y):
+144"""Fit a FakedProbaClassifier.
+145
+146 Parameters
+147 ----------
+148 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+149 The input samples.
+150 y : {array-like, sparse matrix} of shape (n_samples,)
+151 The target values.
+152
+153 Returns
+154 -------
+155 self : FakedProbaClassifier
+156 Returns self.
+157 """
+158self.classes_=np.unique(y)
+159self.one_hot=OneHotEncoder().fit(y.reshape(-1,1))
+160self.base_estimator.fit(X,y)
+161returnself
+162
+163defpredict(self,X):
+164"""Predict the classes of X.
+165
+166 Parameters
+167 ----------
+168 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+169 Array representing the data.
+170
+171 Returns
+172 -------
+173 y : ndarray of shape (n_samples,)
+174 Array with predicted labels.
+175 """
+176returnself.base_estimator.predict(X)
+177
+178defpredict_proba(self,X):
+179"""Predict the probabilities of each class for X.
+180 If the base estimator does not have a predict_proba method, it will be faked using one hot encoding.
+181
+182 Parameters
+183 ----------
+184 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+185
+186 Returns
+187 -------
+188 y : ndarray of shape (n_samples, n_classes)
+189 Array with predicted probabilities.
+190 """
+191if"predict_proba"indir(self.base_estimator):
+192returnself.base_estimator.predict_proba(X)
+193else:
+194returnself.one_hot.transform(self.base_estimator.predict(X).reshape(-1,1)).toarray()
+195
+196
+197def_fit_binary_ssl(estimator,X,y_label,size,classes=None,**fit_params):
+198# unique_y = np.unique(y_label)
+199# X = np.concatenate((X_label, X_unlabel), axis=0)
+200y=np.concatenate((y_label,np.array([y_label.dtype.type(-1)]*size)))
+201unique_y=np.unique(y_label)
+202iflen(unique_y)==1:
+203ifclassesisnotNone:
+204ify_label[0]==-1:
+205c=0
+206else:
+207c=y_label[0]
+208warnings.warn(
+209"Label %s is present in all training examples."%str(classes[c])
+210)
+211estimator=_ConstantPredictor().fit(None,unique_y)
+212else:
+213estimator=skclone(estimator)
+214estimator.fit(X,y,**fit_params)
+215returnestimator
+216
+217def_predict_binary_ssl(estimator,X,**predict_params):
+218"""Make predictions using a single binary estimator."""
+219try:
+220score=np.ravel(estimator.decision_function(X,**predict_params))
+221except(AttributeError,NotImplementedError):
+222# probabilities of the positive class
+223score=estimator.predict_proba(X,**predict_params)[:,1]
+224returnscore
+225
+226
+227classOneVsRestSSLClassifier(OneVsRestClassifier):
+228"""Adapted OneVsRestClassifier for SSL datasets
+229
+230 Prevent use unlabeled data as a independent class in the classifier.
+231
+232 For more information of OvR classifier, see the documentation of [OneVsRestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html).
+233 """
+234
+235def__init__(self,estimator,*,n_jobs=None):
+236"""Adapted OneVsRestClassifier for SSL datasets
+237
+238 Parameters
+239 ----------
+240 estimator : {ClassifierMixin, list},
+241 An estimator object implementing fit and predict_proba or a list of ClassifierMixin
+242 n_jobs : n_jobs : int, optional
+243 The number of jobs to run in parallel. -1 means using all processors., by default None
+244 """
+245super().__init__(estimator,n_jobs=n_jobs)
+246
+247deffit(self,X,y,**fit_params):
+248#
+249y_label=y[y!=y.dtype.type(-1)]
+250size=len(y)-len(y_label)
+251
+252self.label_binarizer_=LabelBinarizer(sparse_output=True)
+253Y=self.label_binarizer_.fit_transform(y_label)
+254Y=Y.tocsc()
+255self.classes_=self.label_binarizer_.classes_
+256columns=(col.toarray().ravel()forcolinY.T)
+257
+258estimators=[skclone(self.estimator)for_inrange(len(self.classes_))]
+259rs=check_random_state(estimators[0].get_params(deep=False).get("random_state",None))
+260foreinestimators:
+261_set_random_states(e,rs)
+262
+263self.estimators_=Parallel(n_jobs=self.n_jobs)(
+264delayed(_fit_binary_ssl)(
+265estimators[i],
+266X,
+267column,
+268size,
+269classes=[
+270"not %s"%self.label_binarizer_.classes_[i],
+271self.label_binarizer_.classes_[i],
+272],
+273**fit_params
+274)
+275fori,columninenumerate(columns)
+276)
+277
+278ifhasattr(self.estimators_[0],"n_features_in_"):
+279self.n_features_in_=self.estimators_[0].n_features_in_
+280ifhasattr(self.estimators_[0],"feature_names_in_"):
+281self.feature_names_in_=self.estimators_[0].feature_names_in_
+282
+283returnself
+284
+285defpredict(self,X,**kwards):
+286check_is_fitted(self)
+287
+288n_samples=_num_samples(X)
+289ifself.label_binarizer_.y_type_=="multiclass":
+290maxima=np.empty(n_samples,dtype=float)
+291maxima.fill(-np.inf)
+292argmaxima=np.zeros(n_samples,dtype=int)
+293fori,einenumerate(self.estimators_):
+294pred=_predict_binary_ssl(e,X,**kwards)
+295np.maximum(maxima,pred,out=maxima)
+296argmaxima[maxima==pred]=i
+297returnself.classes_[argmaxima]
+298else:
+299if(hasattr(self.estimators_[0],"decision_function")and
+300is_classifier(self.estimators_[0])):
+301thresh=0
+302else:
+303thresh=.5
+304indices=array.array('i')
+305indptr=array.array('i',[0])
+306foreinself.estimators_:
+307indices.extend(np.where(_predict_binary_ssl(e,X,**kwards)>thresh)[0])
+308indptr.append(len(indices))
+309data=np.ones(len(indices),dtype=int)
+310indicator=sp.csc_matrix((data,indices,indptr),
+311shape=(n_samples,len(self.estimators_)))
+312returnself.label_binarizer_.inverse_transform(indicator)
+313
+314defpredict_proba(self,X,**kwards):
+315check_is_fitted(self)
+316# Y[i, j] gives the probability that sample i has the label j.
+317# In the multi-label case, these are not disjoint.
+318Y=np.array([e.predict_proba(X,**kwards)[:,1]foreinself.estimators_]).T
+319
+320iflen(self.estimators_)==1:
+321# Only one estimator, but we still want to return probabilities
+322# for two classes.
+323Y=np.concatenate(((1-Y),Y),axis=1)
+324
+325ifnotself.multilabel_:
+326# Then, probabilities should be normalized to 1.
+327Y/=np.sum(Y,axis=1)[:,np.newaxis]
+328returnY
+
+
+
+
+
+
+
+
+ def
+ get_dataset(X, y):
+
+
+
+
+
+
47defget_dataset(X,y):
+48"""Check and divide dataset between labeled and unlabeled data.
+49
+50 Parameters
+51 ----------
+52 X : ndarray or DataFrame of shape (n_samples, n_features)
+53 Features matrix.
+54 y : ndarray of shape (n_samples,)
+55 Target vector.
+56
+57 Returns
+58 -------
+59 X_label : ndarray or DataFrame of shape (n_label, n_features)
+60 Labeled features matrix.
+61 y_label : ndarray or Serie of shape (n_label,)
+62 Labeled target vector.
+63 X_unlabel : ndarray or Serie DataFrame of shape (n_unlabel, n_features)
+64 Unlabeled features matrix.
+65 """
+66
+67is_df=False
+68ifisinstance(X,pd.DataFrame):
+69is_df=True
+70columns=X.columns
+71
+72X=check_array(X)
+73y=check_array(y,ensure_2d=False,dtype=y.dtype.type)
+74
+75X_label=X[y!=y.dtype.type(-1)]
+76y_label=y[y!=y.dtype.type(-1)]
+77X_unlabel=X[y==y.dtype.type(-1)]
+78
+79X_label,y_label=check_X_y(X_label,y_label)
+80
+81ifis_df:
+82X_label=pd.DataFrame(X_label,columns=columns)
+83X_unlabel=pd.DataFrame(X_unlabel,columns=columns)
+84
+85returnX_label,y_label,X_unlabel
+
+
+
+
Check and divide dataset between labeled and unlabeled data.
+
+
Parameters
+
+
+
X (ndarray or DataFrame of shape (n_samples, n_features)):
+Features matrix.
+
y (ndarray of shape (n_samples,)):
+Target vector.
+
+
+
Returns
+
+
+
X_label (ndarray or DataFrame of shape (n_label, n_features)):
+Labeled features matrix.
+
y_label (ndarray or Serie of shape (n_label,)):
+Labeled target vector.
+
X_unlabel (ndarray or Serie DataFrame of shape (n_unlabel, n_features)):
+Unlabeled features matrix.
116classFakedProbaClassifier(MetaEstimatorMixin,ClassifierMixin,BaseEstimator):
+117"""
+118 Fake predict_proba method for classifiers that do not have it.
+119 When predict_proba is called, it will use one hot encoding to fake the probabilities if base_estimator does not have predict_proba method.
+120
+121 Examples
+122 --------
+123 ```python
+124 from sklearn.svm import SVC
+125 # SVC does not have predict_proba method
+126
+127 from sslearn.base import FakedProbaClassifier
+128 faked_svc = FakedProbaClassifier(SVC())
+129 faked_svc.fit(X, y)
+130 faked_svc.predict_proba(X) # One hot encoding probabilities
+131 ```
+132 """
+133
+134def__init__(self,base_estimator):
+135"""Create a classifier that fakes predict_proba method if it does not exist.
+136
+137 Parameters
+138 ----------
+139 base_estimator : ClassifierMixin
+140 A classifier that implements fit and predict methods.
+141 """
+142self.base_estimator=base_estimator
+143
+144deffit(self,X,y):
+145"""Fit a FakedProbaClassifier.
+146
+147 Parameters
+148 ----------
+149 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+150 The input samples.
+151 y : {array-like, sparse matrix} of shape (n_samples,)
+152 The target values.
+153
+154 Returns
+155 -------
+156 self : FakedProbaClassifier
+157 Returns self.
+158 """
+159self.classes_=np.unique(y)
+160self.one_hot=OneHotEncoder().fit(y.reshape(-1,1))
+161self.base_estimator.fit(X,y)
+162returnself
+163
+164defpredict(self,X):
+165"""Predict the classes of X.
+166
+167 Parameters
+168 ----------
+169 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+170 Array representing the data.
+171
+172 Returns
+173 -------
+174 y : ndarray of shape (n_samples,)
+175 Array with predicted labels.
+176 """
+177returnself.base_estimator.predict(X)
+178
+179defpredict_proba(self,X):
+180"""Predict the probabilities of each class for X.
+181 If the base estimator does not have a predict_proba method, it will be faked using one hot encoding.
+182
+183 Parameters
+184 ----------
+185 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+186
+187 Returns
+188 -------
+189 y : ndarray of shape (n_samples, n_classes)
+190 Array with predicted probabilities.
+191 """
+192if"predict_proba"indir(self.base_estimator):
+193returnself.base_estimator.predict_proba(X)
+194else:
+195returnself.one_hot.transform(self.base_estimator.predict(X).reshape(-1,1)).toarray()
+
+
+
+
Fake predict_proba method for classifiers that do not have it.
+When predict_proba is called, it will use one hot encoding to fake the probabilities if base_estimator does not have predict_proba method.
+
+
Examples
+
+
+
fromsklearn.svmimportSVC
+# SVC does not have predict_proba method
+
+fromsslearn.baseimportFakedProbaClassifier
+faked_svc=FakedProbaClassifier(SVC())
+faked_svc.fit(X,y)
+faked_svc.predict_proba(X)# One hot encoding probabilities
+
+
+
+
+
+
+
+
+
+ FakedProbaClassifier(base_estimator)
+
+
+
+
+
+
134def__init__(self,base_estimator):
+135"""Create a classifier that fakes predict_proba method if it does not exist.
+136
+137 Parameters
+138 ----------
+139 base_estimator : ClassifierMixin
+140 A classifier that implements fit and predict methods.
+141 """
+142self.base_estimator=base_estimator
+
+
+
+
Create a classifier that fakes predict_proba method if it does not exist.
+
+
Parameters
+
+
+
base_estimator (ClassifierMixin):
+A classifier that implements fit and predict methods.
+
+
+
+
+
+
+
+
+
+ def
+ fit(self, X, y):
+
+
+
+
+
+
144deffit(self,X,y):
+145"""Fit a FakedProbaClassifier.
+146
+147 Parameters
+148 ----------
+149 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+150 The input samples.
+151 y : {array-like, sparse matrix} of shape (n_samples,)
+152 The target values.
+153
+154 Returns
+155 -------
+156 self : FakedProbaClassifier
+157 Returns self.
+158 """
+159self.classes_=np.unique(y)
+160self.one_hot=OneHotEncoder().fit(y.reshape(-1,1))
+161self.base_estimator.fit(X,y)
+162returnself
+
+
+
+
Fit a FakedProbaClassifier.
+
+
Parameters
+
+
+
X ({array-like, sparse matrix} of shape (n_samples, n_features)):
+The input samples.
+
y ({array-like, sparse matrix} of shape (n_samples,)):
+The target values.
+
+
+
Returns
+
+
+
self (FakedProbaClassifier):
+Returns self.
+
+
+
+
+
+
+
+
+
+ def
+ predict(self, X):
+
+
+
+
+
+
164defpredict(self,X):
+165"""Predict the classes of X.
+166
+167 Parameters
+168 ----------
+169 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+170 Array representing the data.
+171
+172 Returns
+173 -------
+174 y : ndarray of shape (n_samples,)
+175 Array with predicted labels.
+176 """
+177returnself.base_estimator.predict(X)
+
+
+
+
Predict the classes of X.
+
+
Parameters
+
+
+
X ({array-like, sparse matrix} of shape (n_samples, n_features)):
+Array representing the data.
+
+
+
Returns
+
+
+
y (ndarray of shape (n_samples,)):
+Array with predicted labels.
+
+
+
+
+
+
+
+
+
+ def
+ predict_proba(self, X):
+
+
+
+
+
+
179defpredict_proba(self,X):
+180"""Predict the probabilities of each class for X.
+181 If the base estimator does not have a predict_proba method, it will be faked using one hot encoding.
+182
+183 Parameters
+184 ----------
+185 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+186
+187 Returns
+188 -------
+189 y : ndarray of shape (n_samples, n_classes)
+190 Array with predicted probabilities.
+191 """
+192if"predict_proba"indir(self.base_estimator):
+193returnself.base_estimator.predict_proba(X)
+194else:
+195returnself.one_hot.transform(self.base_estimator.predict(X).reshape(-1,1)).toarray()
+
+
+
+
Predict the probabilities of each class for X.
+If the base estimator does not have a predict_proba method, it will be faked using one hot encoding.
+
+
Parameters
+
+
+
X ({array-like, sparse matrix} of shape (n_samples, n_features)):
+
+
+
Returns
+
+
+
y (ndarray of shape (n_samples, n_classes)):
+Array with predicted probabilities.
+
+
+
+
+
+
+
Inherited Members
+
+
sklearn.base.ClassifierMixin
+
score
+
+
+
sklearn.base.BaseEstimator
+
get_params
+
set_params
+
+
+
+
+
+
+
+
+
+ class
+ OneVsRestSSLClassifier(sklearn.multiclass.OneVsRestClassifier):
+
+
+
+
+
+
228classOneVsRestSSLClassifier(OneVsRestClassifier):
+229"""Adapted OneVsRestClassifier for SSL datasets
+230
+231 Prevent use unlabeled data as a independent class in the classifier.
+232
+233 For more information of OvR classifier, see the documentation of [OneVsRestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html).
+234 """
+235
+236def__init__(self,estimator,*,n_jobs=None):
+237"""Adapted OneVsRestClassifier for SSL datasets
+238
+239 Parameters
+240 ----------
+241 estimator : {ClassifierMixin, list},
+242 An estimator object implementing fit and predict_proba or a list of ClassifierMixin
+243 n_jobs : n_jobs : int, optional
+244 The number of jobs to run in parallel. -1 means using all processors., by default None
+245 """
+246super().__init__(estimator,n_jobs=n_jobs)
+247
+248deffit(self,X,y,**fit_params):
+249#
+250y_label=y[y!=y.dtype.type(-1)]
+251size=len(y)-len(y_label)
+252
+253self.label_binarizer_=LabelBinarizer(sparse_output=True)
+254Y=self.label_binarizer_.fit_transform(y_label)
+255Y=Y.tocsc()
+256self.classes_=self.label_binarizer_.classes_
+257columns=(col.toarray().ravel()forcolinY.T)
+258
+259estimators=[skclone(self.estimator)for_inrange(len(self.classes_))]
+260rs=check_random_state(estimators[0].get_params(deep=False).get("random_state",None))
+261foreinestimators:
+262_set_random_states(e,rs)
+263
+264self.estimators_=Parallel(n_jobs=self.n_jobs)(
+265delayed(_fit_binary_ssl)(
+266estimators[i],
+267X,
+268column,
+269size,
+270classes=[
+271"not %s"%self.label_binarizer_.classes_[i],
+272self.label_binarizer_.classes_[i],
+273],
+274**fit_params
+275)
+276fori,columninenumerate(columns)
+277)
+278
+279ifhasattr(self.estimators_[0],"n_features_in_"):
+280self.n_features_in_=self.estimators_[0].n_features_in_
+281ifhasattr(self.estimators_[0],"feature_names_in_"):
+282self.feature_names_in_=self.estimators_[0].feature_names_in_
+283
+284returnself
+285
+286defpredict(self,X,**kwards):
+287check_is_fitted(self)
+288
+289n_samples=_num_samples(X)
+290ifself.label_binarizer_.y_type_=="multiclass":
+291maxima=np.empty(n_samples,dtype=float)
+292maxima.fill(-np.inf)
+293argmaxima=np.zeros(n_samples,dtype=int)
+294fori,einenumerate(self.estimators_):
+295pred=_predict_binary_ssl(e,X,**kwards)
+296np.maximum(maxima,pred,out=maxima)
+297argmaxima[maxima==pred]=i
+298returnself.classes_[argmaxima]
+299else:
+300if(hasattr(self.estimators_[0],"decision_function")and
+301is_classifier(self.estimators_[0])):
+302thresh=0
+303else:
+304thresh=.5
+305indices=array.array('i')
+306indptr=array.array('i',[0])
+307foreinself.estimators_:
+308indices.extend(np.where(_predict_binary_ssl(e,X,**kwards)>thresh)[0])
+309indptr.append(len(indices))
+310data=np.ones(len(indices),dtype=int)
+311indicator=sp.csc_matrix((data,indices,indptr),
+312shape=(n_samples,len(self.estimators_)))
+313returnself.label_binarizer_.inverse_transform(indicator)
+314
+315defpredict_proba(self,X,**kwards):
+316check_is_fitted(self)
+317# Y[i, j] gives the probability that sample i has the label j.
+318# In the multi-label case, these are not disjoint.
+319Y=np.array([e.predict_proba(X,**kwards)[:,1]foreinself.estimators_]).T
+320
+321iflen(self.estimators_)==1:
+322# Only one estimator, but we still want to return probabilities
+323# for two classes.
+324Y=np.concatenate(((1-Y),Y),axis=1)
+325
+326ifnotself.multilabel_:
+327# Then, probabilities should be normalized to 1.
+328Y/=np.sum(Y,axis=1)[:,np.newaxis]
+329returnY
+
+
+
+
Adapted OneVsRestClassifier for SSL datasets
+
+
Prevent use unlabeled data as a independent class in the classifier.
+
+
For more information of OvR classifier, see the documentation of OneVsRestClassifier.
236def__init__(self,estimator,*,n_jobs=None):
+237"""Adapted OneVsRestClassifier for SSL datasets
+238
+239 Parameters
+240 ----------
+241 estimator : {ClassifierMixin, list},
+242 An estimator object implementing fit and predict_proba or a list of ClassifierMixin
+243 n_jobs : n_jobs : int, optional
+244 The number of jobs to run in parallel. -1 means using all processors., by default None
+245 """
+246super().__init__(estimator,n_jobs=n_jobs)
+
+
+
+
Adapted OneVsRestClassifier for SSL datasets
+
+
Parameters
+
+
+
estimator ({ClassifierMixin, list},):
+An estimator object implementing fit and predict_proba or a list of ClassifierMixin
+
n_jobs : n_jobs (int, optional):
+The number of jobs to run in parallel. -1 means using all processors., by default None
X ({array-like, sparse matrix} of shape (n_samples, n_features)):
+Data.
+
y ({array-like, sparse matrix} of shape (n_samples,) or (n_samples, n_classes)):
+Multi-class targets. An indicator matrix turns on multilabel
+classification.
315defpredict_proba(self,X,**kwards):
+316check_is_fitted(self)
+317# Y[i, j] gives the probability that sample i has the label j.
+318# In the multi-label case, these are not disjoint.
+319Y=np.array([e.predict_proba(X,**kwards)[:,1]foreinself.estimators_]).T
+320
+321iflen(self.estimators_)==1:
+322# Only one estimator, but we still want to return probabilities
+323# for two classes.
+324Y=np.concatenate(((1-Y),Y),axis=1)
+325
+326ifnotself.multilabel_:
+327# Then, probabilities should be normalized to 1.
+328Y/=np.sum(Y,axis=1)[:,np.newaxis]
+329returnY
+
+
+
+
Probability estimates.
+
+
The returned estimates for all classes are ordered by label of classes.
+
+
Note that in the multilabel case, each sample can have any number of
+labels. This returns the marginal probability that the given sample has
+the label in question. For example, it is entirely consistent that two
+labels both have a 90% probability of applying to a given sample.
+
+
In the single label multiclass case, the rows of the returned matrix
+sum to 1.
+
+
Parameters
+
+
+
X ({array-like, sparse matrix} of shape (n_samples, n_features)):
+Input data.
+
+
+
Returns
+
+
+
T (array-like of shape (n_samples, n_classes)):
+Returns the probability of the sample for each class in the model,
+where classes are ordered as they are in self.classes_.
+
+
+
+
+
+
+
Inherited Members
+
+
sklearn.multiclass.OneVsRestClassifier
+
partial_fit
+
decision_function
+
multilabel_
+
n_classes_
+
+
+
sklearn.base.ClassifierMixin
+
score
+
+
+
sklearn.base.BaseEstimator
+
get_params
+
set_params
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/docs/sslearn/datasets.html b/docs/sslearn/datasets.html
new file mode 100644
index 0000000..ee56f59
--- /dev/null
+++ b/docs/sslearn/datasets.html
@@ -0,0 +1,695 @@
+
+
+
+
+
+
+ sslearn.datasets API documentation
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
This module contains functions to load and save datasets in different formats.
+
+
Functions
+
+
+
read_csv : Load a dataset from a CSV file.
+
read_keel : Load a dataset from a KEEL file.
+
secure_dataset : Secure the dataset by converting it into a secure format.
+
save_keel : Save a dataset in KEEL format.
+
+
+
+
+
+
+
+
1"""
+ 2Summary of module `sslearn.datasets`:
+ 3
+ 4This module contains functions to load and save datasets in different formats.
+ 5
+ 6## Functions
+ 7
+ 81. read_csv : Load a dataset from a CSV file.
+ 92. read_keel : Load a dataset from a KEEL file.
+103. secure_dataset : Secure the dataset by converting it into a secure format.
+114. save_keel : Save a dataset in KEEL format.
+12
+13
+14"""
+15
+16from._loaderimportread_csv,read_keel
+17from._writerimportsave_keel
+18from._preprocessimportsecure_dataset
+19
+20__all__=["read_csv","read_keel","secure_dataset","save_keel"]
+
94defread_csv(path,format="pandas",secure=False,target_col=-1,**kwards):
+ 95"""Read a .csv file
+ 96
+ 97 Parameters
+ 98 ----------
+ 99 path : str
+100 File path
+101 format : str, optional
+102 Object that will contain the data, it can be `numpy` or `pandas`, by default "pandas"
+103 secure : bool, optional
+104 It guarantees that the dataset has not `-1` as valid class, in order to make it semi-supervised after, by default False
+105 target_col : {str, int, None}, optional
+106 Column name or index to select class column, if None use the default value stored in the file, by default None
+107
+108 Returns
+109 -------
+110 X, y: array_like
+111 Dataset loaded.
+112 """
+113ifformatnotin["pandas","numpy"]:
+114raiseAttributeError("Formats allowed are `pandas` or `numpy`")
+115data=pd.read_csv(path,**kwards)
+116
+117iftarget_colisNone:
+118raiseAttributeError("`read_csv` do not allow a `None` value for `target_col`, use `integer` or `string` instead.")
+119elifisinstance(target_col,str):
+120target_col=data.columns.index(target_col)
+121
+122X=data.iloc[:,data.columns!=data.columns[target_col]]
+123y=data.iloc[:,target_col]
+124
+125ifsecure:
+126X,y=secure_dataset(X,y)
+127ifformat=="numpy":
+128X=X.to_numpy()
+129y=y.to_numpy()
+130returnX,y
+
+
+
+
Read a .csv file
+
+
Parameters
+
+
+
path (str):
+File path
+
format (str, optional):
+Object that will contain the data, it can be numpy or pandas, by default "pandas"
+
secure (bool, optional):
+It guarantees that the dataset has not -1 as valid class, in order to make it semi-supervised after, by default False
+
target_col ({str, int, None}, optional):
+Column name or index to select class column, if None use the default value stored in the file, by default None
14defread_keel(path,format="pandas",secure=False,target_col=None,encoding="utf-8",**kwards):
+15"""Read a .dat file from KEEL (http://www.keel.es/)
+16
+17 Parameters
+18 ----------
+19 path : str
+20 File path
+21 format : str, optional
+22 Object that will contain the data, it can be `numpy` or `pandas`, by default "pandas"
+23 secure : bool, optional
+24 It guarantees that the dataset has not `-1` as valid class, in order to make it semi-supervised after, by default False
+25 target_col : {str, int, None}, optional
+26 Column name or index to select class column, if None use the default value stored in the file, by default None
+27 encoding: str, optional
+28 Encoding of file, by default "utf-8"
+29
+30 Returns
+31 -------
+32 X, y: array_like
+33 Dataset loaded.
+34 """
+35ifformatnotin["pandas","numpy"]:
+36raiseAttributeError("Formats allowed are `pandas` or `numpy`")
+37
+38attributes=[]
+39types=[]
+40target=None
+41withopen(path,"r")asfile:
+42lines=file.readlines()
+43counter=1
+44forlineinlines:
+45counter+=1
+46if"@attribute"inline:
+47parts=line.split(" ")
+48name_=parts[1]
+49type_=parts[2]
+50iftype_[0]=="{":
+51type_="string"
+52attributes.append(name_)
+53types.append(keel_type_cheat[type_])
+54elif"@outputs"inline:
+55target=line.split(" ")[1].strip('\n')
+56elif"@data"inline:
+57break
+58iftargetisNone:
+59target=attributes[-1]
+60data=pd.read_csv(path,skiprows=counter-1,header=None,**kwards)
+61iflen(data.columns)!=len(attributes):
+62warnings.warn(f"The dataset's have {len(data.columns)} columns but file declares {len(attributes)}.",RuntimeWarning)
+63X=data
+64y=None
+65else:
+66data.columns=attributes
+67data=data.astype(dict(zip(attributes,types)))
+68foratt,tpinzip(attributes,types):
+69iftp=="string":
+70data[att]=data[att].str.strip()
+71iftarget_colisNone:
+72target_col=target
+73elifisinstance(target_col,int):
+74target_col=data.columns[target_col]
+75
+76att_columns=attributes.copy()
+77att_columns.remove(target_col)
+78
+79X=data[att_columns]
+80y=data[target_col]
+81
+82y[y=="unlabeled"]=y.dtype.type(-1)
+83ifsecure:
+84X,y=secure_dataset(X,y)
+85
+86ifformat=="numpy":
+87X=X.to_numpy().astype(float)
+88y=y.to_numpy()
+89ify.dtype==object:
+90y=y.astype("str")
+91returnX,y
+
format (str, optional):
+Object that will contain the data, it can be numpy or pandas, by default "pandas"
+
secure (bool, optional):
+It guarantees that the dataset has not -1 as valid class, in order to make it semi-supervised after, by default False
+
target_col ({str, int, None}, optional):
+Column name or index to select class column, if None use the default value stored in the file, by default None
+
encoding (str, optional):
+Encoding of file, by default "utf-8"
+
+
+
Returns
+
+
+
X, y (array_like):
+Dataset loaded.
+
+
+
+
+
+
+
+
+
+ def
+ secure_dataset(X, y):
+
+
+
+
+
+
2defsecure_dataset(X,y):
+ 3"""It guarantees that the dataset has not `-1` as valid class, in order to make it semi-supervised after
+ 4
+ 5 Parameters
+ 6 ----------
+ 7 X : Array-like
+ 8 Ignored
+ 9 y : Array-like
+10 Target array.
+11
+12 Returns
+13 -------
+14 X, y: array_like
+15 Dataset securized.
+16 """
+17ify.dtype.type(-1)iny.tolist():
+18raiseValueError("The dataset contains -1 as valid class. Please, change it to another value.")
+19returnX,y
+20# if np.issubdtype(y.dtype, np.number):
+21# y = y + 2
+22
+23# return X, y
+
+
+
+
It guarantees that the dataset has not -1 as valid class, in order to make it semi-supervised after
70defartificial_ssl_dataset(X,y,label_rate=0.1,random_state=None,force_minimum=None,indexes=False,**kwards):
+ 71"""Create an artificial Semi-supervised dataset from a supervised dataset.
+ 72
+ 73 Parameters
+ 74 ----------
+ 75 X : array-like of shape (n_samples, n_features)
+ 76 Training data, where n_samples is the number of samples
+ 77 and n_features is the number of features.
+ 78 y : array-like of shape (n_samples,)
+ 79 The target variable for supervised learning problems.
+ 80 label_rate : float, optional
+ 81 Proportion between labeled instances and unlabel instances, by default 0.1
+ 82 random_state : int or RandomState, optional
+ 83 Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls, by default None
+ 84 force_minimum: int, optional
+ 85 Force a minimum of instances of each class, by default None
+ 86 indexes: bool, optional
+ 87 If True, return the indexes of the labeled and unlabeled instances, by default False
+ 88 shuffle: bool, default=True
+ 89 Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.
+ 90 stratify: array-like, default=None
+ 91 If not None, data is split in a stratified fashion, using this as the class labels.
+ 92
+ 93 Returns
+ 94 -------
+ 95 X : ndarray
+ 96 The feature set.
+ 97 y : ndarray
+ 98 The label set, -1 for unlabel instance.
+ 99 X_unlabel: ndarray
+100 The feature set for each y mark as unlabel
+101 y_unlabel: ndarray
+102 The true label for each y in the same order.
+103 label: ndarray (optional)
+104 The training set indexes for split mark as labeled.
+105 unlabel: ndarray (optional)
+106 The training set indexes for split mark as unlabeled.
+107 """
+108assert(label_rate>0)and(label_rate<1),\
+109"Label rate must be in (0, 1)."
+110assert"test_size"notinkwardsand"train_size"notinkwards,\
+111"Test size and train size are illegal parameters in this method."
+112
+113indices=np.arange(len(y))
+114
+115ifforce_minimumisnotNone:
+116try:
+117selected=__random_select_n_instances(y,force_minimum,random_state)
+118exceptValueError:
+119raiseValueError("The number of instances of each class is less than force_minimum.")
+120
+121# Remove selected instances from indices
+122indices=np.delete(indices,selected,axis=0)
+123
+124# Train test split with indexes
+125label,unlabel=ms.train_test_split(indices,train_size=label_rate,
+126random_state=random_state,**kwards)
+127
+128ifforce_minimumisnotNone:
+129label=np.concatenate((selected,label))
+130
+131# Create the label and unlabel sets
+132X_label,y_label,X_unlabel,y_unlabel=X[label],y[label],\
+133X[unlabel],np.array([-1]*len(unlabel))
+134
+135# Create the artificial dataset
+136X=np.concatenate((X_label,X_unlabel),axis=0)
+137y=np.concatenate((y_label,y_unlabel),axis=0)
+138
+139ifindexes:
+140returnX,y,X_unlabel,y_unlabel,label,unlabel
+141
+142returnX,y,X_unlabel,y_unlabel
+143
+144
+145"""
+146 if force_minimum is not None:
+147 try:
+148 selected = __random_select_n_instances(y, force_minimum, random_state)
+149 except ValueError:
+150 raise ValueError("The number of instances of each class is less than force_minimum.")
+151 X_selected = X[selected]
+152 y_selected = y[selected]
+153
+154 # Remove selected instances from X and y
+155 X = np.delete(X, selected, axis=0)
+156 y = np.delete(y, selected, axis=0)
+157
+158 X_label, X_unlabel, y_label, true_label = \
+159 ms.train_test_split(X, y,
+160 train_size=label_rate,
+161 random_state=random_state, **kwards)
+162 X = np.concatenate((X_label, X_unlabel), axis=0)
+163 y = np.concatenate((y_label, np.array([-1] * len(true_label))), axis=0)
+164
+165 if force_minimum is not None:
+166 X = np.concatenate((X, X_selected), axis=0)
+167 y = np.concatenate((y, y_selected), axis=0)
+168
+169 if indexes:
+170 return X, y, X_unlabel, true_label, X_label, X_unlabel
+171
+172 return X, y, X_unlabel, true_label
+173 """
+
+
+
+
Create an artificial Semi-supervised dataset from a supervised dataset.
+
+
Parameters
+
+
+
X (array-like of shape (n_samples, n_features)):
+Training data, where n_samples is the number of samples
+and n_features is the number of features.
+
y (array-like of shape (n_samples,)):
+The target variable for supervised learning problems.
+
label_rate (float, optional):
+Proportion between labeled instances and unlabel instances, by default 0.1
+
random_state (int or RandomState, optional):
+Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls, by default None
+
force_minimum (int, optional):
+Force a minimum of instances of each class, by default None
+
indexes (bool, optional):
+If True, return the indexes of the labeled and unlabeled instances, by default False
+
shuffle (bool, default=True):
+Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.
+
stratify (array-like, default=None):
+If not None, data is split in a stratified fashion, using this as the class labels.
+
+
+
Returns
+
+
+
X (ndarray):
+The feature set.
+
y (ndarray):
+The label set, -1 for unlabel instance.
+
X_unlabel (ndarray):
+The feature set for each y mark as unlabel
+
y_unlabel (ndarray):
+The true label for each y in the same order.
+
label (ndarray (optional)):
+The training set indexes for split mark as labeled.
+
unlabel (ndarray (optional)):
+The training set indexes for split mark as unlabeled.
+
+
+
+
+
+
+
+
+
+ class
+ StratifiedKFoldSS:
+
+
+
+
+
+
7classStratifiedKFoldSS():
+ 8"""
+ 9 Stratified K-Folds cross-validator for semi-supervised learning.
+10
+11 Provides label and unlabel indices for each split. Using the `StratifiedKFold` method from `sklearn`.
+12 The `test` set is the labeled set and the `train` set is the unlabeled set.
+13 """
+14
+15
+16def__init__(self,n_splits=5,shuffle=False,random_state=None):
+17"""
+18 Parameters
+19 ----------
+20 n_splits : int, default=5
+21 Number of folds. Must be at least 2.
+22 shuffle : bool, default=False
+23 Whether to shuffle each class's samples before splitting into batches.
+24 random_state : int or RandomState instance, default=None
+25 When shuffle is True, random_state affects the ordering of the indices.
+26
+27 """
+28
+29self.K=ms.StratifiedKFold(n_splits=n_splits,shuffle=shuffle,
+30random_state=random_state)
+31self.n_splits=n_splits
+32self.shuffle=shuffle
+33self.random_state=random_state
+34
+35defsplit(self,X,y):
+36"""Generate a artificial dataset based on StratifiedKFold method
+37
+38 Parameters
+39 ----------
+40 X : array-like of shape (n_samples, n_features)
+41 Training data, where n_samples is the number of samples
+42 and n_features is the number of features.
+43 y : array-like of shape (n_samples,)
+44 The target variable for supervised learning problems.
+45
+46 Yields
+47 -------
+48 X : ndarray
+49 The feature set.
+50 y : ndarray
+51 The label set, -1 for unlabel instance.
+52 label : ndarray
+53 The training set indices for split mark as labeled.
+54 unlabel : ndarray
+55 The training set indices for split mark as unlabeled.
+56 """
+57fortrain,testinself.K.split(X,y):
+58# Inverse train and test because train is big dataset
+59label=test
+60unlabel=train
+61
+62X_label,y_label,X_unlabel,y_unlabel=X[label],y[label],\
+63X[unlabel],np.array([-1]*len(unlabel))
+64X_=np.concatenate((X_label,X_unlabel),axis=0)
+65y_=np.concatenate((y_label,y_unlabel),axis=0)
+66
+67yieldX_,y_,label,unlabel
+
+
+
+
Stratified K-Folds cross-validator for semi-supervised learning.
+
+
Provides label and unlabel indices for each split. Using the StratifiedKFold method from sklearn.
+The test set is the labeled set and the train set is the unlabeled set.
16def__init__(self,n_splits=5,shuffle=False,random_state=None):
+17"""
+18 Parameters
+19 ----------
+20 n_splits : int, default=5
+21 Number of folds. Must be at least 2.
+22 shuffle : bool, default=False
+23 Whether to shuffle each class's samples before splitting into batches.
+24 random_state : int or RandomState instance, default=None
+25 When shuffle is True, random_state affects the ordering of the indices.
+26
+27 """
+28
+29self.K=ms.StratifiedKFold(n_splits=n_splits,shuffle=shuffle,
+30random_state=random_state)
+31self.n_splits=n_splits
+32self.shuffle=shuffle
+33self.random_state=random_state
+
+
+
+
Parameters
+
+
+
n_splits (int, default=5):
+Number of folds. Must be at least 2.
+
shuffle (bool, default=False):
+Whether to shuffle each class's samples before splitting into batches.
+
random_state (int or RandomState instance, default=None):
+When shuffle is True, random_state affects the ordering of the indices.
+
+
+
+
+
+
+
+
+
+ def
+ split(self, X, y):
+
+
+
+
+
+
35defsplit(self,X,y):
+36"""Generate a artificial dataset based on StratifiedKFold method
+37
+38 Parameters
+39 ----------
+40 X : array-like of shape (n_samples, n_features)
+41 Training data, where n_samples is the number of samples
+42 and n_features is the number of features.
+43 y : array-like of shape (n_samples,)
+44 The target variable for supervised learning problems.
+45
+46 Yields
+47 -------
+48 X : ndarray
+49 The feature set.
+50 y : ndarray
+51 The label set, -1 for unlabel instance.
+52 label : ndarray
+53 The training set indices for split mark as labeled.
+54 unlabel : ndarray
+55 The training set indices for split mark as unlabeled.
+56 """
+57fortrain,testinself.K.split(X,y):
+58# Inverse train and test because train is big dataset
+59label=test
+60unlabel=train
+61
+62X_label,y_label,X_unlabel,y_unlabel=X[label],y[label],\
+63X[unlabel],np.array([-1]*len(unlabel))
+64X_=np.concatenate((X_label,X_unlabel),axis=0)
+65y_=np.concatenate((y_label,y_unlabel),axis=0)
+66
+67yieldX_,y_,label,unlabel
+
+
+
+
Generate a artificial dataset based on StratifiedKFold method
+
+
Parameters
+
+
+
X (array-like of shape (n_samples, n_features)):
+Training data, where n_samples is the number of samples
+and n_features is the number of features.
+
y (array-like of shape (n_samples,)):
+The target variable for supervised learning problems.
+
+
+
Yields
+
+
+
X (ndarray):
+The feature set.
+
y (ndarray):
+The label set, -1 for unlabel instance.
+
label (ndarray):
+The training set indices for split mark as labeled.
+
unlabel (ndarray):
+The training set indices for split mark as unlabeled.
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/docs/sslearn/restricted.html b/docs/sslearn/restricted.html
new file mode 100644
index 0000000..e39f346
--- /dev/null
+++ b/docs/sslearn/restricted.html
@@ -0,0 +1,985 @@
+
+
+
+
+
+
+ sslearn.restricted API documentation
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Compute the conflict rate of a prediction, given a set of restrictions.
+ combine_predictions:
+ Combine the predictions of a group of instances to keep the restrictions.
+
+
+
+
+
+
+
+
1"""Summary of module `sslearn.restricted`:
+ 2
+ 3This module contains classes to train a classifier using the restricted set classification approach.
+ 4
+ 5## Classes
+ 6
+ 7[WhoIsWhoClassifier](#WhoIsWhoClassifier):
+ 8> Who is Who Classifier
+ 9
+ 10## Functions
+ 11
+ 12[conflict_rate](#conflict_rate):
+ 13> Compute the conflict rate of a prediction, given a set of restrictions.
+ 14[combine_predictions](#combine_predictions):
+ 15> Combine the predictions of a group of instances to keep the restrictions.
+ 16
+ 17
+ 18"""
+ 19
+ 20importnumpyasnp
+ 21fromsklearn.baseimportClassifierMixin,MetaEstimatorMixin,BaseEstimator
+ 22fromscipy.optimizeimportlinear_sum_assignment
+ 23importwarnings
+ 24importpandasaspd
+ 25
+ 26__all__=["conflict_rate","combine_predictions","WhoIsWhoClassifier"]
+ 27
+ 28classWhoIsWhoClassifier(BaseEstimator,ClassifierMixin,MetaEstimatorMixin):
+ 29
+ 30def__init__(self,base_estimator,method="hungarian",conflict_weighted=True):
+ 31"""
+ 32 Who is Who Classifier
+ 33 Kuncheva, L. I., Rodriguez, J. J., & Jackson, A. S. (2017).
+ 34 Restricted set classification: Who is there?. <i>Pattern Recognition</i>, 63, 158-170.
+ 35
+ 36 Parameters
+ 37 ----------
+ 38 base_estimator : ClassifierMixin
+ 39 The base estimator to be used for training.
+ 40 method : str, optional
+ 41 The method to use to assing class, it can be `greedy` to first-look or `hungarian` to use the Hungarian algorithm, by default "hungarian"
+ 42 conflict_weighted : bool, default=True
+ 43 Whether to weighted the confusion rate by the number of instances with the same group.
+ 44 """
+ 45allowed_methods=["greedy","hungarian"]
+ 46self.base_estimator=base_estimator
+ 47self.method=method
+ 48ifmethodnotinallowed_methods:
+ 49raiseValueError(f"method {self.method} not supported, use one of {allowed_methods}")
+ 50self.conflict_weighted=conflict_weighted
+ 51
+ 52
+ 53deffit(self,X,y,instance_group=None,**kwards):
+ 54"""Fit the model according to the given training data.
+ 55 Parameters
+ 56 ----------
+ 57 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+ 58 The input samples.
+ 59 y : array-like of shape (n_samples,)
+ 60 The target values.
+ 61 instance_group : array-like of shape (n_samples)
+ 62 The group. Two instances with the same label are not allowed to be in the same group. If None, group restriction will not be used in training.
+ 63 Returns
+ 64 -------
+ 65 self : object
+ 66 Returns self.
+ 67 """
+ 68self.base_estimator=self.base_estimator.fit(X,y,**kwards)
+ 69self.classes_=self.base_estimator.classes_
+ 70ifinstance_groupisnotNone:
+ 71self.conflict_in_train=conflict_rate(self.base_estimator.predict(X),instance_group,self.conflict_weighted)
+ 72else:
+ 73self.conflict_in_train=None
+ 74returnself
+ 75
+ 76defconflict_rate(self,X,instance_group):
+ 77"""Calculate the conflict rate of the model.
+ 78 Parameters
+ 79 ----------
+ 80 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+ 81 The input samples.
+ 82 instance_group : array-like of shape (n_samples)
+ 83 The group. Two instances with the same label are not allowed to be in the same group.
+ 84 Returns
+ 85 -------
+ 86 float
+ 87 The conflict rate.
+ 88 """
+ 89y_pred=self.base_estimator.predict(X)
+ 90returnconflict_rate(y_pred,instance_group,self.conflict_weighted)
+ 91
+ 92defpredict(self,X,instance_group):
+ 93"""Predict class for X.
+ 94 Parameters
+ 95 ----------
+ 96 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+ 97 The input samples.
+ 98 **kwards : array-like of shape (n_samples)
+ 99 The group. Two instances with the same label are not allowed to be in the same group.
+100 Returns
+101 -------
+102 array-like of shape (n_samples, n_classes)
+103 The class probabilities of the input samples.
+104 """
+105
+106y_prob=self.predict_proba(X)
+107
+108y_predicted=combine_predictions(y_prob,instance_group,len(self.classes_),self.method)
+109
+110returnself.classes_.take(y_predicted)
+111
+112
+113defpredict_proba(self,X):
+114"""Predict class probabilities for X.
+115 Parameters
+116 ----------
+117 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+118 The input samples.
+119 Returns
+120 -------
+121 array-like of shape (n_samples, n_classes)
+122 The class probabilities of the input samples.
+123 """
+124returnself.base_estimator.predict_proba(X)
+125
+126
+127defconflict_rate(y_pred,restrictions,weighted=True):
+128"""
+129 Computes the conflict rate of a prediction, given a set of restrictions.
+130 Parameters
+131 ----------
+132 y_pred : array-like of shape (n_samples,)
+133 Predicted target values.
+134 restrictions : array-like of shape (n_samples,)
+135 Restrictions for each sample. If two samples have the same restriction, they cannot have the same y.
+136 weighted : bool, default=True
+137 Whether to weighted the confusion rate by the number of instances with the same group.
+138 Returns
+139 -------
+140 conflict rate : float
+141 The conflict rate.
+142 """
+143
+144# Check that y_pred and restrictions have the same length
+145iflen(y_pred)!=len(restrictions):
+146raiseValueError("y_pred and restrictions must have the same length.")
+147
+148restricted_df=pd.DataFrame({'y_pred':y_pred,'restrictions':restrictions})
+149
+150conflicted=restricted_df.groupby('restrictions').agg({'y_pred':lambdax:np.unique(x,return_counts=True)[1][np.unique(x,return_counts=True)[1]>1].sum()})
+151ifweighted:
+152returnconflicted.sum().y_pred/len(y_pred)
+153else:
+154rcount=restricted_df.groupby('restrictions').count()
+155return(conflicted.y_pred/rcount.y_pred).sum()
+156
+157defcombine_predictions(y_probas,instance_group,class_number,method="hungarian"):
+158y_predicted=[]
+159forgroupinnp.unique(instance_group):
+160
+161mask=instance_group==group
+162probas_matrix=y_probas[mask]
+163
+164
+165preds=list(np.argmax(probas_matrix,axis=1))
+166
+167iflen(preds)==len(set(preds))orprobas_matrix.shape[0]>class_number:
+168y_predicted.extend(preds)
+169ifprobas_matrix.shape[0]>class_number:
+170warnings.warn("That the number of instances in the group is greater than the number of classes.",UserWarning)
+171continue
+172
+173ifmethod=="greedy":
+174y=_greedy(probas_matrix)
+175elifmethod=="hungarian":
+176y=_hungarian(probas_matrix)
+177
+178y_predicted.extend(y)
+179returny_predicted
+180
+181def_greedy(probas_matrix):
+182
+183probas=probas_matrix.reshape(probas_matrix.size,)
+184order=probas.argsort()[::-1]
+185
+186y_pred_group=[Noneforiinrange(probas_matrix.shape[0])]
+187
+188instance_to_predict={iforiinrange(probas_matrix.shape[0])}
+189class_predicted=set()
+190foriteminorder:
+191class_=item%probas_matrix.shape[0]
+192instance=item//probas_matrix.shape[0]
+193ifinstanceininstance_to_predictandclass_notinclass_predicted:
+194y_pred_group[instance]=class_
+195instance_to_predict.remove(instance)
+196class_predicted.add(class_)
+197
+198returny_pred_group
+199
+200
+201def_hungarian(probas_matrix):
+202
+203costs=np.log(probas_matrix)
+204costs[costs==-np.inf]=0# if proba is 0, then the cost is 0
+205_,col_ind=linear_sum_assignment(costs,maximize=True)
+206col_ind=list(col_ind)
+207
+208returncol_ind
+
128defconflict_rate(y_pred,restrictions,weighted=True):
+129"""
+130 Computes the conflict rate of a prediction, given a set of restrictions.
+131 Parameters
+132 ----------
+133 y_pred : array-like of shape (n_samples,)
+134 Predicted target values.
+135 restrictions : array-like of shape (n_samples,)
+136 Restrictions for each sample. If two samples have the same restriction, they cannot have the same y.
+137 weighted : bool, default=True
+138 Whether to weighted the confusion rate by the number of instances with the same group.
+139 Returns
+140 -------
+141 conflict rate : float
+142 The conflict rate.
+143 """
+144
+145# Check that y_pred and restrictions have the same length
+146iflen(y_pred)!=len(restrictions):
+147raiseValueError("y_pred and restrictions must have the same length.")
+148
+149restricted_df=pd.DataFrame({'y_pred':y_pred,'restrictions':restrictions})
+150
+151conflicted=restricted_df.groupby('restrictions').agg({'y_pred':lambdax:np.unique(x,return_counts=True)[1][np.unique(x,return_counts=True)[1]>1].sum()})
+152ifweighted:
+153returnconflicted.sum().y_pred/len(y_pred)
+154else:
+155rcount=restricted_df.groupby('restrictions').count()
+156return(conflicted.y_pred/rcount.y_pred).sum()
+
+
+
+
Computes the conflict rate of a prediction, given a set of restrictions.
+
+
Parameters
+
+
+
y_pred (array-like of shape (n_samples,)):
+Predicted target values.
+
restrictions (array-like of shape (n_samples,)):
+Restrictions for each sample. If two samples have the same restriction, they cannot have the same y.
+
weighted (bool, default=True):
+Whether to weighted the confusion rate by the number of instances with the same group.
29classWhoIsWhoClassifier(BaseEstimator,ClassifierMixin,MetaEstimatorMixin):
+ 30
+ 31def__init__(self,base_estimator,method="hungarian",conflict_weighted=True):
+ 32"""
+ 33 Who is Who Classifier
+ 34 Kuncheva, L. I., Rodriguez, J. J., & Jackson, A. S. (2017).
+ 35 Restricted set classification: Who is there?. <i>Pattern Recognition</i>, 63, 158-170.
+ 36
+ 37 Parameters
+ 38 ----------
+ 39 base_estimator : ClassifierMixin
+ 40 The base estimator to be used for training.
+ 41 method : str, optional
+ 42 The method to use to assing class, it can be `greedy` to first-look or `hungarian` to use the Hungarian algorithm, by default "hungarian"
+ 43 conflict_weighted : bool, default=True
+ 44 Whether to weighted the confusion rate by the number of instances with the same group.
+ 45 """
+ 46allowed_methods=["greedy","hungarian"]
+ 47self.base_estimator=base_estimator
+ 48self.method=method
+ 49ifmethodnotinallowed_methods:
+ 50raiseValueError(f"method {self.method} not supported, use one of {allowed_methods}")
+ 51self.conflict_weighted=conflict_weighted
+ 52
+ 53
+ 54deffit(self,X,y,instance_group=None,**kwards):
+ 55"""Fit the model according to the given training data.
+ 56 Parameters
+ 57 ----------
+ 58 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+ 59 The input samples.
+ 60 y : array-like of shape (n_samples,)
+ 61 The target values.
+ 62 instance_group : array-like of shape (n_samples)
+ 63 The group. Two instances with the same label are not allowed to be in the same group. If None, group restriction will not be used in training.
+ 64 Returns
+ 65 -------
+ 66 self : object
+ 67 Returns self.
+ 68 """
+ 69self.base_estimator=self.base_estimator.fit(X,y,**kwards)
+ 70self.classes_=self.base_estimator.classes_
+ 71ifinstance_groupisnotNone:
+ 72self.conflict_in_train=conflict_rate(self.base_estimator.predict(X),instance_group,self.conflict_weighted)
+ 73else:
+ 74self.conflict_in_train=None
+ 75returnself
+ 76
+ 77defconflict_rate(self,X,instance_group):
+ 78"""Calculate the conflict rate of the model.
+ 79 Parameters
+ 80 ----------
+ 81 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+ 82 The input samples.
+ 83 instance_group : array-like of shape (n_samples)
+ 84 The group. Two instances with the same label are not allowed to be in the same group.
+ 85 Returns
+ 86 -------
+ 87 float
+ 88 The conflict rate.
+ 89 """
+ 90y_pred=self.base_estimator.predict(X)
+ 91returnconflict_rate(y_pred,instance_group,self.conflict_weighted)
+ 92
+ 93defpredict(self,X,instance_group):
+ 94"""Predict class for X.
+ 95 Parameters
+ 96 ----------
+ 97 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+ 98 The input samples.
+ 99 **kwards : array-like of shape (n_samples)
+100 The group. Two instances with the same label are not allowed to be in the same group.
+101 Returns
+102 -------
+103 array-like of shape (n_samples, n_classes)
+104 The class probabilities of the input samples.
+105 """
+106
+107y_prob=self.predict_proba(X)
+108
+109y_predicted=combine_predictions(y_prob,instance_group,len(self.classes_),self.method)
+110
+111returnself.classes_.take(y_predicted)
+112
+113
+114defpredict_proba(self,X):
+115"""Predict class probabilities for X.
+116 Parameters
+117 ----------
+118 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+119 The input samples.
+120 Returns
+121 -------
+122 array-like of shape (n_samples, n_classes)
+123 The class probabilities of the input samples.
+124 """
+125returnself.base_estimator.predict_proba(X)
+
+
+
+
Base class for all estimators in scikit-learn.
+
+
Notes
+
+
All estimators should specify all the parameters that can be set
+at the class level in their __init__ as explicit keyword
+arguments (no *args or **kwargs).
31def__init__(self,base_estimator,method="hungarian",conflict_weighted=True):
+32"""
+33 Who is Who Classifier
+34 Kuncheva, L. I., Rodriguez, J. J., & Jackson, A. S. (2017).
+35 Restricted set classification: Who is there?. <i>Pattern Recognition</i>, 63, 158-170.
+36
+37 Parameters
+38 ----------
+39 base_estimator : ClassifierMixin
+40 The base estimator to be used for training.
+41 method : str, optional
+42 The method to use to assing class, it can be `greedy` to first-look or `hungarian` to use the Hungarian algorithm, by default "hungarian"
+43 conflict_weighted : bool, default=True
+44 Whether to weighted the confusion rate by the number of instances with the same group.
+45 """
+46allowed_methods=["greedy","hungarian"]
+47self.base_estimator=base_estimator
+48self.method=method
+49ifmethodnotinallowed_methods:
+50raiseValueError(f"method {self.method} not supported, use one of {allowed_methods}")
+51self.conflict_weighted=conflict_weighted
+
+
+
+
Who is Who Classifier
+Kuncheva, L. I., Rodriguez, J. J., & Jackson, A. S. (2017).
+Restricted set classification: Who is there?. Pattern Recognition, 63, 158-170.
+
+
Parameters
+
+
+
base_estimator (ClassifierMixin):
+The base estimator to be used for training.
+
method (str, optional):
+The method to use to assing class, it can be greedy to first-look or hungarian to use the Hungarian algorithm, by default "hungarian"
+
conflict_weighted (bool, default=True):
+Whether to weighted the confusion rate by the number of instances with the same group.
54deffit(self,X,y,instance_group=None,**kwards):
+55"""Fit the model according to the given training data.
+56 Parameters
+57 ----------
+58 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+59 The input samples.
+60 y : array-like of shape (n_samples,)
+61 The target values.
+62 instance_group : array-like of shape (n_samples)
+63 The group. Two instances with the same label are not allowed to be in the same group. If None, group restriction will not be used in training.
+64 Returns
+65 -------
+66 self : object
+67 Returns self.
+68 """
+69self.base_estimator=self.base_estimator.fit(X,y,**kwards)
+70self.classes_=self.base_estimator.classes_
+71ifinstance_groupisnotNone:
+72self.conflict_in_train=conflict_rate(self.base_estimator.predict(X),instance_group,self.conflict_weighted)
+73else:
+74self.conflict_in_train=None
+75returnself
+
+
+
+
Fit the model according to the given training data.
+
+
Parameters
+
+
+
X ({array-like, sparse matrix} of shape (n_samples, n_features)):
+The input samples.
+
y (array-like of shape (n_samples,)):
+The target values.
+
instance_group (array-like of shape (n_samples)):
+The group. Two instances with the same label are not allowed to be in the same group. If None, group restriction will not be used in training.
77defconflict_rate(self,X,instance_group):
+78"""Calculate the conflict rate of the model.
+79 Parameters
+80 ----------
+81 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+82 The input samples.
+83 instance_group : array-like of shape (n_samples)
+84 The group. Two instances with the same label are not allowed to be in the same group.
+85 Returns
+86 -------
+87 float
+88 The conflict rate.
+89 """
+90y_pred=self.base_estimator.predict(X)
+91returnconflict_rate(y_pred,instance_group,self.conflict_weighted)
+
+
+
+
Calculate the conflict rate of the model.
+
+
Parameters
+
+
+
X ({array-like, sparse matrix} of shape (n_samples, n_features)):
+The input samples.
+
instance_group (array-like of shape (n_samples)):
+The group. Two instances with the same label are not allowed to be in the same group.
93defpredict(self,X,instance_group):
+ 94"""Predict class for X.
+ 95 Parameters
+ 96 ----------
+ 97 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+ 98 The input samples.
+ 99 **kwards : array-like of shape (n_samples)
+100 The group. Two instances with the same label are not allowed to be in the same group.
+101 Returns
+102 -------
+103 array-like of shape (n_samples, n_classes)
+104 The class probabilities of the input samples.
+105 """
+106
+107y_prob=self.predict_proba(X)
+108
+109y_predicted=combine_predictions(y_prob,instance_group,len(self.classes_),self.method)
+110
+111returnself.classes_.take(y_predicted)
+
+
+
+
Predict class for X.
+
+
Parameters
+
+
+
X ({array-like, sparse matrix} of shape (n_samples, n_features)):
+The input samples.
+
**kwards (array-like of shape (n_samples)):
+The group. Two instances with the same label are not allowed to be in the same group.
+
+
+
Returns
+
+
+
array-like of shape (n_samples, n_classes): The class probabilities of the input samples.
+
+
+
+
+
+
+
+
+
+ def
+ predict_proba(self, X):
+
+
+
+
+
+
114defpredict_proba(self,X):
+115"""Predict class probabilities for X.
+116 Parameters
+117 ----------
+118 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+119 The input samples.
+120 Returns
+121 -------
+122 array-like of shape (n_samples, n_classes)
+123 The class probabilities of the input samples.
+124 """
+125returnself.base_estimator.predict_proba(X)
+
+
+
+
Predict class probabilities for X.
+
+
Parameters
+
+
+
X ({array-like, sparse matrix} of shape (n_samples, n_features)):
+The input samples.
+
+
+
Returns
+
+
+
array-like of shape (n_samples, n_classes): The class probabilities of the input samples.
+
+
+
+
+
+
+
Inherited Members
+
+
sklearn.base.BaseEstimator
+
get_params
+
set_params
+
+
+
sklearn.base.ClassifierMixin
+
score
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/docs/sslearn/subview.html b/docs/sslearn/subview.html
new file mode 100644
index 0000000..2a3bd19
--- /dev/null
+++ b/docs/sslearn/subview.html
@@ -0,0 +1,580 @@
+
+
+
+
+
+
+ sslearn.subview API documentation
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Train a sub-view classifier.
+ SubViewRegressor:
+ Train a sub-view regressor.
+
+
+
+
+
+
+
+
1"""
+ 2Summary of module `sslearn.subview`:
+ 3
+ 4This module contains classes to train a classifier or a regressor selecting a sub-view of the data.
+ 5
+ 6## Classes
+ 7
+ 8[SubViewClassifier](#SubViewClassifier):
+ 9> Train a sub-view classifier.
+10[SubViewRegressor](#SubViewRegressor):
+11> Train a sub-view regressor.
+12
+13
+14"""
+15
+16from._subviewimportSubViewClassifier,SubViewRegressor
+17
+18__all__=["SubViewClassifier","SubViewRegressor"]
+
+
+
+
+
+
+
+
+ class
+ SubViewClassifier(sslearn.subview._subview.SubView, sklearn.base.ClassifierMixin):
+
+
+
+
+
+
154classSubViewClassifier(SubView,ClassifierMixin):
+155
+156defpredict_proba(self,X):
+157"""Predict class probabilities using the base estimator.
+158
+159 Parameters
+160 ----------
+161 X : array-like of shape (n_samples, n_features)
+162 The input samples.
+163
+164 Returns
+165 -------
+166 p : array-like of shape (n_samples, n_classes)
+167 The class probabilities of the input samples.
+168 """
+169ifself.mode=="regex":
+170X=self._regex_subview(X)
+171elifself.mode=="index":
+172X=self._index_subview(X)
+173elifself.mode=="include":
+174X=self._include_subview(X)
+175
+176returnself.base_estimator_.predict_proba(X)
+
+
+
+
A classifier that uses a subview of the data.
+
+
Example
+
+
+
fromsklearn.model_selectionimporttrain_test_split
+fromsklearn.treeimportDecisionTreeClassifier
+fromsslearn.subviewimportSubViewClassifier
+
+# Mode 'include' will include all columns that contain `string`
+clf=SubViewClassifier(DecisionTreeClassifier(),"sepal",mode="include")
+clf.fit(X,y)
+
+# Mode 'regex' will include all columns that match the regex
+clf=SubViewClassifier(DecisionTreeClassifier(),"sepal.*",mode="regex")
+clf.fit(X,y)
+
+# Mode 'index' will include the columns at the index, useful for numpy arrays
+clf=SubViewClassifier(DecisionTreeClassifier(),[0,1],mode="index")
+clf.fit(X,y)
+
+
+
+
+
+
+
+
+
+ def
+ predict_proba(self, X):
+
+
+
+
+
+
156defpredict_proba(self,X):
+157"""Predict class probabilities using the base estimator.
+158
+159 Parameters
+160 ----------
+161 X : array-like of shape (n_samples, n_features)
+162 The input samples.
+163
+164 Returns
+165 -------
+166 p : array-like of shape (n_samples, n_classes)
+167 The class probabilities of the input samples.
+168 """
+169ifself.mode=="regex":
+170X=self._regex_subview(X)
+171elifself.mode=="index":
+172X=self._index_subview(X)
+173elifself.mode=="include":
+174X=self._include_subview(X)
+175
+176returnself.base_estimator_.predict_proba(X)
+
+
+
+
Predict class probabilities using the base estimator.
+
+
Parameters
+
+
+
X (array-like of shape (n_samples, n_features)):
+The input samples.
+
+
+
Returns
+
+
+
p (array-like of shape (n_samples, n_classes)):
+The class probabilities of the input samples.
+
+
+
+
+
+
+
Inherited Members
+
+
sslearn.subview._subview.SubView
+
SubView
+
fit
+
predict
+
+
+
sklearn.base.BaseEstimator
+
get_params
+
set_params
+
+
+
sklearn.base.ClassifierMixin
+
score
+
+
+
+
+
+
+
+
+
+ class
+ SubViewRegressor(sslearn.subview._subview.SubView, sklearn.base.RegressorMixin):
+
+
+
+
+
+
178classSubViewRegressor(SubView,RegressorMixin):
+179
+180defpredict(self,X):
+181"""Predict using the base estimator.
+182
+183 Parameters
+184 ----------
+185 X : array-like of shape (n_samples, n_features)
+186 The input samples.
+187
+188 Returns
+189 -------
+190 y : array-like of shape (n_samples,)
+191 The predicted values.
+192 """
+193returnsuper().predict(X)
+
+
+
+
A classifier that uses a subview of the data.
+
+
Example
+
+
+
fromsklearn.model_selectionimporttrain_test_split
+fromsklearn.treeimportDecisionTreeClassifier
+fromsslearn.subviewimportSubViewClassifier
+
+# Mode 'include' will include all columns that contain `string`
+clf=SubViewClassifier(DecisionTreeClassifier(),"sepal",mode="include")
+clf.fit(X,y)
+
+# Mode 'regex' will include all columns that match the regex
+clf=SubViewClassifier(DecisionTreeClassifier(),"sepal.*",mode="regex")
+clf.fit(X,y)
+
+# Mode 'index' will include the columns at the index, useful for numpy arrays
+clf=SubViewClassifier(DecisionTreeClassifier(),[0,1],mode="index")
+clf.fit(X,y)
+
+
+
+
+
+
+
+
+
+ def
+ predict(self, X):
+
+
+
+
+
+
180defpredict(self,X):
+181"""Predict using the base estimator.
+182
+183 Parameters
+184 ----------
+185 X : array-like of shape (n_samples, n_features)
+186 The input samples.
+187
+188 Returns
+189 -------
+190 y : array-like of shape (n_samples,)
+191 The predicted values.
+192 """
+193returnsuper().predict(X)
+
+
+
+
Predict using the base estimator.
+
+
Parameters
+
+
+
X (array-like of shape (n_samples, n_features)):
+The input samples.
+
+
+
Returns
+
+
+
y (array-like of shape (n_samples,)):
+The predicted values.
+
+
+
+
+
+
+
Inherited Members
+
+
sslearn.subview._subview.SubView
+
SubView
+
fit
+
+
+
sklearn.base.BaseEstimator
+
get_params
+
set_params
+
+
+
sklearn.base.RegressorMixin
+
score
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/docs/sslearn/utils.html b/docs/sslearn/utils.html
new file mode 100644
index 0000000..f641167
--- /dev/null
+++ b/docs/sslearn/utils.html
@@ -0,0 +1,881 @@
+
+
+
+
+
+
+ sslearn.utils API documentation
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Safely divide two numbers preventing division by zero.
+ confidence_interval:
+ Calculate the confidence interval of the predictions.
+ choice_with_proportion:
+ Choice the best predictions according to the proportion of each class.
+ calculate_prior_probability:
+ Calculate the priori probability of each label.
+ mode:
+ Calculate the mode of a list of values.
+ check_n_jobs:
+ Check n_jobs parameter according to the scikit-learn convention.
+ check_classifier:
+ Check if the classifier is a ClassifierMixin or a list of ClassifierMixin.
+
+
+
+
+
+
+
+
1"""
+ 2Some utility functions
+ 3
+ 4This module contains utility functions that are used in different parts of the library.
+ 5
+ 6## Functions
+ 7
+ 8[safe_division](#safe_division):
+ 9> Safely divide two numbers preventing division by zero.
+ 10[confidence_interval](#confidence_interval):
+ 11> Calculate the confidence interval of the predictions.
+ 12[choice_with_proportion](#choice_with_proportion):
+ 13> Choice the best predictions according to the proportion of each class.
+ 14[calculate_prior_probability](#calculate_prior_probability):
+ 15> Calculate the priori probability of each label.
+ 16[mode](#mode):
+ 17> Calculate the mode of a list of values.
+ 18[check_n_jobs](#check_n_jobs):
+ 19> Check `n_jobs` parameter according to the scikit-learn convention.
+ 20[check_classifier](#check_classifier):
+ 21> Check if the classifier is a ClassifierMixin or a list of ClassifierMixin.
+ 22
+ 23"""
+ 24
+ 25importnumpyasnp
+ 26importos
+ 27importmath
+ 28
+ 29importpandasaspd
+ 30
+ 31fromstatsmodels.stats.proportionimportproportion_confint
+ 32fromsklearn.treeimportDecisionTreeClassifier
+ 33fromsklearn.baseimportClassifierMixin
+ 34
+ 35__all__=["safe_division","confidence_interval","choice_with_proportion","calculate_prior_probability",
+ 36"mode","check_n_jobs","check_classifier"]
+ 37
+ 38
+ 39defsafe_division(dividend,divisor,epsilon):
+ 40"""Safely divide two numbers preventing division by zero
+ 41
+ 42 Parameters
+ 43 ----------
+ 44 dividend : numeric
+ 45 Dividend value
+ 46 divisor : numeric
+ 47 Divisor value
+ 48 epsilon : numeric
+ 49 Close to zero value to be used in case of division by zero
+ 50
+ 51 Returns
+ 52 -------
+ 53 result : numeric
+ 54 Result of the division
+ 55 """
+ 56ifdivisor==0:
+ 57returndividend/epsilon
+ 58returndividend/divisor
+ 59
+ 60
+ 61defconfidence_interval(X,hyp,y,alpha=.95):
+ 62"""Calculate the confidence interval of the predictions
+ 63
+ 64 Parameters
+ 65 ----------
+ 66 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+ 67 The input samples.
+ 68 hyp : classifier
+ 69 The classifier to be used for prediction
+ 70 y : array-like of shape (n_samples,)
+ 71 The target values
+ 72 alpha : float, optional
+ 73 confidence (1 - significance), by default .95
+ 74
+ 75 Returns
+ 76 -------
+ 77 li, hi: float
+ 78 lower and upper bound of the confidence interval
+ 79 """
+ 80data=hyp.predict(X)
+ 81
+ 82successes=np.count_nonzero(data==y)
+ 83trials=X.shape[0]
+ 84li,hi=proportion_confint(successes,trials,alpha=1-alpha,method="wilson")
+ 85returnli,hi
+ 86
+ 87
+ 88defchoice_with_proportion(predictions,class_predicted,proportion,extra=0):
+ 89"""Choice the best predictions according to the proportion of each class.
+ 90
+ 91 Parameters
+ 92 ----------
+ 93 predictions : array-like of shape (n_samples,)
+ 94 array of predictions
+ 95 class_predicted : array-like of shape (n_samples,)
+ 96 array of predicted classes
+ 97 proportion : dict
+ 98 dictionary with the proportion of each class
+ 99 extra : int, optional
+100 number of extra instances to be added, by default 0
+101
+102 Returns
+103 -------
+104 indices: array-like of shape (n_samples,)
+105 array of indices of the best predictions
+106 """
+107n=len(predictions)
+108for_each_class={c:int(n*j)forc,jinproportion.items()}
+109indices=np.zeros(0)
+110forcinproportion:
+111instances=class_predicted==c
+112to_add=np.argsort(predictions,kind="mergesort")[instances][::-1][0:for_each_class[c]+extra]
+113indices=np.concatenate((indices,to_add))
+114
+115returnindices.astype(int)
+116
+117
+118defcalculate_prior_probability(y):
+119"""Calculate the priori probability of each label
+120
+121 Parameters
+122 ----------
+123 y : array-like of shape (n_samples,)
+124 array of labels
+125
+126 Returns
+127 -------
+128 class_probability: dict
+129 dictionary with priori probability (value) of each label (key)
+130 """
+131unique,counts=np.unique(y,return_counts=True)
+132u_c=dict(zip(unique,counts))
+133instances=len(y)
+134foruinu_c:
+135u_c[u]=float(u_c[u]/instances)
+136returnu_c
+137
+138
+139defis_int(x):
+140"""Check if x is of integer type, but not boolean"""
+141# From sktime: BSD 3-Clause
+142# boolean are subclasses of integers in Python, so explicitly exclude them
+143returnisinstance(x,(int,np.integer))andnotisinstance(x,bool)
+144
+145
+146defmode(y):
+147"""Calculate the mode of a list of values
+148
+149 Parameters
+150 ----------
+151 y : array-like of shape (n_samples, n_estimators)
+152 array of values
+153
+154 Returns
+155 -------
+156 mode: array-like of shape (n_samples,)
+157 array of mode of each label
+158 count: array-like of shape (n_samples,)
+159 array of count of the mode of each label
+160 """
+161array=pd.DataFrame(np.array(y))
+162mode=array.mode(axis=0).loc[0,:]
+163count=array.apply(lambdax:x.value_counts().max())
+164returnmode.values,count.values
+165
+166
+167defcheck_n_jobs(n_jobs):
+168"""Check `n_jobs` parameter according to the scikit-learn convention.
+169 From sktime: BSD 3-Clause
+170 Parameters
+171 ----------
+172 n_jobs : int, positive or -1
+173 The number of jobs for parallelization.
+174
+175 Returns
+176 -------
+177 n_jobs : int
+178 Checked number of jobs.
+179 """
+180# scikit-learn convention
+181# https://scikit-learn.org/stable/glossary.html#term-n-jobs
+182ifn_jobsisNone:
+183return1
+184elifnotis_int(n_jobs):
+185raiseValueError(f"`n_jobs` must be None or an integer, but found: {n_jobs}")
+186elifn_jobs<0:
+187returnos.cpu_count()
+188else:
+189returnn_jobs
+190
+191
+192defcalc_number_per_class(y_label):
+193classes=np.unique(y_label)
+194proportion=calculate_prior_probability(y_label)
+195factor=1/min(proportion.values())
+196number_per_class=dict()
+197forcinclasses:
+198number_per_class[c]=math.ceil(proportion[c]*factor)
+199
+200returnnumber_per_class
+201
+202
+203defcheck_classifier(base_classifier,can_be_list=True,collection_size=None):
+204
+205ifbase_classifierisNone:
+206returnDecisionTreeClassifier()
+207elifcan_be_listand(type(base_classifier)==listortype(base_classifier)==tuple):
+208ifcollection_sizeisnotNone:
+209iflen(base_classifier)!=collection_size:
+210raiseAttributeError(f"base_classifier is a list of classifiers, but its length ({len(base_classifier)}) is different from expected ({collection_size})")
+211fori,bcinenumerate(base_classifier):
+212base_classifier[i]=check_classifier(bc,False)
+213returnlist(base_classifier)# Transform to list
+214else:
+215ifnotisinstance(base_classifier,ClassifierMixin):
+216raiseAttributeError(f"base_classifier must be a ClassifierMixin, but found {type(base_classifier)}")
+217returnbase_classifier
+
40defsafe_division(dividend,divisor,epsilon):
+41"""Safely divide two numbers preventing division by zero
+42
+43 Parameters
+44 ----------
+45 dividend : numeric
+46 Dividend value
+47 divisor : numeric
+48 Divisor value
+49 epsilon : numeric
+50 Close to zero value to be used in case of division by zero
+51
+52 Returns
+53 -------
+54 result : numeric
+55 Result of the division
+56 """
+57ifdivisor==0:
+58returndividend/epsilon
+59returndividend/divisor
+
+
+
+
Safely divide two numbers preventing division by zero
+
+
Parameters
+
+
+
dividend (numeric):
+Dividend value
+
divisor (numeric):
+Divisor value
+
epsilon (numeric):
+Close to zero value to be used in case of division by zero
62defconfidence_interval(X,hyp,y,alpha=.95):
+63"""Calculate the confidence interval of the predictions
+64
+65 Parameters
+66 ----------
+67 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+68 The input samples.
+69 hyp : classifier
+70 The classifier to be used for prediction
+71 y : array-like of shape (n_samples,)
+72 The target values
+73 alpha : float, optional
+74 confidence (1 - significance), by default .95
+75
+76 Returns
+77 -------
+78 li, hi: float
+79 lower and upper bound of the confidence interval
+80 """
+81data=hyp.predict(X)
+82
+83successes=np.count_nonzero(data==y)
+84trials=X.shape[0]
+85li,hi=proportion_confint(successes,trials,alpha=1-alpha,method="wilson")
+86returnli,hi
+
+
+
+
Calculate the confidence interval of the predictions
+
+
Parameters
+
+
+
X ({array-like, sparse matrix} of shape (n_samples, n_features)):
+The input samples.
+
hyp (classifier):
+The classifier to be used for prediction
+
y (array-like of shape (n_samples,)):
+The target values
+
alpha (float, optional):
+confidence (1 - significance), by default .95
+
+
+
Returns
+
+
+
li, hi (float):
+lower and upper bound of the confidence interval
89defchoice_with_proportion(predictions,class_predicted,proportion,extra=0):
+ 90"""Choice the best predictions according to the proportion of each class.
+ 91
+ 92 Parameters
+ 93 ----------
+ 94 predictions : array-like of shape (n_samples,)
+ 95 array of predictions
+ 96 class_predicted : array-like of shape (n_samples,)
+ 97 array of predicted classes
+ 98 proportion : dict
+ 99 dictionary with the proportion of each class
+100 extra : int, optional
+101 number of extra instances to be added, by default 0
+102
+103 Returns
+104 -------
+105 indices: array-like of shape (n_samples,)
+106 array of indices of the best predictions
+107 """
+108n=len(predictions)
+109for_each_class={c:int(n*j)forc,jinproportion.items()}
+110indices=np.zeros(0)
+111forcinproportion:
+112instances=class_predicted==c
+113to_add=np.argsort(predictions,kind="mergesort")[instances][::-1][0:for_each_class[c]+extra]
+114indices=np.concatenate((indices,to_add))
+115
+116returnindices.astype(int)
+
+
+
+
Choice the best predictions according to the proportion of each class.
+
+
Parameters
+
+
+
predictions (array-like of shape (n_samples,)):
+array of predictions
+
class_predicted (array-like of shape (n_samples,)):
+array of predicted classes
+
proportion (dict):
+dictionary with the proportion of each class
+
extra (int, optional):
+number of extra instances to be added, by default 0
+
+
+
Returns
+
+
+
indices (array-like of shape (n_samples,)):
+array of indices of the best predictions
+
+
+
+
+
+
+
+
+
+ def
+ calculate_prior_probability(y):
+
+
+
+
+
+
119defcalculate_prior_probability(y):
+120"""Calculate the priori probability of each label
+121
+122 Parameters
+123 ----------
+124 y : array-like of shape (n_samples,)
+125 array of labels
+126
+127 Returns
+128 -------
+129 class_probability: dict
+130 dictionary with priori probability (value) of each label (key)
+131 """
+132unique,counts=np.unique(y,return_counts=True)
+133u_c=dict(zip(unique,counts))
+134instances=len(y)
+135foruinu_c:
+136u_c[u]=float(u_c[u]/instances)
+137returnu_c
+
+
+
+
Calculate the priori probability of each label
+
+
Parameters
+
+
+
y (array-like of shape (n_samples,)):
+array of labels
+
+
+
Returns
+
+
+
class_probability (dict):
+dictionary with priori probability (value) of each label (key)
+
+
+
+
+
+
+
+
+
+ def
+ mode(y):
+
+
+
+
+
+
147defmode(y):
+148"""Calculate the mode of a list of values
+149
+150 Parameters
+151 ----------
+152 y : array-like of shape (n_samples, n_estimators)
+153 array of values
+154
+155 Returns
+156 -------
+157 mode: array-like of shape (n_samples,)
+158 array of mode of each label
+159 count: array-like of shape (n_samples,)
+160 array of count of the mode of each label
+161 """
+162array=pd.DataFrame(np.array(y))
+163mode=array.mode(axis=0).loc[0,:]
+164count=array.apply(lambdax:x.value_counts().max())
+165returnmode.values,count.values
+
+
+
+
Calculate the mode of a list of values
+
+
Parameters
+
+
+
y (array-like of shape (n_samples, n_estimators)):
+array of values
+
+
+
Returns
+
+
+
mode (array-like of shape (n_samples,)):
+array of mode of each label
+
count (array-like of shape (n_samples,)):
+array of count of the mode of each label
+
+
+
+
+
+
+
+
+
+ def
+ check_n_jobs(n_jobs):
+
+
+
+
+
+
168defcheck_n_jobs(n_jobs):
+169"""Check `n_jobs` parameter according to the scikit-learn convention.
+170 From sktime: BSD 3-Clause
+171 Parameters
+172 ----------
+173 n_jobs : int, positive or -1
+174 The number of jobs for parallelization.
+175
+176 Returns
+177 -------
+178 n_jobs : int
+179 Checked number of jobs.
+180 """
+181# scikit-learn convention
+182# https://scikit-learn.org/stable/glossary.html#term-n-jobs
+183ifn_jobsisNone:
+184return1
+185elifnotis_int(n_jobs):
+186raiseValueError(f"`n_jobs` must be None or an integer, but found: {n_jobs}")
+187elifn_jobs<0:
+188returnos.cpu_count()
+189else:
+190returnn_jobs
+
+
+
+
Check n_jobs parameter according to the scikit-learn convention.
+From sktime: BSD 3-Clause
+
+
Parameters
+
+
+
n_jobs (int, positive or -1):
+The number of jobs for parallelization.
+
+
+
Returns
+
+
+
n_jobs (int):
+Checked number of jobs.
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/docs/sslearn/wrapper.html b/docs/sslearn/wrapper.html
new file mode 100644
index 0000000..bb2c4cf
--- /dev/null
+++ b/docs/sslearn/wrapper.html
@@ -0,0 +1,6054 @@
+
+
+
+
+
+
+ sslearn.wrapper API documentation
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ class
+ SelfTraining(sklearn.semi_supervised._self_training.SelfTrainingClassifier):
+
+
+
+
+
+
16classSelfTraining(SelfTrainingClassifier):
+ 17"""
+ 18 **Self Training Classifier with data loader compatible.**
+ 19 ----------------------------
+ 20
+ 21 Is the same `SelfTrainingClassifier` from sklearn but with `sslearn` data loader compatible.
+ 22 For more information, see the [sklearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.semi_supervised.SelfTrainingClassifier.html).
+ 23
+ 24 **Example**
+ 25 -----------
+ 26 ```python
+ 27 from sklearn.datasets import load_iris
+ 28 from sslearn.model_selection import artificial_ssl_dataset
+ 29 from sslearn.wrapper import SelfTraining
+ 30
+ 31 X, y = load_iris(return_X_y=True)
+ 32 X, y, X_unlabel, y_unlabel, _, _ = artificial_ssl_dataset(X, y, label_rate=0.1, random_state=0)
+ 33
+ 34 clf = SelfTraining()
+ 35 clf.fit(X, y)
+ 36 clf.score(X_unlabel, y_unlabel)
+ 37 ```
+ 38
+ 39 **References**
+ 40 --------------
+ 41 David Yarowsky. (1995). <br>
+ 42 Unsupervised word sense disambiguation rivaling supervised methods.<br>
+ 43 In <i>Proceedings of the 33rd annual meeting on Association for Computational Linguistics (ACL '95).</i><br>
+ 44 Association for Computational Linguistics,<br>
+ 45 Stroudsburg, PA, USA, 189-196. <br>
+ 46 [10.3115/981658.981684](https://doi.org/10.3115/981658.981684)
+ 47 """
+ 48
+ 49_estimator_type="classifier"
+ 50
+ 51def__init__(self,
+ 52base_estimator,
+ 53threshold=0.75,
+ 54criterion='threshold',
+ 55k_best=10,
+ 56max_iter=10,
+ 57verbose=False):
+ 58"""Self-training. Adaptation of SelfTrainingClassifier from sklearn with data loader compatible.
+ 59
+ 60 This class allows a given supervised classifier to function as a
+ 61 semi-supervised classifier, allowing it to learn from unlabeled data. It
+ 62 does this by iteratively predicting pseudo-labels for the unlabeled data
+ 63 and adding them to the training set.
+ 64
+ 65 The classifier will continue iterating until either max_iter is reached, or
+ 66 no pseudo-labels were added to the training set in the previous iteration.
+ 67
+ 68 Parameters
+ 69 ----------
+ 70 base_estimator : estimator object
+ 71 An estimator object implementing ``fit`` and ``predict_proba``.
+ 72 Invoking the ``fit`` method will fit a clone of the passed estimator,
+ 73 which will be stored in the ``base_estimator_`` attribute.
+ 74
+ 75 threshold : float, default=0.75
+ 76 The decision threshold for use with `criterion='threshold'`.
+ 77 Should be in [0, 1). When using the 'threshold' criterion, a
+ 78 :ref:`well calibrated classifier <calibration>` should be used.
+ 79
+ 80 criterion : {'threshold', 'k_best'}, default='threshold'
+ 81 The selection criterion used to select which labels to add to the
+ 82 training set. If 'threshold', pseudo-labels with prediction
+ 83 probabilities above `threshold` are added to the dataset. If 'k_best',
+ 84 the `k_best` pseudo-labels with highest prediction probabilities are
+ 85 added to the dataset. When using the 'threshold' criterion, a
+ 86 :ref:`well calibrated classifier <calibration>` should be used.
+ 87
+ 88 k_best : int, default=10
+ 89 The amount of samples to add in each iteration. Only used when
+ 90 `criterion` is k_best'.
+ 91
+ 92 max_iter : int or None, default=10
+ 93 Maximum number of iterations allowed. Should be greater than or equal
+ 94 to 0. If it is ``None``, the classifier will continue to predict labels
+ 95 until no new pseudo-labels are added, or all unlabeled samples have
+ 96 been labeled.
+ 97
+ 98 verbose : bool, default=False
+ 99 Enable verbose output.
+100 """
+101super().__init__(base_estimator,threshold,criterion,k_best,max_iter,verbose)
+102
+103deffit(self,X,y):
+104"""
+105 Fits this ``SelfTrainingClassifier`` to a dataset.
+106
+107 Parameters
+108 ----------
+109 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+110 Array representing the data.
+111
+112 y : {array-like, sparse matrix} of shape (n_samples,)
+113 Array representing the labels. Unlabeled samples should have the
+114 label -1.
+115
+116 Returns
+117 -------
+118 self : SelfTrainingClassifier
+119 Returns an instance of self.
+120 """
+121y_adapted=y.copy()
+122ify_adapted.dtype.typeisstrory_adapted.dtype.typeisnp.str_:
+123y_adapted=y_adapted.astype(object)
+124y_adapted[y_adapted=='-1']=-1
+125returnsuper().fit(X,y_adapted)
+
+
+
+
Self Training Classifier with data loader compatible.
+
+
Is the same SelfTrainingClassifier from sklearn but with sslearn data loader compatible.
+For more information, see the sklearn documentation.
David Yarowsky. (1995).
+Unsupervised word sense disambiguation rivaling supervised methods.
+In Proceedings of the 33rd annual meeting on Association for Computational Linguistics (ACL '95).
+Association for Computational Linguistics,
+Stroudsburg, PA, USA, 189-196.
+10.3115/981658.981684
51def__init__(self,
+ 52base_estimator,
+ 53threshold=0.75,
+ 54criterion='threshold',
+ 55k_best=10,
+ 56max_iter=10,
+ 57verbose=False):
+ 58"""Self-training. Adaptation of SelfTrainingClassifier from sklearn with data loader compatible.
+ 59
+ 60 This class allows a given supervised classifier to function as a
+ 61 semi-supervised classifier, allowing it to learn from unlabeled data. It
+ 62 does this by iteratively predicting pseudo-labels for the unlabeled data
+ 63 and adding them to the training set.
+ 64
+ 65 The classifier will continue iterating until either max_iter is reached, or
+ 66 no pseudo-labels were added to the training set in the previous iteration.
+ 67
+ 68 Parameters
+ 69 ----------
+ 70 base_estimator : estimator object
+ 71 An estimator object implementing ``fit`` and ``predict_proba``.
+ 72 Invoking the ``fit`` method will fit a clone of the passed estimator,
+ 73 which will be stored in the ``base_estimator_`` attribute.
+ 74
+ 75 threshold : float, default=0.75
+ 76 The decision threshold for use with `criterion='threshold'`.
+ 77 Should be in [0, 1). When using the 'threshold' criterion, a
+ 78 :ref:`well calibrated classifier <calibration>` should be used.
+ 79
+ 80 criterion : {'threshold', 'k_best'}, default='threshold'
+ 81 The selection criterion used to select which labels to add to the
+ 82 training set. If 'threshold', pseudo-labels with prediction
+ 83 probabilities above `threshold` are added to the dataset. If 'k_best',
+ 84 the `k_best` pseudo-labels with highest prediction probabilities are
+ 85 added to the dataset. When using the 'threshold' criterion, a
+ 86 :ref:`well calibrated classifier <calibration>` should be used.
+ 87
+ 88 k_best : int, default=10
+ 89 The amount of samples to add in each iteration. Only used when
+ 90 `criterion` is k_best'.
+ 91
+ 92 max_iter : int or None, default=10
+ 93 Maximum number of iterations allowed. Should be greater than or equal
+ 94 to 0. If it is ``None``, the classifier will continue to predict labels
+ 95 until no new pseudo-labels are added, or all unlabeled samples have
+ 96 been labeled.
+ 97
+ 98 verbose : bool, default=False
+ 99 Enable verbose output.
+100 """
+101super().__init__(base_estimator,threshold,criterion,k_best,max_iter,verbose)
+
+
+
+
Self-training. Adaptation of SelfTrainingClassifier from sklearn with data loader compatible.
+
+
This class allows a given supervised classifier to function as a
+semi-supervised classifier, allowing it to learn from unlabeled data. It
+does this by iteratively predicting pseudo-labels for the unlabeled data
+and adding them to the training set.
+
+
The classifier will continue iterating until either max_iter is reached, or
+no pseudo-labels were added to the training set in the previous iteration.
+
+
Parameters
+
+
+
base_estimator (estimator object):
+An estimator object implementing fit and predict_proba.
+Invoking the fit method will fit a clone of the passed estimator,
+which will be stored in the base_estimator_ attribute.
+
threshold (float, default=0.75):
+The decision threshold for use with criterion='threshold'.
+Should be in [0, 1). When using the 'threshold' criterion, a
+:ref:well calibrated classifier <calibration> should be used.
+
criterion ({'threshold', 'k_best'}, default='threshold'):
+The selection criterion used to select which labels to add to the
+training set. If 'threshold', pseudo-labels with prediction
+probabilities above threshold are added to the dataset. If 'k_best',
+the k_best pseudo-labels with highest prediction probabilities are
+added to the dataset. When using the 'threshold' criterion, a
+:ref:well calibrated classifier <calibration> should be used.
+
k_best (int, default=10):
+The amount of samples to add in each iteration. Only used when
+criterion is k_best'.
+
max_iter (int or None, default=10):
+Maximum number of iterations allowed. Should be greater than or equal
+to 0. If it is None, the classifier will continue to predict labels
+until no new pseudo-labels are added, or all unlabeled samples have
+been labeled.
+
+ class
+ Setred(sklearn.base.ClassifierMixin, sklearn.base.BaseEstimator):
+
+
+
+
+
+
128classSetred(ClassifierMixin,BaseEstimator):
+129"""
+130 **Self-training with Editing.**
+131 ----------------------------
+132
+133 Create a SETRED classifier. It is a self-training algorithm that uses a rejection mechanism to avoid adding noisy samples to the training set.
+134 The main process are:
+135 1. Train a classifier with the labeled data.
+136 2. Create a pool of unlabeled data and select the most confident predictions.
+137 3. Repeat until the maximum number of iterations is reached:
+138 a. Select the most confident predictions from the unlabeled data.
+139 b. Calculate the neighborhood graph of the labeled data and the selected instances from the unlabeled data.
+140 c. Calculate the significance level of the selected instances.
+141 d. Reject the instances that are not significant according their position in the neighborhood graph.
+142 e. Add the selected instances to the labeled data and retrains the classifier.
+143 f. Add new instances to the pool of unlabeled data.
+144 4. Return the classifier trained with the labeled data.
+145
+146 **Example**
+147 -----------
+148 ```python
+149 from sklearn.datasets import load_iris
+150 from sslearn.model_selection import artificial_ssl_dataset
+151 from sslearn.wrapper import Setred
+152
+153 X, y = load_iris(return_X_y=True)
+154 X, y, X_unlabel, y_unlabel, _, _ = artificial_ssl_dataset(X, y, label_rate=0.1, random_state=0)
+155
+156 clf = Setred()
+157 clf.fit(X, y)
+158 clf.score(X_unlabel, y_unlabel)
+159 ```
+160
+161 **References**
+162 ----------
+163 Li, Ming, and Zhi-Hua Zhou. (2005)<br>
+164 SETRED: Self-training with editing,<br>
+165 in <i>Advances in Knowledge Discovery and Data Mining.</i> <br>
+166 Pacific-Asia Conference on Knowledge Discovery and Data Mining <br>
+167 LNAI 3518, Springer, Berlin, Heidelberg, <br>
+168 [10.1007/11430919_71](https://doi.org/10.1007/11430919_71)
+169
+170 """
+171
+172def__init__(
+173self,
+174base_estimator=KNeighborsClassifier(n_neighbors=3),
+175max_iterations=40,
+176distance="euclidean",
+177poolsize=0.25,
+178rejection_threshold=0.05,
+179graph_neighbors=1,
+180random_state=None,
+181n_jobs=None,
+182):
+183"""
+184 Create a SETRED classifier.
+185 It is a self-training algorithm that uses a rejection mechanism to avoid adding noisy samples to the training set.
+186
+187 Parameters
+188 ----------
+189 base_estimator : ClassifierMixin, optional
+190 An estimator object implementing fit and predict_proba,, by default DecisionTreeClassifier(), by default KNeighborsClassifier(n_neighbors=3)
+191 max_iterations : int, optional
+192 Maximum number of iterations allowed. Should be greater than or equal to 0., by default 40
+193 distance : str, optional
+194 The distance metric to use for the graph.
+195 The default metric is euclidean, and with p=2 is equivalent to the standard Euclidean metric.
+196 For a list of available metrics, see the documentation of DistanceMetric and the metrics listed in sklearn.metrics.pairwise.PAIRWISE_DISTANCE_FUNCTIONS.
+197 Note that the `cosine` metric uses cosine_distances., by default `euclidean`
+198 poolsize : float, optional
+199 Max number of unlabel instances candidates to pseudolabel, by default 0.25
+200 rejection_threshold : float, optional
+201 significance level, by default 0.1
+202 graph_neighbors : int, optional
+203 Number of neighbors for each sample., by default 1
+204 random_state : int, RandomState instance, optional
+205 controls the randomness of the estimator, by default None
+206 n_jobs : int, optional
+207 The number of parallel jobs to run for neighbors search. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors, by default None
+208 """
+209self.base_estimator=check_classifier(base_estimator,can_be_list=False)
+210self.max_iterations=max_iterations
+211self.poolsize=poolsize
+212self.distance=distance
+213self.rejection_threshold=rejection_threshold
+214self.graph_neighbors=graph_neighbors
+215self.random_state=random_state
+216self.n_jobs=n_jobs
+217
+218def__create_neighborhood(self,X):
+219# kneighbors_graph(X, 1, metric=self.distance, n_jobs=self.n_jobs).toarray()
+220returnkneighbors_graph(
+221X,self.graph_neighbors,metric=self.distance,n_jobs=self.n_jobs,mode="distance"
+222).toarray()
+223
+224deffit(self,X,y,**kwars):
+225"""Build a Setred classifier from the training set (X, y).
+226
+227 Parameters
+228 ----------
+229 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+230 The training input samples.
+231 y : array-like of shape (n_samples,)
+232 The target values (class labels), -1 if unlabeled.
+233
+234 Returns
+235 -------
+236 self: Setred
+237 Fitted estimator.
+238 """
+239random_state=check_random_state(self.random_state)
+240
+241X_label,y_label,X_unlabel=get_dataset(X,y)
+242
+243is_df=isinstance(X_label,pd.DataFrame)
+244
+245self.classes_=np.unique(y_label)
+246
+247each_iteration_candidates=X_label.shape[0]
+248
+249pool=int(len(X_unlabel)*self.poolsize)
+250self._base_estimator=skclone(self.base_estimator)
+251
+252self._base_estimator.fit(X_label,y_label,**kwars)
+253
+254y_probabilities=calculate_prior_probability(
+255y_label
+256)# Should probabilities change every iteration or may it keep with the first L?
+257
+258sort_idx=np.argsort(list(y_probabilities.keys()))
+259
+260ifX_unlabel.shape[0]==0:
+261returnself
+262
+263for_inrange(self.max_iterations):
+264U_=resample(
+265X_unlabel,replace=False,n_samples=pool,random_state=random_state
+266)
+267
+268ifis_df:
+269U_=pd.DataFrame(U_,columns=X_label.columns)
+270
+271raw_predictions=self._base_estimator.predict_proba(U_)
+272predictions=np.max(raw_predictions,axis=1)
+273class_predicted=np.argmax(raw_predictions,axis=1)
+274# Unless a better understanding is given, only the size of L will be used as maximal size of the candidate set.
+275indexes=predictions.argsort()[-each_iteration_candidates:]
+276
+277ifis_df:
+278L_=U_.iloc[indexes]
+279else:
+280L_=U_[indexes]
+281y_=np.array(
+282list(
+283map(
+284lambdax:self._base_estimator.classes_[x],
+285class_predicted[indexes],
+286)
+287)
+288)
+289
+290ifis_df:
+291pre_L=pd.concat([X_label,L_])
+292else:
+293pre_L=np.concatenate((X_label,L_),axis=0)
+294
+295weights=self.__create_neighborhood(pre_L)
+296# Keep only weights for L_
+297weights=weights[-L_.shape[0]:,:]
+298
+299idx=np.searchsorted(np.array(list(y_probabilities.keys())),y_,sorter=sort_idx)
+300p_wrong=1-np.asarray(np.array(list(y_probabilities.values())))[sort_idx][idx]
+301# Must weights be the inverse of distance?
+302weights=np.divide(1,weights,out=np.zeros_like(weights),where=weights!=0)
+303
+304weights_sum=weights.sum(axis=1)
+305weights_square_sum=(weights**2).sum(axis=1)
+306
+307iid_random=random_state.binomial(
+3081,np.repeat(p_wrong,weights.shape[1]).reshape(weights.shape)
+309)
+310ji=(iid_random*weights).sum(axis=1)
+311
+312mu_h0=p_wrong*weights_sum
+313sigma_h0=np.sqrt((1-p_wrong)*p_wrong*weights_square_sum)
+314
+315z_score=np.divide((ji-mu_h0),sigma_h0,out=np.zeros_like(sigma_h0),where=sigma_h0!=0)
+316# z_score = (ji - mu_h0) / sigma_h0
+317
+318oi=norm.sf(abs(z_score),mu_h0,sigma_h0)
+319to_add=(oi<self.rejection_threshold)&(z_score<mu_h0)
+320
+321ifis_df:
+322L_filtered=L_.iloc[to_add,:]
+323else:
+324L_filtered=L_[to_add,:]
+325y_filtered=y_[to_add]
+326
+327ifis_df:
+328X_label=pd.concat([X_label,L_filtered])
+329else:
+330X_label=np.concatenate((X_label,L_filtered),axis=0)
+331y_label=np.concatenate((y_label,y_filtered),axis=0)
+332
+333# Remove the instances from the unlabeled set.
+334to_delete=indexes[to_add]
+335ifis_df:
+336X_unlabel=X_unlabel.drop(index=X_unlabel.index[to_delete])
+337else:
+338X_unlabel=np.delete(X_unlabel,to_delete,axis=0)
+339
+340returnself
+341
+342defpredict(self,X,**kwards):
+343"""Predict class value for X.
+344 For a classification model, the predicted class for each sample in X is returned.
+345 Parameters
+346 ----------
+347 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+348 The input samples.
+349 Returns
+350 -------
+351 y : array-like of shape (n_samples,)
+352 The predicted classes
+353 """
+354returnself._base_estimator.predict(X,**kwards)
+355
+356defpredict_proba(self,X,**kwards):
+357"""Predict class probabilities of the input samples X.
+358 The predicted class probability depends on the ensemble estimator.
+359 Parameters
+360 ----------
+361 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+362 The input samples.
+363 Returns
+364 -------
+365 y : ndarray of shape (n_samples, n_classes) or list of n_outputs such arrays if n_outputs > 1
+366 The predicted classes
+367 """
+368returnself._base_estimator.predict_proba(X,**kwards)
+
+
+
+
Self-training with Editing.
+
+
Create a SETRED classifier. It is a self-training algorithm that uses a rejection mechanism to avoid adding noisy samples to the training set.
+The main process are:
+
+
+
Train a classifier with the labeled data.
+
Create a pool of unlabeled data and select the most confident predictions.
+
Repeat until the maximum number of iterations is reached:
+a. Select the most confident predictions from the unlabeled data.
+b. Calculate the neighborhood graph of the labeled data and the selected instances from the unlabeled data.
+c. Calculate the significance level of the selected instances.
+d. Reject the instances that are not significant according their position in the neighborhood graph.
+e. Add the selected instances to the labeled data and retrains the classifier.
+f. Add new instances to the pool of unlabeled data.
+
Return the classifier trained with the labeled data.
Li, Ming, and Zhi-Hua Zhou. (2005)
+SETRED: Self-training with editing,
+in Advances in Knowledge Discovery and Data Mining.
+Pacific-Asia Conference on Knowledge Discovery and Data Mining
+LNAI 3518, Springer, Berlin, Heidelberg,
+10.1007/11430919_71
172def__init__(
+173self,
+174base_estimator=KNeighborsClassifier(n_neighbors=3),
+175max_iterations=40,
+176distance="euclidean",
+177poolsize=0.25,
+178rejection_threshold=0.05,
+179graph_neighbors=1,
+180random_state=None,
+181n_jobs=None,
+182):
+183"""
+184 Create a SETRED classifier.
+185 It is a self-training algorithm that uses a rejection mechanism to avoid adding noisy samples to the training set.
+186
+187 Parameters
+188 ----------
+189 base_estimator : ClassifierMixin, optional
+190 An estimator object implementing fit and predict_proba,, by default DecisionTreeClassifier(), by default KNeighborsClassifier(n_neighbors=3)
+191 max_iterations : int, optional
+192 Maximum number of iterations allowed. Should be greater than or equal to 0., by default 40
+193 distance : str, optional
+194 The distance metric to use for the graph.
+195 The default metric is euclidean, and with p=2 is equivalent to the standard Euclidean metric.
+196 For a list of available metrics, see the documentation of DistanceMetric and the metrics listed in sklearn.metrics.pairwise.PAIRWISE_DISTANCE_FUNCTIONS.
+197 Note that the `cosine` metric uses cosine_distances., by default `euclidean`
+198 poolsize : float, optional
+199 Max number of unlabel instances candidates to pseudolabel, by default 0.25
+200 rejection_threshold : float, optional
+201 significance level, by default 0.1
+202 graph_neighbors : int, optional
+203 Number of neighbors for each sample., by default 1
+204 random_state : int, RandomState instance, optional
+205 controls the randomness of the estimator, by default None
+206 n_jobs : int, optional
+207 The number of parallel jobs to run for neighbors search. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors, by default None
+208 """
+209self.base_estimator=check_classifier(base_estimator,can_be_list=False)
+210self.max_iterations=max_iterations
+211self.poolsize=poolsize
+212self.distance=distance
+213self.rejection_threshold=rejection_threshold
+214self.graph_neighbors=graph_neighbors
+215self.random_state=random_state
+216self.n_jobs=n_jobs
+
+
+
+
Create a SETRED classifier.
+It is a self-training algorithm that uses a rejection mechanism to avoid adding noisy samples to the training set.
+
+
Parameters
+
+
+
base_estimator (ClassifierMixin, optional):
+An estimator object implementing fit and predict_proba,, by default DecisionTreeClassifier(), by default KNeighborsClassifier(n_neighbors=3)
+
max_iterations (int, optional):
+Maximum number of iterations allowed. Should be greater than or equal to 0., by default 40
+
distance (str, optional):
+The distance metric to use for the graph.
+The default metric is euclidean, and with p=2 is equivalent to the standard Euclidean metric.
+For a list of available metrics, see the documentation of DistanceMetric and the metrics listed in sklearn.metrics.pairwise.PAIRWISE_DISTANCE_FUNCTIONS.
+Note that the cosine metric uses cosine_distances., by default euclidean
+
poolsize (float, optional):
+Max number of unlabel instances candidates to pseudolabel, by default 0.25
+
rejection_threshold (float, optional):
+significance level, by default 0.1
+
graph_neighbors (int, optional):
+Number of neighbors for each sample., by default 1
+
random_state (int, RandomState instance, optional):
+controls the randomness of the estimator, by default None
+
n_jobs (int, optional):
+The number of parallel jobs to run for neighbors search. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors, by default None
+
+
+
+
+
+
+
+
+
+ def
+ fit(self, X, y, **kwars):
+
+
+
+
+
+
224deffit(self,X,y,**kwars):
+225"""Build a Setred classifier from the training set (X, y).
+226
+227 Parameters
+228 ----------
+229 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+230 The training input samples.
+231 y : array-like of shape (n_samples,)
+232 The target values (class labels), -1 if unlabeled.
+233
+234 Returns
+235 -------
+236 self: Setred
+237 Fitted estimator.
+238 """
+239random_state=check_random_state(self.random_state)
+240
+241X_label,y_label,X_unlabel=get_dataset(X,y)
+242
+243is_df=isinstance(X_label,pd.DataFrame)
+244
+245self.classes_=np.unique(y_label)
+246
+247each_iteration_candidates=X_label.shape[0]
+248
+249pool=int(len(X_unlabel)*self.poolsize)
+250self._base_estimator=skclone(self.base_estimator)
+251
+252self._base_estimator.fit(X_label,y_label,**kwars)
+253
+254y_probabilities=calculate_prior_probability(
+255y_label
+256)# Should probabilities change every iteration or may it keep with the first L?
+257
+258sort_idx=np.argsort(list(y_probabilities.keys()))
+259
+260ifX_unlabel.shape[0]==0:
+261returnself
+262
+263for_inrange(self.max_iterations):
+264U_=resample(
+265X_unlabel,replace=False,n_samples=pool,random_state=random_state
+266)
+267
+268ifis_df:
+269U_=pd.DataFrame(U_,columns=X_label.columns)
+270
+271raw_predictions=self._base_estimator.predict_proba(U_)
+272predictions=np.max(raw_predictions,axis=1)
+273class_predicted=np.argmax(raw_predictions,axis=1)
+274# Unless a better understanding is given, only the size of L will be used as maximal size of the candidate set.
+275indexes=predictions.argsort()[-each_iteration_candidates:]
+276
+277ifis_df:
+278L_=U_.iloc[indexes]
+279else:
+280L_=U_[indexes]
+281y_=np.array(
+282list(
+283map(
+284lambdax:self._base_estimator.classes_[x],
+285class_predicted[indexes],
+286)
+287)
+288)
+289
+290ifis_df:
+291pre_L=pd.concat([X_label,L_])
+292else:
+293pre_L=np.concatenate((X_label,L_),axis=0)
+294
+295weights=self.__create_neighborhood(pre_L)
+296# Keep only weights for L_
+297weights=weights[-L_.shape[0]:,:]
+298
+299idx=np.searchsorted(np.array(list(y_probabilities.keys())),y_,sorter=sort_idx)
+300p_wrong=1-np.asarray(np.array(list(y_probabilities.values())))[sort_idx][idx]
+301# Must weights be the inverse of distance?
+302weights=np.divide(1,weights,out=np.zeros_like(weights),where=weights!=0)
+303
+304weights_sum=weights.sum(axis=1)
+305weights_square_sum=(weights**2).sum(axis=1)
+306
+307iid_random=random_state.binomial(
+3081,np.repeat(p_wrong,weights.shape[1]).reshape(weights.shape)
+309)
+310ji=(iid_random*weights).sum(axis=1)
+311
+312mu_h0=p_wrong*weights_sum
+313sigma_h0=np.sqrt((1-p_wrong)*p_wrong*weights_square_sum)
+314
+315z_score=np.divide((ji-mu_h0),sigma_h0,out=np.zeros_like(sigma_h0),where=sigma_h0!=0)
+316# z_score = (ji - mu_h0) / sigma_h0
+317
+318oi=norm.sf(abs(z_score),mu_h0,sigma_h0)
+319to_add=(oi<self.rejection_threshold)&(z_score<mu_h0)
+320
+321ifis_df:
+322L_filtered=L_.iloc[to_add,:]
+323else:
+324L_filtered=L_[to_add,:]
+325y_filtered=y_[to_add]
+326
+327ifis_df:
+328X_label=pd.concat([X_label,L_filtered])
+329else:
+330X_label=np.concatenate((X_label,L_filtered),axis=0)
+331y_label=np.concatenate((y_label,y_filtered),axis=0)
+332
+333# Remove the instances from the unlabeled set.
+334to_delete=indexes[to_add]
+335ifis_df:
+336X_unlabel=X_unlabel.drop(index=X_unlabel.index[to_delete])
+337else:
+338X_unlabel=np.delete(X_unlabel,to_delete,axis=0)
+339
+340returnself
+
+
+
+
Build a Setred classifier from the training set (X, y).
+
+
Parameters
+
+
+
X ({array-like, sparse matrix} of shape (n_samples, n_features)):
+The training input samples.
+
y (array-like of shape (n_samples,)):
+The target values (class labels), -1 if unlabeled.
+
+
+
Returns
+
+
+
self (Setred):
+Fitted estimator.
+
+
+
+
+
+
+
+
+
+ def
+ predict(self, X, **kwards):
+
+
+
+
+
+
342defpredict(self,X,**kwards):
+343"""Predict class value for X.
+344 For a classification model, the predicted class for each sample in X is returned.
+345 Parameters
+346 ----------
+347 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+348 The input samples.
+349 Returns
+350 -------
+351 y : array-like of shape (n_samples,)
+352 The predicted classes
+353 """
+354returnself._base_estimator.predict(X,**kwards)
+
+
+
+
Predict class value for X.
+For a classification model, the predicted class for each sample in X is returned.
+
+
Parameters
+
+
+
X ({array-like, sparse matrix} of shape (n_samples, n_features)):
+The input samples.
+
+
+
Returns
+
+
+
y (array-like of shape (n_samples,)):
+The predicted classes
356defpredict_proba(self,X,**kwards):
+357"""Predict class probabilities of the input samples X.
+358 The predicted class probability depends on the ensemble estimator.
+359 Parameters
+360 ----------
+361 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+362 The input samples.
+363 Returns
+364 -------
+365 y : ndarray of shape (n_samples, n_classes) or list of n_outputs such arrays if n_outputs > 1
+366 The predicted classes
+367 """
+368returnself._base_estimator.predict_proba(X,**kwards)
+
+
+
+
Predict class probabilities of the input samples X.
+The predicted class probability depends on the ensemble estimator.
+
+
Parameters
+
+
+
X ({array-like, sparse matrix} of shape (n_samples, n_features)):
+The input samples.
+
+
+
Returns
+
+
+
y (ndarray of shape (n_samples, n_classes) or list of n_outputs such arrays if n_outputs > 1):
+The predicted classes
+
+
+
+
+
+
+
Inherited Members
+
+
sklearn.base.ClassifierMixin
+
score
+
+
+
sklearn.base.BaseEstimator
+
get_params
+
set_params
+
+
+
+
+
+
+
+
+
+ class
+ CoTraining(sslearn.wrapper._co.BaseCoTraining):
+
+
+
+
+
+
453classCoTraining(BaseCoTraining):
+454"""
+455 **CoTraining classifier. Multi-view learning algorithm that uses two classifiers to label instances.**
+456 --------------------------------------------
+457
+458 The main process is:
+459 1. Train each classifier with the labeled instances and their respective view.
+460 2. While max iterations is not reached or any instance is unlabeled:
+461 1. Predict the instances from the unlabeled set.
+462 2. Select the instances that have the same prediction and the predictions are above the threshold.
+463 3. Label the instances with the highest probability, keeping the balance of the classes.
+464 4. Retrain the classifier with the new instances.
+465 3. Combine the probabilities of each classifier.
+466
+467 **Methods**
+468 -------
+469 - `fit`: Fit the model with the labeled instances.
+470 - `predict` : Predict the class for each instance.
+471 - `predict_proba`: Predict the probability for each class.
+472 - `score`: Return the mean accuracy on the given test data and labels.
+473
+474 **Example**
+475 -------
+476 ```python
+477 from sklearn.datasets import load_iris
+478 from sklearn.tree import DecisionTreeClassifier
+479 from sslearn.wrapper import CoTraining
+480 from sslearn.model_selection import artificial_ssl_dataset
+481
+482 X, y = load_iris(return_X_y=True)
+483 X, y, X_unlabel, y_unlabel, _, _ = artificial_ssl_dataset(X, y, label_rate=0.1, random_state=0)
+484 cotraining = CoTraining(DecisionTreeClassifier())
+485 X1 = X[:, [0, 1]]
+486 X2 = X[:, [2, 3]]
+487 cotraining.fit(X1, y, X2)
+488 # or
+489 cotraining.fit(X, y, features=[[0, 1], [2, 3]])
+490 # or
+491 cotraining = CoTraining(DecisionTreeClassifier(), force_second_view=False)
+492 cotraining.fit(X, y)
+493 ```
+494
+495 **References**
+496 ----------
+497 Avrim Blum and Tom Mitchell. (1998).<br>
+498 Combining labeled and unlabeled data with co-training<br>
+499 in <i>Proceedings of the eleventh annual conference on Computational learning theory (COLT' 98)</i>.<br>
+500 Association for Computing Machinery, New York, NY, USA, 92-100.<br>
+501 [10.1145/279943.279962](https://doi.org/10.1145/279943.279962)
+502
+503 Han, Xian-Hua, Yen-wei Chen, and Xiang Ruan. (2011). <br>
+504 Multi-Class Co-Training Learning for Object and Scene Recognition,<br>
+505 pp. 67-70 in. Nara, Japan. <br>
+506 [http://www.mva-org.jp/Proceedings/2011CD/papers/04-08.pdf](http://www.mva-org.jp/Proceedings/2011CD/papers/04-08.pdf)<br>
+507 """
+508
+509def__init__(
+510self,
+511base_estimator=DecisionTreeClassifier(),
+512second_base_estimator=None,
+513max_iterations=30,
+514poolsize=75,
+515threshold=0.5,
+516force_second_view=True,
+517random_state=None
+518):
+519"""
+520 Create a CoTraining classifier.
+521 Multi-view learning algorithm that uses two classifiers to label instances.
+522
+523 Parameters
+524 ----------
+525 base_estimator : ClassifierMixin, optional
+526 The classifier that will be used in the cotraining algorithm on the feature set, by default DecisionTreeClassifier()
+527 second_base_estimator : ClassifierMixin, optional
+528 The classifier that will be used in the cotraining algorithm on another feature set, if none are a clone of base_estimator, by default None
+529 max_iterations : int, optional
+530 The number of iterations, by default 30
+531 poolsize : int, optional
+532 The size of the pool of unlabeled samples from which the classifier can choose, by default 75
+533 threshold : float, optional
+534 The threshold for label instances, by default 0.5
+535 force_second_view : bool, optional
+536 The second classifier needs a different view of the data. If False then a second view will be same as the first, by default True
+537 random_state : int, RandomState instance, optional
+538 controls the randomness of the estimator, by default None
+539
+540 """
+541self.base_estimator=check_classifier(base_estimator,False)
+542ifsecond_base_estimatorisnotNone:
+543second_base_estimator=check_classifier(second_base_estimator,False)
+544self.second_base_estimator=second_base_estimator
+545self.max_iterations=max_iterations
+546self.poolsize=poolsize
+547self.threshold=threshold
+548self.force_second_view=force_second_view
+549self.random_state=random_state
+550
+551deffit(self,X,y,X2=None,features:list=None,number_per_class:dict=None,**kwards):
+552"""
+553 Build a CoTraining classifier from the training set.
+554
+555 Parameters
+556 ----------
+557 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+558 Array representing the data.
+559 y : array-like of shape (n_samples,)
+560 The target values (class labels), -1 if unlabeled.
+561 X2 : {array-like, sparse matrix} of shape (n_samples, n_features), optional
+562 Array representing the data from another view, not compatible with `features`, by default None
+563 features : {list, tuple}, optional
+564 list or tuple of two arrays with `feature` index for each subspace view, not compatible with `X2`, by default None
+565 number_per_class : {dict}, optional
+566 dict of class name:integer with the max ammount of instances to label in this class in each iteration, by default None
+567
+568 Returns
+569 -------
+570 self: CoTraining
+571 Fitted estimator.
+572 """
+573rs=check_random_state(self.random_state)
+574
+575X_label,y_label,X_unlabel=get_dataset(X,y)
+576
+577is_df=isinstance(X_label,pd.DataFrame)
+578
+579ifX2isnotNone:
+580X2_label,_,X2_unlabel=get_dataset(X2,y)
+581eliffeaturesisnotNone:
+582ifis_df:
+583X2_label=X_label.iloc[:,features[1]]
+584X2_unlabel=X_unlabel.iloc[:,features[1]]
+585X_label=X_label.iloc[:,features[0]]
+586X_unlabel=X_unlabel.iloc[:,features[0]]
+587else:
+588X2_label=X_label[:,features[1]]
+589X2_unlabel=X_unlabel[:,features[1]]
+590X_label=X_label[:,features[0]]
+591X_unlabel=X_unlabel[:,features[0]]
+592self.columns_=features
+593elifself.force_second_view:
+594raiseAttributeError("Either X2 or features must be defined. CoTraining need another view to train the second classifier")
+595else:
+596self.columns_=[list(range(X.shape[1]))]*2
+597X2_label=X_label.copy()
+598X2_unlabel=X_unlabel.copy()
+599
+600ifis_dfandX2_labelisnotNoneandnotisinstance(X2_label,pd.DataFrame):
+601raiseAttributeError("X and X2 must be both pandas DataFrame or numpy arrays")
+602
+603self.h=[
+604skclone(self.base_estimator),
+605skclone(self.base_estimator)ifself.second_base_estimatorisNoneelseskclone(self.second_base_estimator)
+606]
+607assert(
+608X2isNoneorfeaturesisNone
+609),"The list of features and X2 cannot be defined at the same time"
+610
+611self.classes_=np.unique(y_label)
+612ifnumber_per_classisNone:
+613number_per_class=calc_number_per_class(y_label)
+614
+615ifX_unlabel.shape[0]<self.poolsize:
+616warnings.warn(f"Poolsize ({self.poolsize}) is bigger than U ({X_unlabel.shape[0]})")
+617
+618permutation=rs.permutation(len(X_unlabel))
+619
+620self.h[0].fit(X_label,y_label)
+621self.h[1].fit(X2_label,y_label)
+622
+623it=0
+624whileit<self.max_iterationsandany(permutation):
+625it+=1
+626
+627get_index=permutation[:self.poolsize]
+628y1_prob=self.h[0].predict_proba(X_unlabel[get_index]ifnotis_dfelseX_unlabel.iloc[get_index,:])
+629y2_prob=self.h[1].predict_proba(X2_unlabel[get_index]ifnotis_dfelseX2_unlabel.iloc[get_index,:])
+630
+631predictions1=np.max(y1_prob,axis=1)
+632class_predicted1=np.argmax(y1_prob,axis=1)
+633
+634predictions2=np.max(y2_prob,axis=1)
+635class_predicted2=np.argmax(y2_prob,axis=1)
+636
+637# If two classifier select same instance and bring different predictions then the instance is not labeled
+638candidates1=predictions1>self.threshold
+639candidates2=predictions2>self.threshold
+640aggreement=class_predicted1==class_predicted2
+641
+642full_candidates=candidates1^candidates2
+643medium_candidates=candidates1&candidates2&aggreement
+644true_candidates1=full_candidates&candidates1
+645true_candidates2=full_candidates&candidates2
+646
+647# Fill probas and candidate classes.
+648y_probas=np.zeros(predictions1.shape,dtype=predictions1.dtype)
+649y_class=class_predicted1.copy()
+650
+651temp_probas1=predictions1[true_candidates1]
+652temp_probas2=predictions2[true_candidates2]
+653temp_probasB=(predictions1[medium_candidates]+predictions2[medium_candidates])/2
+654
+655temp_classes2=class_predicted2[true_candidates2]
+656
+657y_probas[true_candidates1]=temp_probas1
+658y_probas[true_candidates2]=temp_probas2
+659y_probas[medium_candidates]=temp_probasB
+660y_class[true_candidates2]=temp_classes2
+661
+662# Select the best candidates
+663final_instances=list()
+664best_candidates=np.argsort(y_probas,kind="mergesort")[::-1]
+665forcinself.classes_:
+666final_instances+=list(best_candidates[y_class[best_candidates]==c])[:number_per_class[c]]
+667
+668# Fill the new labeled instances
+669pseudoy=y_class[final_instances]
+670y_label=np.append(y_label,pseudoy)
+671
+672index=permutation[0:self.poolsize][final_instances]
+673ifis_df:
+674X_label=pd.concat([X_label,X_unlabel.iloc[index,:]])
+675X2_label=pd.concat([X2_label,X2_unlabel.iloc[index,:]])
+676else:
+677X_label=np.append(X_label,X_unlabel[index],axis=0)
+678X2_label=np.append(X2_label,X2_unlabel[index],axis=0)
+679
+680permutation=permutation[list(map(lambdax:xnotinindex,permutation))]
+681
+682# Poolsize increments in order double of max instances candidates:
+683self.poolsize+=sum(number_per_class.values())*2
+684
+685self.h[0].fit(X_label,y_label)
+686self.h[1].fit(X2_label,y_label)
+687
+688self.h_=self.h
+689
+690returnself
+691
+692defpredict_proba(self,X,X2=None,**kwards):
+693"""Predict probability for each possible outcome.
+694
+695 Parameters
+696 ----------
+697 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+698 Array representing the data.
+699 X2 : {array-like, sparse matrix} of shape (n_samples, n_features), optional
+700 Array representing the data from another view, by default None
+701 Returns
+702 -------
+703 class probabilities: ndarray of shape (n_samples, n_classes)
+704 Array with prediction probabilities.
+705 """
+706if"columns_"indir(self):
+707returnsuper().predict_proba(X,**kwards)
+708elif"h_"indir(self):
+709ys=[]
+710ys.append(self.h_[0].predict_proba(X,**kwards))
+711ys.append(self.h_[1].predict_proba(X2,**kwards))
+712y=sum(ys)/len(ys)
+713returny
+714else:
+715raiseNotFittedError("Classifier not fitted")
+716
+717defpredict(self,X,X2=None,**kwards):
+718"""Predict the classes of X.
+719 Parameters
+720 ----------
+721 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+722 Array representing the data.
+723 X2 : {array-like, sparse matrix} of shape (n_samples, n_features), optional
+724 Array representing the data from another view, by default None
+725
+726 Returns
+727 -------
+728 y : ndarray of shape (n_samples,)
+729 Array with predicted labels.
+730 """
+731if"columns_"indir(self):
+732result=super().predict(X,**kwards)
+733else:
+734predicted_probabilitiy=self.predict_proba(X,X2,**kwards)
+735result=self.classes_.take(
+736(np.argmax(predicted_probabilitiy,axis=1)),axis=0
+737)
+738returnresult
+739
+740defscore(self,X,y,sample_weight=None,**kwards):
+741"""
+742 Return the mean accuracy on the given test data and labels.
+743 In multi-label classification, this is the subset accuracy
+744 which is a harsh metric since you require for each sample that
+745 each label set be correctly predicted.
+746 Parameters
+747 ----------
+748 X : array-like of shape (n_samples, n_features)
+749 Test samples.
+750 y : array-like of shape (n_samples,) or (n_samples, n_outputs)
+751 True labels for `X`.
+752 sample_weight : array-like of shape (n_samples,), default=None
+753 Sample weights.
+754 X2 : {array-like, sparse matrix} of shape (n_samples, n_features), optional
+755 Array representing the data from another view, by default None
+756 Returns
+757 -------
+758 score : float
+759 Mean accuracy of ``self.predict(X)`` wrt. `y`.
+760 """
+761if"X2"inkwards:
+762returnaccuracy_score(y,self.predict(X,kwards["X2"]),sample_weight=sample_weight)
+763else:
+764returnsuper().score(X,y,sample_weight=sample_weight)
+
+
+
+
CoTraining classifier. Multi-view learning algorithm that uses two classifiers to label instances.
+
+
The main process is:
+
+
+
Train each classifier with the labeled instances and their respective view.
+
While max iterations is not reached or any instance is unlabeled:
+
+
Predict the instances from the unlabeled set.
+
Select the instances that have the same prediction and the predictions are above the threshold.
+
Label the instances with the highest probability, keeping the balance of the classes.
predict_proba: Predict the probability for each class.
+
score: Return the mean accuracy on the given test data and labels.
+
+
+
Example
+
+
+
fromsklearn.datasetsimportload_iris
+fromsklearn.treeimportDecisionTreeClassifier
+fromsslearn.wrapperimportCoTraining
+fromsslearn.model_selectionimportartificial_ssl_dataset
+
+X,y=load_iris(return_X_y=True)
+X,y,X_unlabel,y_unlabel,_,_=artificial_ssl_dataset(X,y,label_rate=0.1,random_state=0)
+cotraining=CoTraining(DecisionTreeClassifier())
+X1=X[:,[0,1]]
+X2=X[:,[2,3]]
+cotraining.fit(X1,y,X2)
+# or
+cotraining.fit(X,y,features=[[0,1],[2,3]])
+# or
+cotraining=CoTraining(DecisionTreeClassifier(),force_second_view=False)
+cotraining.fit(X,y)
+
+
+
+
References
+
+
Avrim Blum and Tom Mitchell. (1998).
+Combining labeled and unlabeled data with co-training
+in Proceedings of the eleventh annual conference on Computational learning theory (COLT' 98).
+Association for Computing Machinery, New York, NY, USA, 92-100.
+10.1145/279943.279962
509def__init__(
+510self,
+511base_estimator=DecisionTreeClassifier(),
+512second_base_estimator=None,
+513max_iterations=30,
+514poolsize=75,
+515threshold=0.5,
+516force_second_view=True,
+517random_state=None
+518):
+519"""
+520 Create a CoTraining classifier.
+521 Multi-view learning algorithm that uses two classifiers to label instances.
+522
+523 Parameters
+524 ----------
+525 base_estimator : ClassifierMixin, optional
+526 The classifier that will be used in the cotraining algorithm on the feature set, by default DecisionTreeClassifier()
+527 second_base_estimator : ClassifierMixin, optional
+528 The classifier that will be used in the cotraining algorithm on another feature set, if none are a clone of base_estimator, by default None
+529 max_iterations : int, optional
+530 The number of iterations, by default 30
+531 poolsize : int, optional
+532 The size of the pool of unlabeled samples from which the classifier can choose, by default 75
+533 threshold : float, optional
+534 The threshold for label instances, by default 0.5
+535 force_second_view : bool, optional
+536 The second classifier needs a different view of the data. If False then a second view will be same as the first, by default True
+537 random_state : int, RandomState instance, optional
+538 controls the randomness of the estimator, by default None
+539
+540 """
+541self.base_estimator=check_classifier(base_estimator,False)
+542ifsecond_base_estimatorisnotNone:
+543second_base_estimator=check_classifier(second_base_estimator,False)
+544self.second_base_estimator=second_base_estimator
+545self.max_iterations=max_iterations
+546self.poolsize=poolsize
+547self.threshold=threshold
+548self.force_second_view=force_second_view
+549self.random_state=random_state
+
+
+
+
Create a CoTraining classifier.
+Multi-view learning algorithm that uses two classifiers to label instances.
+
+
Parameters
+
+
+
base_estimator (ClassifierMixin, optional):
+The classifier that will be used in the cotraining algorithm on the feature set, by default DecisionTreeClassifier()
+
second_base_estimator (ClassifierMixin, optional):
+The classifier that will be used in the cotraining algorithm on another feature set, if none are a clone of base_estimator, by default None
+
max_iterations (int, optional):
+The number of iterations, by default 30
+
poolsize (int, optional):
+The size of the pool of unlabeled samples from which the classifier can choose, by default 75
+
threshold (float, optional):
+The threshold for label instances, by default 0.5
+
force_second_view (bool, optional):
+The second classifier needs a different view of the data. If False then a second view will be same as the first, by default True
+
random_state (int, RandomState instance, optional):
+controls the randomness of the estimator, by default None
551deffit(self,X,y,X2=None,features:list=None,number_per_class:dict=None,**kwards):
+552"""
+553 Build a CoTraining classifier from the training set.
+554
+555 Parameters
+556 ----------
+557 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+558 Array representing the data.
+559 y : array-like of shape (n_samples,)
+560 The target values (class labels), -1 if unlabeled.
+561 X2 : {array-like, sparse matrix} of shape (n_samples, n_features), optional
+562 Array representing the data from another view, not compatible with `features`, by default None
+563 features : {list, tuple}, optional
+564 list or tuple of two arrays with `feature` index for each subspace view, not compatible with `X2`, by default None
+565 number_per_class : {dict}, optional
+566 dict of class name:integer with the max ammount of instances to label in this class in each iteration, by default None
+567
+568 Returns
+569 -------
+570 self: CoTraining
+571 Fitted estimator.
+572 """
+573rs=check_random_state(self.random_state)
+574
+575X_label,y_label,X_unlabel=get_dataset(X,y)
+576
+577is_df=isinstance(X_label,pd.DataFrame)
+578
+579ifX2isnotNone:
+580X2_label,_,X2_unlabel=get_dataset(X2,y)
+581eliffeaturesisnotNone:
+582ifis_df:
+583X2_label=X_label.iloc[:,features[1]]
+584X2_unlabel=X_unlabel.iloc[:,features[1]]
+585X_label=X_label.iloc[:,features[0]]
+586X_unlabel=X_unlabel.iloc[:,features[0]]
+587else:
+588X2_label=X_label[:,features[1]]
+589X2_unlabel=X_unlabel[:,features[1]]
+590X_label=X_label[:,features[0]]
+591X_unlabel=X_unlabel[:,features[0]]
+592self.columns_=features
+593elifself.force_second_view:
+594raiseAttributeError("Either X2 or features must be defined. CoTraining need another view to train the second classifier")
+595else:
+596self.columns_=[list(range(X.shape[1]))]*2
+597X2_label=X_label.copy()
+598X2_unlabel=X_unlabel.copy()
+599
+600ifis_dfandX2_labelisnotNoneandnotisinstance(X2_label,pd.DataFrame):
+601raiseAttributeError("X and X2 must be both pandas DataFrame or numpy arrays")
+602
+603self.h=[
+604skclone(self.base_estimator),
+605skclone(self.base_estimator)ifself.second_base_estimatorisNoneelseskclone(self.second_base_estimator)
+606]
+607assert(
+608X2isNoneorfeaturesisNone
+609),"The list of features and X2 cannot be defined at the same time"
+610
+611self.classes_=np.unique(y_label)
+612ifnumber_per_classisNone:
+613number_per_class=calc_number_per_class(y_label)
+614
+615ifX_unlabel.shape[0]<self.poolsize:
+616warnings.warn(f"Poolsize ({self.poolsize}) is bigger than U ({X_unlabel.shape[0]})")
+617
+618permutation=rs.permutation(len(X_unlabel))
+619
+620self.h[0].fit(X_label,y_label)
+621self.h[1].fit(X2_label,y_label)
+622
+623it=0
+624whileit<self.max_iterationsandany(permutation):
+625it+=1
+626
+627get_index=permutation[:self.poolsize]
+628y1_prob=self.h[0].predict_proba(X_unlabel[get_index]ifnotis_dfelseX_unlabel.iloc[get_index,:])
+629y2_prob=self.h[1].predict_proba(X2_unlabel[get_index]ifnotis_dfelseX2_unlabel.iloc[get_index,:])
+630
+631predictions1=np.max(y1_prob,axis=1)
+632class_predicted1=np.argmax(y1_prob,axis=1)
+633
+634predictions2=np.max(y2_prob,axis=1)
+635class_predicted2=np.argmax(y2_prob,axis=1)
+636
+637# If two classifier select same instance and bring different predictions then the instance is not labeled
+638candidates1=predictions1>self.threshold
+639candidates2=predictions2>self.threshold
+640aggreement=class_predicted1==class_predicted2
+641
+642full_candidates=candidates1^candidates2
+643medium_candidates=candidates1&candidates2&aggreement
+644true_candidates1=full_candidates&candidates1
+645true_candidates2=full_candidates&candidates2
+646
+647# Fill probas and candidate classes.
+648y_probas=np.zeros(predictions1.shape,dtype=predictions1.dtype)
+649y_class=class_predicted1.copy()
+650
+651temp_probas1=predictions1[true_candidates1]
+652temp_probas2=predictions2[true_candidates2]
+653temp_probasB=(predictions1[medium_candidates]+predictions2[medium_candidates])/2
+654
+655temp_classes2=class_predicted2[true_candidates2]
+656
+657y_probas[true_candidates1]=temp_probas1
+658y_probas[true_candidates2]=temp_probas2
+659y_probas[medium_candidates]=temp_probasB
+660y_class[true_candidates2]=temp_classes2
+661
+662# Select the best candidates
+663final_instances=list()
+664best_candidates=np.argsort(y_probas,kind="mergesort")[::-1]
+665forcinself.classes_:
+666final_instances+=list(best_candidates[y_class[best_candidates]==c])[:number_per_class[c]]
+667
+668# Fill the new labeled instances
+669pseudoy=y_class[final_instances]
+670y_label=np.append(y_label,pseudoy)
+671
+672index=permutation[0:self.poolsize][final_instances]
+673ifis_df:
+674X_label=pd.concat([X_label,X_unlabel.iloc[index,:]])
+675X2_label=pd.concat([X2_label,X2_unlabel.iloc[index,:]])
+676else:
+677X_label=np.append(X_label,X_unlabel[index],axis=0)
+678X2_label=np.append(X2_label,X2_unlabel[index],axis=0)
+679
+680permutation=permutation[list(map(lambdax:xnotinindex,permutation))]
+681
+682# Poolsize increments in order double of max instances candidates:
+683self.poolsize+=sum(number_per_class.values())*2
+684
+685self.h[0].fit(X_label,y_label)
+686self.h[1].fit(X2_label,y_label)
+687
+688self.h_=self.h
+689
+690returnself
+
+
+
+
Build a CoTraining classifier from the training set.
+
+
Parameters
+
+
+
X ({array-like, sparse matrix} of shape (n_samples, n_features)):
+Array representing the data.
+
y (array-like of shape (n_samples,)):
+The target values (class labels), -1 if unlabeled.
+
X2 ({array-like, sparse matrix} of shape (n_samples, n_features), optional):
+Array representing the data from another view, not compatible with features, by default None
+
features ({list, tuple}, optional):
+list or tuple of two arrays with feature index for each subspace view, not compatible with X2, by default None
+
number_per_class ({dict}, optional):
+dict of class name:integer with the max ammount of instances to label in this class in each iteration, by default None
692defpredict_proba(self,X,X2=None,**kwards):
+693"""Predict probability for each possible outcome.
+694
+695 Parameters
+696 ----------
+697 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+698 Array representing the data.
+699 X2 : {array-like, sparse matrix} of shape (n_samples, n_features), optional
+700 Array representing the data from another view, by default None
+701 Returns
+702 -------
+703 class probabilities: ndarray of shape (n_samples, n_classes)
+704 Array with prediction probabilities.
+705 """
+706if"columns_"indir(self):
+707returnsuper().predict_proba(X,**kwards)
+708elif"h_"indir(self):
+709ys=[]
+710ys.append(self.h_[0].predict_proba(X,**kwards))
+711ys.append(self.h_[1].predict_proba(X2,**kwards))
+712y=sum(ys)/len(ys)
+713returny
+714else:
+715raiseNotFittedError("Classifier not fitted")
+
+
+
+
Predict probability for each possible outcome.
+
+
Parameters
+
+
+
X ({array-like, sparse matrix} of shape (n_samples, n_features)):
+Array representing the data.
+
X2 ({array-like, sparse matrix} of shape (n_samples, n_features), optional):
+Array representing the data from another view, by default None
+
+
+
Returns
+
+
+
class probabilities (ndarray of shape (n_samples, n_classes)):
+Array with prediction probabilities.
740defscore(self,X,y,sample_weight=None,**kwards):
+741"""
+742 Return the mean accuracy on the given test data and labels.
+743 In multi-label classification, this is the subset accuracy
+744 which is a harsh metric since you require for each sample that
+745 each label set be correctly predicted.
+746 Parameters
+747 ----------
+748 X : array-like of shape (n_samples, n_features)
+749 Test samples.
+750 y : array-like of shape (n_samples,) or (n_samples, n_outputs)
+751 True labels for `X`.
+752 sample_weight : array-like of shape (n_samples,), default=None
+753 Sample weights.
+754 X2 : {array-like, sparse matrix} of shape (n_samples, n_features), optional
+755 Array representing the data from another view, by default None
+756 Returns
+757 -------
+758 score : float
+759 Mean accuracy of ``self.predict(X)`` wrt. `y`.
+760 """
+761if"X2"inkwards:
+762returnaccuracy_score(y,self.predict(X,kwards["X2"]),sample_weight=sample_weight)
+763else:
+764returnsuper().score(X,y,sample_weight=sample_weight)
+
+
+
+
Return the mean accuracy on the given test data and labels.
+In multi-label classification, this is the subset accuracy
+which is a harsh metric since you require for each sample that
+each label set be correctly predicted.
+
+
Parameters
+
+
+
X (array-like of shape (n_samples, n_features)):
+Test samples.
+
y (array-like of shape (n_samples,) or (n_samples, n_outputs)):
+True labels for X.
+
sample_weight (array-like of shape (n_samples,), default=None):
+Sample weights.
+
X2 ({array-like, sparse matrix} of shape (n_samples, n_features), optional):
+Array representing the data from another view, by default None
+
+
+
Returns
+
+
+
score (float):
+Mean accuracy of self.predict(X) wrt. y.
+
+
+
+
+
+
+
Inherited Members
+
+
sklearn.base.BaseEstimator
+
get_params
+
set_params
+
+
+
+
+
+
+
+
+
+ class
+ CoTrainingByCommittee(sslearn.wrapper._co.BaseCoTraining):
+
+
+
+
+
+
1060classCoTrainingByCommittee(BaseCoTraining):
+1061"""
+1062 **Co-Training by Committee classifier.**
+1063 --------------------------------------------
+1064
+1065 Create a committee trained by co-training based on the diversity of the classifiers
+1066
+1067 The main process is:
+1068 1. Train a committee of classifiers.
+1069 2. Create a pool of unlabeled instances.
+1070 3. While max iterations is not reached or any instance is unlabeled:
+1071 1. Predict the instances from the unlabeled set.
+1072 2. Select the instances with the highest probability.
+1073 3. Label the instances with the highest probability, keeping the balance of the classes but ensuring that at least n instances of each class are added.
+1074 4. Retrain the classifier with the new instances.
+1075 4. Combine the probabilities of each classifier.
+1076
+1077 **Methods**
+1078 -------
+1079 - `fit`: Fit the model with the labeled instances.
+1080 - `predict` : Predict the class for each instance.
+1081 - `predict_proba`: Predict the probability for each class.
+1082 - `score`: Return the mean accuracy on the given test data and labels.
+1083
+1084 **Example**
+1085 -------
+1086 ```python
+1087 from sklearn.datasets import load_iris
+1088 from sslearn.wrapper import CoTrainingByCommittee
+1089 from sslearn.model_selection import artificial_ssl_dataset
+1090
+1091 X, y = load_iris(return_X_y=True)
+1092 X, y, X_unlabel, y_unlabel, _, _ = artificial_ssl_dataset(X, y, label_rate=0.1, random_state=0)
+1093 cotraining = CoTrainingByCommittee()
+1094 cotraining.fit(X, y)
+1095 cotraining.score(X_unlabel, y_unlabel)
+1096 ```
+1097
+1098 **References**
+1099 ----------
+1100 M. F. A. Hady and F. Schwenker,<br>
+1101 Co-training by Committee: A New Semi-supervised Learning Framework,<br>
+1102 in <i>2008 IEEE International Conference on Data Mining Workshops</i>,<br>
+1103 Pisa, 2008, pp. 563-572, [10.1109/ICDMW.2008.27](https://doi.org/10.1109/ICDMW.2008.27)
+1104 """
+1105
+1106
+1107def__init__(
+1108self,
+1109ensemble_estimator=BaggingClassifier(),
+1110max_iterations=100,
+1111poolsize=100,
+1112min_instances_for_class=3,
+1113random_state=None,
+1114):
+1115"""
+1116 Create a committee trained by cotraining based on
+1117 the diversity of classifiers.
+1118
+1119 Parameters
+1120 ----------
+1121 ensemble_estimator : ClassifierMixin, optional
+1122 ensemble method, works without a ensemble as
+1123 self training with pool, by default BaggingClassifier().
+1124 max_iterations : int, optional
+1125 number of iterations of training, -1 if no max iterations, by default 100
+1126 poolsize : int, optional
+1127 max number of unlabeled instances candidates to pseudolabel, by default 100
+1128 random_state : int, RandomState instance, optional
+1129 controls the randomness of the estimator, by default None
+1130
+1131
+1132 """
+1133self.ensemble_estimator=check_classifier(ensemble_estimator,False)
+1134self.max_iterations=max_iterations
+1135self.poolsize=poolsize
+1136self.random_state=random_state
+1137self.min_instances_for_class=min_instances_for_class
+1138
+1139deffit(self,X,y,**kwards):
+1140"""Build a CoTrainingByCommittee classifier from the training set (X, y).
+1141 Parameters
+1142 ----------
+1143 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+1144 The training input samples.
+1145 y : array-like of shape (n_samples,)
+1146 The target values (class labels), -1 if unlabel.
+1147 Returns
+1148 -------
+1149 self : CoTrainingByCommittee
+1150 Fitted estimator.
+1151 """
+1152self.ensemble_estimator=skclone(self.ensemble_estimator)
+1153random_state=check_random_state(self.random_state)
+1154
+1155X_label,y_prev,X_unlabel=get_dataset(X,y)
+1156
+1157is_df=isinstance(X_label,pd.DataFrame)
+1158
+1159self.label_encoder_=LabelEncoder()
+1160y_label=self.label_encoder_.fit_transform(y_prev)
+1161
+1162self.classes_=self.label_encoder_.classes_
+1163
+1164prior=calculate_prior_probability(y_label)
+1165permutation=random_state.permutation(len(X_unlabel))
+1166
+1167self.ensemble_estimator.fit(X_label,y_label,**kwards)
+1168
+1169ifX_unlabel.shape[0]==0:
+1170returnself
+1171
+1172for_inrange(self.max_iterations):
+1173iflen(permutation)==0:
+1174break
+1175raw_predictions=self.ensemble_estimator.predict_proba(
+1176X_unlabel[permutation[0:self.poolsize]]ifnotis_dfelseX_unlabel.iloc[permutation[0:self.poolsize]]
+1177)
+1178
+1179predictions=np.max(raw_predictions,axis=1)
+1180class_predicted=np.argmax(raw_predictions,axis=1)
+1181
+1182added=np.zeros(predictions.shape,dtype=bool)
+1183# First the n (or less) most confidence instances will be selected
+1184forcinself.ensemble_estimator.classes_:
+1185condition=class_predicted==c
+1186
+1187candidates=predictions[condition]
+1188candidates_bool=np.zeros(predictions.shape,dtype=bool)
+1189candidates_sub_set=candidates_bool[condition]
+1190
+1191instances_index_selected=candidates.argsort(kind="mergesort")[
+1192-self.min_instances_for_class:
+1193]
+1194
+1195candidates_sub_set[instances_index_selected]=True
+1196candidates_bool[condition]+=candidates_sub_set
+1197
+1198added[candidates_bool]=True
+1199
+1200# Bajo esta interpretación se garantiza que al menos existen n elemento de cada clase por iteración
+1201# Pero si se añaden ya en el proceso de proporción no se duplica.
+1202
+1203# Con esta otra interpretación ignora las n primeras instancias de cada clase
+1204to_label=choice_with_proportion(
+1205predictions,class_predicted,prior,extra=self.min_instances_for_class
+1206)
+1207added[to_label]=True
+1208
+1209index=permutation[0:self.poolsize][added]
+1210X_label=np.append(X_label,X_unlabel[index],axis=0)ifnotis_dfelsepd.concat(
+1211[X_label,X_unlabel.iloc[index,:]]
+1212)
+1213pseudoy=class_predicted[added]
+1214
+1215y_label=np.append(y_label,pseudoy)
+1216permutation=permutation[list(map(lambdax:xnotinindex,permutation))]
+1217
+1218self.ensemble_estimator.fit(X_label,y_label,**kwards)
+1219
+1220returnself
+1221
+1222defpredict(self,X):
+1223"""Predict class value for X.
+1224 For a classification model, the predicted class for each sample in X is returned.
+1225 Parameters
+1226 ----------
+1227 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+1228 The input samples.
+1229 Returns
+1230 -------
+1231 y : array-like of shape (n_samples,)
+1232 The predicted classes
+1233 """
+1234check_is_fitted(self.ensemble_estimator)
+1235returnself.label_encoder_.inverse_transform(self.ensemble_estimator.predict(X))
+1236
+1237defpredict_proba(self,X):
+1238"""Predict class probabilities of the input samples X.
+1239 The predicted class probability depends on the ensemble estimator.
+1240 Parameters
+1241 ----------
+1242 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+1243 The input samples.
+1244 Returns
+1245 -------
+1246 y : ndarray of shape (n_samples, n_classes) or list of n_outputs such arrays if n_outputs > 1
+1247 The predicted classes
+1248 """
+1249check_is_fitted(self.ensemble_estimator)
+1250returnself.ensemble_estimator.predict_proba(X)
+1251
+1252defscore(self,X,y,sample_weight=None):
+1253"""Return the mean accuracy on the given test data and labels.
+1254 In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
+1255 Parameters
+1256 ----------
+1257 X : array-like of shape (n_samples, n_features)
+1258 Test samples.
+1259 y : array-like of shape (n_samples,) or (n_samples, n_outputs)
+1260 True labels for X.
+1261 sample_weight : array-like of shape (n_samples,), optional
+1262 Sample weights., by default None
+1263 Returns
+1264 -------
+1265 score: float
+1266 Mean accuracy of self.predict(X) wrt. y.
+1267 """
+1268try:
+1269y=self.label_encoder_.transform(y)
+1270exceptValueError:
+1271if"le_dict_"notindir(self):
+1272self.le_dict_=dict(
+1273zip(
+1274self.label_encoder_.classes_,
+1275self.label_encoder_.transform(self.label_encoder_.classes_),
+1276)
+1277)
+1278y=np.array(list(map(lambdax:self.le_dict_.get(x,-1),y)),dtype=y.dtype)
+1279
+1280returnself.ensemble_estimator.score(X,y,sample_weight)
+
+
+
+
Co-Training by Committee classifier.
+
+
Create a committee trained by co-training based on the diversity of the classifiers
+
+
The main process is:
+
+
+
Train a committee of classifiers.
+
Create a pool of unlabeled instances.
+
While max iterations is not reached or any instance is unlabeled:
+
+
Predict the instances from the unlabeled set.
+
Select the instances with the highest probability.
+
Label the instances with the highest probability, keeping the balance of the classes but ensuring that at least n instances of each class are added.
M. F. A. Hady and F. Schwenker,
+Co-training by Committee: A New Semi-supervised Learning Framework,
+in 2008 IEEE International Conference on Data Mining Workshops,
+Pisa, 2008, pp. 563-572, 10.1109/ICDMW.2008.27
1107def__init__(
+1108self,
+1109ensemble_estimator=BaggingClassifier(),
+1110max_iterations=100,
+1111poolsize=100,
+1112min_instances_for_class=3,
+1113random_state=None,
+1114):
+1115"""
+1116 Create a committee trained by cotraining based on
+1117 the diversity of classifiers.
+1118
+1119 Parameters
+1120 ----------
+1121 ensemble_estimator : ClassifierMixin, optional
+1122 ensemble method, works without a ensemble as
+1123 self training with pool, by default BaggingClassifier().
+1124 max_iterations : int, optional
+1125 number of iterations of training, -1 if no max iterations, by default 100
+1126 poolsize : int, optional
+1127 max number of unlabeled instances candidates to pseudolabel, by default 100
+1128 random_state : int, RandomState instance, optional
+1129 controls the randomness of the estimator, by default None
+1130
+1131
+1132 """
+1133self.ensemble_estimator=check_classifier(ensemble_estimator,False)
+1134self.max_iterations=max_iterations
+1135self.poolsize=poolsize
+1136self.random_state=random_state
+1137self.min_instances_for_class=min_instances_for_class
+
+
+
+
Create a committee trained by cotraining based on
+the diversity of classifiers.
+
+
Parameters
+
+
+
ensemble_estimator (ClassifierMixin, optional):
+ensemble method, works without a ensemble as
+self training with pool, by default BaggingClassifier().
+
max_iterations (int, optional):
+number of iterations of training, -1 if no max iterations, by default 100
+
poolsize (int, optional):
+max number of unlabeled instances candidates to pseudolabel, by default 100
+
random_state (int, RandomState instance, optional):
+controls the randomness of the estimator, by default None
+
+
+
+
+
+
+
+
+
+ def
+ fit(self, X, y, **kwards):
+
+
+
+
+
+
1139deffit(self,X,y,**kwards):
+1140"""Build a CoTrainingByCommittee classifier from the training set (X, y).
+1141 Parameters
+1142 ----------
+1143 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+1144 The training input samples.
+1145 y : array-like of shape (n_samples,)
+1146 The target values (class labels), -1 if unlabel.
+1147 Returns
+1148 -------
+1149 self : CoTrainingByCommittee
+1150 Fitted estimator.
+1151 """
+1152self.ensemble_estimator=skclone(self.ensemble_estimator)
+1153random_state=check_random_state(self.random_state)
+1154
+1155X_label,y_prev,X_unlabel=get_dataset(X,y)
+1156
+1157is_df=isinstance(X_label,pd.DataFrame)
+1158
+1159self.label_encoder_=LabelEncoder()
+1160y_label=self.label_encoder_.fit_transform(y_prev)
+1161
+1162self.classes_=self.label_encoder_.classes_
+1163
+1164prior=calculate_prior_probability(y_label)
+1165permutation=random_state.permutation(len(X_unlabel))
+1166
+1167self.ensemble_estimator.fit(X_label,y_label,**kwards)
+1168
+1169ifX_unlabel.shape[0]==0:
+1170returnself
+1171
+1172for_inrange(self.max_iterations):
+1173iflen(permutation)==0:
+1174break
+1175raw_predictions=self.ensemble_estimator.predict_proba(
+1176X_unlabel[permutation[0:self.poolsize]]ifnotis_dfelseX_unlabel.iloc[permutation[0:self.poolsize]]
+1177)
+1178
+1179predictions=np.max(raw_predictions,axis=1)
+1180class_predicted=np.argmax(raw_predictions,axis=1)
+1181
+1182added=np.zeros(predictions.shape,dtype=bool)
+1183# First the n (or less) most confidence instances will be selected
+1184forcinself.ensemble_estimator.classes_:
+1185condition=class_predicted==c
+1186
+1187candidates=predictions[condition]
+1188candidates_bool=np.zeros(predictions.shape,dtype=bool)
+1189candidates_sub_set=candidates_bool[condition]
+1190
+1191instances_index_selected=candidates.argsort(kind="mergesort")[
+1192-self.min_instances_for_class:
+1193]
+1194
+1195candidates_sub_set[instances_index_selected]=True
+1196candidates_bool[condition]+=candidates_sub_set
+1197
+1198added[candidates_bool]=True
+1199
+1200# Bajo esta interpretación se garantiza que al menos existen n elemento de cada clase por iteración
+1201# Pero si se añaden ya en el proceso de proporción no se duplica.
+1202
+1203# Con esta otra interpretación ignora las n primeras instancias de cada clase
+1204to_label=choice_with_proportion(
+1205predictions,class_predicted,prior,extra=self.min_instances_for_class
+1206)
+1207added[to_label]=True
+1208
+1209index=permutation[0:self.poolsize][added]
+1210X_label=np.append(X_label,X_unlabel[index],axis=0)ifnotis_dfelsepd.concat(
+1211[X_label,X_unlabel.iloc[index,:]]
+1212)
+1213pseudoy=class_predicted[added]
+1214
+1215y_label=np.append(y_label,pseudoy)
+1216permutation=permutation[list(map(lambdax:xnotinindex,permutation))]
+1217
+1218self.ensemble_estimator.fit(X_label,y_label,**kwards)
+1219
+1220returnself
+
+
+
+
Build a CoTrainingByCommittee classifier from the training set (X, y).
+
+
Parameters
+
+
+
X ({array-like, sparse matrix} of shape (n_samples, n_features)):
+The training input samples.
+
y (array-like of shape (n_samples,)):
+The target values (class labels), -1 if unlabel.
+
+
+
Returns
+
+
+
self (CoTrainingByCommittee):
+Fitted estimator.
+
+
+
+
+
+
+
+
+
+ def
+ predict(self, X):
+
+
+
+
+
+
1222defpredict(self,X):
+1223"""Predict class value for X.
+1224 For a classification model, the predicted class for each sample in X is returned.
+1225 Parameters
+1226 ----------
+1227 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+1228 The input samples.
+1229 Returns
+1230 -------
+1231 y : array-like of shape (n_samples,)
+1232 The predicted classes
+1233 """
+1234check_is_fitted(self.ensemble_estimator)
+1235returnself.label_encoder_.inverse_transform(self.ensemble_estimator.predict(X))
+
+
+
+
Predict class value for X.
+For a classification model, the predicted class for each sample in X is returned.
+
+
Parameters
+
+
+
X ({array-like, sparse matrix} of shape (n_samples, n_features)):
+The input samples.
+
+
+
Returns
+
+
+
y (array-like of shape (n_samples,)):
+The predicted classes
+
+
+
+
+
+
+
+
+
+ def
+ predict_proba(self, X):
+
+
+
+
+
+
1237defpredict_proba(self,X):
+1238"""Predict class probabilities of the input samples X.
+1239 The predicted class probability depends on the ensemble estimator.
+1240 Parameters
+1241 ----------
+1242 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+1243 The input samples.
+1244 Returns
+1245 -------
+1246 y : ndarray of shape (n_samples, n_classes) or list of n_outputs such arrays if n_outputs > 1
+1247 The predicted classes
+1248 """
+1249check_is_fitted(self.ensemble_estimator)
+1250returnself.ensemble_estimator.predict_proba(X)
+
+
+
+
Predict class probabilities of the input samples X.
+The predicted class probability depends on the ensemble estimator.
+
+
Parameters
+
+
+
X ({array-like, sparse matrix} of shape (n_samples, n_features)):
+The input samples.
+
+
+
Returns
+
+
+
y (ndarray of shape (n_samples, n_classes) or list of n_outputs such arrays if n_outputs > 1):
+The predicted classes
1252defscore(self,X,y,sample_weight=None):
+1253"""Return the mean accuracy on the given test data and labels.
+1254 In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
+1255 Parameters
+1256 ----------
+1257 X : array-like of shape (n_samples, n_features)
+1258 Test samples.
+1259 y : array-like of shape (n_samples,) or (n_samples, n_outputs)
+1260 True labels for X.
+1261 sample_weight : array-like of shape (n_samples,), optional
+1262 Sample weights., by default None
+1263 Returns
+1264 -------
+1265 score: float
+1266 Mean accuracy of self.predict(X) wrt. y.
+1267 """
+1268try:
+1269y=self.label_encoder_.transform(y)
+1270exceptValueError:
+1271if"le_dict_"notindir(self):
+1272self.le_dict_=dict(
+1273zip(
+1274self.label_encoder_.classes_,
+1275self.label_encoder_.transform(self.label_encoder_.classes_),
+1276)
+1277)
+1278y=np.array(list(map(lambdax:self.le_dict_.get(x,-1),y)),dtype=y.dtype)
+1279
+1280returnself.ensemble_estimator.score(X,y,sample_weight)
+
+
+
+
Return the mean accuracy on the given test data and labels.
+In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
+
+
Parameters
+
+
+
X (array-like of shape (n_samples, n_features)):
+Test samples.
+
y (array-like of shape (n_samples,) or (n_samples, n_outputs)):
+True labels for X.
+
sample_weight (array-like of shape (n_samples,), optional):
+Sample weights., by default None
+
+
+
Returns
+
+
+
score (float):
+Mean accuracy of self.predict(X) wrt. y.
+
+
+
+
+
+
+
Inherited Members
+
+
sklearn.base.BaseEstimator
+
get_params
+
set_params
+
+
+
+
+
+
+
+
+
+ class
+ DemocraticCoLearning(sslearn.wrapper._co.BaseCoTraining):
+
+
+
+
+
+
118classDemocraticCoLearning(BaseCoTraining):
+119"""
+120 **Democratic Co-learning. Ensemble of classifiers of different types.**
+121 --------------------------------------------
+122
+123 A iterative algorithm that uses a ensemble of classifiers to label instances.
+124 The main process is:
+125 1. Train each classifier with the labeled instances.
+126 2. While any classifier is retrained:
+127 1. Predict the instances from the unlabeled set.
+128 2. Calculate the confidence interval for each classifier for define weights.
+129 3. Calculate the weighted vote for each instance.
+130 4. Calculate the majority vote for each instance.
+131 5. Select the instances to label if majority vote is the same as weighted vote.
+132 6. Select the instances to retrain the classifier, if `only_mislabeled` is False then select all instances, else select only mislabeled instances for each classifier.
+133 7. Retrain the classifier with the new instances if the error rate is lower than the previous iteration.
+134 3. Ignore the classifiers with confidence interval lower than 0.5.
+135 4. Combine the probabilities of each classifier.
+136
+137 **Methods**
+138 -------
+139 - `fit`: Fit the model with the labeled instances.
+140 - `predict` : Predict the class for each instance.
+141 - `predict_proba`: Predict the probability for each class.
+142 - `score`: Return the mean accuracy on the given test data and labels.
+143
+144
+145 **Example**
+146 -------
+147 ```python
+148 from sklearn.datasets import load_iris
+149 from sklearn.tree import DecisionTreeClassifier
+150 from sklearn.naive_bayes import GaussianNB
+151 from sklearn.neighbors import KNeighborsClassifier
+152 from sslearn.wrapper import DemocraticCoLearning
+153 from sslearn.model_selection import artificial_ssl_dataset
+154
+155 X, y = load_iris(return_X_y=True)
+156 X, y, X_unlabel, y_unlabel, _, _ = artificial_ssl_dataset(X, y, label_rate=0.1, random_state=0)
+157 dcl = DemocraticCoLearning(base_estimator=[DecisionTreeClassifier(), GaussianNB(), KNeighborsClassifier(n_neighbors=3)])
+158 dcl.fit(X, y)
+159 dcl.score(X_unlabel, y_unlabel)
+160 ```
+161
+162 **References**
+163 ----------
+164 Y. Zhou and S. Goldman, (2004) <br>
+165 Democratic co-learning, <br>
+166 in <i>16th IEEE International Conference on Tools with Artificial Intelligence</i>,<br>
+167 pp. 594-602, [10.1109/ICTAI.2004.48](https://doi.org/10.1109/ICTAI.2004.48).
+168 """
+169
+170def__init__(
+171self,
+172base_estimator=[
+173DecisionTreeClassifier(),
+174GaussianNB(),
+175KNeighborsClassifier(n_neighbors=3),
+176],
+177n_estimators=None,
+178expand_only_mislabeled=True,
+179alpha=0.95,
+180q_exp=2,
+181random_state=None
+182):
+183"""
+184 Democratic Co-learning. Ensemble of classifiers of different types.
+185
+186 Parameters
+187 ----------
+188 base_estimator : {ClassifierMixin, list}, optional
+189 An estimator object implementing fit and predict_proba or a list of ClassifierMixin, by default DecisionTreeClassifier()
+190 n_estimators : int, optional
+191 number of base_estimators to use. None if base_estimator is a list, by default None
+192 expand_only_mislabeled : bool, optional
+193 expand only mislabeled instances by itself, by default True
+194 alpha : float, optional
+195 confidence level, by default 0.95
+196 q_exp : int, optional
+197 exponent for the estimation for error rate, by default 2
+198 random_state : int, RandomState instance, optional
+199 controls the randomness of the estimator, by default None
+200 Raises
+201 ------
+202 AttributeError
+203 If n_estimators is None and base_estimator is not a list
+204 """
+205
+206ifisinstance(base_estimator,ClassifierMixin)andn_estimatorsisnotNone:
+207estimators=list()
+208random_available=True
+209rand=check_random_state(random_state)
+210if"random_state"notindir(base_estimator):
+211warnings.warn(
+212"The classifier will not be able to converge correctly, there is not enough diversity among the estimators (learners should be different).",
+213ConvergenceWarning,
+214)
+215random_available=False
+216foriinrange(n_estimators):
+217estimators.append(skclone(base_estimator))
+218ifrandom_available:
+219estimators[i].random_state=rand.randint(0,1e5)
+220self.base_estimator=estimators
+221
+222elifisinstance(base_estimator,list):
+223self.base_estimator=base_estimator
+224else:
+225raiseAttributeError(
+226"If `n_estimators` is None then `base_estimator` must be a `list`."
+227)
+228self.base_estimator=check_classifier(self.base_estimator)
+229self.n_estimators=len(self.base_estimator)
+230self.one_hot=OneHotEncoder(sparse_output=False)
+231self.expand_only_mislabeled=expand_only_mislabeled
+232
+233self.alpha=alpha
+234self.q_exp=q_exp
+235self.random_state=random_state
+236
+237def__weighted_y(self,predictions,weights):
+238y_complete=np.sum(
+239[
+240self.one_hot.transform(p.reshape(-1,1))*wi
+241forp,wiinzip(predictions,weights)
+242],
+2430,
+244)
+245y_zeros=np.zeros(y_complete.shape)
+246y_zeros[np.arange(y_complete.shape[0]),y_complete.argmax(1)]=1
+247
+248returnself.one_hot.inverse_transform(y_zeros).flatten()
+249
+250def__calcule_last_confidences(self,X,y):
+251"""Calculate the confidence of each learner
+252
+253 Parameters
+254 ----------
+255 X : array-like
+256 Set of instances
+257 y : array-like
+258 Set of classes for each instance
+259 """
+260w=[]
+261w=[sum(confidence_interval(X,H,y,self.alpha))/2forHinself.h_]
+262self.confidences_=w
+263
+264deffit(self,X,y,estimator_kwards=None):
+265"""Fit Democratic-Co classifier
+266
+267 Parameters
+268 ----------
+269 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+270 The training input samples.
+271 y : array-like of shape (n_samples,)
+272 The target values (class labels), -1 if unlabel.
+273 estimator_kwards : {list, dict}, optional
+274 list of kwards for each estimator or kwards for all estimators, by default None
+275
+276 Returns
+277 -------
+278 self : DemocraticCoLearning
+279 fitted classifier
+280 """
+281
+282X_label,y_label,X_unlabel=get_dataset(X,y)
+283
+284is_df=isinstance(X_label,pd.DataFrame)
+285
+286self.classes_=np.unique(y_label)
+287self.encoder=LabelEncoder().fit(y_label)
+288y_label=self.encoder.transform(y_label)
+289
+290self.one_hot.fit(y_label.reshape(-1,1))
+291
+292L=[X_label]*self.n_estimators
+293Ly=[y_label]*self.n_estimators
+294# This variable prevents duplicate instances.
+295L_added=[np.zeros(X_unlabel.shape[0]).astype(bool)]*self.n_estimators
+296e=[0]*self.n_estimators
+297
+298ifestimator_kwardsisNone:
+299estimator_kwards=[{}]*self.n_estimators
+300
+301changed=True
+302iteration=0
+303whilechanged:
+304changed=False
+305iteration_dict={}
+306iteration+=1
+307
+308foriinrange(self.n_estimators):
+309self.base_estimator[i].fit(L[i],Ly[i],**estimator_kwards[i])
+310ifX_unlabel.shape[0]==0:
+311break
+312# Majority Vote
+313predictions=[H.predict(X_unlabel)forHinself.base_estimator]
+314majority_class=mode(np.array(predictions,dtype=predictions[0].dtype))[0]
+315# majority_class = st.mode(np.array(predictions, dtype=predictions[0].dtype), axis=0, keepdims=True)[
+316# 0
+317# ].flatten() # K in pseudocode
+318
+319L_=[[]]*self.n_estimators
+320Ly_=[[]]*self.n_estimators
+321
+322# Calculate confidence interval
+323conf_interval=[
+324confidence_interval(
+325X_label,
+326H,
+327y_label,
+328self.alpha
+329)
+330forHinself.base_estimator
+331]
+332
+333weights=[(li+hi)/2for(li,hi)inconf_interval]
+334iteration_dict["weights"]={
+335"cl"+str(i):(l,h,w)
+336fori,((l,h),w)inenumerate(zip(conf_interval,weights))
+337}
+338# weighted vote
+339weighted_class=self.__weighted_y(predictions,weights)
+340
+341# If `weighted_class` is equal as `majority_class` then
+342# the sum of classifier's weights of max voted class
+343# is greater than the max of sum of classifier's weights
+344# from another classes.
+345
+346candidates=weighted_class==majority_class
+347candidates_bool=list()
+348
+349ifnotself.expand_only_mislabeled:
+350all_same_list=list()
+351foriinrange(1,self.n_estimators):
+352all_same_list.append(predictions[i]==predictions[i-1])
+353all_same=np.logical_and(*all_same_list)
+354# new_instances = []
+355foriinrange(self.n_estimators):
+356
+357mispredictions=predictions[i]!=weighted_class
+358# An instance from U are added to Li' only if:
+359# It is a misprediction for i
+360# It is a candidate (weighted_class are same majority_class)
+361# It hasn't been added yet in Li
+362
+363candidates_temp=np.logical_and(mispredictions,candidates)
+364
+365ifnotself.expand_only_mislabeled:
+366candidates_temp=np.logical_or(candidates_temp,all_same)
+367
+368to_add=np.logical_and(np.logical_not(L_added[i]),candidates_temp)
+369
+370candidates_bool.append(to_add)
+371ifis_df:
+372L_[i]=X_unlabel.iloc[to_add,:]
+373else:
+374L_[i]=X_unlabel[to_add,:]
+375Ly_[i]=weighted_class[to_add]
+376
+377new_conf_interval=[
+378confidence_interval(L[i],H,Ly[i],self.alpha)
+379fori,Hinenumerate(self.base_estimator)
+380]
+381e_factor=1-sum([l_forl_,_innew_conf_interval])/self.n_estimators
+382fori,_inenumerate(self.base_estimator):
+383iflen(L_[i])>0:
+384
+385qi=len(L[i])*((1-2*(e[i]/len(L[i])))**2)
+386e_i=e_factor*len(L_[i])
+387# |Li|+|L'i| == |Li U L'i| because of to_add
+388q_i=(len(L[i])+len(L_[i]))*(
+3891-2*(e[i]+e_i)/(len(L[i])+len(L_[i]))
+390)**self.q_exp
+391ifq_i<=qi:
+392continue
+393L_added[i]=np.logical_or(L_added[i],candidates_bool[i])
+394ifis_df:
+395L[i]=pd.concat([L[i],L_[i]])
+396else:
+397L[i]=np.concatenate((L[i],np.array(L_[i])))
+398Ly[i]=np.concatenate((Ly[i],np.array(Ly_[i])))
+399
+400e[i]=e[i]+e_i
+401changed=True
+402
+403self.h_=self.base_estimator
+404self.__calcule_last_confidences(X_label,y_label)
+405
+406# Ignore hypothesis
+407self.h_=[Hforw,Hinzip(self.confidences_,self.h_)ifw>0.5]
+408self.confidences_=[wforwinself.confidences_ifw>0.5]
+409
+410self.columns_=[list(range(X.shape[1]))]*self.n_estimators
+411
+412returnself
+413
+414def__combine_probabilities(self,X):
+415
+416n_instances=X.shape[0]# uppercase X as it will be an np.array
+417sizes=np.zeros((n_instances,len(self.classes_)),dtype=int)
+418C=np.zeros((n_instances,len(self.classes_)),dtype=float)
+419Cavg=np.zeros((n_instances,len(self.classes_)),dtype=float)
+420
+421forw,Hinzip(self.confidences_,self.h_):
+422cj=H.predict(X)
+423factor=self.one_hot.transform(cj.reshape(-1,1)).astype(int)
+424C+=w*factor
+425sizes+=factor
+426
+427Cavg[sizes==0]=0.5# «voting power» of 0.5 for small groups
+428ne=(sizes!=0)# non empty groups
+429Cavg[ne]=(sizes[ne]+0.5)/(sizes[ne]+1)*C[ne]/sizes[ne]
+430
+431returnsoftmax(Cavg,axis=1)
+432
+433defpredict_proba(self,X):
+434"""Predict probability for each possible outcome.
+435
+436 Parameters
+437 ----------
+438 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+439 Array representing the data.
+440 Returns
+441 -------
+442 class probabilities: ndarray of shape (n_samples, n_classes)
+443 Array with prediction probabilities.
+444 """
+445if"h_"indir(self):
+446iflen(X)==1:
+447X=[X]
+448returnself.__combine_probabilities(X)
+449else:
+450raiseNotFittedError("Classifier not fitted")
+
+
+
+
Democratic Co-learning. Ensemble of classifiers of different types.
+
+
A iterative algorithm that uses a ensemble of classifiers to label instances.
+The main process is:
+
+
+
Train each classifier with the labeled instances.
+
While any classifier is retrained:
+
+
Predict the instances from the unlabeled set.
+
Calculate the confidence interval for each classifier for define weights.
+
Calculate the weighted vote for each instance.
+
Calculate the majority vote for each instance.
+
Select the instances to label if majority vote is the same as weighted vote.
+
Select the instances to retrain the classifier, if only_mislabeled is False then select all instances, else select only mislabeled instances for each classifier.
+
Retrain the classifier with the new instances if the error rate is lower than the previous iteration.
+
+
Ignore the classifiers with confidence interval lower than 0.5.
Y. Zhou and S. Goldman, (2004)
+Democratic co-learning,
+in 16th IEEE International Conference on Tools with Artificial Intelligence,
+pp. 594-602, 10.1109/ICTAI.2004.48.
170def__init__(
+171self,
+172base_estimator=[
+173DecisionTreeClassifier(),
+174GaussianNB(),
+175KNeighborsClassifier(n_neighbors=3),
+176],
+177n_estimators=None,
+178expand_only_mislabeled=True,
+179alpha=0.95,
+180q_exp=2,
+181random_state=None
+182):
+183"""
+184 Democratic Co-learning. Ensemble of classifiers of different types.
+185
+186 Parameters
+187 ----------
+188 base_estimator : {ClassifierMixin, list}, optional
+189 An estimator object implementing fit and predict_proba or a list of ClassifierMixin, by default DecisionTreeClassifier()
+190 n_estimators : int, optional
+191 number of base_estimators to use. None if base_estimator is a list, by default None
+192 expand_only_mislabeled : bool, optional
+193 expand only mislabeled instances by itself, by default True
+194 alpha : float, optional
+195 confidence level, by default 0.95
+196 q_exp : int, optional
+197 exponent for the estimation for error rate, by default 2
+198 random_state : int, RandomState instance, optional
+199 controls the randomness of the estimator, by default None
+200 Raises
+201 ------
+202 AttributeError
+203 If n_estimators is None and base_estimator is not a list
+204 """
+205
+206ifisinstance(base_estimator,ClassifierMixin)andn_estimatorsisnotNone:
+207estimators=list()
+208random_available=True
+209rand=check_random_state(random_state)
+210if"random_state"notindir(base_estimator):
+211warnings.warn(
+212"The classifier will not be able to converge correctly, there is not enough diversity among the estimators (learners should be different).",
+213ConvergenceWarning,
+214)
+215random_available=False
+216foriinrange(n_estimators):
+217estimators.append(skclone(base_estimator))
+218ifrandom_available:
+219estimators[i].random_state=rand.randint(0,1e5)
+220self.base_estimator=estimators
+221
+222elifisinstance(base_estimator,list):
+223self.base_estimator=base_estimator
+224else:
+225raiseAttributeError(
+226"If `n_estimators` is None then `base_estimator` must be a `list`."
+227)
+228self.base_estimator=check_classifier(self.base_estimator)
+229self.n_estimators=len(self.base_estimator)
+230self.one_hot=OneHotEncoder(sparse_output=False)
+231self.expand_only_mislabeled=expand_only_mislabeled
+232
+233self.alpha=alpha
+234self.q_exp=q_exp
+235self.random_state=random_state
+
+
+
+
Democratic Co-learning. Ensemble of classifiers of different types.
+
+
Parameters
+
+
+
base_estimator ({ClassifierMixin, list}, optional):
+An estimator object implementing fit and predict_proba or a list of ClassifierMixin, by default DecisionTreeClassifier()
+
n_estimators (int, optional):
+number of base_estimators to use. None if base_estimator is a list, by default None
+
expand_only_mislabeled (bool, optional):
+expand only mislabeled instances by itself, by default True
+
alpha (float, optional):
+confidence level, by default 0.95
+
q_exp (int, optional):
+exponent for the estimation for error rate, by default 2
+
random_state (int, RandomState instance, optional):
+controls the randomness of the estimator, by default None
+
+
+
Raises
+
+
+
AttributeError: If n_estimators is None and base_estimator is not a list
264deffit(self,X,y,estimator_kwards=None):
+265"""Fit Democratic-Co classifier
+266
+267 Parameters
+268 ----------
+269 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+270 The training input samples.
+271 y : array-like of shape (n_samples,)
+272 The target values (class labels), -1 if unlabel.
+273 estimator_kwards : {list, dict}, optional
+274 list of kwards for each estimator or kwards for all estimators, by default None
+275
+276 Returns
+277 -------
+278 self : DemocraticCoLearning
+279 fitted classifier
+280 """
+281
+282X_label,y_label,X_unlabel=get_dataset(X,y)
+283
+284is_df=isinstance(X_label,pd.DataFrame)
+285
+286self.classes_=np.unique(y_label)
+287self.encoder=LabelEncoder().fit(y_label)
+288y_label=self.encoder.transform(y_label)
+289
+290self.one_hot.fit(y_label.reshape(-1,1))
+291
+292L=[X_label]*self.n_estimators
+293Ly=[y_label]*self.n_estimators
+294# This variable prevents duplicate instances.
+295L_added=[np.zeros(X_unlabel.shape[0]).astype(bool)]*self.n_estimators
+296e=[0]*self.n_estimators
+297
+298ifestimator_kwardsisNone:
+299estimator_kwards=[{}]*self.n_estimators
+300
+301changed=True
+302iteration=0
+303whilechanged:
+304changed=False
+305iteration_dict={}
+306iteration+=1
+307
+308foriinrange(self.n_estimators):
+309self.base_estimator[i].fit(L[i],Ly[i],**estimator_kwards[i])
+310ifX_unlabel.shape[0]==0:
+311break
+312# Majority Vote
+313predictions=[H.predict(X_unlabel)forHinself.base_estimator]
+314majority_class=mode(np.array(predictions,dtype=predictions[0].dtype))[0]
+315# majority_class = st.mode(np.array(predictions, dtype=predictions[0].dtype), axis=0, keepdims=True)[
+316# 0
+317# ].flatten() # K in pseudocode
+318
+319L_=[[]]*self.n_estimators
+320Ly_=[[]]*self.n_estimators
+321
+322# Calculate confidence interval
+323conf_interval=[
+324confidence_interval(
+325X_label,
+326H,
+327y_label,
+328self.alpha
+329)
+330forHinself.base_estimator
+331]
+332
+333weights=[(li+hi)/2for(li,hi)inconf_interval]
+334iteration_dict["weights"]={
+335"cl"+str(i):(l,h,w)
+336fori,((l,h),w)inenumerate(zip(conf_interval,weights))
+337}
+338# weighted vote
+339weighted_class=self.__weighted_y(predictions,weights)
+340
+341# If `weighted_class` is equal as `majority_class` then
+342# the sum of classifier's weights of max voted class
+343# is greater than the max of sum of classifier's weights
+344# from another classes.
+345
+346candidates=weighted_class==majority_class
+347candidates_bool=list()
+348
+349ifnotself.expand_only_mislabeled:
+350all_same_list=list()
+351foriinrange(1,self.n_estimators):
+352all_same_list.append(predictions[i]==predictions[i-1])
+353all_same=np.logical_and(*all_same_list)
+354# new_instances = []
+355foriinrange(self.n_estimators):
+356
+357mispredictions=predictions[i]!=weighted_class
+358# An instance from U are added to Li' only if:
+359# It is a misprediction for i
+360# It is a candidate (weighted_class are same majority_class)
+361# It hasn't been added yet in Li
+362
+363candidates_temp=np.logical_and(mispredictions,candidates)
+364
+365ifnotself.expand_only_mislabeled:
+366candidates_temp=np.logical_or(candidates_temp,all_same)
+367
+368to_add=np.logical_and(np.logical_not(L_added[i]),candidates_temp)
+369
+370candidates_bool.append(to_add)
+371ifis_df:
+372L_[i]=X_unlabel.iloc[to_add,:]
+373else:
+374L_[i]=X_unlabel[to_add,:]
+375Ly_[i]=weighted_class[to_add]
+376
+377new_conf_interval=[
+378confidence_interval(L[i],H,Ly[i],self.alpha)
+379fori,Hinenumerate(self.base_estimator)
+380]
+381e_factor=1-sum([l_forl_,_innew_conf_interval])/self.n_estimators
+382fori,_inenumerate(self.base_estimator):
+383iflen(L_[i])>0:
+384
+385qi=len(L[i])*((1-2*(e[i]/len(L[i])))**2)
+386e_i=e_factor*len(L_[i])
+387# |Li|+|L'i| == |Li U L'i| because of to_add
+388q_i=(len(L[i])+len(L_[i]))*(
+3891-2*(e[i]+e_i)/(len(L[i])+len(L_[i]))
+390)**self.q_exp
+391ifq_i<=qi:
+392continue
+393L_added[i]=np.logical_or(L_added[i],candidates_bool[i])
+394ifis_df:
+395L[i]=pd.concat([L[i],L_[i]])
+396else:
+397L[i]=np.concatenate((L[i],np.array(L_[i])))
+398Ly[i]=np.concatenate((Ly[i],np.array(Ly_[i])))
+399
+400e[i]=e[i]+e_i
+401changed=True
+402
+403self.h_=self.base_estimator
+404self.__calcule_last_confidences(X_label,y_label)
+405
+406# Ignore hypothesis
+407self.h_=[Hforw,Hinzip(self.confidences_,self.h_)ifw>0.5]
+408self.confidences_=[wforwinself.confidences_ifw>0.5]
+409
+410self.columns_=[list(range(X.shape[1]))]*self.n_estimators
+411
+412returnself
+
+
+
+
Fit Democratic-Co classifier
+
+
Parameters
+
+
+
X ({array-like, sparse matrix} of shape (n_samples, n_features)):
+The training input samples.
+
y (array-like of shape (n_samples,)):
+The target values (class labels), -1 if unlabel.
+
estimator_kwards ({list, dict}, optional):
+list of kwards for each estimator or kwards for all estimators, by default None
+
+
+
Returns
+
+
+
self (DemocraticCoLearning):
+fitted classifier
+
+
+
+
+
+
+
+
+
+ def
+ predict_proba(self, X):
+
+
+
+
+
+
433defpredict_proba(self,X):
+434"""Predict probability for each possible outcome.
+435
+436 Parameters
+437 ----------
+438 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+439 Array representing the data.
+440 Returns
+441 -------
+442 class probabilities: ndarray of shape (n_samples, n_classes)
+443 Array with prediction probabilities.
+444 """
+445if"h_"indir(self):
+446iflen(X)==1:
+447X=[X]
+448returnself.__combine_probabilities(X)
+449else:
+450raiseNotFittedError("Classifier not fitted")
+
+
+
+
Predict probability for each possible outcome.
+
+
Parameters
+
+
+
X ({array-like, sparse matrix} of shape (n_samples, n_features)):
+Array representing the data.
+
+
+
Returns
+
+
+
class probabilities (ndarray of shape (n_samples, n_classes)):
+Array with prediction probabilities.
+
+ class
+ Rasco(sslearn.wrapper._co.BaseCoTraining):
+
+
+
+
+
+
767classRasco(BaseCoTraining):
+768"""
+769 **Co-Training based on random subspaces**
+770 --------------------------------------------
+771
+772 Generate a set of random subspaces and train a classifier for each subspace.
+773
+774 The main process is:
+775 1. Generate a set of random subspaces.
+776 2. Train a classifier for each subspace.
+777 3. While max iterations is not reached or any instance is unlabeled:
+778 1. Predict the instances from the unlabeled set for each classifier.
+779 2. Calculate the average of the predictions.
+780 3. Select the instances with the highest probability.
+781 4. Label the instances with the highest probability, keeping the balance of the classes.
+782 5. Retrain the classifier with the new instances.
+783 4. Combine the probabilities of each classifier.
+784
+785 **Methods**
+786 -------
+787 - `fit`: Fit the model with the labeled instances.
+788 - `predict` : Predict the class for each instance.
+789 - `predict_proba`: Predict the probability for each class.
+790 - `score`: Return the mean accuracy on the given test data and labels.
+791
+792 **Example**
+793 -------
+794 ```python
+795 from sklearn.datasets import load_iris
+796 from sslearn.wrapper import Rasco
+797 from sslearn.model_selection import artificial_ssl_dataset
+798
+799 X, y = load_iris(return_X_y=True)
+800 X, y, X_unlabel, y_unlabel, _, _ = artificial_ssl_dataset(X, y, label_rate=0.1, random_state=0)
+801 rasco = Rasco()
+802 rasco.fit(X, y)
+803 rasco.score(X_unlabel, y_unlabel)
+804 ```
+805
+806 **References**
+807 ----------
+808 Wang, J., Luo, S. W., & Zeng, X. H. (2008).<br>
+809 A random subspace method for co-training,<br>
+810 in <i>2008 IEEE International Joint Conference on Neural Networks</i><br>
+811 IEEE World Congress on Computational Intelligence<br>
+812 (pp. 195-200). IEEE. [10.1109/IJCNN.2008.4633789](https://doi.org/10.1109/IJCNN.2008.4633789)
+813 """
+814
+815
+816def__init__(
+817self,
+818base_estimator=DecisionTreeClassifier(),
+819max_iterations=10,
+820n_estimators=30,
+821subspace_size=None,
+822random_state=None,
+823n_jobs=None,
+824):
+825"""
+826 Co-Training based on random subspaces
+827
+828 Parameters
+829 ----------
+830 base_estimator : ClassifierMixin, optional
+831 An estimator object implementing fit and predict_proba, by default DecisionTreeClassifier()
+832 max_iterations : int, optional
+833 Maximum number of iterations allowed. Should be greater than or equal to 0.
+834 If is -1 then will be infinite iterations until U be empty, by default 10
+835 n_estimators : int, optional
+836 The number of base estimators in the ensemble., by default 30
+837 subspace_size : int, optional
+838 The number of features for each subspace. If it is None will be the half of the features size., by default None
+839 random_state : int, RandomState instance, optional
+840 controls the randomness of the estimator, by default None
+841 """
+842self.base_estimator=check_classifier(base_estimator,True,n_estimators)# C in paper
+843self.max_iterations=max_iterations# J in paper
+844self.n_estimators=n_estimators# K in paper
+845self.subspace_size=subspace_size# m in paper
+846self.n_jobs=check_n_jobs(n_jobs)
+847
+848self.random_state=random_state
+849
+850def_generate_random_subspaces(self,X,y=None,random_state=None):
+851"""Generate the random subspaces
+852
+853 Parameters
+854 ----------
+855 X : array like
+856 Labeled dataset
+857 y : array like, optional
+858 Target for each X, not needed on Rasco, by default None
+859
+860 Returns
+861 -------
+862 subspaces : list
+863 List of index of features
+864 """
+865random_state=check_random_state(random_state)
+866features=list(range(X.shape[1]))
+867idxs=[]
+868for_inrange(self.n_estimators):
+869idxs.append(random_state.permutation(features)[:self.subspace_size])
+870returnidxs
+871
+872def_fit_estimator(self,X,y,i,**kwards):
+873estimator=self.base_estimator
+874iftype(self.base_estimator)==list:
+875estimator=skclone(self.base_estimator[i])
+876returnskclone(estimator).fit(X,y,**kwards)
+877
+878deffit(self,X,y,**kwards):
+879"""Build a Rasco classifier from the training set (X, y).
+880
+881 Parameters
+882 ----------
+883 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+884 The training input samples.
+885 y : array-like of shape (n_samples,)
+886 The target values (class labels), -1 if unlabel.
+887
+888 Returns
+889 -------
+890 self: Rasco
+891 Fitted estimator.
+892 """
+893X_label,y_label,X_unlabel=get_dataset(X,y)
+894self.classes_=np.unique(y_label)
+895
+896is_df=isinstance(X_label,pd.DataFrame)
+897
+898random_state=check_random_state(self.random_state)
+899
+900self.classes_=np.unique(y_label)
+901number_per_class=calc_number_per_class(y_label)
+902
+903ifself.subspace_sizeisNone:
+904self.subspace_size=int(X.shape[1]/2)
+905idxs=self._generate_random_subspaces(X_label,y_label,random_state)
+906
+907cfs=Parallel(n_jobs=self.n_jobs)(
+908delayed(self._fit_estimator)(X_label[:,idxs[i]]ifnotis_dfelseX_label.iloc[:,idxs[i]],y_label,i,**kwards)
+909foriinrange(self.n_estimators)
+910)
+911
+912it=0
+913whileTrue:
+914if(self.max_iterations!=-1andit>=self.max_iterations)orlen(
+915X_unlabel
+916)==0:
+917break
+918
+919raw_predicions=[]
+920foriinrange(self.n_estimators):
+921rp=cfs[i].predict_proba(X_unlabel[:,idxs[i]]ifnotis_dfelseX_unlabel.iloc[:,idxs[i]])
+922raw_predicions.append(rp)
+923raw_predicions=sum(raw_predicions)/self.n_estimators
+924predictions=np.max(raw_predicions,axis=1)
+925class_predicted=np.argmax(raw_predicions,axis=1)
+926pseudoy=self.classes_.take(class_predicted,axis=0)
+927
+928final_instances=list()
+929best_candidates=np.argsort(predictions,kind="mergesort")[::-1]
+930forcinself.classes_:
+931final_instances+=list(best_candidates[pseudoy[best_candidates]==c])[:number_per_class[c]]
+932
+933Lj=X_unlabel[final_instances]ifnotis_dfelseX_unlabel.iloc[final_instances]
+934yj=pseudoy[final_instances]
+935
+936X_label=np.append(X_label,Lj,axis=0)ifnotis_dfelsepd.concat([X_label,Lj])
+937y_label=np.append(y_label,yj)
+938X_unlabel=np.delete(X_unlabel,final_instances,axis=0)ifnotis_dfelseX_unlabel.drop(index=X_unlabel.index[final_instances])
+939
+940cfs=Parallel(n_jobs=self.n_jobs)(
+941delayed(self._fit_estimator)(X_label[:,idxs[i]]ifnotis_dfelseX_label.iloc[:,idxs[i]],y_label,i,**kwards)
+942foriinrange(self.n_estimators)
+943)
+944
+945it+=1
+946
+947self.h_=cfs
+948self.columns_=idxs
+949
+950returnself
+
+
+
+
Co-Training based on random subspaces
+
+
Generate a set of random subspaces and train a classifier for each subspace.
+
+
The main process is:
+
+
+
Generate a set of random subspaces.
+
Train a classifier for each subspace.
+
While max iterations is not reached or any instance is unlabeled:
+
+
Predict the instances from the unlabeled set for each classifier.
+
Calculate the average of the predictions.
+
Select the instances with the highest probability.
+
Label the instances with the highest probability, keeping the balance of the classes.
Wang, J., Luo, S. W., & Zeng, X. H. (2008).
+A random subspace method for co-training,
+in 2008 IEEE International Joint Conference on Neural Networks
+IEEE World Congress on Computational Intelligence
+(pp. 195-200). IEEE. 10.1109/IJCNN.2008.4633789
816def__init__(
+817self,
+818base_estimator=DecisionTreeClassifier(),
+819max_iterations=10,
+820n_estimators=30,
+821subspace_size=None,
+822random_state=None,
+823n_jobs=None,
+824):
+825"""
+826 Co-Training based on random subspaces
+827
+828 Parameters
+829 ----------
+830 base_estimator : ClassifierMixin, optional
+831 An estimator object implementing fit and predict_proba, by default DecisionTreeClassifier()
+832 max_iterations : int, optional
+833 Maximum number of iterations allowed. Should be greater than or equal to 0.
+834 If is -1 then will be infinite iterations until U be empty, by default 10
+835 n_estimators : int, optional
+836 The number of base estimators in the ensemble., by default 30
+837 subspace_size : int, optional
+838 The number of features for each subspace. If it is None will be the half of the features size., by default None
+839 random_state : int, RandomState instance, optional
+840 controls the randomness of the estimator, by default None
+841 """
+842self.base_estimator=check_classifier(base_estimator,True,n_estimators)# C in paper
+843self.max_iterations=max_iterations# J in paper
+844self.n_estimators=n_estimators# K in paper
+845self.subspace_size=subspace_size# m in paper
+846self.n_jobs=check_n_jobs(n_jobs)
+847
+848self.random_state=random_state
+
+
+
+
Co-Training based on random subspaces
+
+
Parameters
+
+
+
base_estimator (ClassifierMixin, optional):
+An estimator object implementing fit and predict_proba, by default DecisionTreeClassifier()
+
max_iterations (int, optional):
+Maximum number of iterations allowed. Should be greater than or equal to 0.
+If is -1 then will be infinite iterations until U be empty, by default 10
+
n_estimators (int, optional):
+The number of base estimators in the ensemble., by default 30
+
subspace_size (int, optional):
+The number of features for each subspace. If it is None will be the half of the features size., by default None
+
random_state (int, RandomState instance, optional):
+controls the randomness of the estimator, by default None
+
+
+
+
+
+
+
+
+
+ def
+ fit(self, X, y, **kwards):
+
+
+
+
+
+
878deffit(self,X,y,**kwards):
+879"""Build a Rasco classifier from the training set (X, y).
+880
+881 Parameters
+882 ----------
+883 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+884 The training input samples.
+885 y : array-like of shape (n_samples,)
+886 The target values (class labels), -1 if unlabel.
+887
+888 Returns
+889 -------
+890 self: Rasco
+891 Fitted estimator.
+892 """
+893X_label,y_label,X_unlabel=get_dataset(X,y)
+894self.classes_=np.unique(y_label)
+895
+896is_df=isinstance(X_label,pd.DataFrame)
+897
+898random_state=check_random_state(self.random_state)
+899
+900self.classes_=np.unique(y_label)
+901number_per_class=calc_number_per_class(y_label)
+902
+903ifself.subspace_sizeisNone:
+904self.subspace_size=int(X.shape[1]/2)
+905idxs=self._generate_random_subspaces(X_label,y_label,random_state)
+906
+907cfs=Parallel(n_jobs=self.n_jobs)(
+908delayed(self._fit_estimator)(X_label[:,idxs[i]]ifnotis_dfelseX_label.iloc[:,idxs[i]],y_label,i,**kwards)
+909foriinrange(self.n_estimators)
+910)
+911
+912it=0
+913whileTrue:
+914if(self.max_iterations!=-1andit>=self.max_iterations)orlen(
+915X_unlabel
+916)==0:
+917break
+918
+919raw_predicions=[]
+920foriinrange(self.n_estimators):
+921rp=cfs[i].predict_proba(X_unlabel[:,idxs[i]]ifnotis_dfelseX_unlabel.iloc[:,idxs[i]])
+922raw_predicions.append(rp)
+923raw_predicions=sum(raw_predicions)/self.n_estimators
+924predictions=np.max(raw_predicions,axis=1)
+925class_predicted=np.argmax(raw_predicions,axis=1)
+926pseudoy=self.classes_.take(class_predicted,axis=0)
+927
+928final_instances=list()
+929best_candidates=np.argsort(predictions,kind="mergesort")[::-1]
+930forcinself.classes_:
+931final_instances+=list(best_candidates[pseudoy[best_candidates]==c])[:number_per_class[c]]
+932
+933Lj=X_unlabel[final_instances]ifnotis_dfelseX_unlabel.iloc[final_instances]
+934yj=pseudoy[final_instances]
+935
+936X_label=np.append(X_label,Lj,axis=0)ifnotis_dfelsepd.concat([X_label,Lj])
+937y_label=np.append(y_label,yj)
+938X_unlabel=np.delete(X_unlabel,final_instances,axis=0)ifnotis_dfelseX_unlabel.drop(index=X_unlabel.index[final_instances])
+939
+940cfs=Parallel(n_jobs=self.n_jobs)(
+941delayed(self._fit_estimator)(X_label[:,idxs[i]]ifnotis_dfelseX_label.iloc[:,idxs[i]],y_label,i,**kwards)
+942foriinrange(self.n_estimators)
+943)
+944
+945it+=1
+946
+947self.h_=cfs
+948self.columns_=idxs
+949
+950returnself
+
+
+
+
Build a Rasco classifier from the training set (X, y).
+
+
Parameters
+
+
+
X ({array-like, sparse matrix} of shape (n_samples, n_features)):
+The training input samples.
+
y (array-like of shape (n_samples,)):
+The target values (class labels), -1 if unlabel.
953classRelRasco(Rasco):
+ 954"""
+ 955 **Co-Training based on relevant random subspaces**
+ 956 --------------------------------------------
+ 957
+ 958 Is a variation of `sslearn.wrapper.Rasco` that uses the mutual information of each feature to select the random subspaces.
+ 959 The process of training is the same as Rasco.
+ 960
+ 961 **Methods**
+ 962 -------
+ 963 - `fit`: Fit the model with the labeled instances.
+ 964 - `predict` : Predict the class for each instance.
+ 965 - `predict_proba`: Predict the probability for each class.
+ 966 - `score`: Return the mean accuracy on the given test data and labels.
+ 967
+ 968 **Example**
+ 969 -------
+ 970 ```python
+ 971 from sklearn.datasets import load_iris
+ 972 from sslearn.wrapper import RelRasco
+ 973 from sslearn.model_selection import artificial_ssl_dataset
+ 974
+ 975 X, y = load_iris(return_X_y=True)
+ 976 X, y, X_unlabel, y_unlabel, _, _ = artificial_ssl_dataset(X, y, label_rate=0.1, random_state=0)
+ 977 relrasco = RelRasco()
+ 978 relrasco.fit(X, y)
+ 979 relrasco.score(X_unlabel, y_unlabel)
+ 980 ```
+ 981
+ 982 **References**
+ 983 ----------
+ 984 Yaslan, Y., & Cataltepe, Z. (2010).<br>
+ 985 Co-training with relevant random subspaces.<br>
+ 986 <i>Neurocomputing</i>, 73(10-12), 1652-1661.<br>
+ 987 [10.1016/j.neucom.2010.01.018](https://doi.org/10.1016/j.neucom.2010.01.018)
+ 988 """
+ 989
+ 990def__init__(
+ 991self,
+ 992base_estimator=DecisionTreeClassifier(),
+ 993max_iterations=10,
+ 994n_estimators=30,
+ 995subspace_size=None,
+ 996random_state=None,
+ 997n_jobs=None,
+ 998):
+ 999"""
+1000 Co-Training with relevant random subspaces
+1001
+1002 Parameters
+1003 ----------
+1004 base_estimator : ClassifierMixin, optional
+1005 An estimator object implementing fit and predict_proba, by default DecisionTreeClassifier()
+1006 max_iterations : int, optional
+1007 Maximum number of iterations allowed. Should be greater than or equal to 0.
+1008 If is -1 then will be infinite iterations until U be empty, by default 10
+1009 n_estimators : int, optional
+1010 The number of base estimators in the ensemble., by default 30
+1011 subspace_size : int, optional
+1012 The number of features for each subspace. If it is None will be the half of the features size., by default None
+1013 random_state : int, RandomState instance, optional
+1014 controls the randomness of the estimator, by default None
+1015 n_jobs : int, optional
+1016 The number of jobs to run in parallel. -1 means using all processors., by default None
+1017
+1018 """
+1019super().__init__(
+1020base_estimator,
+1021max_iterations,
+1022n_estimators,
+1023subspace_size,
+1024random_state,
+1025n_jobs,
+1026)
+1027
+1028def_generate_random_subspaces(self,X,y,random_state=None):
+1029"""Generate the relevant random subspcaes
+1030
+1031 Parameters
+1032 ----------
+1033 X : array like
+1034 Labeled dataset
+1035 y : array like, optional
+1036 Target for each X, only needed on Rel-Rasco, by default None
+1037
+1038 Returns
+1039 -------
+1040 subspaces: list
+1041 List of index of features
+1042 """
+1043random_state=check_random_state(random_state)
+1044relevance=mutual_info_classif(X,y,random_state=random_state)
+1045idxs=[]
+1046for_inrange(self.n_estimators):
+1047subspace=[]
+1048for__inrange(self.subspace_size):
+1049f1=random_state.randint(0,X.shape[1])
+1050f2=random_state.randint(0,X.shape[1])
+1051ifrelevance[f1]>relevance[f2]:
+1052subspace.append(f1)
+1053else:
+1054subspace.append(f2)
+1055idxs.append(subspace)
+1056returnidxs
+
+
+
+
Co-Training based on relevant random subspaces
+
+
Is a variation of sslearn.wrapper.Rasco that uses the mutual information of each feature to select the random subspaces.
+The process of training is the same as Rasco.
990def__init__(
+ 991self,
+ 992base_estimator=DecisionTreeClassifier(),
+ 993max_iterations=10,
+ 994n_estimators=30,
+ 995subspace_size=None,
+ 996random_state=None,
+ 997n_jobs=None,
+ 998):
+ 999"""
+1000 Co-Training with relevant random subspaces
+1001
+1002 Parameters
+1003 ----------
+1004 base_estimator : ClassifierMixin, optional
+1005 An estimator object implementing fit and predict_proba, by default DecisionTreeClassifier()
+1006 max_iterations : int, optional
+1007 Maximum number of iterations allowed. Should be greater than or equal to 0.
+1008 If is -1 then will be infinite iterations until U be empty, by default 10
+1009 n_estimators : int, optional
+1010 The number of base estimators in the ensemble., by default 30
+1011 subspace_size : int, optional
+1012 The number of features for each subspace. If it is None will be the half of the features size., by default None
+1013 random_state : int, RandomState instance, optional
+1014 controls the randomness of the estimator, by default None
+1015 n_jobs : int, optional
+1016 The number of jobs to run in parallel. -1 means using all processors., by default None
+1017
+1018 """
+1019super().__init__(
+1020base_estimator,
+1021max_iterations,
+1022n_estimators,
+1023subspace_size,
+1024random_state,
+1025n_jobs,
+1026)
+
+
+
+
Co-Training with relevant random subspaces
+
+
Parameters
+
+
+
base_estimator (ClassifierMixin, optional):
+An estimator object implementing fit and predict_proba, by default DecisionTreeClassifier()
+
max_iterations (int, optional):
+Maximum number of iterations allowed. Should be greater than or equal to 0.
+If is -1 then will be infinite iterations until U be empty, by default 10
+
n_estimators (int, optional):
+The number of base estimators in the ensemble., by default 30
+
subspace_size (int, optional):
+The number of features for each subspace. If it is None will be the half of the features size., by default None
+
random_state (int, RandomState instance, optional):
+controls the randomness of the estimator, by default None
+
n_jobs (int, optional):
+The number of jobs to run in parallel. -1 means using all processors., by default None
+
+ class
+ CoForest(sslearn.wrapper._co.BaseCoTraining):
+
+
+
+
+
+
1283classCoForest(BaseCoTraining):
+1284"""
+1285 **CoForest classifier. Random Forest co-training**
+1286 ----------------------------
+1287
+1288 Ensemble method for CoTraining based on Random Forest.
+1289
+1290 The main process is:
+1291 1. Train a committee of classifiers using bootstrap.
+1292 2. While any base classifier is retrained:
+1293 1. Predict the instances from the unlabeled set.
+1294 2. Select the instances with the highest probability.
+1295 3. Label the instances with the highest probability
+1296 4. Add the instances to the labeled set only if the error is not bigger than the previous error.
+1297 5. Retrain the classifier with the new instances.
+1298 3. Combine the probabilities of each classifier.
+1299
+1300
+1301 **Methods**
+1302 -------
+1303 - `fit`: Fit the model with the labeled instances.
+1304 - `predict` : Predict the class for each instance.
+1305 - `predict_proba`: Predict the probability for each class.
+1306 - `score`: Return the mean accuracy on the given test data and labels.
+1307
+1308 **Example**
+1309 -------
+1310 ```python
+1311 from sklearn.datasets import load_iris
+1312 from sslearn.wrapper import CoForest
+1313 from sslearn.model_selection import artificial_ssl_dataset
+1314
+1315 X, y = load_iris(return_X_y=True)
+1316 X, y, X_unlabel, y_unlabel, _, _ = artificial_ssl_dataset(X, y, label_rate=0.1, random_state=0)
+1317 coforest = CoForest()
+1318 coforest.fit(X, y)
+1319 coforest.score(X_unlabel, y_unlabel)
+1320 ```
+1321
+1322 **References**
+1323 ----------
+1324 Li, M., & Zhou, Z.-H. (2007).<br>
+1325 Improve Computer-Aided Diagnosis With Machine Learning Techniques Using Undiagnosed Samples.<br>
+1326 <i>IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans</i>,<br>
+1327 37(6), 1088-1098. [10.1109/tsmca.2007.904745](https://doi.org/10.1109/tsmca.2007.904745)
+1328 """
+1329
+1330def__init__(self,base_estimator=DecisionTreeClassifier(),n_estimators=7,threshold=0.75,bootstrap=True,n_jobs=None,random_state=None,version="1.0.3"):
+1331"""
+1332 Generate a CoForest classifier.
+1333 A SSL Random Forest adaption for CoTraining.
+1334
+1335 Parameters
+1336 ----------
+1337 base_estimator : ClassifierMixin, optional
+1338 An estimator object implementing fit and predict_proba, by default DecisionTreeClassifier()
+1339 n_estimators : int, optional
+1340 The number of base estimators in the ensemble., by default 7
+1341 threshold : float, optional
+1342 The decision threshold. Should be in [0, 1)., by default 0.5
+1343 n_jobs : int, optional
+1344 The number of jobs to run in parallel for both fit and predict., by default None
+1345 bootstrap : bool, optional
+1346 Whether bootstrap samples are used when building estimators., by default True
+1347 random_state : int, RandomState instance, optional
+1348 controls the randomness of the estimator, by default None
+1349 **kwards : dict, optional
+1350 Additional parameters to be passed to base_estimator, by default None.
+1351 """
+1352self.base_estimator=check_classifier(base_estimator,collection_size=n_estimators)
+1353self.n_estimators=n_estimators
+1354self.threshold=threshold
+1355self.bootstrap=bootstrap
+1356self._epsilon=sys.float_info.epsilon
+1357self.n_jobs=n_jobs
+1358self.random_state=random_state
+1359self.version=version
+1360ifself.version=="1.0.2":
+1361warnings.warn("The version 1.0.2 is deprecated. Please use the version 1.0.3",DeprecationWarning)
+1362
+1363def__bootstraping(self,X,y,r_state):
+1364# It is necessary to bootstrap the data
+1365ifself.bootstrapandself.version=="1.0.3":
+1366is_df=isinstance(X,pd.DataFrame)
+1367columns=None
+1368ifis_df:
+1369columns=X.columns
+1370X=X.to_numpy()
+1371y=y.copy()
+1372# Get a reprentation of each class
+1373classes=np.unique(y)
+1374# Choose at least one sample from each class
+1375X_label,y_label=[],[]
+1376forcinclasses:
+1377index=np.where(y==c)[0]
+1378# Choose one sample from each class
+1379X_label.append(X[index[0],:])
+1380y_label.append(y[index[0]])
+1381# Remove the sample from the original data
+1382X=np.delete(X,index[0],axis=0)
+1383y=np.delete(y,index[0],axis=0)
+1384X,y=resample(X,y,random_state=r_state)
+1385X=np.concatenate((X,np.array(X_label)),axis=0)
+1386y=np.concatenate((y,np.array(y_label)),axis=0)
+1387ifis_df:
+1388X=pd.DataFrame(X,columns=columns)
+1389returnX,y
+1390
+1391def__estimate_error(self,hypothesis,X,y,index):
+1392ifself.version=="1.0.3":
+1393concomitants=[hfori,hinenumerate(self.hypotheses)ifi!=index]
+1394predicted=[h.predict(X)forhinconcomitants]
+1395predicted=np.array(predicted,dtype=y.dtype)
+1396# Get the majority vote
+1397predicted,_=mode(predicted)
+1398# predicted, _ = st.mode(predicted, axis=1)
+1399# Get the error rate
+1400return1-accuracy_score(y,predicted)
+1401else:
+1402probas=hypothesis.predict_proba(X)
+1403ei_t=0
+1404classes=list(hypothesis.classes_)
+1405forjinrange(y.shape[0]):
+1406true_y=y[j]
+1407true_y_index=classes.index(true_y)
+1408ei_t+=1-probas[j,true_y_index]
+1409ifei_t==0:
+1410ei_t=self._epsilon
+1411returnei_t
+1412
+1413def__confidence(self,h_index,X):
+1414concomitants=[hfori,hinenumerate(self.hypotheses)ifi!=h_index]
+1415
+1416predicted=[h.predict(X)forhinconcomitants]
+1417predicted=np.array(predicted,dtype=predicted[0].dtype)
+1418# Get the majority vote and the number of votes
+1419_,counts=mode(predicted)
+1420# _, counts = st.mode(predicted, axis=1)
+1421confidences=counts/len(concomitants)
+1422returnconfidences
+1423
+1424def_fit_estimator(self,X,y,i,beginning=False,**kwards):
+1425estimator=self.base_estimator
+1426iftype(self.base_estimator)==list:
+1427estimator=skclone(self.hypotheses[i])
+1428
+1429if"random_state"inestimator.get_params():
+1430r_state=estimator.random_state
+1431else:
+1432r_state=self.random_state
+1433ifr_stateisNone:
+1434r_state=np.random.randint(0,1000)
+1435r_state+=i
+1436# Only in the beginning
+1437ifbeginning:
+1438X,y=self.__bootstraping(X,y,r_state)
+1439
+1440returnskclone(estimator).fit(X,y,**kwards)
+1441
+1442deffit(self,X,y,**kwards):
+1443"""Build a CoForest classifier from the training set (X, y).
+1444
+1445 Parameters
+1446 ----------
+1447 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+1448 The training input samples.
+1449 y : array-like of shape (n_samples,)
+1450 The target values (class labels), -1 if unlabel.
+1451
+1452 Returns
+1453 -------
+1454 self: CoForest
+1455 Fitted estimator.
+1456 """
+1457random_state=check_random_state(self.random_state)
+1458n_jobs=check_n_jobs(self.n_jobs)
+1459
+1460X_label,y_label,X_unlabel=get_dataset(X,y)
+1461
+1462is_df=isinstance(X_label,pd.DataFrame)
+1463
+1464self.classes_=np.unique(y_label)
+1465
+1466self.hypotheses=[]
+1467errors=[]
+1468weights=[]
+1469foriinrange(self.n_estimators):
+1470self.hypotheses.append(skclone(self.base_estimatoriftype(self.base_estimator)isnotlistelseself.base_estimator[i]))
+1471if"random_state"indir(self.hypotheses[-1]):
+1472self.hypotheses[-1].set_params(random_state=random_state.randint(0,2**32-1))
+1473errors.append(0.5)
+1474
+1475self.hypotheses=Parallel(n_jobs=n_jobs)(
+1476delayed(self._fit_estimator)(X_label,y_label,i,beginning=True,**kwards)
+1477foriinrange(self.n_estimators)
+1478)
+1479
+1480foriinrange(self.n_estimators):
+1481# The paper stablishes that the weight of each hypothesis is 0,
+1482# but it is not possible to do that because it will be impossible increase the training set
+1483ifself.version=="1.0.2":
+1484weights.append(np.max(self.hypotheses[i].predict_proba(X_label),axis=1).sum())# Version 1.0.2
+1485else:
+1486weights.append(self.__confidence(i,X_label).sum())
+1487
+1488changing=TrueifX_unlabel.shape[0]>0elseFalse
+1489whilechanging:
+1490changing=False
+1491foriinrange(self.n_estimators):
+1492hi,ei,wi=self.hypotheses[i],errors[i],weights[i]
+1493
+1494ei_t=self.__estimate_error(hi,X_label,y_label,i)
+1495
+1496ifei_t<ei:
+1497random_index_subsample=list(range(X_unlabel.shape[0]))
+1498random_index_subsample=random_state.permutation(
+1499random_index_subsample
+1500)
+1501cond=random_index_subsample[0:int(safe_division(ei*wi,ei_t,self._epsilon))]
+1502ifis_df:
+1503Ui_t=X_unlabel.iloc[cond,:]
+1504else:
+1505Ui_t=X_unlabel[cond,:]
+1506
+1507raw_predictions=hi.predict_proba(Ui_t)
+1508predictions=np.max(raw_predictions,axis=1)
+1509class_predicted=self.classes_.take(np.argmax(raw_predictions,axis=1),axis=0)
+1510
+1511to_label=predictions>self.threshold
+1512wi_t=predictions[to_label].sum()
+1513
+1514ifei_t*wi_t<ei*wi:
+1515changing=True
+1516ifis_df:
+1517x_temp=pd.concat([X_label,Ui_t.iloc[to_label,:]])
+1518else:
+1519x_temp=np.concatenate((X_label,Ui_t[to_label]))
+1520y_temp=np.concatenate((y_label,class_predicted[to_label]))
+1521hi.fit(
+1522x_temp,
+1523y_temp,
+1524**kwards
+1525)
+1526
+1527errors[i]=ei_t
+1528weights[i]=wi_t
+1529
+1530self.h_=self.hypotheses
+1531self.columns_=[list(range(X.shape[1]))]*self.n_estimators
+1532
+1533returnself
+
+
+
+
CoForest classifier. Random Forest co-training
+
+
Ensemble method for CoTraining based on Random Forest.
+
+
The main process is:
+
+
+
Train a committee of classifiers using bootstrap.
+
While any base classifier is retrained:
+
+
Predict the instances from the unlabeled set.
+
Select the instances with the highest probability.
+
Label the instances with the highest probability
+
Add the instances to the labeled set only if the error is not bigger than the previous error.
Li, M., & Zhou, Z.-H. (2007).
+Improve Computer-Aided Diagnosis With Machine Learning Techniques Using Undiagnosed Samples.
+IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans,
+37(6), 1088-1098. 10.1109/tsmca.2007.904745
1330def__init__(self,base_estimator=DecisionTreeClassifier(),n_estimators=7,threshold=0.75,bootstrap=True,n_jobs=None,random_state=None,version="1.0.3"):
+1331"""
+1332 Generate a CoForest classifier.
+1333 A SSL Random Forest adaption for CoTraining.
+1334
+1335 Parameters
+1336 ----------
+1337 base_estimator : ClassifierMixin, optional
+1338 An estimator object implementing fit and predict_proba, by default DecisionTreeClassifier()
+1339 n_estimators : int, optional
+1340 The number of base estimators in the ensemble., by default 7
+1341 threshold : float, optional
+1342 The decision threshold. Should be in [0, 1)., by default 0.5
+1343 n_jobs : int, optional
+1344 The number of jobs to run in parallel for both fit and predict., by default None
+1345 bootstrap : bool, optional
+1346 Whether bootstrap samples are used when building estimators., by default True
+1347 random_state : int, RandomState instance, optional
+1348 controls the randomness of the estimator, by default None
+1349 **kwards : dict, optional
+1350 Additional parameters to be passed to base_estimator, by default None.
+1351 """
+1352self.base_estimator=check_classifier(base_estimator,collection_size=n_estimators)
+1353self.n_estimators=n_estimators
+1354self.threshold=threshold
+1355self.bootstrap=bootstrap
+1356self._epsilon=sys.float_info.epsilon
+1357self.n_jobs=n_jobs
+1358self.random_state=random_state
+1359self.version=version
+1360ifself.version=="1.0.2":
+1361warnings.warn("The version 1.0.2 is deprecated. Please use the version 1.0.3",DeprecationWarning)
+
+
+
+
Generate a CoForest classifier.
+A SSL Random Forest adaption for CoTraining.
+
+
Parameters
+
+
+
base_estimator (ClassifierMixin, optional):
+An estimator object implementing fit and predict_proba, by default DecisionTreeClassifier()
+
n_estimators (int, optional):
+The number of base estimators in the ensemble., by default 7
+
threshold (float, optional):
+The decision threshold. Should be in [0, 1)., by default 0.5
+
n_jobs (int, optional):
+The number of jobs to run in parallel for both fit and predict., by default None
+
bootstrap (bool, optional):
+Whether bootstrap samples are used when building estimators., by default True
+
random_state (int, RandomState instance, optional):
+controls the randomness of the estimator, by default None
+
**kwards (dict, optional):
+Additional parameters to be passed to base_estimator, by default None.
+
+
+
+
+
+
+
+
+
+ def
+ fit(self, X, y, **kwards):
+
+
+
+
+
+
1442deffit(self,X,y,**kwards):
+1443"""Build a CoForest classifier from the training set (X, y).
+1444
+1445 Parameters
+1446 ----------
+1447 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+1448 The training input samples.
+1449 y : array-like of shape (n_samples,)
+1450 The target values (class labels), -1 if unlabel.
+1451
+1452 Returns
+1453 -------
+1454 self: CoForest
+1455 Fitted estimator.
+1456 """
+1457random_state=check_random_state(self.random_state)
+1458n_jobs=check_n_jobs(self.n_jobs)
+1459
+1460X_label,y_label,X_unlabel=get_dataset(X,y)
+1461
+1462is_df=isinstance(X_label,pd.DataFrame)
+1463
+1464self.classes_=np.unique(y_label)
+1465
+1466self.hypotheses=[]
+1467errors=[]
+1468weights=[]
+1469foriinrange(self.n_estimators):
+1470self.hypotheses.append(skclone(self.base_estimatoriftype(self.base_estimator)isnotlistelseself.base_estimator[i]))
+1471if"random_state"indir(self.hypotheses[-1]):
+1472self.hypotheses[-1].set_params(random_state=random_state.randint(0,2**32-1))
+1473errors.append(0.5)
+1474
+1475self.hypotheses=Parallel(n_jobs=n_jobs)(
+1476delayed(self._fit_estimator)(X_label,y_label,i,beginning=True,**kwards)
+1477foriinrange(self.n_estimators)
+1478)
+1479
+1480foriinrange(self.n_estimators):
+1481# The paper stablishes that the weight of each hypothesis is 0,
+1482# but it is not possible to do that because it will be impossible increase the training set
+1483ifself.version=="1.0.2":
+1484weights.append(np.max(self.hypotheses[i].predict_proba(X_label),axis=1).sum())# Version 1.0.2
+1485else:
+1486weights.append(self.__confidence(i,X_label).sum())
+1487
+1488changing=TrueifX_unlabel.shape[0]>0elseFalse
+1489whilechanging:
+1490changing=False
+1491foriinrange(self.n_estimators):
+1492hi,ei,wi=self.hypotheses[i],errors[i],weights[i]
+1493
+1494ei_t=self.__estimate_error(hi,X_label,y_label,i)
+1495
+1496ifei_t<ei:
+1497random_index_subsample=list(range(X_unlabel.shape[0]))
+1498random_index_subsample=random_state.permutation(
+1499random_index_subsample
+1500)
+1501cond=random_index_subsample[0:int(safe_division(ei*wi,ei_t,self._epsilon))]
+1502ifis_df:
+1503Ui_t=X_unlabel.iloc[cond,:]
+1504else:
+1505Ui_t=X_unlabel[cond,:]
+1506
+1507raw_predictions=hi.predict_proba(Ui_t)
+1508predictions=np.max(raw_predictions,axis=1)
+1509class_predicted=self.classes_.take(np.argmax(raw_predictions,axis=1),axis=0)
+1510
+1511to_label=predictions>self.threshold
+1512wi_t=predictions[to_label].sum()
+1513
+1514ifei_t*wi_t<ei*wi:
+1515changing=True
+1516ifis_df:
+1517x_temp=pd.concat([X_label,Ui_t.iloc[to_label,:]])
+1518else:
+1519x_temp=np.concatenate((X_label,Ui_t[to_label]))
+1520y_temp=np.concatenate((y_label,class_predicted[to_label]))
+1521hi.fit(
+1522x_temp,
+1523y_temp,
+1524**kwards
+1525)
+1526
+1527errors[i]=ei_t
+1528weights[i]=wi_t
+1529
+1530self.h_=self.hypotheses
+1531self.columns_=[list(range(X.shape[1]))]*self.n_estimators
+1532
+1533returnself
+
+
+
+
Build a CoForest classifier from the training set (X, y).
+
+
Parameters
+
+
+
X ({array-like, sparse matrix} of shape (n_samples, n_features)):
+The training input samples.
+
y (array-like of shape (n_samples,)):
+The target values (class labels), -1 if unlabel.
+
+ class
+ TriTraining(sslearn.wrapper._co.BaseCoTraining):
+
+
+
+
+
+
25classTriTraining(BaseCoTraining):
+ 26"""
+ 27 **TriTraining. Trio of classifiers with bootstrapping.**
+ 28
+ 29 The main process is:
+ 30 1. Generate three classifiers using bootstrapping.
+ 31 2. Iterate until convergence:
+ 32 1. Calculate the error between two hypotheses.
+ 33 2. If the error is less than the previous error, generate a dataset with the instances where both hypotheses agree.
+ 34 3. Retrain the classifiers with the new dataset and the original labeled dataset.
+ 35 3. Combine the predictions of the three classifiers.
+ 36
+ 37 **Methods**
+ 38 -------
+ 39 - `fit`: Fit the model with the labeled instances.
+ 40 - `predict` : Predict the class for each instance.
+ 41 - `predict_proba`: Predict the probability for each class.
+ 42 - `score`: Return the mean accuracy on the given test data and labels.
+ 43
+ 44 **References**
+ 45 ----------
+ 46 Zhi-Hua Zhou and Ming Li,<br>
+ 47 Tri-training: exploiting unlabeled data using three classifiers,<br>
+ 48 in <i>IEEE Transactions on Knowledge and Data Engineering</i>,<br>
+ 49 vol. 17, no. 11, pp. 1529-1541, Nov. 2005,<br>
+ 50 [10.1109/TKDE.2005.186](https://doi.org/10.1109/TKDE.2005.186)
+ 51
+ 52 """
+ 53
+ 54def__init__(
+ 55self,
+ 56base_estimator=DecisionTreeClassifier(),
+ 57n_samples=None,
+ 58random_state=None,
+ 59n_jobs=None,
+ 60):
+ 61"""TriTraining. Trio of classifiers with bootstrapping.
+ 62
+ 63 Parameters
+ 64 ----------
+ 65 base_estimator : ClassifierMixin, optional
+ 66 An estimator object implementing fit and predict_proba, by default DecisionTreeClassifier()
+ 67 n_samples : int, optional
+ 68 Number of samples to generate.
+ 69 If left to None this is automatically set to the first dimension of the arrays., by default None
+ 70 random_state : int, RandomState instance, optional
+ 71 controls the randomness of the estimator, by default None
+ 72 n_jobs : int, optional
+ 73 The number of jobs to run in parallel for both `fit` and `predict`.
+ 74 `None` means 1 unless in a :obj:`joblib.parallel_backend` context.
+ 75 `-1` means using all processors., by default None
+ 76
+ 77 """
+ 78self._N_LEARNER=3
+ 79self.base_estimator=check_classifier(base_estimator,collection_size=self._N_LEARNER)
+ 80self.n_samples=n_samples
+ 81self._epsilon=sys.float_info.epsilon
+ 82self.random_state=random_state
+ 83self.n_jobs=n_jobs
+ 84
+ 85deffit(self,X,y,**kwards):
+ 86"""Build a TriTraining classifier from the training set (X, y).
+ 87 Parameters
+ 88 ----------
+ 89 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+ 90 The training input samples.
+ 91 y : array-like of shape (n_samples,)
+ 92 The target values (class labels), -1 if unlabeled.
+ 93 Returns
+ 94 -------
+ 95 self : TriTraining
+ 96 Fitted estimator.
+ 97 """
+ 98random_state=check_random_state(self.random_state)
+ 99self.n_jobs=min(check_n_jobs(self.n_jobs),self._N_LEARNER)
+100
+101X_label,y_label,X_unlabel=get_dataset(X,y)
+102
+103is_df=isinstance(X_label,pd.DataFrame)
+104
+105hypotheses=[]
+106e_=[0.5]*self._N_LEARNER
+107l_=[0]*self._N_LEARNER
+108
+109# Get a random instance for each class to keep class index
+110self.classes_=np.unique(y_label)
+111classes=set(self.classes_)
+112instances=list()
+113labels=list()
+114iteration=zip(X_label,y_label)
+115ifis_df:
+116iteration=zip(X_label.values,y_label)
+117forx_,y_initeration:
+118ify_inclasses:
+119classes.remove(y_)
+120instances.append(x_)
+121labels.append(y_)
+122iflen(classes)==0:
+123break
+124
+125foriinrange(self._N_LEARNER):
+126X_sampled,y_sampled=resample(
+127X_label,
+128y_label,
+129replace=True,
+130n_samples=self.n_samples,
+131random_state=random_state,
+132)
+133
+134ifis_df:
+135X_sampled=pd.DataFrame(X_sampled,columns=X_label.columns)
+136X_sampled=pd.concat([pd.DataFrame(instances,columns=X_label.columns),X_sampled])
+137else:
+138X_sampled=np.concatenate((np.array(instances),X_sampled),axis=0)
+139y_sampled=np.concatenate((np.array(labels),y_sampled),axis=0)
+140
+141hypotheses.append(
+142skclone(self.base_estimatoriftype(self.base_estimator)isnotlistelseself.base_estimator[i]).fit(X_sampled,y_sampled,**kwards)
+143)
+144
+145something_has_changed=TrueifX_unlabel.size>0elseFalse
+146whilesomething_has_changed:
+147something_has_changed=False
+148L=[[]]*self._N_LEARNER
+149Ly=[[]]*self._N_LEARNER
+150e=[]
+151updates=[False]*3
+152
+153foriinrange(self._N_LEARNER):
+154hj,hk=TriTraining._another_hs(hypotheses,i)
+155e.append(
+156self._measure_error(X_label,y_label,hj,hk,self._epsilon)
+157)
+158ife_[i]<=e[i]:
+159continue
+160y_p=hj.predict(X_unlabel)
+161validx=y_p==hk.predict(X_unlabel)
+162L[i]=X_unlabel[validx]
+163Ly[i]=y_p[validx]
+164
+165ifl_[i]==0:
+166l_[i]=math.floor(
+167safe_division(e[i],(e_[i]-e[i]),self._epsilon)+1
+168)
+169ifl_[i]>=len(L[i]):
+170continue
+171ife[i]*len(L[i])<e_[i]*l_[i]:
+172updates[i]=True
+173elifl_[i]>safe_division(e[i],e_[i]-e[i],self._epsilon):
+174L[i],Ly[i]=TriTraining._subsample(
+175(L[i],Ly[i]),
+176math.ceil(
+177safe_division(e_[i]*l_[i],e[i],self._epsilon)-1
+178),
+179random_state,
+180)
+181ifis_df:
+182L[i]=pd.DataFrame(L[i],columns=X_label.columns)
+183updates[i]=True
+184
+185hypotheses=Parallel(n_jobs=self.n_jobs)(
+186delayed(self._fit_estimator)(
+187hypotheses[i],X_label,y_label,L[i],Ly[i],updates[i],**kwards
+188)
+189foriinrange(self._N_LEARNER)
+190)
+191
+192foriinrange(self._N_LEARNER):
+193ifupdates[i]:
+194e_[i]=e[i]
+195l_[i]=len(L[i])
+196something_has_changed=True
+197
+198self.h_=hypotheses
+199self.columns_=[list(range(X.shape[1]))]*self._N_LEARNER
+200
+201returnself
+202
+203def_fit_estimator(self,hyp,X_label,y_label,L,Ly,update,**kwards):
+204ifupdate:
+205ifisinstance(L,pd.DataFrame):
+206_tempL=pd.concat([X_label,L])
+207else:
+208_tempL=np.concatenate((X_label,L))
+209_tempY=np.concatenate((y_label,Ly))
+210
+211returnhyp.fit(_tempL,_tempY,**kwards)
+212returnhyp
+213
+214@staticmethod
+215def_another_hs(hs,index):
+216"""Get the other hypotheses
+217 Parameters
+218 ----------
+219 hs : list
+220 hypotheses collection
+221 index : int
+222 base hypothesis index
+223 Returns
+224 -------
+225 classifiers: list
+226 Collection of other hypotheses
+227 """
+228another_hs=[]
+229foriinrange(len(hs)):
+230ifi!=index:
+231another_hs.append(hs[i])
+232returnanother_hs
+233
+234@staticmethod
+235def_subsample(L,s,random_state=None):
+236"""Randomly removes |L| - s number of examples from L
+237 Parameters
+238 ----------
+239 L : tuple of array-like
+240 Collection pseudo-labeled candidates and its labels
+241 s : int
+242 Equation 10 in paper
+243 random_state : int, RandomState instance, optional
+244 controls the randomness of the estimator, by default None
+245 Returns
+246 -------
+247 subsamples: tuple
+248 Collection of pseudo-labeled selected for enlarged labeled examples.
+249 """
+250to_remove=len(L[0])-s
+251select=len(L[0])-to_remove
+252
+253returnresample(*L,replace=False,n_samples=select,random_state=random_state)
+254
+255def_measure_error(
+256self,X,y,h1:ClassifierMixin,h2:ClassifierMixin,epsilon=sys.float_info.epsilon,**kwards
+257):
+258"""Calculate the error between two hypotheses
+259 Parameters
+260 ----------
+261 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+262 The training labeled input samples.
+263 y : array-like of shape (n_samples,)
+264 The target values (class labels).
+265 h1 : ClassifierMixin
+266 First hypothesis
+267 h2 : ClassifierMixin
+268 Second hypothesis
+269 epsilon : float
+270 A small number to avoid division by zero
+271 Returns
+272 -------
+273 error : float
+274 Division of the number of labeled examples on which both h1 and h2 make incorrect classification,
+275 by the number of labeled examples on which the classification made by h1 is the same as that made by h2.
+276 """
+277y1=h1.predict(X)
+278y2=h2.predict(X)
+279
+280error=np.count_nonzero(np.logical_and(y1==y2,y2!=y))
+281coincidence=np.count_nonzero(y1==y2)
+282returnsafe_division(error,coincidence,epsilon)
+
+
+
+
TriTraining. Trio of classifiers with bootstrapping.
+
+
The main process is:
+
+
+
Generate three classifiers using bootstrapping.
+
Iterate until convergence:
+
+
Calculate the error between two hypotheses.
+
If the error is less than the previous error, generate a dataset with the instances where both hypotheses agree.
+
Retrain the classifiers with the new dataset and the original labeled dataset.
predict_proba: Predict the probability for each class.
+
score: Return the mean accuracy on the given test data and labels.
+
+
+
References
+
+
Zhi-Hua Zhou and Ming Li,
+Tri-training: exploiting unlabeled data using three classifiers,
+in IEEE Transactions on Knowledge and Data Engineering,
+vol. 17, no. 11, pp. 1529-1541, Nov. 2005,
+10.1109/TKDE.2005.186
54def__init__(
+55self,
+56base_estimator=DecisionTreeClassifier(),
+57n_samples=None,
+58random_state=None,
+59n_jobs=None,
+60):
+61"""TriTraining. Trio of classifiers with bootstrapping.
+62
+63 Parameters
+64 ----------
+65 base_estimator : ClassifierMixin, optional
+66 An estimator object implementing fit and predict_proba, by default DecisionTreeClassifier()
+67 n_samples : int, optional
+68 Number of samples to generate.
+69 If left to None this is automatically set to the first dimension of the arrays., by default None
+70 random_state : int, RandomState instance, optional
+71 controls the randomness of the estimator, by default None
+72 n_jobs : int, optional
+73 The number of jobs to run in parallel for both `fit` and `predict`.
+74 `None` means 1 unless in a :obj:`joblib.parallel_backend` context.
+75 `-1` means using all processors., by default None
+76
+77 """
+78self._N_LEARNER=3
+79self.base_estimator=check_classifier(base_estimator,collection_size=self._N_LEARNER)
+80self.n_samples=n_samples
+81self._epsilon=sys.float_info.epsilon
+82self.random_state=random_state
+83self.n_jobs=n_jobs
+
+
+
+
TriTraining. Trio of classifiers with bootstrapping.
+
+
Parameters
+
+
+
base_estimator (ClassifierMixin, optional):
+An estimator object implementing fit and predict_proba, by default DecisionTreeClassifier()
+
n_samples (int, optional):
+Number of samples to generate.
+If left to None this is automatically set to the first dimension of the arrays., by default None
+
random_state (int, RandomState instance, optional):
+controls the randomness of the estimator, by default None
+
n_jobs (int, optional):
+The number of jobs to run in parallel for both fit and predict.
+None means 1 unless in a joblib.parallel_backend context.
+-1 means using all processors., by default None
+
+
+
+
+
+
+
+
+
+ def
+ fit(self, X, y, **kwards):
+
+
+
+
+
+
85deffit(self,X,y,**kwards):
+ 86"""Build a TriTraining classifier from the training set (X, y).
+ 87 Parameters
+ 88 ----------
+ 89 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+ 90 The training input samples.
+ 91 y : array-like of shape (n_samples,)
+ 92 The target values (class labels), -1 if unlabeled.
+ 93 Returns
+ 94 -------
+ 95 self : TriTraining
+ 96 Fitted estimator.
+ 97 """
+ 98random_state=check_random_state(self.random_state)
+ 99self.n_jobs=min(check_n_jobs(self.n_jobs),self._N_LEARNER)
+100
+101X_label,y_label,X_unlabel=get_dataset(X,y)
+102
+103is_df=isinstance(X_label,pd.DataFrame)
+104
+105hypotheses=[]
+106e_=[0.5]*self._N_LEARNER
+107l_=[0]*self._N_LEARNER
+108
+109# Get a random instance for each class to keep class index
+110self.classes_=np.unique(y_label)
+111classes=set(self.classes_)
+112instances=list()
+113labels=list()
+114iteration=zip(X_label,y_label)
+115ifis_df:
+116iteration=zip(X_label.values,y_label)
+117forx_,y_initeration:
+118ify_inclasses:
+119classes.remove(y_)
+120instances.append(x_)
+121labels.append(y_)
+122iflen(classes)==0:
+123break
+124
+125foriinrange(self._N_LEARNER):
+126X_sampled,y_sampled=resample(
+127X_label,
+128y_label,
+129replace=True,
+130n_samples=self.n_samples,
+131random_state=random_state,
+132)
+133
+134ifis_df:
+135X_sampled=pd.DataFrame(X_sampled,columns=X_label.columns)
+136X_sampled=pd.concat([pd.DataFrame(instances,columns=X_label.columns),X_sampled])
+137else:
+138X_sampled=np.concatenate((np.array(instances),X_sampled),axis=0)
+139y_sampled=np.concatenate((np.array(labels),y_sampled),axis=0)
+140
+141hypotheses.append(
+142skclone(self.base_estimatoriftype(self.base_estimator)isnotlistelseself.base_estimator[i]).fit(X_sampled,y_sampled,**kwards)
+143)
+144
+145something_has_changed=TrueifX_unlabel.size>0elseFalse
+146whilesomething_has_changed:
+147something_has_changed=False
+148L=[[]]*self._N_LEARNER
+149Ly=[[]]*self._N_LEARNER
+150e=[]
+151updates=[False]*3
+152
+153foriinrange(self._N_LEARNER):
+154hj,hk=TriTraining._another_hs(hypotheses,i)
+155e.append(
+156self._measure_error(X_label,y_label,hj,hk,self._epsilon)
+157)
+158ife_[i]<=e[i]:
+159continue
+160y_p=hj.predict(X_unlabel)
+161validx=y_p==hk.predict(X_unlabel)
+162L[i]=X_unlabel[validx]
+163Ly[i]=y_p[validx]
+164
+165ifl_[i]==0:
+166l_[i]=math.floor(
+167safe_division(e[i],(e_[i]-e[i]),self._epsilon)+1
+168)
+169ifl_[i]>=len(L[i]):
+170continue
+171ife[i]*len(L[i])<e_[i]*l_[i]:
+172updates[i]=True
+173elifl_[i]>safe_division(e[i],e_[i]-e[i],self._epsilon):
+174L[i],Ly[i]=TriTraining._subsample(
+175(L[i],Ly[i]),
+176math.ceil(
+177safe_division(e_[i]*l_[i],e[i],self._epsilon)-1
+178),
+179random_state,
+180)
+181ifis_df:
+182L[i]=pd.DataFrame(L[i],columns=X_label.columns)
+183updates[i]=True
+184
+185hypotheses=Parallel(n_jobs=self.n_jobs)(
+186delayed(self._fit_estimator)(
+187hypotheses[i],X_label,y_label,L[i],Ly[i],updates[i],**kwards
+188)
+189foriinrange(self._N_LEARNER)
+190)
+191
+192foriinrange(self._N_LEARNER):
+193ifupdates[i]:
+194e_[i]=e[i]
+195l_[i]=len(L[i])
+196something_has_changed=True
+197
+198self.h_=hypotheses
+199self.columns_=[list(range(X.shape[1]))]*self._N_LEARNER
+200
+201returnself
+
+
+
+
Build a TriTraining classifier from the training set (X, y).
+
+
Parameters
+
+
+
X ({array-like, sparse matrix} of shape (n_samples, n_features)):
+The training input samples.
+
y (array-like of shape (n_samples,)):
+The target values (class labels), -1 if unlabeled.
557classDeTriTraining(TriTraining):
+558"""
+559 **TriTraining with Data Editing.**
+560
+561 It is a variation of the TriTraining, the main difference is that the instances are depurated in each iteration.
+562 It means that the instances with their neighbors that have the same class are kept, the rest are removed.
+563 At the end of the iterations, the instances are clustered and the class is assigned to the cluster centroid.
+564
+565 **Methods**
+566 -------
+567 - `fit`: Fit the model with the labeled instances.
+568 - `predict` : Predict the class for each instance.
+569 - `predict_proba`: Predict the probability for each class.
+570 - `score`: Return the mean accuracy on the given test data and labels.
+571
+572 **References**
+573 ----------
+574 Deng C., Guo M.Z. (2006)<br>
+575 Tri-training and Data Editing Based Semi-supervised Clustering Algorithm, <br>
+576 in <i>Gelbukh A., Reyes-Garcia C.A. (eds) MICAI 2006: Advances in Artificial Intelligence. MICAI 2006.</i><br>
+577 Lecture Notes in Computer Science, vol 4293. Springer, Berlin, Heidelberg.<br>
+578 [10.1007/11925231_61](https://doi.org/10.1007/11925231_61)
+579 """
+580
+581def__init__(self,base_estimator=DecisionTreeClassifier(),k_neighbors=3,
+582n_samples=None,mode="seeded",max_iterations=100,n_jobs=None,random_state=None):
+583"""
+584 DeTriTraining - TriTraining with Depurated and Clustering.
+585 Avoid the noise generated by the TriTraining algorithm by depurating the enlarged dataset and clustering the instances.
+586
+587 Parameters
+588 ----------
+589 base_estimator : ClassifierMixin, optional
+590 An estimator object implementing fit and predict_proba, by default DecisionTreeClassifier()
+591 n_samples : int, optional
+592 Number of samples to generate.
+593 If left to None this is automatically set to the first dimension of the arrays., by default None
+594 k_neighbors : int, optional
+595 Number of neighbors for depurate classification.
+596 If at least k_neighbors/2+1 have a class other than the one predicted, the class is ignored., by default 3
+597 mode : string, optional
+598 How to calculate the cluster each instance belongs to.
+599 If `seeded` each instance belong to nearest cluster.
+600 If `constrained` each instance belong to nearest cluster unless the instance is in to enlarged dataset,
+601 then the instance belongs to the cluster of its class., by default `seeded`
+602 max_iterations : int, optional
+603 Maximum number of iterations, by default 100
+604 n_jobs : int, optional
+605 The number of parallel jobs to run for neighbors search.
+606 None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.
+607 Doesn't affect fit method., by default None
+608 random_state : int, RandomState instance, optional
+609 controls the randomness of the estimator, by default None
+610 """
+611super().__init__(base_estimator,n_samples,random_state)
+612self.k_neighbors=k_neighbors
+613self.mode=mode
+614self.max_iterations=max_iterations
+615self.n_jobs=n_jobs
+616ifmode!="seeded"andmode!="constrained":
+617raiseAttributeError("`mode` must be \"seeded\" or \"constrained\".")
+618
+619def_depure(self,S):
+620"""Depure the S dataset
+621
+622 Parameters
+623 ----------
+624 S : tuple (X, y)
+625 Enlarged dataset
+626
+627 Returns
+628 -------
+629 tuple : (X, y)
+630 Enlarged dataset with instances where at least k_neighbors/2+1 have the same class.
+631 """
+632init=time.time()
+633knn=KNeighborsClassifier(n_neighbors=self.k_neighbors,n_jobs=self.n_jobs)
+634valid=knn.fit(*S).predict(S[0])==S[1]
+635print(f"Depure time: {time.time()-init}")
+636returnS[0][valid],S[1][valid]
+637
+638def_clustering(self,S,X):
+639"""Clustering phase of the fitting
+640
+641 Parameters
+642 ----------
+643 S : tuple (X, y)
+644 Enlarged dataset
+645 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+646 Complete dataset, only features
+647
+648 Returns
+649 -------
+650 y: array-like of shape (n_samples,)
+651 class predicted for each instance
+652 """
+653centroids=dict()
+654clusters=set(S[1])
+655
+656# uses as numpy
+657ifisinstance(X,pd.DataFrame):
+658X=X.to_numpy()
+659ifisinstance(S[0],pd.DataFrame):
+660S=(S[0].to_numpy(),S[1])
+661
+662forkinclusters:
+663centroids[k]=np.mean(S[0][S[1]==k],axis=0)
+664
+665defseeded(X):
+666# For each instance, calculate the distance to each centroid
+667distances=np.linalg.norm(X[:,None,:]-np.array(list(centroids.values())),axis=2)
+668# Get the index of the nearest centroid
+669returnnp.argmin(distances,axis=1)
+670
+671defconstrained(X):
+672# Calculate the distances to centroids using broadcasting
+673distances=np.linalg.norm(X[:,None,:]-np.array(list(centroids.values())),axis=2)
+674# Get the index of the nearest centroid
+675nearest=np.argmin(distances,axis=1)
+676# Create a mask to find instances in X that belong to S[0]
+677mask=(S[0]==X[:,None])
+678# Find the row and column indices where all elements are True
+679i,j=np.where(mask.all(axis=2))
+680# Initialize cluster with -1
+681cluster=np.full(X.shape[0],-1,dtype=int)
+682# Update cluster for the instances found in S[0]
+683cluster[i]=S[1][j]
+684# Update cluster for instances not found in S[0]
+685cluster[cluster==-1]=nearest[cluster==-1]
+686
+687returncluster
+688
+689ifself.mode=="seeded":
+690op=seeded
+691elifself.mode=="constrained":
+692op=constrained
+693
+694changes=True
+695iterations=0
+696whilechangesanditerations<self.max_iterations:
+697changes=False
+698iterations+=1
+699# Need to vectorize
+700new_clusters=op(X)
+701new_centroids=dict()
+702forkinclusters:
+703ifnp.any(new_clusters==k):
+704new_centroids[k]=np.mean(X[new_clusters==k],axis=0)
+705ifnotnp.array_equal(new_centroids[k],centroids[k]):
+706changes=True
+707centroids=new_centroids
+708
+709returnnew_clusters
+710
+711deffit(self,X,y,**kwards):
+712"""Build a DeTriTraining classifier from the training set (X, y).
+713
+714 Parameters
+715 ----------
+716 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+717 The training input samples.
+718 y : array-like of shape (n_samples,)
+719 The target values (class labels), -1 if unlabel.
+720
+721 Returns
+722 -------
+723 self: DeTriTraining
+724 Fitted estimator.
+725 """
+726X_label,y_label,X_unlabel=get_dataset(X,y)
+727
+728self.label_encoder_=LabelEncoder()
+729self.label_encoder_.fit(y_label)
+730y_label=self.label_encoder_.transform(y_label)
+731
+732is_df=isinstance(X_label,pd.DataFrame)
+733
+734self.classes_=np.unique(y_label)
+735
+736classes=set(self.classes_)
+737instances=list()
+738labels=list()
+739iteration=zip(X_label,y_label)
+740ifis_df:
+741iteration=zip(X_label.values,y_label)
+742forx_,y_initeration:
+743ify_inclasses:
+744classes.remove(y_)
+745instances.append(x_)
+746labels.append(y_)
+747iflen(classes)==0:
+748break
+749
+750S_=[]
+751hypothesis=[]
+752foriinrange(self._N_LEARNER):
+753X_sampled,y_sampled= \
+754resample(X_label,y_label,replace=True,
+755n_samples=self.n_samples,
+756random_state=self.random_state)
+757ifis_df:
+758X_sampled=pd.DataFrame(X_sampled,columns=X_label.columns)
+759hypothesis.append(
+760skclone(self.base_estimatoriftype(self.base_estimator)isnotlistelseself.base_estimator[i]).fit(
+761X_sampled,y_sampled,**kwards)
+762)
+763
+764# Keep class order
+765ifnotis_df:
+766X_sampled=np.concatenate((np.array(instances),X_sampled),axis=0)
+767else:
+768X_sampled=pd.concat([pd.DataFrame(instances,columns=X_label.columns),X_sampled],axis=0)
+769
+770y_sampled=np.concatenate((np.array(labels),y_sampled),axis=0)
+771
+772S_.append((X_sampled,y_sampled))
+773
+774changes=True
+775last_addition=[0]*self._N_LEARNER
+776it=0ifX_unlabel.shape[0]>0elseself.max_iterations
+777whileit<self.max_iterations:
+778it+=1
+779changes=False
+780
+781# Enlarged
+782L=[[]]*self._N_LEARNER
+783
+784foriinrange(self._N_LEARNER):
+785hj,hk=TriTraining._another_hs(hypothesis,i)
+786y_p=hj.predict(X_unlabel)
+787validx=y_p==hk.predict(X_unlabel)
+788L[i]=(X_unlabel[validx]ifnotis_dfelseX_unlabel.iloc[validx,:],y_p[validx])
+789
+790fori,_inenumerate(L):
+791
+792iflen(L[i][0])>0:
+793S_[i]=np.concatenate((X_label,L[i][0]))ifnotis_dfelsepd.concat([X_label,L[i][0]]),np.concatenate((y_label,L[i][1]))
+794S_[i]=self._depure(S_[i])
+795
+796foriinrange(self._N_LEARNER):
+797iflen(S_[i][0])>len(X_label):
+798last_addition[i]=len(S_[i][0])
+799changes=True
+800hypothesis[i].fit(*S_[i],**kwards)
+801
+802ifnotchanges:
+803break
+804else:
+805warn.warn("Maximum number of iterations reached before convergence. Consider increasing max_iter to improve the fit.",ConvergenceWarning)
+806
+807S=np.concatenate([x[0]forxinS_])ifnotis_dfelsepd.concat([x[0]forxinS_]),np.concatenate([x[1]forxinS_])
+808S_0,index_=np.unique(S[0],axis=0,return_index=True)
+809S_1=S[1][index_]
+810S=S_0,S_1
+811S=self._depure(S)# Change, is S - L (only new)
+812
+813new_y=self._clustering(S,X)
+814
+815self.h_=[skclone(self.base_estimatoriftype(self.base_estimator)isnotlistelseself.base_estimator[i]).fit(X,new_y,**kwards)foriinrange(self._N_LEARNER)]
+816self.columns_=[list(range(X.shape[1]))]
+817
+818returnself
+
+
+
+
TriTraining with Data Editing.
+
+
It is a variation of the TriTraining, the main difference is that the instances are depurated in each iteration.
+It means that the instances with their neighbors that have the same class are kept, the rest are removed.
+At the end of the iterations, the instances are clustered and the class is assigned to the cluster centroid.
581def__init__(self,base_estimator=DecisionTreeClassifier(),k_neighbors=3,
+582n_samples=None,mode="seeded",max_iterations=100,n_jobs=None,random_state=None):
+583"""
+584 DeTriTraining - TriTraining with Depurated and Clustering.
+585 Avoid the noise generated by the TriTraining algorithm by depurating the enlarged dataset and clustering the instances.
+586
+587 Parameters
+588 ----------
+589 base_estimator : ClassifierMixin, optional
+590 An estimator object implementing fit and predict_proba, by default DecisionTreeClassifier()
+591 n_samples : int, optional
+592 Number of samples to generate.
+593 If left to None this is automatically set to the first dimension of the arrays., by default None
+594 k_neighbors : int, optional
+595 Number of neighbors for depurate classification.
+596 If at least k_neighbors/2+1 have a class other than the one predicted, the class is ignored., by default 3
+597 mode : string, optional
+598 How to calculate the cluster each instance belongs to.
+599 If `seeded` each instance belong to nearest cluster.
+600 If `constrained` each instance belong to nearest cluster unless the instance is in to enlarged dataset,
+601 then the instance belongs to the cluster of its class., by default `seeded`
+602 max_iterations : int, optional
+603 Maximum number of iterations, by default 100
+604 n_jobs : int, optional
+605 The number of parallel jobs to run for neighbors search.
+606 None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.
+607 Doesn't affect fit method., by default None
+608 random_state : int, RandomState instance, optional
+609 controls the randomness of the estimator, by default None
+610 """
+611super().__init__(base_estimator,n_samples,random_state)
+612self.k_neighbors=k_neighbors
+613self.mode=mode
+614self.max_iterations=max_iterations
+615self.n_jobs=n_jobs
+616ifmode!="seeded"andmode!="constrained":
+617raiseAttributeError("`mode` must be \"seeded\" or \"constrained\".")
+
+
+
+
DeTriTraining - TriTraining with Depurated and Clustering.
+Avoid the noise generated by the TriTraining algorithm by depurating the enlarged dataset and clustering the instances.
+
+
Parameters
+
+
+
base_estimator (ClassifierMixin, optional):
+An estimator object implementing fit and predict_proba, by default DecisionTreeClassifier()
+
n_samples (int, optional):
+Number of samples to generate.
+If left to None this is automatically set to the first dimension of the arrays., by default None
+
k_neighbors (int, optional):
+Number of neighbors for depurate classification.
+If at least k_neighbors/2+1 have a class other than the one predicted, the class is ignored., by default 3
+
mode (string, optional):
+How to calculate the cluster each instance belongs to.
+If seeded each instance belong to nearest cluster.
+If constrained each instance belong to nearest cluster unless the instance is in to enlarged dataset,
+then the instance belongs to the cluster of its class., by default seeded
+
max_iterations (int, optional):
+Maximum number of iterations, by default 100
+
n_jobs (int, optional):
+The number of parallel jobs to run for neighbors search.
+None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.
+Doesn't affect fit method., by default None
+
random_state (int, RandomState instance, optional):
+controls the randomness of the estimator, by default None
+
+
+
+
+
+
+
+
+
+ def
+ fit(self, X, y, **kwards):
+
+
+
+
+
+
711deffit(self,X,y,**kwards):
+712"""Build a DeTriTraining classifier from the training set (X, y).
+713
+714 Parameters
+715 ----------
+716 X : {array-like, sparse matrix} of shape (n_samples, n_features)
+717 The training input samples.
+718 y : array-like of shape (n_samples,)
+719 The target values (class labels), -1 if unlabel.
+720
+721 Returns
+722 -------
+723 self: DeTriTraining
+724 Fitted estimator.
+725 """
+726X_label,y_label,X_unlabel=get_dataset(X,y)
+727
+728self.label_encoder_=LabelEncoder()
+729self.label_encoder_.fit(y_label)
+730y_label=self.label_encoder_.transform(y_label)
+731
+732is_df=isinstance(X_label,pd.DataFrame)
+733
+734self.classes_=np.unique(y_label)
+735
+736classes=set(self.classes_)
+737instances=list()
+738labels=list()
+739iteration=zip(X_label,y_label)
+740ifis_df:
+741iteration=zip(X_label.values,y_label)
+742forx_,y_initeration:
+743ify_inclasses:
+744classes.remove(y_)
+745instances.append(x_)
+746labels.append(y_)
+747iflen(classes)==0:
+748break
+749
+750S_=[]
+751hypothesis=[]
+752foriinrange(self._N_LEARNER):
+753X_sampled,y_sampled= \
+754resample(X_label,y_label,replace=True,
+755n_samples=self.n_samples,
+756random_state=self.random_state)
+757ifis_df:
+758X_sampled=pd.DataFrame(X_sampled,columns=X_label.columns)
+759hypothesis.append(
+760skclone(self.base_estimatoriftype(self.base_estimator)isnotlistelseself.base_estimator[i]).fit(
+761X_sampled,y_sampled,**kwards)
+762)
+763
+764# Keep class order
+765ifnotis_df:
+766X_sampled=np.concatenate((np.array(instances),X_sampled),axis=0)
+767else:
+768X_sampled=pd.concat([pd.DataFrame(instances,columns=X_label.columns),X_sampled],axis=0)
+769
+770y_sampled=np.concatenate((np.array(labels),y_sampled),axis=0)
+771
+772S_.append((X_sampled,y_sampled))
+773
+774changes=True
+775last_addition=[0]*self._N_LEARNER
+776it=0ifX_unlabel.shape[0]>0elseself.max_iterations
+777whileit<self.max_iterations:
+778it+=1
+779changes=False
+780
+781# Enlarged
+782L=[[]]*self._N_LEARNER
+783
+784foriinrange(self._N_LEARNER):
+785hj,hk=TriTraining._another_hs(hypothesis,i)
+786y_p=hj.predict(X_unlabel)
+787validx=y_p==hk.predict(X_unlabel)
+788L[i]=(X_unlabel[validx]ifnotis_dfelseX_unlabel.iloc[validx,:],y_p[validx])
+789
+790fori,_inenumerate(L):
+791
+792iflen(L[i][0])>0:
+793S_[i]=np.concatenate((X_label,L[i][0]))ifnotis_dfelsepd.concat([X_label,L[i][0]]),np.concatenate((y_label,L[i][1]))
+794S_[i]=self._depure(S_[i])
+795
+796foriinrange(self._N_LEARNER):
+797iflen(S_[i][0])>len(X_label):
+798last_addition[i]=len(S_[i][0])
+799changes=True
+800hypothesis[i].fit(*S_[i],**kwards)
+801
+802ifnotchanges:
+803break
+804else:
+805warn.warn("Maximum number of iterations reached before convergence. Consider increasing max_iter to improve the fit.",ConvergenceWarning)
+806
+807S=np.concatenate([x[0]forxinS_])ifnotis_dfelsepd.concat([x[0]forxinS_]),np.concatenate([x[1]forxinS_])
+808S_0,index_=np.unique(S[0],axis=0,return_index=True)
+809S_1=S[1][index_]
+810S=S_0,S_1
+811S=self._depure(S)# Change, is S - L (only new)
+812
+813new_y=self._clustering(S,X)
+814
+815self.h_=[skclone(self.base_estimatoriftype(self.base_estimator)isnotlistelseself.base_estimator[i]).fit(X,new_y,**kwards)foriinrange(self._N_LEARNER)]
+816self.columns_=[list(range(X.shape[1]))]
+817
+818returnself
+
+
+
+
Build a DeTriTraining classifier from the training set (X, y).
+
+
Parameters
+
+
+
X ({array-like, sparse matrix} of shape (n_samples, n_features)):
+The training input samples.
+
y (array-like of shape (n_samples,)):
+The target values (class labels), -1 if unlabel.
+
+
+
+
\ No newline at end of file
diff --git a/docs/sslearn_mini.svg b/docs/sslearn_mini.svg
new file mode 100644
index 0000000..f8b0b13
--- /dev/null
+++ b/docs/sslearn_mini.svg
@@ -0,0 +1,101 @@
+
+
+
+
diff --git a/docs/sslearn_mini.webp b/docs/sslearn_mini.webp
new file mode 100644
index 0000000..8539af9
Binary files /dev/null and b/docs/sslearn_mini.webp differ
diff --git a/setup.py b/setup.py
index 6182b0b..e483cc0 100644
--- a/setup.py
+++ b/setup.py
@@ -12,7 +12,7 @@ def get_version():
version = get_version()
-url = f"https://github.com/jlgarridol/sslearn/archive/refs/tags/f{version}.tar.gz"
+url = f"https://github.com/jlgarridol/sslearn/archive/refs/tags/{version}.tar.gz"
setuptools.setup(
name='sslearn',
@@ -25,12 +25,12 @@ def get_version():
url='https://github.com/jlgarridol/sslearn',
license='new BSD',
download_url=url,
- install_requires=["joblib==1.2.0",
- "numpy==1.23.3",
- "pandas==1.4.3",
- "scikit_learn==1.2.0",
- "scipy==1.9.3",
- "statsmodels==0.13.2"],
+ install_requires=["joblib>=1.2.0",
+ "numpy>=1.23.3",
+ "pandas>=1.4.3",
+ "scikit_learn>=1.2.0",
+ "scipy>=1.10.1",
+ "statsmodels>=0.13.2"],
packages=setuptools.find_packages(exclude=("tests", "experiments")),
include_package_data=True,
classifiers=[
@@ -42,5 +42,7 @@ def get_version():
'Programming Language :: Python :: 3.8',
'Programming Language :: Python :: 3.9',
'Programming Language :: Python :: 3.10',
+ 'Programming Language :: Python :: 3.11',
+ 'Programming Language :: Python :: 3.12',
]
)
diff --git a/sitemap.xml b/sitemap.xml
new file mode 100644
index 0000000..6e7fb7f
--- /dev/null
+++ b/sitemap.xml
@@ -0,0 +1,14 @@
+
+
+https://pdoc.dev/docs/
+https://pdoc.dev/docs/sslearn.html
+https://pdoc.dev/docs/sslearn/subview.html
+https://pdoc.dev/docs/sslearn/model_selection.html
+https://pdoc.dev/docs/sslearn/base.html
+https://pdoc.dev/docs/sslearn/datasets.html
+https://pdoc.dev/docs/sslearn/wrapper.html
+https://pdoc.dev/docs/sslearn/restricted.html
+https://pdoc.dev/docs/sslearn/utils.html
+
\ No newline at end of file
diff --git a/sslearn/__init__.py b/sslearn/__init__.py
index 6b0785a..f6bbaae 100644
--- a/sslearn/__init__.py
+++ b/sslearn/__init__.py
@@ -1,4 +1,18 @@
-__version__='1.0.4'
+# Open README.md and added to __doc__ for
+import os
+if os.path.exists("../README.md"):
+ with open("../README.md", "r") as f:
+ __doc__ = f.read()
+elif os.path.exists("README.md"):
+ with open("README.md", "r") as f:
+ __doc__ = f.read()
+else:
+ __doc__ = "Semi-Supervised Learning (SSL) is a Python package that provides tools to train and evaluate semi-supervised learning models."
+
+
+__version__='1.0.4.1'
__AUTHOR__="José Luis Garrido-Labrador" # Author of the package
__AUTHOR_EMAIL__="jlgarrido@ubu.es" # Author's email
__URL__="https://pypi.org/project/sslearn/"
+
+
diff --git a/sslearn/base.py b/sslearn/base.py
index 158010d..58c4323 100644
--- a/sslearn/base.py
+++ b/sslearn/base.py
@@ -1,3 +1,23 @@
+"""
+Summary of module `sslearn.base`:
+
+## Functions
+
+
+get_dataset(X, y):
+ Check and divide dataset between labeled and unlabeled data.
+
+## Classes
+
+
+[FakedProbaClassifier](#FakedProbaClassifier):
+> Create a classifier that fakes predict_proba method if it does not exist.
+
+[OneVsRestSSLClassifier](#OneVsRestSSLClassifier):
+> Adapted OneVsRestClassifier for SSL datasets
+
+"""
+
import array
import warnings
from abc import ABC, abstractmethod
@@ -19,7 +39,29 @@
from sklearn.ensemble._base import _set_random_states
from sklearn.utils import check_random_state
+__all__ = ["get_dataset", "FakedProbaClassifier", "OneVsRestSSLClassifier"]
+
+
+
def get_dataset(X, y):
+ """Check and divide dataset between labeled and unlabeled data.
+
+ Parameters
+ ----------
+ X : ndarray or DataFrame of shape (n_samples, n_features)
+ Features matrix.
+ y : ndarray of shape (n_samples,)
+ Target vector.
+
+ Returns
+ -------
+ X_label : ndarray or DataFrame of shape (n_label, n_features)
+ Labeled features matrix.
+ y_label : ndarray or Serie of shape (n_label,)
+ Labeled target vector.
+ X_unlabel : ndarray or Serie DataFrame of shape (n_unlabel, n_features)
+ Unlabeled features matrix.
+ """
is_df = False
if isinstance(X, pd.DataFrame):
@@ -42,7 +84,7 @@ def get_dataset(X, y):
return X_label, y_label, X_unlabel
-class BaseEnsemble(ABC, MetaEstimatorMixin):
+class BaseEnsemble(ABC, MetaEstimatorMixin, BaseEstimator):
@abstractmethod
def predict_proba(self, X):
@@ -71,20 +113,81 @@ def predict(self, X):
class FakedProbaClassifier(MetaEstimatorMixin, ClassifierMixin, BaseEstimator):
+ """
+ Fake predict_proba method for classifiers that do not have it.
+ When predict_proba is called, it will use one hot encoding to fake the probabilities if base_estimator does not have predict_proba method.
+
+ Examples
+ --------
+ ```python
+ from sklearn.svm import SVC
+ # SVC does not have predict_proba method
+
+ from sslearn.base import FakedProbaClassifier
+ faked_svc = FakedProbaClassifier(SVC())
+ faked_svc.fit(X, y)
+ faked_svc.predict_proba(X) # One hot encoding probabilities
+ ```
+ """
def __init__(self, base_estimator):
+ """Create a classifier that fakes predict_proba method if it does not exist.
+
+ Parameters
+ ----------
+ base_estimator : ClassifierMixin
+ A classifier that implements fit and predict methods.
+ """
self.base_estimator = base_estimator
def fit(self, X, y):
+ """Fit a FakedProbaClassifier.
+
+ Parameters
+ ----------
+ X : {array-like, sparse matrix} of shape (n_samples, n_features)
+ The input samples.
+ y : {array-like, sparse matrix} of shape (n_samples,)
+ The target values.
+
+ Returns
+ -------
+ self : FakedProbaClassifier
+ Returns self.
+ """
self.classes_ = np.unique(y)
self.one_hot = OneHotEncoder().fit(y.reshape(-1, 1))
self.base_estimator.fit(X, y)
return self
def predict(self, X):
+ """Predict the classes of X.
+
+ Parameters
+ ----------
+ X : {array-like, sparse matrix} of shape (n_samples, n_features)
+ Array representing the data.
+
+ Returns
+ -------
+ y : ndarray of shape (n_samples,)
+ Array with predicted labels.
+ """
return self.base_estimator.predict(X)
def predict_proba(self, X):
+ """Predict the probabilities of each class for X.
+ If the base estimator does not have a predict_proba method, it will be faked using one hot encoding.
+
+ Parameters
+ ----------
+ X : {array-like, sparse matrix} of shape (n_samples, n_features)
+
+ Returns
+ -------
+ y : ndarray of shape (n_samples, n_classes)
+ Array with predicted probabilities.
+ """
if "predict_proba" in dir(self.base_estimator):
return self.base_estimator.predict_proba(X)
else:
@@ -122,6 +225,12 @@ def _predict_binary_ssl(estimator, X, **predict_params):
class OneVsRestSSLClassifier(OneVsRestClassifier):
+ """Adapted OneVsRestClassifier for SSL datasets
+
+ Prevent use unlabeled data as a independent class in the classifier.
+
+ For more information of OvR classifier, see the documentation of [OneVsRestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html).
+ """
def __init__(self, estimator, *, n_jobs=None):
"""Adapted OneVsRestClassifier for SSL datasets
diff --git a/sslearn/datasets/__init__.py b/sslearn/datasets/__init__.py
index bd13379..aca2e0e 100644
--- a/sslearn/datasets/__init__.py
+++ b/sslearn/datasets/__init__.py
@@ -1,3 +1,18 @@
+"""
+Summary of module `sslearn.datasets`:
+
+This module contains functions to load and save datasets in different formats.
+
+## Functions
+
+1. read_csv : Load a dataset from a CSV file.
+2. read_keel : Load a dataset from a KEEL file.
+3. secure_dataset : Secure the dataset by converting it into a secure format.
+4. save_keel : Save a dataset in KEEL format.
+
+
+"""
+
from ._loader import read_csv, read_keel
from ._writer import save_keel
from ._preprocess import secure_dataset
diff --git a/sslearn/model_selection/__init__.py b/sslearn/model_selection/__init__.py
index 5bc0185..95cde12 100644
--- a/sslearn/model_selection/__init__.py
+++ b/sslearn/model_selection/__init__.py
@@ -1,3 +1,21 @@
-from ._split import *
+"""
+Summary of module `sslearn.model_selection`:
-__all__ = ['StratifiedKFoldSS', 'artificial_ssl_dataset']
\ No newline at end of file
+This module contains functions to split datasets into training and testing sets.
+
+## Functions
+
+[artificial_ssl_dataset](#artificial_ssl_dataset):
+> Generate an artificial semi-supervised learning dataset.
+
+## Classes
+
+[StratifiedKFoldSS](#StratifiedKFoldSS):
+> Stratified K-Folds cross-validator for semi-supervised learning.
+
+
+"""
+
+from ._split import artificial_ssl_dataset, StratifiedKFoldSS
+
+__all__ = ['artificial_ssl_dataset', 'StratifiedKFoldSS']
\ No newline at end of file
diff --git a/sslearn/model_selection/_split.py b/sslearn/model_selection/_split.py
index d24c773..0463435 100644
--- a/sslearn/model_selection/_split.py
+++ b/sslearn/model_selection/_split.py
@@ -4,7 +4,26 @@
class StratifiedKFoldSS():
+ """
+ Stratified K-Folds cross-validator for semi-supervised learning.
+
+ Provides label and unlabel indices for each split. Using the `StratifiedKFold` method from `sklearn`.
+ The `test` set is the labeled set and the `train` set is the unlabeled set.
+ """
+
+
def __init__(self, n_splits=5, shuffle=False, random_state=None):
+ """
+ Parameters
+ ----------
+ n_splits : int, default=5
+ Number of folds. Must be at least 2.
+ shuffle : bool, default=False
+ Whether to shuffle each class's samples before splitting into batches.
+ random_state : int or RandomState instance, default=None
+ When shuffle is True, random_state affects the ordering of the indices.
+
+ """
self.K = ms.StratifiedKFold(n_splits=n_splits, shuffle=shuffle,
random_state=random_state)
@@ -29,9 +48,9 @@ def split(self, X, y):
The feature set.
y : ndarray
The label set, -1 for unlabel instance.
- label: ndarray
+ label : ndarray
The training set indices for split mark as labeled.
- unlabel: ndarray
+ unlabel : ndarray
The training set indices for split mark as unlabeled.
"""
for train, test in self.K.split(X, y):
diff --git a/sslearn/restricted.py b/sslearn/restricted.py
index 6dd5a36..011be28 100644
--- a/sslearn/restricted.py
+++ b/sslearn/restricted.py
@@ -1,9 +1,30 @@
+"""Summary of module `sslearn.restricted`:
+
+This module contains classes to train a classifier using the restricted set classification approach.
+
+## Classes
+
+[WhoIsWhoClassifier](#WhoIsWhoClassifier):
+> Who is Who Classifier
+
+## Functions
+
+[conflict_rate](#conflict_rate):
+> Compute the conflict rate of a prediction, given a set of restrictions.
+[combine_predictions](#combine_predictions):
+> Combine the predictions of a group of instances to keep the restrictions.
+
+
+"""
+
import numpy as np
from sklearn.base import ClassifierMixin, MetaEstimatorMixin, BaseEstimator
from scipy.optimize import linear_sum_assignment
import warnings
import pandas as pd
+__all__ = ["conflict_rate", "combine_predictions", "WhoIsWhoClassifier"]
+
class WhoIsWhoClassifier(BaseEstimator, ClassifierMixin, MetaEstimatorMixin):
def __init__(self, base_estimator, method="hungarian", conflict_weighted=True):
diff --git a/sslearn/subview/__init__.py b/sslearn/subview/__init__.py
index e112345..aaaf8ba 100644
--- a/sslearn/subview/__init__.py
+++ b/sslearn/subview/__init__.py
@@ -1,3 +1,18 @@
+"""
+Summary of module `sslearn.subview`:
+
+This module contains classes to train a classifier or a regressor selecting a sub-view of the data.
+
+## Classes
+
+[SubViewClassifier](#SubViewClassifier):
+> Train a sub-view classifier.
+[SubViewRegressor](#SubViewRegressor):
+> Train a sub-view regressor.
+
+
+"""
+
from ._subview import SubViewClassifier, SubViewRegressor
__all__ = ["SubViewClassifier", "SubViewRegressor"]
\ No newline at end of file
diff --git a/sslearn/subview/_subview.py b/sslearn/subview/_subview.py
index b1119b1..eca25d8 100644
--- a/sslearn/subview/_subview.py
+++ b/sslearn/subview/_subview.py
@@ -10,6 +10,29 @@
class SubView(BaseEstimator):
+ """
+ A classifier that uses a subview of the data.
+
+ Example
+ -------
+ ```python
+ from sklearn.model_selection import train_test_split
+ from sklearn.tree import DecisionTreeClassifier
+ from sslearn.subview import SubViewClassifier
+
+ # Mode 'include' will include all columns that contain `string`
+ clf = SubViewClassifier(DecisionTreeClassifier(), "sepal", mode="include")
+ clf.fit(X, y)
+
+ # Mode 'regex' will include all columns that match the regex
+ clf = SubViewClassifier(DecisionTreeClassifier(), "sepal.*", mode="regex")
+ clf.fit(X, y)
+
+ # Mode 'index' will include the columns at the index, useful for numpy arrays
+ clf = SubViewClassifier(DecisionTreeClassifier(), [0, 1], mode="index")
+ clf.fit(X, y)
+ ```
+ """
def __init__(self, base_estimator, subview, mode="regex"):
"""Create a classifier that uses a subview of the data.
diff --git a/sslearn/utils.py b/sslearn/utils.py
index 83951bb..352ca3d 100644
--- a/sslearn/utils.py
+++ b/sslearn/utils.py
@@ -1,3 +1,27 @@
+"""
+Some utility functions
+
+This module contains utility functions that are used in different parts of the library.
+
+## Functions
+
+[safe_division](#safe_division):
+> Safely divide two numbers preventing division by zero.
+[confidence_interval](#confidence_interval):
+> Calculate the confidence interval of the predictions.
+[choice_with_proportion](#choice_with_proportion):
+> Choice the best predictions according to the proportion of each class.
+[calculate_prior_probability](#calculate_prior_probability):
+> Calculate the priori probability of each label.
+[mode](#mode):
+> Calculate the mode of a list of values.
+[check_n_jobs](#check_n_jobs):
+> Check `n_jobs` parameter according to the scikit-learn convention.
+[check_classifier](#check_classifier):
+> Check if the classifier is a ClassifierMixin or a list of ClassifierMixin.
+
+"""
+
import numpy as np
import os
import math
@@ -8,14 +32,51 @@
from sklearn.tree import DecisionTreeClassifier
from sklearn.base import ClassifierMixin
+__all__ = ["safe_division", "confidence_interval", "choice_with_proportion", "calculate_prior_probability",
+ "mode", "check_n_jobs", "check_classifier"]
+
def safe_division(dividend, divisor, epsilon):
+ """Safely divide two numbers preventing division by zero
+
+ Parameters
+ ----------
+ dividend : numeric
+ Dividend value
+ divisor : numeric
+ Divisor value
+ epsilon : numeric
+ Close to zero value to be used in case of division by zero
+
+ Returns
+ -------
+ result : numeric
+ Result of the division
+ """
if divisor == 0:
return dividend / epsilon
return dividend / divisor
def confidence_interval(X, hyp, y, alpha=.95):
+ """Calculate the confidence interval of the predictions
+
+ Parameters
+ ----------
+ X : {array-like, sparse matrix} of shape (n_samples, n_features)
+ The input samples.
+ hyp : classifier
+ The classifier to be used for prediction
+ y : array-like of shape (n_samples,)
+ The target values
+ alpha : float, optional
+ confidence (1 - significance), by default .95
+
+ Returns
+ -------
+ li, hi: float
+ lower and upper bound of the confidence interval
+ """
data = hyp.predict(X)
successes = np.count_nonzero(data == y)
@@ -25,6 +86,24 @@ def confidence_interval(X, hyp, y, alpha=.95):
def choice_with_proportion(predictions, class_predicted, proportion, extra=0):
+ """Choice the best predictions according to the proportion of each class.
+
+ Parameters
+ ----------
+ predictions : array-like of shape (n_samples,)
+ array of predictions
+ class_predicted : array-like of shape (n_samples,)
+ array of predicted classes
+ proportion : dict
+ dictionary with the proportion of each class
+ extra : int, optional
+ number of extra instances to be added, by default 0
+
+ Returns
+ -------
+ indices: array-like of shape (n_samples,)
+ array of indices of the best predictions
+ """
n = len(predictions)
for_each_class = {c: int(n * j) for c, j in proportion.items()}
indices = np.zeros(0)
diff --git a/sslearn/wrapper/__init__.py b/sslearn/wrapper/__init__.py
index f5336c9..7403aa9 100644
--- a/sslearn/wrapper/__init__.py
+++ b/sslearn/wrapper/__init__.py
@@ -1,7 +1,41 @@
+"""
+Summary of module `sslearn.wrapper`:
+
+This module contains classes to train semi-supervised learning algorithms using a wrapper approach.
+
+## Self-Training Algorithms
+
+* [SelfTraining](#SelfTraining):
+Self-training algorithm.
+* [Setred](#Setred):
+Self-training with redundancy reduction.
+
+## Co-Training Algorithms
+
+* [CoTraining](#CoTraining):
+Co-training
+* [CoTrainingByCommittee](#CoTrainingByCommittee):
+Co-training by committee
+* [DemocraticCoLearning](#DemocraticCoLearning):
+Democratic co-learning
+* [Rasco](#Rasco):
+Random subspace co-training
+* [RelRasco](#RelRasco):
+Relevant random subspace co-training
+* [CoForest](#CoForest):
+Co-Forest
+* [TriTraining](#TriTraining):
+Tri-training
+* [DeTriTraining](#DeTriTraining):
+Data Editing Tri-training
+
+"""
+
from ._co import (CoForest, CoTraining, CoTrainingByCommittee,
DemocraticCoLearning, Rasco, RelRasco)
from ._self import SelfTraining, Setred
-from ._tritraining import DeTriTraining, TriTraining, WiWTriTraining
+from ._tritraining import DeTriTraining, TriTraining
-__all__ = ['SelfTraining', 'CoTrainingByCommittee', 'Rasco', 'RelRasco', 'TriTraining', 'WiWTriTraining'
- "CoTraining", "DeTriTraining", "DemocraticCoLearning", "Setred", "CoForest"]
+__all__ = ["SelfTraining", "Setred", "CoTraining", "CoTrainingByCommittee",
+ "DemocraticCoLearning", "Rasco", "RelRasco", "CoForest",
+ "TriTraining", "DeTriTraining"]
diff --git a/sslearn/wrapper/_co.py b/sslearn/wrapper/_co.py
index 7b266f1..a01d67d 100644
--- a/sslearn/wrapper/_co.py
+++ b/sslearn/wrapper/_co.py
@@ -19,6 +19,7 @@
from sklearn.utils import check_array, check_random_state, resample
from sklearn.utils.validation import check_is_fitted
+
from sslearn.utils import check_n_jobs
from ..base import BaseEnsemble, get_dataset
@@ -26,7 +27,19 @@
choice_with_proportion, confidence_interval, mode, safe_division)
-class BaseCoTraining(BaseEstimator, ClassifierMixin, BaseEnsemble):
+class BaseCoTraining(BaseEnsemble):
+ """
+ Base class for CoTraining classifiers.
+
+ Include
+ -------
+ 1. `predict_proba` method that returns the probability of each class.
+ 2. `predict` method that returns the class of each instance by argmax of `predict_proba`.
+ 3. `score` method that returns the mean accuracy on the given test data and labels.
+ """
+
+ _estimator_type = "classifier"
+
@abstractmethod
def fit(self, X, y, **kwards):
pass
@@ -40,7 +53,7 @@ def predict_proba(self, X):
Array representing the data.
Returns
-------
- ndarray of shape (n_samples, n_features)
+ class probabilities: ndarray of shape (n_samples, n_classes)
Array with prediction probabilities.
"""
is_df = isinstance(X, pd.DataFrame)
@@ -69,9 +82,90 @@ def predict_proba(self, X):
return y
else:
raise NotFittedError("Classifier not fitted")
+
+ _estimator_type = "classifier"
+
+ def score(self, X, y, sample_weight=None):
+ """
+ Return the mean accuracy on the given test data and labels.
+
+ In multi-label classification, this is the subset accuracy
+ which is a harsh metric since you require for each sample that
+ each label set be correctly predicted.
+
+ Parameters
+ ----------
+ X : array-like of shape (n_samples, n_features)
+ Test samples.
+
+ y : array-like of shape (n_samples,) or (n_samples, n_outputs)
+ True labels for `X`.
+
+ sample_weight : array-like of shape (n_samples,), default=None
+ Sample weights.
+
+ Returns
+ -------
+ score : float
+ Mean accuracy of ``self.predict(X)`` w.r.t. `y`.
+ """
+ from .metrics import accuracy_score
+
+ return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
class DemocraticCoLearning(BaseCoTraining):
+ """
+ **Democratic Co-learning. Ensemble of classifiers of different types.**
+ --------------------------------------------
+
+ A iterative algorithm that uses a ensemble of classifiers to label instances.
+ The main process is:
+ 1. Train each classifier with the labeled instances.
+ 2. While any classifier is retrained:
+ 1. Predict the instances from the unlabeled set.
+ 2. Calculate the confidence interval for each classifier for define weights.
+ 3. Calculate the weighted vote for each instance.
+ 4. Calculate the majority vote for each instance.
+ 5. Select the instances to label if majority vote is the same as weighted vote.
+ 6. Select the instances to retrain the classifier, if `only_mislabeled` is False then select all instances, else select only mislabeled instances for each classifier.
+ 7. Retrain the classifier with the new instances if the error rate is lower than the previous iteration.
+ 3. Ignore the classifiers with confidence interval lower than 0.5.
+ 4. Combine the probabilities of each classifier.
+
+ **Methods**
+ -------
+ - `fit`: Fit the model with the labeled instances.
+ - `predict` : Predict the class for each instance.
+ - `predict_proba`: Predict the probability for each class.
+ - `score`: Return the mean accuracy on the given test data and labels.
+
+
+ **Example**
+ -------
+ ```python
+ from sklearn.datasets import load_iris
+ from sklearn.tree import DecisionTreeClassifier
+ from sklearn.naive_bayes import GaussianNB
+ from sklearn.neighbors import KNeighborsClassifier
+ from sslearn.wrapper import DemocraticCoLearning
+ from sslearn.model_selection import artificial_ssl_dataset
+
+ X, y = load_iris(return_X_y=True)
+ X, y, X_unlabel, y_unlabel, _, _ = artificial_ssl_dataset(X, y, label_rate=0.1, random_state=0)
+ dcl = DemocraticCoLearning(base_estimator=[DecisionTreeClassifier(), GaussianNB(), KNeighborsClassifier(n_neighbors=3)])
+ dcl.fit(X, y)
+ dcl.score(X_unlabel, y_unlabel)
+ ```
+
+ **References**
+ ----------
+ Y. Zhou and S. Goldman, (2004)
+ Democratic co-learning,
+ in 16th IEEE International Conference on Tools with Artificial Intelligence,
+ pp. 594-602, [10.1109/ICTAI.2004.48](https://doi.org/10.1109/ICTAI.2004.48).
+ """
+
def __init__(
self,
base_estimator=[
@@ -86,9 +180,7 @@ def __init__(
random_state=None
):
"""
- Y. Zhou and S. Goldman, "Democratic co-learning,"
- 16th IEEE International Conference on Tools with Artificial Intelligence,
- 2004, pp. 594-602, doi: 10.1109/ICTAI.2004.48.
+ Democratic Co-learning. Ensemble of classifiers of different types.
Parameters
----------
@@ -182,7 +274,7 @@ def fit(self, X, y, estimator_kwards=None):
Returns
-------
- self
+ self : DemocraticCoLearning
fitted classifier
"""
@@ -346,7 +438,7 @@ def predict_proba(self, X):
Array representing the data.
Returns
-------
- ndarray of shape (n_samples, n_features)
+ class probabilities: ndarray of shape (n_samples, n_classes)
Array with prediction probabilities.
"""
if "h_" in dir(self):
@@ -359,15 +451,58 @@ def predict_proba(self, X):
class CoTraining(BaseCoTraining):
"""
- Avrim Blum and Tom Mitchell. 1998.
- Combining labeled and unlabeled data with co-training.
- In Proceedings of the eleventh annual conference on Computational learning theory (COLT' 98).
- Association for Computing Machinery, New York, NY, USA, 92–100.
- DOI:https://doi.org/10.1145/279943.279962
-
- Han, Xian-Hua, Yen-wei Chen, and Xiang Ruan. 2011.
- ‘Multi-Class Co-Training Learning for Object and Scene Recognition’.
- Pp. 67–70 in. Nara, Japan.
+ **CoTraining classifier. Multi-view learning algorithm that uses two classifiers to label instances.**
+ --------------------------------------------
+
+ The main process is:
+ 1. Train each classifier with the labeled instances and their respective view.
+ 2. While max iterations is not reached or any instance is unlabeled:
+ 1. Predict the instances from the unlabeled set.
+ 2. Select the instances that have the same prediction and the predictions are above the threshold.
+ 3. Label the instances with the highest probability, keeping the balance of the classes.
+ 4. Retrain the classifier with the new instances.
+ 3. Combine the probabilities of each classifier.
+
+ **Methods**
+ -------
+ - `fit`: Fit the model with the labeled instances.
+ - `predict` : Predict the class for each instance.
+ - `predict_proba`: Predict the probability for each class.
+ - `score`: Return the mean accuracy on the given test data and labels.
+
+ **Example**
+ -------
+ ```python
+ from sklearn.datasets import load_iris
+ from sklearn.tree import DecisionTreeClassifier
+ from sslearn.wrapper import CoTraining
+ from sslearn.model_selection import artificial_ssl_dataset
+
+ X, y = load_iris(return_X_y=True)
+ X, y, X_unlabel, y_unlabel, _, _ = artificial_ssl_dataset(X, y, label_rate=0.1, random_state=0)
+ cotraining = CoTraining(DecisionTreeClassifier())
+ X1 = X[:, [0, 1]]
+ X2 = X[:, [2, 3]]
+ cotraining.fit(X1, y, X2)
+ # or
+ cotraining.fit(X, y, features=[[0, 1], [2, 3]])
+ # or
+ cotraining = CoTraining(DecisionTreeClassifier(), force_second_view=False)
+ cotraining.fit(X, y)
+ ```
+
+ **References**
+ ----------
+ Avrim Blum and Tom Mitchell. (1998).
+ Combining labeled and unlabeled data with co-training
+ in Proceedings of the eleventh annual conference on Computational learning theory (COLT' 98).
+ Association for Computing Machinery, New York, NY, USA, 92-100.
+ [10.1145/279943.279962](https://doi.org/10.1145/279943.279962)
+
+ Han, Xian-Hua, Yen-wei Chen, and Xiang Ruan. (2011).
+ Multi-Class Co-Training Learning for Object and Scene Recognition,
+ pp. 67-70 in. Nara, Japan.
+ [http://www.mva-org.jp/Proceedings/2011CD/papers/04-08.pdf](http://www.mva-org.jp/Proceedings/2011CD/papers/04-08.pdf)
"""
def __init__(
@@ -380,7 +515,9 @@ def __init__(
force_second_view=True,
random_state=None
):
- """Create a CoTraining classifier
+ """
+ Create a CoTraining classifier.
+ Multi-view learning algorithm that uses two classifiers to label instances.
Parameters
----------
@@ -398,6 +535,7 @@ def __init__(
The second classifier needs a different view of the data. If False then a second view will be same as the first, by default True
random_state : int, RandomState instance, optional
controls the randomness of the estimator, by default None
+
"""
self.base_estimator = check_classifier(base_estimator, False)
if second_base_estimator is not None:
@@ -561,7 +699,7 @@ def predict_proba(self, X, X2=None, **kwards):
Array representing the data from another view, by default None
Returns
-------
- ndarray of shape (n_samples, n_features)
+ class probabilities: ndarray of shape (n_samples, n_classes)
Array with prediction probabilities.
"""
if "columns_" in dir(self):
@@ -626,6 +764,54 @@ def score(self, X, y, sample_weight=None, **kwards):
class Rasco(BaseCoTraining):
+ """
+ **Co-Training based on random subspaces**
+ --------------------------------------------
+
+ Generate a set of random subspaces and train a classifier for each subspace.
+
+ The main process is:
+ 1. Generate a set of random subspaces.
+ 2. Train a classifier for each subspace.
+ 3. While max iterations is not reached or any instance is unlabeled:
+ 1. Predict the instances from the unlabeled set for each classifier.
+ 2. Calculate the average of the predictions.
+ 3. Select the instances with the highest probability.
+ 4. Label the instances with the highest probability, keeping the balance of the classes.
+ 5. Retrain the classifier with the new instances.
+ 4. Combine the probabilities of each classifier.
+
+ **Methods**
+ -------
+ - `fit`: Fit the model with the labeled instances.
+ - `predict` : Predict the class for each instance.
+ - `predict_proba`: Predict the probability for each class.
+ - `score`: Return the mean accuracy on the given test data and labels.
+
+ **Example**
+ -------
+ ```python
+ from sklearn.datasets import load_iris
+ from sslearn.wrapper import Rasco
+ from sslearn.model_selection import artificial_ssl_dataset
+
+ X, y = load_iris(return_X_y=True)
+ X, y, X_unlabel, y_unlabel, _, _ = artificial_ssl_dataset(X, y, label_rate=0.1, random_state=0)
+ rasco = Rasco()
+ rasco.fit(X, y)
+ rasco.score(X_unlabel, y_unlabel)
+ ```
+
+ **References**
+ ----------
+ Wang, J., Luo, S. W., & Zeng, X. H. (2008).
+ A random subspace method for co-training,
+ in 2008 IEEE International Joint Conference on Neural Networks
+ IEEE World Congress on Computational Intelligence
+ (pp. 195-200). IEEE. [10.1109/IJCNN.2008.4633789](https://doi.org/10.1109/IJCNN.2008.4633789)
+ """
+
+
def __init__(
self,
base_estimator=DecisionTreeClassifier(),
@@ -638,12 +824,6 @@ def __init__(
"""
Co-Training based on random subspaces
- Wang, J., Luo, S. W., & Zeng, X. H. (2008, June).
- A random subspace method for co-training.
- In 2008 IEEE International Joint Conference on Neural Networks
- (IEEE World Congress on Computational Intelligence)
- (pp. 195-200). IEEE.
-
Parameters
----------
base_estimator : ClassifierMixin, optional
@@ -678,7 +858,7 @@ def _generate_random_subspaces(self, X, y=None, random_state=None):
Returns
-------
- list
+ subspaces : list
List of index of features
"""
random_state = check_random_state(random_state)
@@ -770,6 +950,42 @@ def fit(self, X, y, **kwards):
class RelRasco(Rasco):
+ """
+ **Co-Training based on relevant random subspaces**
+ --------------------------------------------
+
+ Is a variation of `sslearn.wrapper.Rasco` that uses the mutual information of each feature to select the random subspaces.
+ The process of training is the same as Rasco.
+
+ **Methods**
+ -------
+ - `fit`: Fit the model with the labeled instances.
+ - `predict` : Predict the class for each instance.
+ - `predict_proba`: Predict the probability for each class.
+ - `score`: Return the mean accuracy on the given test data and labels.
+
+ **Example**
+ -------
+ ```python
+ from sklearn.datasets import load_iris
+ from sslearn.wrapper import RelRasco
+ from sslearn.model_selection import artificial_ssl_dataset
+
+ X, y = load_iris(return_X_y=True)
+ X, y, X_unlabel, y_unlabel, _, _ = artificial_ssl_dataset(X, y, label_rate=0.1, random_state=0)
+ relrasco = RelRasco()
+ relrasco.fit(X, y)
+ relrasco.score(X_unlabel, y_unlabel)
+ ```
+
+ **References**
+ ----------
+ Yaslan, Y., & Cataltepe, Z. (2010).
+ Co-training with relevant random subspaces.
+ Neurocomputing, 73(10-12), 1652-1661.
+ [10.1016/j.neucom.2010.01.018](https://doi.org/10.1016/j.neucom.2010.01.018)
+ """
+
def __init__(
self,
base_estimator=DecisionTreeClassifier(),
@@ -782,11 +998,6 @@ def __init__(
"""
Co-Training with relevant random subspaces
- Yaslan, Y., & Cataltepe, Z. (2010).
- Co-training with relevant random subspaces.
- Neurocomputing, 73(10-12), 1652-1661.
-
-
Parameters
----------
base_estimator : ClassifierMixin, optional
@@ -802,6 +1013,7 @@ def __init__(
controls the randomness of the estimator, by default None
n_jobs : int, optional
The number of jobs to run in parallel. -1 means using all processors., by default None
+
"""
super().__init__(
base_estimator,
@@ -824,7 +1036,7 @@ def _generate_random_subspaces(self, X, y, random_state=None):
Returns
-------
- list
+ subspaces: list
List of index of features
"""
random_state = check_random_state(random_state)
@@ -844,7 +1056,52 @@ def _generate_random_subspaces(self, X, y, random_state=None):
# Done and tested
-class CoTrainingByCommittee(ClassifierMixin, BaseEnsemble, BaseEstimator):
+class CoTrainingByCommittee(BaseCoTraining):
+ """
+ **Co-Training by Committee classifier.**
+ --------------------------------------------
+
+ Create a committee trained by co-training based on the diversity of the classifiers
+
+ The main process is:
+ 1. Train a committee of classifiers.
+ 2. Create a pool of unlabeled instances.
+ 3. While max iterations is not reached or any instance is unlabeled:
+ 1. Predict the instances from the unlabeled set.
+ 2. Select the instances with the highest probability.
+ 3. Label the instances with the highest probability, keeping the balance of the classes but ensuring that at least n instances of each class are added.
+ 4. Retrain the classifier with the new instances.
+ 4. Combine the probabilities of each classifier.
+
+ **Methods**
+ -------
+ - `fit`: Fit the model with the labeled instances.
+ - `predict` : Predict the class for each instance.
+ - `predict_proba`: Predict the probability for each class.
+ - `score`: Return the mean accuracy on the given test data and labels.
+
+ **Example**
+ -------
+ ```python
+ from sklearn.datasets import load_iris
+ from sslearn.wrapper import CoTrainingByCommittee
+ from sslearn.model_selection import artificial_ssl_dataset
+
+ X, y = load_iris(return_X_y=True)
+ X, y, X_unlabel, y_unlabel, _, _ = artificial_ssl_dataset(X, y, label_rate=0.1, random_state=0)
+ cotraining = CoTrainingByCommittee()
+ cotraining.fit(X, y)
+ cotraining.score(X_unlabel, y_unlabel)
+ ```
+
+ **References**
+ ----------
+ M. F. A. Hady and F. Schwenker,
+ Co-training by Committee: A New Semi-supervised Learning Framework,
+ in 2008 IEEE International Conference on Data Mining Workshops,
+ Pisa, 2008, pp. 563-572, [10.1109/ICDMW.2008.27](https://doi.org/10.1109/ICDMW.2008.27)
+ """
+
def __init__(
self,
ensemble_estimator=BaggingClassifier(),
@@ -853,12 +1110,10 @@ def __init__(
min_instances_for_class=3,
random_state=None,
):
- """Create a committee trained by cotraining based on
+ """
+ Create a committee trained by cotraining based on
the diversity of classifiers.
- M. F. A. Hady and F. Schwenker,
- "Co-training by Committee: A New Semi-supervised Learning Framework,"
- 2008 IEEE International Conference on Data Mining Workshops,
- Pisa, 2008, pp. 563-572, doi: 10.1109/ICDMW.2008.27.
+
Parameters
----------
ensemble_estimator : ClassifierMixin, optional
@@ -870,6 +1125,8 @@ def __init__(
max number of unlabeled instances candidates to pseudolabel, by default 100
random_state : int, RandomState instance, optional
controls the randomness of the estimator, by default None
+
+
"""
self.ensemble_estimator = check_classifier(ensemble_estimator, False)
self.max_iterations = max_iterations
@@ -887,7 +1144,7 @@ def fit(self, X, y, **kwards):
The target values (class labels), -1 if unlabel.
Returns
-------
- self: CoTrainingByCommittee
+ self : CoTrainingByCommittee
Fitted estimator.
"""
self.ensemble_estimator = skclone(self.ensemble_estimator)
@@ -969,7 +1226,7 @@ def predict(self, X):
The input samples.
Returns
-------
- y: array-like of shape (n_samples,)
+ y : array-like of shape (n_samples,)
The predicted classes
"""
check_is_fitted(self.ensemble_estimator)
@@ -984,7 +1241,7 @@ def predict_proba(self, X):
The input samples.
Returns
-------
- y: ndarray of shape (n_samples, n_classes) or list of n_outputs such arrays if n_outputs > 1
+ y : ndarray of shape (n_samples, n_classes) or list of n_outputs such arrays if n_outputs > 1
The predicted classes
"""
check_is_fitted(self.ensemble_estimator)
@@ -1022,12 +1279,56 @@ def score(self, X, y, sample_weight=None):
# Done and tested
class CoForest(BaseCoTraining):
+ """
+ **CoForest classifier. Random Forest co-training**
+ ----------------------------
+
+ Ensemble method for CoTraining based on Random Forest.
+
+ The main process is:
+ 1. Train a committee of classifiers using bootstrap.
+ 2. While any base classifier is retrained:
+ 1. Predict the instances from the unlabeled set.
+ 2. Select the instances with the highest probability.
+ 3. Label the instances with the highest probability
+ 4. Add the instances to the labeled set only if the error is not bigger than the previous error.
+ 5. Retrain the classifier with the new instances.
+ 3. Combine the probabilities of each classifier.
+
+
+ **Methods**
+ -------
+ - `fit`: Fit the model with the labeled instances.
+ - `predict` : Predict the class for each instance.
+ - `predict_proba`: Predict the probability for each class.
+ - `score`: Return the mean accuracy on the given test data and labels.
+
+ **Example**
+ -------
+ ```python
+ from sklearn.datasets import load_iris
+ from sslearn.wrapper import CoForest
+ from sslearn.model_selection import artificial_ssl_dataset
+
+ X, y = load_iris(return_X_y=True)
+ X, y, X_unlabel, y_unlabel, _, _ = artificial_ssl_dataset(X, y, label_rate=0.1, random_state=0)
+ coforest = CoForest()
+ coforest.fit(X, y)
+ coforest.score(X_unlabel, y_unlabel)
+ ```
+
+ **References**
+ ----------
+ Li, M., & Zhou, Z.-H. (2007).
+ Improve Computer-Aided Diagnosis With Machine Learning Techniques Using Undiagnosed Samples.
+ IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans,
+ 37(6), 1088-1098. [10.1109/tsmca.2007.904745](https://doi.org/10.1109/tsmca.2007.904745)
+ """
+
def __init__(self, base_estimator=DecisionTreeClassifier(), n_estimators=7, threshold=0.75, bootstrap=True, n_jobs=None, random_state=None, version="1.0.3"):
"""
- Li, M., & Zhou, Z.-H. (2007).
- Improve Computer-Aided Diagnosis With Machine Learning Techniques Using Undiagnosed Samples.
- IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans,
- 37(6), 1088–1098. doi:10.1109/tsmca.2007.904745
+ Generate a CoForest classifier.
+ A SSL Random Forest adaption for CoTraining.
Parameters
----------
diff --git a/sslearn/wrapper/_self.py b/sslearn/wrapper/_self.py
index 080db55..e18799d 100644
--- a/sslearn/wrapper/_self.py
+++ b/sslearn/wrapper/_self.py
@@ -13,106 +13,38 @@
class SelfTraining(SelfTrainingClassifier):
-
- """Self-training. Adaptation of SelfTrainingClassifier from sklearn with data loader compatible.
-
- This class allows a given supervised classifier to function as a
- semi-supervised classifier, allowing it to learn from unlabeled data. It
- does this by iteratively predicting pseudo-labels for the unlabeled data
- and adding them to the training set.
-
- The classifier will continue iterating until either max_iter is reached, or
- no pseudo-labels were added to the training set in the previous iteration.
-
- Read more in the :ref:`User Guide `.
-
- Parameters
- ----------
- base_estimator : estimator object
- An estimator object implementing ``fit`` and ``predict_proba``.
- Invoking the ``fit`` method will fit a clone of the passed estimator,
- which will be stored in the ``base_estimator_`` attribute.
-
- threshold : float, default=0.75
- The decision threshold for use with `criterion='threshold'`.
- Should be in [0, 1). When using the 'threshold' criterion, a
- :ref:`well calibrated classifier ` should be used.
-
- criterion : {'threshold', 'k_best'}, default='threshold'
- The selection criterion used to select which labels to add to the
- training set. If 'threshold', pseudo-labels with prediction
- probabilities above `threshold` are added to the dataset. If 'k_best',
- the `k_best` pseudo-labels with highest prediction probabilities are
- added to the dataset. When using the 'threshold' criterion, a
- :ref:`well calibrated classifier ` should be used.
-
- k_best : int, default=10
- The amount of samples to add in each iteration. Only used when
- `criterion` is k_best'.
-
- max_iter : int or None, default=10
- Maximum number of iterations allowed. Should be greater than or equal
- to 0. If it is ``None``, the classifier will continue to predict labels
- until no new pseudo-labels are added, or all unlabeled samples have
- been labeled.
-
- verbose : bool, default=False
- Enable verbose output.
-
- Attributes
- ----------
- base_estimator_ : estimator object
- The fitted estimator.
-
- classes_ : ndarray or list of ndarray of shape (n_classes,)
- Class labels for each output. (Taken from the trained
- ``base_estimator_``).
-
- transduction_ : ndarray of shape (n_samples,)
- The labels used for the final fit of the classifier, including
- pseudo-labels added during fit.
-
- labeled_iter_ : ndarray of shape (n_samples,)
- The iteration in which each sample was labeled. When a sample has
- iteration 0, the sample was already labeled in the original dataset.
- When a sample has iteration -1, the sample was not labeled in any
- iteration.
-
- n_iter_ : int
- The number of rounds of self-training, that is the number of times the
- base estimator is fitted on relabeled variants of the training set.
-
- termination_condition_ : {'max_iter', 'no_change', 'all_labeled'}
- The reason that fitting was stopped.
-
- - 'max_iter': `n_iter_` reached `max_iter`.
- - 'no_change': no new labels were predicted.
- - 'all_labeled': all unlabeled samples were labeled before `max_iter`
- was reached.
-
- Examples
- --------
- >>> import numpy as np
- >>> from sklearn import datasets
- >>> from sklearn.semi_supervised import SelfTrainingClassifier
- >>> from sklearn.svm import SVC
- >>> rng = np.random.RandomState(42)
- >>> iris = datasets.load_iris()
- >>> random_unlabeled_points = rng.rand(iris.target.shape[0]) < 0.3
- >>> iris.target[random_unlabeled_points] = -1
- >>> svc = SVC(probability=True, gamma="auto")
- >>> self_training_model = SelfTrainingClassifier(svc)
- >>> self_training_model.fit(iris.data, iris.target)
- SelfTrainingClassifier(...)
-
- References
- ----------
- David Yarowsky. 1995. Unsupervised word sense disambiguation rivaling
- supervised methods. In Proceedings of the 33rd annual meeting on
- Association for Computational Linguistics (ACL '95). Association for
- Computational Linguistics, Stroudsburg, PA, USA, 189-196. DOI:
- https://doi.org/10.3115/981658.981684
"""
+ **Self Training Classifier with data loader compatible.**
+ ----------------------------
+
+ Is the same `SelfTrainingClassifier` from sklearn but with `sslearn` data loader compatible.
+ For more information, see the [sklearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.semi_supervised.SelfTrainingClassifier.html).
+
+ **Example**
+ -----------
+ ```python
+ from sklearn.datasets import load_iris
+ from sslearn.model_selection import artificial_ssl_dataset
+ from sslearn.wrapper import SelfTraining
+
+ X, y = load_iris(return_X_y=True)
+ X, y, X_unlabel, y_unlabel, _, _ = artificial_ssl_dataset(X, y, label_rate=0.1, random_state=0)
+
+ clf = SelfTraining()
+ clf.fit(X, y)
+ clf.score(X_unlabel, y_unlabel)
+ ```
+
+ **References**
+ --------------
+ David Yarowsky. (1995).
+ Unsupervised word sense disambiguation rivaling supervised methods.
+ In Proceedings of the 33rd annual meeting on Association for Computational Linguistics (ACL '95).
+ Association for Computational Linguistics,
+ Stroudsburg, PA, USA, 189-196.
+ [10.3115/981658.981684](https://doi.org/10.3115/981658.981684)
+ """
+
_estimator_type = "classifier"
def __init__(self,
@@ -122,6 +54,49 @@ def __init__(self,
k_best=10,
max_iter=10,
verbose=False):
+ """Self-training. Adaptation of SelfTrainingClassifier from sklearn with data loader compatible.
+
+ This class allows a given supervised classifier to function as a
+ semi-supervised classifier, allowing it to learn from unlabeled data. It
+ does this by iteratively predicting pseudo-labels for the unlabeled data
+ and adding them to the training set.
+
+ The classifier will continue iterating until either max_iter is reached, or
+ no pseudo-labels were added to the training set in the previous iteration.
+
+ Parameters
+ ----------
+ base_estimator : estimator object
+ An estimator object implementing ``fit`` and ``predict_proba``.
+ Invoking the ``fit`` method will fit a clone of the passed estimator,
+ which will be stored in the ``base_estimator_`` attribute.
+
+ threshold : float, default=0.75
+ The decision threshold for use with `criterion='threshold'`.
+ Should be in [0, 1). When using the 'threshold' criterion, a
+ :ref:`well calibrated classifier ` should be used.
+
+ criterion : {'threshold', 'k_best'}, default='threshold'
+ The selection criterion used to select which labels to add to the
+ training set. If 'threshold', pseudo-labels with prediction
+ probabilities above `threshold` are added to the dataset. If 'k_best',
+ the `k_best` pseudo-labels with highest prediction probabilities are
+ added to the dataset. When using the 'threshold' criterion, a
+ :ref:`well calibrated classifier ` should be used.
+
+ k_best : int, default=10
+ The amount of samples to add in each iteration. Only used when
+ `criterion` is k_best'.
+
+ max_iter : int or None, default=10
+ Maximum number of iterations allowed. Should be greater than or equal
+ to 0. If it is ``None``, the classifier will continue to predict labels
+ until no new pseudo-labels are added, or all unlabeled samples have
+ been labeled.
+
+ verbose : bool, default=False
+ Enable verbose output.
+ """
super().__init__(base_estimator, threshold, criterion, k_best, max_iter, verbose)
def fit(self, X, y):
@@ -150,6 +125,49 @@ def fit(self, X, y):
class Setred(ClassifierMixin, BaseEstimator):
+ """
+ **Self-training with Editing.**
+ ----------------------------
+
+ Create a SETRED classifier. It is a self-training algorithm that uses a rejection mechanism to avoid adding noisy samples to the training set.
+ The main process are:
+ 1. Train a classifier with the labeled data.
+ 2. Create a pool of unlabeled data and select the most confident predictions.
+ 3. Repeat until the maximum number of iterations is reached:
+ a. Select the most confident predictions from the unlabeled data.
+ b. Calculate the neighborhood graph of the labeled data and the selected instances from the unlabeled data.
+ c. Calculate the significance level of the selected instances.
+ d. Reject the instances that are not significant according their position in the neighborhood graph.
+ e. Add the selected instances to the labeled data and retrains the classifier.
+ f. Add new instances to the pool of unlabeled data.
+ 4. Return the classifier trained with the labeled data.
+
+ **Example**
+ -----------
+ ```python
+ from sklearn.datasets import load_iris
+ from sslearn.model_selection import artificial_ssl_dataset
+ from sslearn.wrapper import Setred
+
+ X, y = load_iris(return_X_y=True)
+ X, y, X_unlabel, y_unlabel, _, _ = artificial_ssl_dataset(X, y, label_rate=0.1, random_state=0)
+
+ clf = Setred()
+ clf.fit(X, y)
+ clf.score(X_unlabel, y_unlabel)
+ ```
+
+ **References**
+ ----------
+ Li, Ming, and Zhi-Hua Zhou. (2005)
+ SETRED: Self-training with editing,
+ in Advances in Knowledge Discovery and Data Mining.
+ Pacific-Asia Conference on Knowledge Discovery and Data Mining
+ LNAI 3518, Springer, Berlin, Heidelberg,
+ [10.1007/11430919_71](https://doi.org/10.1007/11430919_71)
+
+ """
+
def __init__(
self,
base_estimator=KNeighborsClassifier(n_neighbors=3),
@@ -162,25 +180,24 @@ def __init__(
n_jobs=None,
):
"""
- Li, Ming, and Zhi-Hua Zhou. "SETRED: Self-training with editing."
- Pacific-Asia Conference on Knowledge Discovery and Data Mining.
- Springer, Berlin, Heidelberg, 2005. doi: 10.1007/11430919_71.
-
+ Create a SETRED classifier.
+ It is a self-training algorithm that uses a rejection mechanism to avoid adding noisy samples to the training set.
+
Parameters
----------
base_estimator : ClassifierMixin, optional
- An estimator object implementing fit and predict_proba,, by default DecisionTreeClassifier(), by default KNeighborsClassifier(n_neighbors=3)
+ An estimator object implementing fit and predict_proba, by default KNeighborsClassifier(n_neighbors=3)
max_iterations : int, optional
Maximum number of iterations allowed. Should be greater than or equal to 0., by default 40
distance : str, optional
The distance metric to use for the graph.
The default metric is euclidean, and with p=2 is equivalent to the standard Euclidean metric.
For a list of available metrics, see the documentation of DistanceMetric and the metrics listed in sklearn.metrics.pairwise.PAIRWISE_DISTANCE_FUNCTIONS.
- Note that the “cosine” metric uses cosine_distances., by default "euclidean"
+ Note that the `cosine` metric uses cosine_distances., by default `euclidean`
poolsize : float, optional
Max number of unlabel instances candidates to pseudolabel, by default 0.25
rejection_threshold : float, optional
- significance level, by default 0.1
+ significance level, by default 0.05
graph_neighbors : int, optional
Number of neighbors for each sample., by default 1
random_state : int, RandomState instance, optional
@@ -330,7 +347,7 @@ def predict(self, X, **kwards):
The input samples.
Returns
-------
- y: array-like of shape (n_samples,)
+ y : array-like of shape (n_samples,)
The predicted classes
"""
return self._base_estimator.predict(X, **kwards)
@@ -344,7 +361,7 @@ def predict_proba(self, X, **kwards):
The input samples.
Returns
-------
- y: ndarray of shape (n_samples, n_classes) or list of n_outputs such arrays if n_outputs > 1
+ y : ndarray of shape (n_samples, n_classes) or list of n_outputs such arrays if n_outputs > 1
The predicted classes
"""
return self._base_estimator.predict_proba(X, **kwards)
diff --git a/sslearn/wrapper/_tritraining.py b/sslearn/wrapper/_tritraining.py
index 4370c7d..e11754e 100644
--- a/sslearn/wrapper/_tritraining.py
+++ b/sslearn/wrapper/_tritraining.py
@@ -22,6 +22,33 @@
class TriTraining(BaseCoTraining):
+ """
+ **TriTraining. Trio of classifiers with bootstrapping.**
+
+ The main process is:
+ 1. Generate three classifiers using bootstrapping.
+ 2. Iterate until convergence:
+ 1. Calculate the error between two hypotheses.
+ 2. If the error is less than the previous error, generate a dataset with the instances where both hypotheses agree.
+ 3. Retrain the classifiers with the new dataset and the original labeled dataset.
+ 3. Combine the predictions of the three classifiers.
+
+ **Methods**
+ -------
+ - `fit`: Fit the model with the labeled instances.
+ - `predict` : Predict the class for each instance.
+ - `predict_proba`: Predict the probability for each class.
+ - `score`: Return the mean accuracy on the given test data and labels.
+
+ **References**
+ ----------
+ Zhi-Hua Zhou and Ming Li,
+ Tri-training: exploiting unlabeled data using three classifiers,
+ in IEEE Transactions on Knowledge and Data Engineering,
+ vol. 17, no. 11, pp. 1529-1541, Nov. 2005,
+ [10.1109/TKDE.2005.186](https://doi.org/10.1109/TKDE.2005.186)
+
+ """
def __init__(
self,
@@ -30,12 +57,8 @@ def __init__(
random_state=None,
n_jobs=None,
):
- """TriTraining
- Zhi-Hua Zhou and Ming Li,
- "Tri-training: exploiting unlabeled data using three classifiers,"
- in IEEE Transactions on Knowledge and Data Engineering,
- vol. 17, no. 11, pp. 1529-1541, Nov. 2005,
- doi: 10.1109/TKDE.2005.186.
+ """TriTraining. Trio of classifiers with bootstrapping.
+
Parameters
----------
base_estimator : ClassifierMixin, optional
@@ -49,6 +72,7 @@ def __init__(
The number of jobs to run in parallel for both `fit` and `predict`.
`None` means 1 unless in a :obj:`joblib.parallel_backend` context.
`-1` means using all processors., by default None
+
"""
self._N_LEARNER = 3
self.base_estimator = check_classifier(base_estimator, collection_size=self._N_LEARNER)
@@ -67,7 +91,7 @@ def fit(self, X, y, **kwards):
The target values (class labels), -1 if unlabeled.
Returns
-------
- self: TriTraining
+ self : TriTraining
Fitted estimator.
"""
random_state = check_random_state(self.random_state)
@@ -197,7 +221,8 @@ def _another_hs(hs, index):
base hypothesis index
Returns
-------
- list
+ classifiers: list
+ Collection of other hypotheses
"""
another_hs = []
for i in range(len(hs)):
@@ -218,7 +243,7 @@ def _subsample(L, s, random_state=None):
controls the randomness of the estimator, by default None
Returns
-------
- tuple
+ subsamples: tuple
Collection of pseudo-labeled selected for enlarged labeled examples.
"""
to_remove = len(L[0]) - s
@@ -244,7 +269,7 @@ def _measure_error(
A small number to avoid division by zero
Returns
-------
- float
+ error : float
Division of the number of labeled examples on which both h1 and h2 make incorrect classification,
by the number of labeled examples on which the classification made by h1 is the same as that made by h2.
"""
@@ -257,6 +282,32 @@ def _measure_error(
class WiWTriTraining(TriTraining):
+ """
+ **Who-Is-Who TriTraining.**
+
+ Trio of classifiers with bootstrapping and restricted set classification.
+ Is the same as TriTraining but with the restricted set classification.
+ Maninly, the conflict rate penalizes the ***measure error*** of basic TriTraining, it can be calculated over differentes subsamples of X, can be:
+ * `labeled` over complete L,
+ * `labeled_plus` over complete L union L',
+ * `unlabeled`: over complete U,
+ * `all`: over complete X (LuU) and
+ * `none`: don't penalize the ***meause error***, only use the restrictions for avoid share classes in the same group.
+
+ **Methods**
+ -------
+ - `fit`: Fit the model with the labeled instances. Receives the instance group, an array-like of shape (n_samples) with the group of each instance. Two instances with the same label are not allowed to be in the same group.
+ - `predict` : Predict the class for each instance.
+ - `predict_proba`: Predict the probability for each class.
+ - `score`: Return the mean accuracy on the given test data and labels.
+
+ **References**
+ ----------
+ Ludmila I. Kuncheva, Juan J. Rodríguez, Aaron S. Jackson, (2016)
+ Restricted set classification: Who is there?
+ Pattern Recognition, 63, 158-170,
+ [10.1016/j.patcog.2016.08.028](https://doi.org/10.1016/j.patcog.2016.08.028)
+ """
def __init__(
self,
@@ -315,7 +366,7 @@ def fit(self, X, y, instance_group=None, **kwards):
The group. Two instances with the same label are not allowed to be in the same group.
Returns
-------
- self: TriTraining
+ self : TriTraining
Fitted estimator.
"""
random_state = check_random_state(self.random_state)
@@ -456,7 +507,7 @@ def _measure_error(self, L, y, h1: ClassifierMixin, h2: ClassifierMixin, epsilon
A small number to avoid division by zero
Returns
-------
- float
+ error: float
Division of the number of labeled examples on which both h1 and h2 make incorrect classification,
by the number of labeled examples on which the classification made by h1 is the same as that made by h2.
"""
@@ -503,17 +554,34 @@ def predict(self, X, instance_group):
class DeTriTraining(TriTraining):
+ """
+ **TriTraining with Data Editing.**
+
+ It is a variation of the TriTraining, the main difference is that the instances are depurated in each iteration.
+ It means that the instances with their neighbors that have the same class are kept, the rest are removed.
+ At the end of the iterations, the instances are clustered and the class is assigned to the cluster centroid.
+
+ **Methods**
+ -------
+ - `fit`: Fit the model with the labeled instances.
+ - `predict` : Predict the class for each instance.
+ - `predict_proba`: Predict the probability for each class.
+ - `score`: Return the mean accuracy on the given test data and labels.
+
+ **References**
+ ----------
+ Deng C., Guo M.Z. (2006)
+ Tri-training and Data Editing Based Semi-supervised Clustering Algorithm,
+ in Gelbukh A., Reyes-Garcia C.A. (eds) MICAI 2006: Advances in Artificial Intelligence. MICAI 2006.
+ Lecture Notes in Computer Science, vol 4293. Springer, Berlin, Heidelberg.
+ [10.1007/11925231_61](https://doi.org/10.1007/11925231_61)
+ """
def __init__(self, base_estimator=DecisionTreeClassifier(), k_neighbors=3,
n_samples=None, mode="seeded", max_iterations=100, n_jobs=None, random_state=None):
- """DeTriTraining
-
- Deng C., Guo M.Z. (2006)
- Tri-training and Data Editing Based Semi-supervised Clustering Algorithm.
- In: Gelbukh A., Reyes-Garcia C.A. (eds) MICAI 2006: Advances in Artificial Intelligence. MICAI 2006.
- Lecture Notes in Computer Science, vol 4293.
- Springer, Berlin, Heidelberg.
- https://doi.org/10.1007/11925231_61
+ """
+ DeTriTraining - TriTraining with Depurated and Clustering.
+ Avoid the noise generated by the TriTraining algorithm by depurating the enlarged dataset and clustering the instances.
Parameters
----------
@@ -535,7 +603,7 @@ def __init__(self, base_estimator=DecisionTreeClassifier(), k_neighbors=3,
n_jobs : int, optional
The number of parallel jobs to run for neighbors search.
None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.
- Doesn’t affect fit method., by default None
+ Doesn't affect fit method., by default None
random_state : int, RandomState instance, optional
controls the randomness of the estimator, by default None
"""
@@ -557,7 +625,7 @@ def _depure(self, S):
Returns
-------
- tuple (X, y)
+ tuple : (X, y)
Enlarged dataset with instances where at least k_neighbors/2+1 have the same class.
"""
init = time.time()
@@ -578,7 +646,7 @@ def _clustering(self, S, X):
Returns
-------
- array-like of shape (n_samples,)
+ y: array-like of shape (n_samples,)
class predicted for each instance
"""
centroids = dict()
diff --git a/test/test_wrapper.py b/test/test_wrapper.py
index 85d7b11..a14092e 100644
--- a/test/test_wrapper.py
+++ b/test/test_wrapper.py
@@ -16,7 +16,7 @@
from sslearn.model_selection import artificial_ssl_dataset
from sslearn.wrapper import (
CoTraining, CoForest, CoTrainingByCommittee, DemocraticCoLearning, Rasco, RelRasco,
- SelfTraining, Setred, TriTraining, WiWTriTraining, DeTriTraining
+ SelfTraining, Setred, TriTraining, DeTriTraining
)
X, y = read_csv(os.path.join(os.path.dirname(os.path.realpath(__file__)), "example_files", "abalone.csv"), format="pandas")
@@ -238,40 +238,40 @@ def test_all_label(self):
groups = np.array(groups)
groups = groups[:X.shape[0]]
-class TestWiWTriTraining:
+# class TestWiWTriTraining:
- def test_basic(self):
- clf = WiWTriTraining(base_estimator=DecisionTreeClassifier())
- clf.fit(X, y, instance_group=groups)
- clf.predict(X, instance_group=groups)
- clf.predict_proba(X)
-
- clf = WiWTriTraining(DecisionTreeClassifier())
- clf.fit(X2, y2, instance_group=groups)
- clf.predict(X2, instance_group=groups)
- clf.predict_proba(X2)
-
- def test_multiple(self):
- clf = WiWTriTraining(base_estimator=[DecisionTreeClassifier(max_depth=1), DecisionTreeClassifier(max_depth=2), DecisionTreeClassifier(max_depth=3)])
- clf.fit(X, y, instance_group=groups)
- clf.predict(X, instance_group=groups)
- clf.predict_proba(X)
+# def test_basic(self):
+# clf = WiWTriTraining(base_estimator=DecisionTreeClassifier())
+# clf.fit(X, y, instance_group=groups)
+# clf.predict(X, instance_group=groups)
+# clf.predict_proba(X)
+
+# clf = WiWTriTraining(DecisionTreeClassifier())
+# clf.fit(X2, y2, instance_group=groups)
+# clf.predict(X2, instance_group=groups)
+# clf.predict_proba(X2)
+
+# def test_multiple(self):
+# clf = WiWTriTraining(base_estimator=[DecisionTreeClassifier(max_depth=1), DecisionTreeClassifier(max_depth=2), DecisionTreeClassifier(max_depth=3)])
+# clf.fit(X, y, instance_group=groups)
+# clf.predict(X, instance_group=groups)
+# clf.predict_proba(X)
- def test_random_state(self):
- for i in range(10):
- clf = WiWTriTraining(base_estimator=KNeighborsClassifier(), random_state=i)
- clf.fit(X, y, instance_group=groups)
- y1 = clf.predict(X, instance_group=groups)
-
- clf = WiWTriTraining(base_estimator=KNeighborsClassifier(), random_state=i)
- clf.fit(X, y, instance_group=groups)
- y2 = clf.predict(X, instance_group=groups)
-
- assert np.all(y1 == y2)
-
- def test_all_label(self):
- clf = WiWTriTraining(base_estimator=KNeighborsClassifier())
- clf.fit(X, y, instance_group=groups)
- clf.predict(X, instance_group=groups)
- clf.predict_proba(X)
+# def test_random_state(self):
+# for i in range(10):
+# clf = WiWTriTraining(base_estimator=KNeighborsClassifier(), random_state=i)
+# clf.fit(X, y, instance_group=groups)
+# y1 = clf.predict(X, instance_group=groups)
+
+# clf = WiWTriTraining(base_estimator=KNeighborsClassifier(), random_state=i)
+# clf.fit(X, y, instance_group=groups)
+# y2 = clf.predict(X, instance_group=groups)
+
+# assert np.all(y1 == y2)
+
+# def test_all_label(self):
+# clf = WiWTriTraining(base_estimator=KNeighborsClassifier())
+# clf.fit(X, y, instance_group=groups)
+# clf.predict(X, instance_group=groups)
+# clf.predict_proba(X)