-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Methodology for normalization with max value division #180
Comments
Hi @andtsouch - this is a good question. Just to call out the block of code you referenced: FeatureExtraction/R/Normalization.R Lines 162 to 172 in 5363f5f
Reading this code, it looks like the intent is to make sure the covariateValue is between [0,1].
I'm not sure I have a good answer to this so I'm going to tag @schuemie and @jreps to ask them to weigh in on this approach for normalization. |
I agree - dividing by max value is not the type of normalization commonly used. I used to use the subtract min value and divide by max-min. We could add an option of type of normalization that defaults to the old way (so things are backwards compatible) but lets users pick a different approach? |
The choice for this type of normalization was mainly a practical one: if you for example aim to make the mean equal to 0 and the SD equal to 1 (a common form of normalization) that will mean that the covariates are no longer sparse; everyone will have a non-zero value for all covariates, which would blow up memory. Jenna's proposal would work. For almost all variables (where the min value equals 0) it would be identical to the current approach. |
Why would you do anything to the binary covariates? This type of normalization is normally only done on continuous covariates.
But back to @andtsouch question. I agree that min-max scaling and z-scoring are most common in my experience. But I'm not aware of any literature showing superiority of one method over the other. There are though some papers trying to answer this question like this one. @andtsouch you reference the literature, I'd be very interested if there is a specific paper you had in mind when you say max abs scaling is not recommended? |
Many machine learning algorithms make no distinction between binary and continuous covariates. For example, for LASSO ideally you would set the mean to 0, and SD to 1 for all, so the (single) hyperparameter scales well across all covariates. But if all you want to do is make sure all covariates are in the 0 - 1 range then yes, you don't need to touch the binary covariates. |
Hi @anthonysena , @egillax, @jreps, @schuemie |
Hi! I am creating this issue to touch upon the way the
tidyCovariateData
function is normalizing the data when using Normalize=T.From what I can see in the code, the data are normalized by dividing with the maximum value of each covariate:
https://github.com/OHDSI/FeatureExtraction/blob/main/R/Normalization.R (line 162)
if (normalize) { ParallelLogger::logInfo("Normalizing covariates") newCovariates <- newCovariates %>% inner_join(covariateData$maxValuePerCovariateId, by = "covariateId") %>% mutate(covariateValue = .data$covariateValue / .data$maxValue) %>% select(-.data$maxValue) metaData$normFactors <- covariateData$maxValuePerCovariateId %>% collect() } newCovariateData$covariates <- newCovariates }
I am interested to discuss the choice of this normalization method, since it is not really recommended in literature. Have you seen that it has some specific advanatages for machine learning models? From what I am aware of, most of the time methods like min-max or z-score are suggested. Is that maybe a feature for a potential future update or would you recommend some way to provide a custom normalization function on the current version of the package?
Thank you in advance for your consideration!
The text was updated successfully, but these errors were encountered: