alloew for a narrow sigma

when the model figure out that the action, they is not point on keeping exploring therefore, teh deviation should be really narrow. Clamping sigma to a high min value can cause the policy to collapse after is figure that is is a good policy. at lest that's the intuition
MADEAPPS · Sep 16, 2024 · 0657a68 · 0657a68
1 parent 486df49
commit 0657a68
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/newton-4.00/sdk/dBrain/ndBrainAgentContinuePolicyGradient_Trainer.cpp b/newton-4.00/sdk/dBrain/ndBrainAgentContinuePolicyGradient_Trainer.cpp
@@ -577,7 +577,7 @@ void ndBrainAgentContinuePolicyGradient_TrainerMaster::OptimizePolicy()
 						{
 							const ndBrainFloat mean = output[i];
 							ndAssert(ndExp(output[i + numberOfActions]) > 0.0f);
-							const ndBrainFloat sigma1 = ndMax (ndExp(output[i + numberOfActions]), ndFloat32(1.0e-2f));
+							const ndBrainFloat sigma1 = ndMax (ndExp(output[i + numberOfActions]), ndFloat32(1.0e-4f));
 							const ndBrainFloat sigma2 = sigma1 * sigma1;
 							const ndBrainFloat sigma3 = sigma2 * sigma1;
 							const ndBrainFloat num = (actions[i] - mean);