-
-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better outlier filter for trainset #121
Comments
@L-M-Sherlock, what is your opinion on the suggestion? @Expertium, you may also be interested. |
I like this idea. |
@user1823 can you elaborate how exactly the new condition should work? |
I was thinking of using Other suggestions are welcome. |
But the value of stability depends on what data is available. Do you want a two-step procedure?
|
Looks like you didn't notice this:
I am suggesting to calculate S0 using the current filter and then using the two conditions to filter the reviews when training the other parameters. |
By the way, this statement holds true for any review, not just the first review. But, because filtering the outliers in each review is very difficult, we are filtering the outliers at the first review only. (First review = second rating) |
Ah, ok, I misunderstood. |
@L-M-Sherlock I suggest trying user1823's idea: #121 (comment) |
In case you guys have forgotten, I want to remind you of the following: In general, any change that makes the data more heterogeneous (such as the above proposed change) will tend to increase the RMSE. So, we shouldn't rely on RMSE to determine the effectiveness of this change. It is possible that RMSE can decrease after this change, but it will happen only if the idea is VERY good. This happened in open-spaced-repetition/fsrs-optimizer#16 In my opinion, the condition for implementing this idea should be that the change should not significantly increase the RMSE. |
@L-M-Sherlock, what is your opinion on this suggestion? If you are too busy to implement this now, it is fine. But you can at least give your opinion. |
Maybe also filter reviews of cards with initial delta_t > 7 × S irrespective of whether they are filtered by the pretrain filter or not? (delta_t = 7S implies R = 56.3%) |
@L-M-Sherlock, the issue that you highlighted is not caused by my suggestion; it exists in the current version too. So, it should not be the reason to not apply my suggestion. Basically, the reason behind my suggestion is that higher the amount of data available, better would be the trained weights (provided the data is not completely rubbish). So, just because there are few cards for a particular delta_t, we should not discard all the ratings for those cards. The criteria to discard them should be stricter (as proposed above). Similarly, just because a card falls in a delta_t with R = 100% (#135), we shouldn't discard all the ratings for those cards. For sure, the above two cases are not suitable for calculating S0, but their subsequent ratings can be useful. |
The second case is suitable though. We use additive smoothing, so R will never be exactly 0% or 100% in the pretrain data. I haven't looked at the Rust code (it's hard to understand), but I'm assuming additive smoothing is implemented. If not, it definitely should be. let recall = {
// Laplace smoothing
// (real_recall * n + average_recall * 1) / (n + 1)
// https://github.com/open-spaced-repetition/fsrs4anki/pull/358/files#diff-35b13c8e3466e8bd1231a51c71524fc31a945a8f332290726214d3a6fa7f442aR491
let real_recall = Array1::from_iter(data.iter().map(|d| d.recall));
let n = data.iter().map(|d| d.count).sum::<f32>();
(real_recall * n + average_recall) / (n + 1.0)
}; Wait, Sherlock, then what's the point of removing cards with R=100%? |
It's really weird that the retention is 100% when the interval >= 1 day. Removing these data could increase the pertaining's robustness. In this case, the number of removed cards is very small. So it's acceptable. |
Is this issue still necessary? |
Yes, the previous few changes were related to the pretrainset. This issue is related to the trainset. |
I want to keep the current outlier filter in card level. All reviews of the same cards are filtered out if the first review is filtered out in the pretrainset. |
But why? The pretrain and train methods are completely different. It is not necessary to use the same outlier filter for both. Also, how would you explain the following?
Here, the optimizer is saying that the initial stability is 14 days but the optimizer completely ignores the cards whose first interval is more than 5 days, just because there are 1000s of reviews with first interval of 1 day (because of past Anki use). |
Their goals are the same: minimizing the loss.
The current outlier filter only filters out 5% of cards for each first rating. That's 1000 * 0.05 = 50 cards. When the number of cards whose first interval is more than 5 days exceeds 50, they will be taken into account. For more details, please see the code and comments: Lines 183 to 201 in 3d0dd3a
|
Yes, their goals are the same. But, the method of achieving the goal is very different. Pretrain uses binning (based on delta_t) while train doesn't.
In training, each card is treated independent of the others (because there is no binning). So, we don't need to look at the number of cards with a particular delta_t (except for some wierd cases like first review after 1000 days of graduating the card, which should be filtered out by the filter condition suggested by me). I agree that not a very large number of cards are affected. But, I can't think of any good reason for not including these cards in training. |
I don't have a strong opinion on this matter, but if I had to choose, I would choose to implement user1823's idea. |
How much benefit does including them bring? If the number of reviews of the same delta_t is more than 6 in the pretrainset, the current outlier filter keeps them as well. So it only removes a bit of cards whose first review is weird. In addition, to filter the trainset based on the initial stability is complex. In my opinion, the cost exceeds the benefit. |
No If the number is less than 6, then the reviews are filtered out. But if the number ≥ 6, then the decision of whether to keep them or not is based on whether they come in the bottom 5% of the total number of cards with that first rating or not. This 5% can be quite large. For example, if I have 4000 cards with first rating of Good, then the optimizer is ignoring 200 cards for no good reason. This is not to say that the 5% limit should be adjusted. It is required for pretrain. But, when training, the filter should be slightly more loose. This is my last comment on this issue. If you still don't want to implement it, then we can forget about it. |
Originally posted by @user1823 in #119 (comment)
Originally posted by @user1823 in #119 (comment)
Originally posted by @user1823 in #119 (comment)
The text was updated successfully, but these errors were encountered: