-
Notifications
You must be signed in to change notification settings - Fork 332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
transform_feature_names for decomposition and scaling #208
base: master
Are you sure you want to change the base?
Conversation
Present top linear weights as PCA2:=(x0*5.3+x1*77) etc.
In merging with master, I see that there has been an attempt to more thoroughly test every estimator with |
Codecov Report
@@ Coverage Diff @@
## master #208 +/- ##
=========================================
- Coverage 97.4% 86.2% -11.2%
=========================================
Files 41 41
Lines 2580 2617 +37
Branches 495 505 +10
=========================================
- Hits 2513 2256 -257
- Misses 35 329 +294
Partials 32 32
|
Hey @jnothman! I'll check this PR later today. As for excessive testing, I like testing of all (most) estimators explicitly because of two reasons:
For example, RandomizedLogisticRegression has SelectorMixin as a metaclass, not as a base class, so #177 didn't work for it; I found this when adding all these estimators (which I was thinking are supported) one-by-one. Another example is lightning support: when adding all estimators one-by-one I found several inconsistencies which required fixing in lightning: scikit-learn-contrib/lightning#95, scikit-learn-contrib/lightning#98. I can see how from a library author (sklearn, etc.) point of view these tests look unnecessary. But they require knowledge of inner workings of the library, and knowledge of development plans - e.g. SelectorMixin is not documented, it is not a part of public scikit-learn API; as a scikit-learn core developer you are comfortable relying on it, but as an outsider I like to be more explicit in what eli5 supports, and check one more time that the library interface is consistent and there are no gotchas and hidden issues. In case of #177 it was also a learning excercise for me - I was checking what this PR means for the user. |
The issue with randomized l1 is not that SelectorMixin is a metaclass; it
is not. It's that SelectorMixin was only made a base class of it recently
(i.e. not in a released version).
You're probably right. But in testing everything, you will also need to
manage deprecations and newcomers in order to support multiple versions. It
will be maintenance hell in time...
…On 24 May 2017 at 20:14, Mikhail Korobov ***@***.***> wrote:
Hey @jnothman <https://github.com/jnothman>! I'll check this PR later
today.
As for excessive testing, I like testing of all (most) estimators
explicitly because of two reasons:
1. this way we declare explicitly what's supported and what's not
supported;
2. it helps to find bugs and wrong assumptions.
For example, RandomizedLogisticRegression has SelectorMixin as a
metaclass, not as a base class, so #177
<#177> didn't work for it; I
found this when adding all these estimators (which I was thinking are
supported) one-by-one. Another example is lightning support: when adding
all estimators on-by-one I found several inconsistencies which required
fixing in lightning: scikit-learn-contrib/lightning#95
<scikit-learn-contrib/lightning#95>,
scikit-learn-contrib/lightning#98
<scikit-learn-contrib/lightning#98>.
I can see how from a library author (sklearn, etc.) point of view these
tests look unnecessary. But they require knowledge of inner workings of the
library, and knowledge of development plans - e.g. SelectorMixin is not
documented, it is not a part of public scikit-learn API; as a scikit-learn
core developer you are comfortable relying on it, but as an outsider I like
to be more explicit in what eli5 supports, and check one more time that the
library interface is consistent and there are no gotchas and hidden issues.
In case of #177 <#177> it was
also a learning excercise for me - I was checking what this PR means for
the user.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#208 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6-iO1tYb_juMiwbiKIlTao6DeK6kks5r9AMXgaJpZM4Nkan0>
.
|
A good catch about randomized l1! Yeah, deprecations can be a problem. We "solved" it once by dropping sklearn 0.17 support, and likely it'll cause problems again with 0.19. But to my taste it is a better problem to have: it is testing-only, and it is visible; if we don't test then problem can be in user applications, and it can be undetected by our testing suite. But yeah, explicit estimator support increases effort required for maintenance. |
I've not worked out how to fix
|
@jnothman yep, mypy still has a lot of rough edges - python/mypy#797. You can ignore such errors like this: transform_feature_names.register(TfidfTransformer)( # type: ignore
make_tfn_weighted('idf_', func_name='TFIDF', show_idx=False)) |
Hey @jnothman, I've tried it with LSI:
eli5.explain_weigths shows features like this now:
Unlike TfidfVectorizer, it shows idf weights, which is nice; I wonder if we should use idf_ as What I found confusing is |
I agree that it looks like function application, which is wrong. Square or
other brackets help a little. Not sure what alternatives are
On 29 May 2017 11:59 pm, "Mikhail Korobov" <[email protected]> wrote:
Hey @jnothman <https://github.com/jnothman>,
I've tried it with LSI:
vec = CountVectorizer()
tfidf = TfidfTransformer()
svd = TruncatedSVD(n_components=50)
clf = LogisticRegressionCV()
pipe = make_pipeline(vec, tfidf, svd, clf)
eli5.explain_weigths shows features like this now:
SVD2:=(TFIDF(keith*3.75)*0.362+TFIDF(caltech*3.99)*0.27+TFIDF(livesey*4.55)*0.223)
Unlike TfidfVectorizer, it shows idf weights, which is nice; I wonder if we
should use idf_ as coef_scale by default for TfidfVectorizer.
What I found confusing is TFIDF(keith*3.75) notation: it reads like "take a
number of words "keith" in document, multiply by 3.75, then compute TF*IDF
for it (?)", while in fact IDF is 3.75, so it is TF(keith)*3.75. If the
goal is not to show a math formula, but to show transformer name, then what
do you think about using some other notation, e.g. TFIDF[...] or something
else. Ideas?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#208 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAEz66z_--jYjN1-GIOa1eGK6s3xs5A3ks5r-s9GgaJpZM4Nkan0>
.
|
FTR: tests are failing because of #217. |
6ee3bd5
to
0e34608
Compare
I'd really like to hear some feedback on this, as I feel linear transformation is a very common case, but it is admittedly hard to present nicely as a string. |
Otoh if you feel like it's stretching the scope of eli5 too wide, I'm happy to make a separate project of it. |
@jnothman this PR looks great, I'm looking forward to using it myself. My only reservation is |
well maybe that bracketing is wrong in all the cases. something less loaded
like angle brackets?
…On 29 Jun 2017 9:10 pm, "Mikhail Korobov" ***@***.***> wrote:
@jnothman <https://github.com/jnothman> this PR looks great, I'm looking
forward to using it myself. My only reservation is tf(...) which looks
like a function is applied. It seems the easiest way to fix it is to make
formatting in make_tfn_weighted customizable, and customize it for
TfidfTransformer.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#208 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz64WG-kWZ5ajqwkgzWq67u0eyMh9Yks5sI4YSgaJpZM4Nkan0>
.
|
@jnothman yes, sounds fine. I tried
Maybe spaces around
|
Does it make sense to keep
Also, should we indicate that some terms are missing (if they are missing), e.g. |
You're right < is going to break when we support discretizing.
…On 29 June 2017 at 22:50, Mikhail Korobov ***@***.***> wrote:
Does it make sense to keep () for := notation?
SVD2:=(TFIDF<keith*3.75>*0.362 + TFIDF<caltech*3.99>*0.27 + TFIDF<livesey*4.55>*0.223)
PCA2:=(x0*5.3 + x1*77)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#208 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz600V1a5eDwHXyueTfk_3asXQ38Tgks5sI52xgaJpZM4Nkan0>
.
|
What do you think about the following?
(4) is what I'm less certain in. Maybe we should just pick one of the options even if it is not perfect to move it forward; it can be fixed later if someone comes with a better idea. |
|
With the |
I've not put in ellipsis previously because I considered brevity to be a virtue in feature names. |
@jnothman Hm, I was thinking about both top and threshold. We shouldn't show "+ …" if the rest of features are zero though, and if you think of threshold as of a way to filter out near-zero components I see how it can be different from
it doesn't matter. It bothers me a bit that these are displaying options, so they'd be better handled at formatting level. For example, the rest of terms in SVD can be displayed on mouse hover in HTML. But it looks outside the scope of this pull request; it may require a structured representation of feature names, instead of plain strings everywhere. We had this issue with HashingVectorizer, where we needed to store signs of possible feature names, in addition to names themselves; that's why formatters accept feature names which are lists of {'name': name, 'sign': sign} dicts. |
Present top linear weights as
PCA2:=(x0*5.3+x1*77)
etc.Issues:
make_tfn_weighted
should probably have a doctest (but I'm not sure how to do this without modifying the global register oftransform_feature_names
implementations)hypothesis
. Suggestions welcome.