-
Notifications
You must be signed in to change notification settings - Fork 374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elbow finding #250
Open
ShawnCodesABit
wants to merge
48
commits into
ddangelov:master
Choose a base branch
from
ShawnCodesABit:ElbowFinding
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Elbow finding #250
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ranslation into a matrix.
…ing on using to describe topics. Stubs for making a sparse matrix.
…before the sparse version.
… form of unit tests.
…s within an embedding. Now with elbows. Also found some strange behavior when the curve crosses over the line.
…lity to function on first elbow and also ensure that only positive values are returned.
…nd that adding in a maximum percent difference for the first bin catches some of the cases where elbow finding performs poorly.
…o flip when we are and are not inclusive from an elbow index. Great. Also more cases where running twice gives us more accurate results than running on all the data. This may be solved with vocabulary reduction, but maybe I should include a recursive option.
…s a combination of the 2nd derivative and the distance from the line.
…rent heuristics. This commit has a rather large performance hit as it saves the y-delta for each point in order to determine the sign of the curve.
…s to the cutoff heuristics. Lots of changes on documentation. Changed name from elbow_finding to cutoff_heuristics as there are multiple options now.
… y distance between the value and the line doesn't work as well for finding an index. Massive speed-ups and simplification due to chopping off various metrics.
…s of changes to start supporting cutoff_heuristics within Top2Vec class.
…the default operation is still the same.
…nit tests. Refactoring all of the heuristics into their own submodule for cleanliness.
…Using some more type hints and refactoring that into its own file. Updates to test to get a bit more coverage.
… provided values when computing a cutoff. Changed default heuristic to recursive_elbow after testing with live data.
…hanges to the printouts for plot_heuristic.
…perimenting with different data sets. Adding in a some more functionality to the plot submodule to show how to visualize a heuristic determination.
…ated into plot and word cloud so that heuristic based cutoffs work a bit smoother.
…erformance increases that are about to be added. Making sure that files which are generated when running the test notebooks don't get added to git.
…array once, speed ups on the derivative calculations.
…and the ability to plot from arbitrary vectors.
…t the algorithm is doing.
…se tight oscillation cases it seems like the shifted second derivative works best.
Removing commented out code.
This PR looks really useful! I might dig into it a bit. Wonder if it'll be merged 🐙 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Incorporates a variety of heuristics for determining what is and isn't similar. This functionality is optional and is disabled by default. This is supported for all aspects of Top2Vec other than the topic by term array as that is currently stored as a single 2D array rather than a list of 1D arrays which would allow for different lengths.
Other changes: