-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Diagonal elements of the transition matrix are non-zero and not-equal #3
Comments
@ellisonbg What do you think about #4 ? |
yeah you could probably improve the analysis – feel free to give it a shot :) |
I agree with @ellisonbg. The fix is not easy because it requires computing the rates of switching rather than the total number of switches. Say there's a 2x2 matrix:
Then, if there were 500 switches from Java to C per year and 5000 Java programmers who were also bloggers that year, then the first row of the rate matrix should be [ 9/10, 1/10 ]. The eigenvector of the matrix (just use numpy.eig and look for eigenvalue 1) will then tell you the future language of those bloggers. Different total numbers may distort those calculations. The rates can be estimated by limiting the counts to the past year or two. One way to "guestimate" the proper diagonal elements could be to assume that the total number of bloggers is proportional to the total number of "I'm leaving this language" posts (with constant P across languages). For the test data, this would give:
|
Hi, I love the overall approach of this work and how you have written it up as a blog post and posted the code on GitHub.
Your treatment sets the diagonal elements of the transition matrix to zero. You got questions about this assumption and replied in the blog comments, but I think you are still missing an important aspect of this.
The (omitted) diagonal components of the transition matrix give the probability of sticking with that language. In a given row, the sum of the non-diagonal components gives the total probability of leaving that language. The point that I want to make is the the total rate at which people are leaving individual languages are not the same. Thus, the omitted diagonal components are not simply a scale times the identity matrix.
The impact that this has on the conclusions is likely dramatic.
The row-wise normalizations you have done are wrt the total number of transitions in that row. For a popular language such as C/C++/Java/Python, those raw numbers represent an extremely small percentage of the overall language population. Thus for those rows, the omitted normalized diagonal element would be$1-\epsilon$ , with the other elements of the row summing to a very small number $\epsilon$ .
Likewise, for less popular languages such as Go (I am not picking on Go here, I love it as language!) the quoted transition counts would represent a more significant fraction of the existing population. This would cause the diagonal component to be significantly smaller than 1.0 after normalization.
I want to emphasize that I love the application of this type of model to these questions. It would be great to see it applied in a context where both the existing population of users (diagonal components) and all of the transitions where available. Cheers, Brian.
The text was updated successfully, but these errors were encountered: