Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diagonal elements of the transition matrix are non-zero and not-equal #3

Open
ellisonbg opened this issue Mar 17, 2017 · 3 comments
Open

Comments

@ellisonbg
Copy link

Hi, I love the overall approach of this work and how you have written it up as a blog post and posted the code on GitHub.

Your treatment sets the diagonal elements of the transition matrix to zero. You got questions about this assumption and replied in the blog comments, but I think you are still missing an important aspect of this.

The (omitted) diagonal components of the transition matrix give the probability of sticking with that language. In a given row, the sum of the non-diagonal components gives the total probability of leaving that language. The point that I want to make is the the total rate at which people are leaving individual languages are not the same. Thus, the omitted diagonal components are not simply a scale times the identity matrix.

The impact that this has on the conclusions is likely dramatic.

The row-wise normalizations you have done are wrt the total number of transitions in that row. For a popular language such as C/C++/Java/Python, those raw numbers represent an extremely small percentage of the overall language population. Thus for those rows, the omitted normalized diagonal element would be $1-\epsilon$, with the other elements of the row summing to a very small number $\epsilon$.

Likewise, for less popular languages such as Go (I am not picking on Go here, I love it as language!) the quoted transition counts would represent a more significant fraction of the existing population. This would cause the diagonal component to be significantly smaller than 1.0 after normalization.

I want to emphasize that I love the application of this type of model to these questions. It would be great to see it applied in a context where both the existing population of users (diagonal components) and all of the transitions where available. Cheers, Brian.

@ellisonbg ellisonbg changed the title Diagonal elements of the transition matrix non-zero and not-equal Diagonal elements of the transition matrix are non-zero and not-equal Mar 17, 2017
@vmarkovtsev
Copy link

@ellisonbg What do you think about #4 ?

@erikbern
Copy link
Owner

yeah you could probably improve the analysis – feel free to give it a shot :)

@frobnitzem
Copy link

I agree with @ellisonbg. The fix is not easy because it requires computing the rates of switching rather than the total number of switches. Say there's a 2x2 matrix:

(stay with Java) (Java -> C)
(C -> Java) (stay with C)

Then, if there were 500 switches from Java to C per year and 5000 Java programmers who were also bloggers that year, then the first row of the rate matrix should be [ 9/10, 1/10 ].
Now, suppose there were 100 switches from C to Java per year and 400 C programmers where were also bloggers that year, then the second row should be [ 1/4, 3/4 ].

The eigenvector of the matrix (just use numpy.eig and look for eigenvalue 1) will then tell you the future language of those bloggers. Different total numbers may distort those calculations. The rates can be estimated by limiting the counts to the past year or two.

One way to "guestimate" the proper diagonal elements could be to assume that the total number of bloggers is proportional to the total number of "I'm leaving this language" posts (with constant P across languages).

For the test data, this would give:

500/(500 + 500P) (500p)/(500 + 500P)
100/(100+100
P) 100P/(100 + 100P)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants