Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve median for pooled vectors #73

Open
bkamins opened this issue Oct 15, 2021 · 1 comment
Open

improve median for pooled vectors #73

bkamins opened this issue Oct 15, 2021 · 1 comment

Comments

@bkamins
Copy link
Member

bkamins commented Oct 15, 2021

If length of pool is much smaller than the number of entries we can run the following (working code):

function median_fast(x::PooledVector)
    n = length(x)
    p = sortperm(x.pool)
    counts = zeros(Int, length(p))
    for v in x.refs
        counts[v] += 1
    end
    cum = 0
    for (j,i) in enumerate(p)
        cum += counts[i]
        if isodd(n)
            if cum >= div(n + 1, 2)
                return middle(x.pool[i])
            end
        else
            if cum >= div(n, 2)
                if cum == div(n, 2)
                    return middle(x.pool[i], x.pool[p[j+1]])
                else
                    return middle(x.pool[i])
                end
            end
        end
    end
    error("unreachable reached")
end

Example timings:

julia> x = PooledArray(rand(1:100, 10^8));

julia> @time median_fast(x);
  0.063734 seconds (4 allocations: 1.781 KiB)

julia> @time median_fast(x);
  0.070556 seconds (4 allocations: 1.781 KiB)

julia> @time median(x);
  0.931585 seconds (3 allocations: 762.940 MiB, 9.86% gc time)

julia> @time median(x);
  0.955555 seconds (3 allocations: 762.940 MiB, 14.29% gc time)
@nalimilan
Copy link
Member

Cool. BTW, it would also make sense to define an optimized method for sort like in CategoricalArrays: https://github.com/JuliaData/CategoricalArrays.jl/blob/058ee8bd73e32f947363978ca061f69f4a6328af/src/array.jl#L1059-L1092

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants