Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

custom_statistics : format and storing stats in regression object #129

Closed
Gkreindler opened this issue Mar 16, 2023 · 4 comments · Fixed by #139
Closed

custom_statistics : format and storing stats in regression object #129

Gkreindler opened this issue Mar 16, 2023 · 4 comments · Fixed by #139

Comments

@Gkreindler
Copy link

I find the custom_statistics option super helpful, for example for displaying the mean of the outcome variable, number of units (in a panel setting), etc.

I have two questions on features that would make this easier to use for me:

  1. Is it possible to specify a custom format for custom_statistics? Currently, I think they are all %0.3f
  2. Currently, we need to define a NamedTuple. I currently define these stats after running my regression, and carry them alongside the regression object. It would be more convenient to store these stats in the regression object itself and tell regtable what to print. Is this feasible? (I realize this might (also) be a question for the regression packages.)
@jmboehm
Copy link
Owner

jmboehm commented Mar 16, 2023

  1. If the statistic is in numeric format, it should be formatted according to statisticformat (which defaults to %0.3f). I agree that this isn't always desirable, in particular if you want to show integers. If the "statistic" is a string, however, it will be displayed as is, so one possible workaround is to format the output before passing it to custom_statistics. To use the example from the test script:
using Statistics, Formatting
comments = ["Baseline", "Preferred"]
means = [sprintf1("%0.6f",Statistics.mean(df.SepalLength[rr1.esample])), sprintf1("%0.6f",Statistics.mean(df.SepalLength[rr2.esample]))]
mystats = NamedTuple{(:comments, :means)}((comments, means))
RegressionTables.regtable(rr1, rr2; renderSettings = RegressionTables.asciiOutput(), regression_statistics = [:nobs, :r2],custom_statistics = mystats, labels = Dict("__LABEL_CUSTOM_STATISTIC_comments__" => "Specification", "__LABEL_CUSTOM_STATISTIC_means__" => "My custom mean") )

If you have an idea about what would be a good interface for the formatting of these additional statistics, let me know.

  1. Yeah, I agree. We had a similar discussion in the context of having the RegressionModel (or FixedEffectModel etc) store custom covariance matrices. Three options: 1) if you feel that the statistic should be part of every regression model, you could file a PR in StatsBase.jl to add the relevant statistic to the abstraction; 2) if you think it's something that the output from FixedEffectModels or GLFixedEffectModels should have, we could add that; 3) if it's very specific to your application, you could define your own struct that contains the RegressionModel (or whatever type your estimator is producing) as well as your custom statistics, and then write a short function that wraps regtable and fills in the relevant custom_statistics. If none of these sound satisfactory, we could have custom_statistics take functions as arguments that would take the RegressionModel as arguments and produce formatted output, something like this:
function mycustomstatistic(rr::RegressionModel)
	return 3.141592 	# or something that depends only on rr
end
mystats = NamedTuple{(:foo)}(mycustomstatistic)
RegressionTables.regtable(rr1, rr2; renderSettings = RegressionTables.asciiOutput(), custom_statistics = mystats)

Let me know what you think.

@Gkreindler
Copy link
Author

"pre"-formatting the statistics as strings is a convenient solution!

For the 2nd issue, for my workflow, the ideal would be if RegressionModel has an attribute other_stats that is a Dict that I could load anything into (application-specific). This would be convenient because in my workflow, I find it convenient after I estimate a model to compute a few other statistics and store them in (attach them to) the rr object. These may also depend on the dataframe, etc., which, as far I understand, is (for good reasons) not included in rr.

It is a great suggestion to do this via a struct that includes RegressionModel and a wrapper to regtable! I'll report back with an example if I implement that.

@Gkreindler
Copy link
Author

Gkreindler commented Mar 18, 2023

Here is my code to wrap fixed effects model (linear and GL) to include statistics, and to then include them (with formatting) in a regression table. I'm sure that this can be much improved!

using FixedEffectModels
using GLFixedEffectModels
using RegressionTables
using DataFrames
import Formatting: sprintf1

### Define FixedEffectModel with additional statistics 
mutable struct FEmodel
    model::Union{FixedEffectModel, GLFixedEffectModel}
    stats::Dict{Symbol, Union{String, Number}}
end

function FEmodel(mymodel::Union{FixedEffectModel, GLFixedEffectModel}, stats::Union{Nothing, Dict{Symbol, Union{String, Number}}}) 
    if isnothing(stats)
        emptydict = Dict{Symbol, Union{String, Number}}()
        return FEmodel(mymodel, emptydict)
    end
    return FEmodel(mymodel, stats)
end

function regtable_stats(
    mymodels::Vararg{FEmodel}; 
    custom_statistics_order::Union{Nothing, Vector{Symbol}, Tuple{Symbol}}=nothing, 
    custom_statistics_format::Union{Nothing, Dict{Symbol, String}}=nothing, 
    kwargs...)
    
    # all statistics names
        allkeys = union([Set(keys(mymodel.stats)) for mymodel=mymodels]...)
        if isnothing(custom_statistics_order)
            custom_statistics_order = sort(allkeys |> collect)
        else
            @assert Set(custom_statistics_order) == allkeys
        end

    # formatting
        if isnothing(custom_statistics_format)
            custom_statistics_format = Dict()
        end

    custom_statistic_dict = Dict{Symbol, Any}()
    for mykey=custom_statistics_order
        stat_entries = Vector{String}(undef, length(mymodels))

        for (idx, mymodel) = enumerate(mymodels)
            if mykey ∈ keys(mymodel.stats)

                myentry = mymodel.stats[mykey]
                
                if mykey ∈ keys(custom_statistics_format)
                    myformat = custom_statistics_format[mykey]
                else
                    if isa(myentry, String) || isa(myentry, Bool)
                        myformat = "%s"
                    elseif isa(myentry, Int)
                        myformat = "%'i" # "%d"
                    elseif isa(myentry, Real)
                        myformat = "%0.3f"
                    end
                end
                
                stat_entries[idx] = sprintf1(myformat, myentry)
            else
                stat_entries[idx] = ""
            end
        end 

        custom_statistic_dict[mykey] = stat_entries
    end

    # custom stats names
        custom_stats_vectors = [custom_statistic_dict[mykey] for mykey=custom_statistics_order]
        
        custom_statistics = NamedTuple{Tuple(custom_statistics_order)}(Tuple(custom_stats_vectors))

    # call regtable
        my_models = [mymodel.model for mymodel=mymodels]
        return regtable(my_models..., custom_statistics=custom_statistics; kwargs...)
end


    ### Fake data
    testdf = DataFrame("a" => [1,0,1,0.5], "b" => [1.2, 3.2, 1.1, 1.01], "fe" => [1, 1, 0, 0])
    testdf.b2 = testdf.b .^ 2

    ### Run some regressions
        r1 = reg(testdf, term(:a) ~ term(:b) + FixedEffectModels.fe(:fe))
        rr = FEmodel(r1,  Dict(:quadratic => 0.0, :linear => "a"))

        r2 = reg(testdf, term(:a) ~ term(:b2) + FixedEffectModels.fe(:fe))
        rr2 = FEmodel(r2, Dict(:quadratic => 1.99, :square => true))

        mymodels = [rr, rr2]

    ### Table -- minimal options
        regtable_stats(mymodels..., renderSettings = asciiOutput())

    ### Table -- full control
        custom_statistics_order = [:square, :quadratic, :linear] # Need to include ALL stats here, otherwise errors
        custom_statistics_format = Dict(:square => "%s", :quadratic => "%0.1f") # ok to only include some
        regtable_stats(mymodels..., 
                custom_statistics_order=custom_statistics_order,  
                custom_statistics_format=custom_statistics_format,
                renderSettings = asciiOutput())

@jmboehm
Copy link
Owner

jmboehm commented Mar 19, 2023

Looks neat! I'm wondering whether it would make sense to implement this as a new parametric type in RegressionTables.jl, something like this:

struct AugmentedRegressionModel{T}
    model::T
    stats::Dict{Symbol, Union{String, Number}}
end

The advantage would be that it could work out-of-the-box with any output model type, including anything that's implementing the StatsBase abstraction. That could be a neat way to override the estimated VCov matrix with some custom one as well...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants