Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large redesign to add flexibility and user defaults and fewer dependencies #139

Merged
merged 73 commits into from
Dec 4, 2023

Conversation

junder873
Copy link
Collaborator

@junder873 junder873 commented Jul 31, 2023

This is more or less a complete rewrite of the package to create more flexibility, allow the user to easily set defaults, and move toward the use of extensions with Julia 1.9. I did not originally set out to rewrite everything, but as I went for a few additions it became easier to do so. While this is not 100% complete (need a few more details fixed, documentation, make sure it is all tested), I wanted to post it to get some feedback.

In general, this package does 4 main things:

  1. More flexibility.
    a. There were several things I wanted to do with regression tables that are currently difficult or impossible. The biggest is I wanted to be able to add an extraline that was between two columns, which is useful if you need to add a statistic that compares two values, e.g.:
rr1 = reg(df, @formula(Sales ~ NDI + Price + fe(State) + fe(Year)), Vcov.cluster(:State))
rr2 = reg(df, @formula(Sales ~ NDI + Price + fe(State) + fe(Decade)), Vcov.cluster(:State))
regtable(
    rr1, rr2;
    align=:c,
    extralines=[["New line", 0.55 => 2:3]]
)
# ------------------------------------------
#                               Sales
#                        -------------------
#                           (1)        (2)
# ------------------------------------------
# NDI                     -0.005**    -0.001
#                         (0.003)    (0.002)
# Price                  -0.823***   -0.273*
#                         (0.190)    (0.157)
# ------------------------------------------
# State Fixed Effects       Yes        Yes
# Year Fixed Effects        Yes
# Decade Fixed Effects                 Yes
# ------------------------------------------
# N                        1,380      1,380
# R2                       0.846      0.796
# Within-R2                0.227      0.148
# New line                      0.550
# ------------------------------------------

There is some clunkiness to this solution. If the user does not use the :c align, then the user can pass a DataRow, which has its own settable alignment for each cell.
b. Another feature is adding statistics. Currently, the user is limited to those provided ($R^2$, Adj. $R^2$, etc.) or manually adding extralines. I went ahead and expanded those available, (e.g., Pseudo $R^2$), but it is now possible for the user to define any new statistic and use it, just as if it was built into the package.
c. A long standing todo in this package is to enable custom block ordering (such as stats in front of fixed effects), this is now possible.
d. The underlying type in this update is a vector of vectors. This means that if the user needs to combine tables in a somewhat unusual way to fit their needs, it should be more possible (I want to do more with this)

  1. User settable defaults. I tend to want all my tables to look similar, but to do so in the current package, I need to make sure to change certain settings for each table. Other settings are difficult to change. This update tries to fix these issues.
    a. As an example of this, in my Latex tables, I almost always want to use tabular*, but to do so I need to define a new RenderSetting that would match, which takes ~53 lines of code, even though only 2 lines really need to change. Now, the user would only need those 2 lines to change (the package also now exports LatexTableStar which does this as well):
RegressionTables.tablestart(::RegressionTables.AbstractLatex, align) = "\\begin{tabular*}{\\textwidth}{$(align[1])@{\\extracolsep{\\fill}}$(align[2:end])}"
RegressionTables.tableend(::RegressionTables.AbstractLatex) = "\\end{tabular*}"

b. As another example, I prefer using T-Stats in my tables. In the current package, I would need to set this in every table. Now, this default is settable by the user:

RegressionTables.default_below_statistic() = TStat
  1. With Julia 1.9, extensions are now possible, I think this is valuable for a package like RegressionTables to minimize dependencies. This part of this proposed update is definitely not complete (I have tested it quite a bit with FixedEffectModels.jl, not the others)
  2. As an added bonus, I wanted this package to be more friendly to other types of data. For example, I often create descriptive tables that need to make it into a paper, ideally with a similar style to my regression tables. The DataFrames.jl package provides a good setup to this, so I just needed to add a new function that works for this:
RegressionTables.RegressionTable(
    names(df_described),
    Matrix(df_described)
)
# ---------------------------------------------------------------------
# variable      mean        std         q25        median        q75
# ---------------------------------------------------------------------
# State         26.826      14.481      15.000      26.500       40.000
# Year          77.500       8.659      70.000      77.500       85.000
# Price         68.700      41.986      34.775      52.300       98.100
# Pop        4,537.113   4,828.836   1,053.000   3,174.000    5,280.250
# Pop16      3,366.616   3,641.847     781.175   2,315.300    3,914.325
# CPI           73.597      36.529      38.800      62.900      107.600
# NDI        7,525.023   4,747.859   3,327.869   6,281.201   11,024.110
# Sales        123.951      30.991     107.900     121.200      133.200
# Pimin         62.899      38.323      31.975      46.400       90.500
# ---------------------------------------------------------------------

With these changes, I also added a lot of other changes that I think are useful:

  1. There is now an order argument, which keeps all of the coefficients but changes the order (a drop argument is a work in progress, I think it will be pretty simple)
  2. Related to order and drop, these arguments are now more flexible. In the current package, you need to provide a full string of the coefficients you want to keep. This proposed update has 4 options: string, integers, ranges, and regex. Integers and ranges are pretty straightforward, regex applies the occursin function, so any coefficient names that match the regex will be used (kept, dropped, higher order).
  3. Every statistic type has its own custom formatting options. For example, if the user wants $R^2$ values to be displayed as a percentage while other statistics are still displayed in the old way, this is now possible.
  4. I changed how the renaming works related to interactions and categorical variables. Before, an interaction was treated as a completely different variable name, now, each piece of the interaction has the name of the base variable. In other words, relabeling these variables is much simpler (similar for categorical variables):
rr1 = reg(df, @formula(Sales ~ NDI * Price), Vcov.cluster(:State))
rr2 = reg(df, @formula(Sales ~ NDI + Price), Vcov.cluster(:State))
regtable(
    rr1, rr2;
    labels=Dict("NDI" => "Newspaper Advertising", "Price" => "Cigarette Price"),
    order=[r"Price", r"Adv"],
)
# -----------------------------------------------------------------
#                                                    Sales
#                                           -----------------------
#                                                  (1)          (2)
# -----------------------------------------------------------------
# Cigarette Price                            -0.813***    -0.938***
#                                              (0.251)      (0.173)
# Newspaper Advertising & Cigarette Price       -0.000
#                                              (0.000)
# Newspaper Advertising                       0.007***     0.007***
#                                              (0.001)      (0.002)
# (Intercept)                               133.068***   138.480***
#                                              (8.502)      (5.753)
# -----------------------------------------------------------------
# N                                              1,380        1,380
# R2                                             0.212        0.209
# -----------------------------------------------------------------
  1. Related to the change to variable naming, different table types now use different interaction symbols. For example, the above would be Newspaper Advertising $\times$ Cigarette Price if using LatexTable(). This prevents the "&" symbols from being a problem in Latex Tables, but is also settable by the user if the user prefers \& or something similar.
  2. For fixed effect models, there is now a suffix applied to the names. I think this makes it a little more consistent form a display perspective.
  3. Several of the default display options are now dependent on what is passed. For example, in the current package the "estimator section" is always printed. Now, it is only printed (by default, which is again user settable) if multiple regression types are passed (e.g., IV and OLS). Another smaller example is column numbers are only printed if more than 1 regression is passed.

I am sure there is something in there I forgot. I would very much appreciate any feedback. Where possible, I tried to stick with the current user interface, but obviously with such large changes the interface changes as well. This proposal is also not completely finished, particularly the extensions.

resolves #130, resolves #109, resolves #105, resolves #90, resolves #52, resolves #17, resolves #12

(It would also sort of solve #129 and #128)

@junder873

This comment was marked as resolved.

@junder873 junder873 marked this pull request as draft July 31, 2023 15:29
@jmboehm
Copy link
Owner

jmboehm commented Jul 31, 2023

Thanks a lot for the PR. I haven't gone through all this yet, but overall it looks like a very sensible set of improvements.

  • Of course tests and documentation would need to be updated.
  • There are a bunch of breaking changes. I feel that the implicit contract with the user is that while it's ok to have some breaking changes every once in a while, we should also set out to minimize them (nobody enjoys having to update their code). So I think it's worth writing them all up and thinking about whether they're necessary and whether it's possible to mitigate them.
  • Finally, since you're rewritten most of the package (and since my ability to be involved in FOSS development is rapidly deteriorating) I'd suggest and invite you to take over maintaining the package (of course this needs to be ok with @greimel as well). It would be much more costly for me to maintain code that was largely written by you.

I wanted to check how comparable the new backend is compared to what already exists. There are a few settings that might change, but this is to show that the results are very comparable between the two. There is some minor spacing changes when dealing with multicolumn objects, it is otherwise capable of producing very similar tables.
@greimel
Copy link
Collaborator

greimel commented Aug 7, 2023

Thanks for the effort, @junder873.

I agree with @jmboehm. I think it would be important that the tests are updated (and pass) so that we can see more clearly what changes from user's perspective. (Hopefully, most old code would still run.)

We can only merge this if you agree to maintain the new codebase.

From a maintainer's perspective, such a big rewrite is really hard to review. Not sure what the perfect solution is here. One option would be to split the PR into small chunks that we can actually review. Another would be to trust the tests and just merge if they look good.

@junder873
Copy link
Collaborator Author

Thank you both for the input (and the past work on this package). Just to respond to the different comments: I am happy to maintain the package going forward, though I appreciate your input. I am trying to minimize the breaking changes that the user faces, though this isn't perfect (see below). I have focused less on maintaining compatibility on the backend pieces, the changes there are just too big to make that doable. I can try to think through creating multiple pull requests, it is possible that doing the backend first would work, but I would have to think more about how to do it.

I have been a little slow to work on the tests because I have been trying to make sure the front end works well and work on the backward compatibility. In the most recent set of changes, I added more backward compatibility and reran the tests to see where things stand. Before I get to those results, a few notes:

  • I focused on the tests that produce tables, so the label_transforms and decorations tests still need updating
  • With this set of updates, I am proposing to change some defaults to (hopefully) be useful. To create a comparable set of results, I undid these changes. Specifically, the following 4 defaults are different in the current proposal compared to the tests:
RegressionTables.default_fe_suffix(x::RegressionTables.AbstractRenderType) = ""
RegressionTables.default_print_control_indicator(x::RegressionTables.AbstractRenderType) = false
RegressionTables.default_regression_statistics(x::RegressionTables.AbstractRenderType, rrs::Tuple) = [Nobs, R2]
RegressionTables.default_print_estimator(x::RegressionTables.AbstractRenderType, rrs) = true

The first two are new features (adding a suffix after fixed effects and printing a yes/no if coefficients are omitted). The next two defaults vary based on conditions, a nonlinear regression will include the Pseudo R2 and the estimator section is only printed if more than one type of estimator is provided. Because these defaults did not exist/are different than the current version of the package, I changed them back to make tests comparable.

First, going through the actual table output, for the most part the results are similar. There are a few places where spacing is different, often connected to allowing the Estimator to be more than OLS, IV or NL. The HTML tables are also quite different since the padding information is moved into the style section of the table instead of between each cell.

From a user perspective, things are mostly similar. The biggest difference that shows up is that file is now a separate argument to renderSettings. I will discuss why this is and why I am not sure how to fix that next. Other differences are:

  • regressors argument is now keep, this is one I can probably put back in with a deprecation warning. regressors seems inconsistent with the other arguments of drop and order that work similarly to keep
  • custom_statistics is gone, replaced by extralines, along with how these work. Simply passing two vectors with the information (and the label in the first argument) works. This means that labels are not necessary there anymore
  • The decorator arguments (estim_decoration, number_regressions_decoration and below_decoration) are gone. The idea is that for most users, these would be "set and forget" type arguments, so these are changeable but the idea is to change it for a table type or all tables, not necessarily for one specific table. In the tests, I create new table types for the two tables where this matters, but from a user perspective, if those are the settings they want they would not need to create a type.

I don't see any other differences between the current package and this proposal, so I wanted to come back to the differences in renderSettings. The renderSettings argument now expects an AbstractRenderType. The idea behind this type system is to make more use of the Julia type system. The render type provided controls how every other type is rendered, including rounding, labels and decorators as well as the defaults used in the regression table. Importantly, this allows users to set up defaults on a per table basis. For example, if a user has two tables (e.g., a descriptive latex table and a regression latex table), the needs might be different for rounding, headings, etc, but those changes should not require a lot of work to create.

In order to keep the creation of these new types as simple as possible, I didn't want to include any actual information with the type, so a file does not fit well. One solution (that feels kind of hacky) is to use multiple dispatch to split the arguments, so something like:

const asciiOutput = AsciiTable
const latexOutput = LatexTable
const htmlOutput = HtmlTable
(::Type{T})(file::String) where {T<:AbstractRenderType} = T(), file # returns tuple
default_render(x::Nothing) = AsciiTable()
default_render(x::AbstractRenderType) = x
default_render(x::Tuple{<:AbstractRenderType, String}) = x[1]
default_file(rndr::AbstractRenderType, renderSettings) = nothing
default_file(rndr::AbstractRenderType, renderSettings::Tuple{<:AbstractRenderType, String}) = renderSettings[2]

function regtable(
    rrs...;
    renderSettings = nothing,
    rndr::AbstractRenderType = default_render(renderSettings),
    file= default_file(rndr, renderSettings),
    ...
)

This would use the old naming system and allow old code to continue to work (based on some simple testing), probably with a deprecation warning. It is obviously a little ugly and possibly create some confusion if somebody provides both a renderSetting and rndr since rndr would dominate.

Once again, I appreciate the input and want to minimize the change for users.

Copy link
Collaborator

@greimel greimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please remove these two lines? they lead to an error on CI.

Can you also adjust change the minimum julia version in this line

- '1.8' # Replace this with the minimum Julia version that your package supports. E.g. if your package requires Julia 1.5 or higher, change this to '1.5'.
to 1.9?

Hopefully we can then see the state of the tests.

test/RegressionTables.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
@greimel
Copy link
Collaborator

greimel commented Aug 7, 2023

I am trying to minimize the breaking changes that the user faces

Thanks!

You are proposing the following (sets of) changes.

  1. Change defaults
  2. Separate file from renderSettings
  3. Change regressors to keep
  4. Remove custom_statistics
  5. Handle decorators differently
  6. Remove dependencies, introduce package extensions

Ideally, you prepare separate PRs, where each PR contains the minimal changes to introduce your proposed change.

However I can imagine that some of these changes would depend on the same fundamental backend work. So it might be beneficial to prepare an initial PR with these backend changes first. I think that's what you're contemplating.

I can try to think through creating multiple pull requests, it is possible that doing the backend first would work, but I would have to think more about how to do it.

@greimel
Copy link
Collaborator

greimel commented Aug 7, 2023

I am trying to run CI again. But it doesn't - let me close and re-open this PR

@greimel greimel closed this Aug 7, 2023
This was linked to issues Sep 20, 2023
@junder873
Copy link
Collaborator Author

I think I am finally happy overall with the state of this. Do you (@jmboehm and @greimel) have any other suggestions on things to add or change? I have tried to resolve as many of the outstanding pull requests as possible as a part of this.

If not, then I think it is ready to merge.

@jmboehm
Copy link
Owner

jmboehm commented Sep 20, 2023

Looks great to me, thanks. No more suggestions from my side.

@junder873
Copy link
Collaborator Author

@jmboehm For the documentation to work after merging this, there would need to be some setup under the GitHub settings for the project. As described here, there would need to be a change to GitHub Pages option and here "DOCUMENTER_KEY" so the TagBot can publish versions of the docs.

@junder873
Copy link
Collaborator Author

@jmboehm Just following up, I can't change the setting to allow the documentation to work properly, once that is done though I am happy to merge this request

@jmboehm
Copy link
Owner

jmboehm commented Nov 18, 2023

Hi @junder873 , apologies for the delay. I've now added the deploy key and the environment variable to the repo. Is there a way to test to see whether it works?

@junder873
Copy link
Collaborator Author

As far as I can tell, there is no easy way to test the documenter key. However, as long as the GitHub Pages part works, it is possible to manually upload versions of documentation if there is an error with the key.

@jmboehm
Copy link
Owner

jmboehm commented Dec 2, 2023

Great. Is this ready to be merged? I feel I've been holding this up longer than it should have been held up.

@junder873
Copy link
Collaborator Author

I will go ahead and merge it, thank you so much for the suggestions and help.

@junder873 junder873 merged commit c80aea9 into jmboehm:master Dec 4, 2023
4 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment