Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model 7: Add GenDB class and incorporate method. #200

Merged
merged 4 commits into from
Sep 19, 2024

Conversation

emilyfertig
Copy link

This could use more thorough test coverage but I'd like to get the major pieces in and expand testing as time permits.

Copy link

@ThomasColthurst ThomasColthurst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apologize for the large number of comments, but they are mostly nits. Overall, this pull request contains a bunch of really elegant code.

cxx/BUILD Outdated
@@ -54,6 +54,23 @@ cc_library(
],
)

cc_library(
name = "gendb_lib",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just call this "gendb" unless you think we are going to want a binary named gendb down the road.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (we probably want a binary but we can figure out names later).

cxx/gendb.cc Outdated
std::vector<std::string>::const_iterator class_path_start,
std::vector<std::string>::const_iterator class_path_end, int class_item) {
if (class_path_end - class_path_start == 1) {
// These are domains and we need to DFS-traverse the class's

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "These are domains" part of this comment is unclear, particularly what the "these" is referring to.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, that was a nonsensical note to self. Updated to clarify.

cxx/gendb.cc Outdated

// This function walks the class_path of the query, populates the global
// reference_values table if necessary, and returns a sampled set of items
// for the query relation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which query relation? This function doesn't take any relation name as input.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query relation for the class path (updated to clarify).

// reference_values table if necessary, and returns a sampled set of items
// for the query relation.
std::vector<int> GenDB::sample_entities_relation(
std::mt19937* prng, const std::string& class_name,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth describing in the comments the relationship that is assumed between class_name and class_path.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

cxx/gendb.cc Outdated
// These are domains and we need to DFS-traverse the class's
// parents, similar to PCleanSchemaHelper::compute_domains_for.
return sample_class_ancestors(prng, class_name, class_item);
} else {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: since the if block always returns, you can drop the else and regain an indent level.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

cxx/gendb.cc Outdated
hirm->incorporate(prng, query_rel_name, items, value);
}

// Generates a vector of items from the clean relation domains, with the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called whenever class_path.size() == 1, and as I commented on the Model 7 design doc, when record_class_is_clean = False, you can have noisy relations in this case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant to respond on the doc -- this works for noisy relations too where class_path.size() == 1, since in that case the noisy relation domains/items are the same as those for the clean relation. I put "clean relation" in the method name in the doc, since the most common case is that the class ancestors are items of a clean relation (I updated the method name so now it doesn't refer to "clean relation").

cxx/gendb.hh Outdated
// This data structure contains entity sets and linkages. Semantics are
// map<class_name, map<primary_key, map<reference_field_name, ref_val>>>,
// where primary_key and ref_val are (integer) entity IDs.
std::map<std::string, std::map<int, std::map<std::string, int>>>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current design, I don't think you ever need to say look up all of the primary keys given a class_name or anything like that. So this could be replaced with a

std::map<std::tuple<std::string, std::string, int>, int>

and probably should, because that will be both more efficient (only one map lookup instead of three) and more informative to the reader.

It might also make sense to call this variable something like "foreign_keys", which foreign_keys[make_tuple(table_name, field_name, row_id)] giving you the foreign key that is entered into the row_id-th row of table_name in the field_name column. (Which equals to row number in the table of field_name's type.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I changed the structure as you suggested.

I agree we should be better about unifying terminology. This PR refers to "reference" in a lot of places to essentially mean foreign keys (which the schema refers to as class variables). I'd rather stick with "reference_values" for now instead of introducing the "foreign keys" term and maybe change it later as part of a bigger renaming.

cxx/gendb.hh Outdated
std::map<std::string, std::map<int, std::map<std::string, int>>>
reference_values;

HIRM* hirm;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth commenting here that this class owns hirm.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

cxx/gendb.hh Outdated

HIRM* hirm;

// CRPs for latent entities, where the "tables" are entity IDs and the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would reverse the order of these two sentences when explaining domain_crps: "domain_crps[class_name] gives the CRP used to generate entity ID's for that class. Inside that CRP, the tables are entity IDs and the customers are unique identifiers of observations of that class."

(Basically, explain the most important thing first and then explain the confusing details!)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, done.

GenDB(std::mt19937* prng, const PCleanSchema& schema,
bool _only_final_emissions = false, bool _record_class_is_clean = true);

void incorporate(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these methods need descriptive comments.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@emilyfertig
Copy link
Author

Thanks for the review!

@emilyfertig emilyfertig merged commit ef1ac21 into master Sep 19, 2024
2 checks passed
@emilyfertig emilyfertig deleted the emilyaf-091024-model7-incorporate branch September 19, 2024 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants