You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey Zhe, in looking at the errors Poirot provides for ill-typed programs(to investigate localization of errors and recovering paths) I worked my way through some of the synthesize programs for section 6.3 of the evaluation.
In my opinion, there are some disappointing issues with this part of the evaluation though I understand this wasn't necessarily the focus of the paper. It's probably important to revisit this at least for the synthesis direction of current work.
I wasn't here when this was done, feel free to correct me where ever I am mistaken. I was running on the artifact branch of this repository with minor modifications on my end for debugging
Limited Structure: From the paper: "Since the space of these programs is potentially quite large (possibly infinite), we constrain the synthesizer to only generate programs with bounded function call depths; in our experiments, this bound was set to three.". In general, the corpus of programs used was disappointing. It's much more limited than the paper suggests with fewer components and only programs sharing identical structure. In particular, Figure 10 b is not represented. Your example of a sortedlist is also not represented and maybe not synthesizeable by Cobalt?
Duplicates: It's silly but some of the programs are duplicated and should be filtered out.
gt_eq_int_gen: I think you must of used a different specification for synthesis than what is currently used in the repository. It seems to just be int_gen with one extra step at the moment. This causes issues in a couple of places since it ends up violating: "The generator outputs all programs that are safe with respect to the specification.". However, you do seem to need this behavior to get some "complete" examples for both sizedlist and sortedlist. Are there somehow two specifications for this floating around?
I also maintained a list of example test cases for sizedlist and sorted list of programs that I thought were handled incorrectly for one reason or another.
SizedList: There are a bunch of obviously programs in terms of coverage like program 24 or 80 which uses some bug with singleton lists to get full coverage.
SizedList: Obviously wrong examples which get rejected for safety, many like 65/104 which creates double-sized lists. 10 which violates the recursion constraint.
SortedList: 2 or 11 which are just single element lists which get full coverage
SortedList: 3 which creates a list that doesn't use x0 at all which violates safety
SortedList: 9 passes the coverage check despite creating a sorted list of just the same element.
SortedList: 10 Might be a correct though disappointing solution except that I'm not sure what the correct spec of gt_eq_int_gen should be. With what my current expectation is, this should not pass.
SortedList: 12 is similar to previous cases except that the first element is probably too small no? This might be more of an issue with the spec of sortedlist instead.
As a general note, for atleast these two cases, I would say the false positive rate is high, over 50% at least.
It's interesting to think about timeouts here. The repository has this value set pretty high. Sometimes this helps with certain checks but in general this isn't going to work for synthesis. It's interesting to think about how changing that value my impact the synthesis algorithm.
The text was updated successfully, but these errors were encountered:
Pat-Lafon
changed the title
Issues with "Completeness of Synthesized Generators" evaulation
Issues with "Completeness of Synthesized Generators" evaluation
Jan 31, 2024
Hey Zhe, in looking at the errors Poirot provides for ill-typed programs(to investigate localization of errors and recovering paths) I worked my way through some of the synthesize programs for section 6.3 of the evaluation.
In my opinion, there are some disappointing issues with this part of the evaluation though I understand this wasn't necessarily the focus of the paper. It's probably important to revisit this at least for the synthesis direction of current work.
I wasn't here when this was done, feel free to correct me where ever I am mistaken. I was running on the artifact branch of this repository with minor modifications on my end for debugging
"Since the space of these programs is potentially quite large (possibly infinite), we constrain the synthesizer to only generate programs with bounded function call depths; in our experiments, this bound was set to three."
. In general, the corpus of programs used was disappointing. It's much more limited than the paper suggests with fewer components and only programs sharing identical structure. In particular, Figure 10 b is not represented. Your example of a sortedlist is also not represented and maybe not synthesizeable by Cobalt?"The generator outputs all programs that are safe with respect to the specification."
. However, you do seem to need this behavior to get some "complete" examples for both sizedlist and sortedlist. Are there somehow two specifications for this floating around?I also maintained a list of example test cases for sizedlist and sorted list of programs that I thought were handled incorrectly for one reason or another.
As a general note, for atleast these two cases, I would say the false positive rate is high, over 50% at least.
It's interesting to think about timeouts here. The repository has this value set pretty high. Sometimes this helps with certain checks but in general this isn't going to work for synthesis. It's interesting to think about how changing that value my impact the synthesis algorithm.
The text was updated successfully, but these errors were encountered: