Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iecodebook: data signature .txt file is a binary #250

Open
luizaandrade opened this issue Feb 6, 2023 · 11 comments
Open

iecodebook: data signature .txt file is a binary #250

luizaandrade opened this issue Feb 6, 2023 · 11 comments
Assignees
Labels
minor bug command still works, no calc errors

Comments

@luizaandrade
Copy link
Collaborator

image

Maybe this is intentional, but in that case, it should not have a .txt file format.

@luizaandrade luizaandrade added the minor bug command still works, no calc errors label Feb 6, 2023
@bbdaniels
Copy link
Contributor

bbdaniels commented Feb 6, 2023

Thanks for flagging, but I don't think this is a bug -- this is the intended functionality and I am aware of this behavior. The datasignature file is created with a .txt file extension so it syncs to git under our standard .gitignore settings, and can then be used on another clone to verify data contents as intended. The contents of the file appear as below -- I do not know why this is read as a binary file (other than that it is declared as one in the header; it opens just fine in plaintext editor as shown). Tracking file level changes using git is not necessary here either so I did not pursue a workaround to make it function with git compare features.

If you agree, close issue; otherwise, my suggestion would be to produce it with the .dtasig extension (edit: or .sig.txt, although I don't know how "double" extensions work on other OS). However, users would then need to update all their .gitignore files to include that extension and achieve the intended command functionality. I thought that was an unnecessary extra step when deciding on this approach to the feature, since it has effectively no technical cost other than the ugly aesthetics of this.

Screen Shot

@luizaandrade
Copy link
Collaborator Author

Interesting. I see this on Notepad
image

And this on VSC, where I can't open it
image

But it pointing out there there are changes in a binary file on Stata is enough for the intended workflow, then we can close this. But perhaps it's useful to explain in the help file what is the purpose of the -sig.txt file?

@bbdaniels
Copy link
Contributor

What would you want to add beyond the helpfile description here and the write-up in the SJ paper that were just accepted? We have just finished a very long review process for this feature set, so I am a bit confused why this is coming up now 😅

signature            This option requests that a datasignature be verified in the same destination folder as the
                            codebook and/or data are saved, and will return an error if a datasignature file is not found or
                            is different, guaranteeing data has not changed since the last reset of the signature.

@kbjarkefur
Copy link
Contributor

One never knows when a bug appears. I do not think this is major bug (as in requiring a new release), but it should be addressed.

My 2 cent is this multi-staged rocket.

  1. Do not allow any file extension other than the .dtasig intended. So if someone try to use the name auto-sig.txt make the name auto-sig.txt.dtasig. File extensions does not change what a file is, only what it tells the computer it should be. And to many software .txt is raw text, but when the file is not raw text it does not want to interact with it as it could be a corrupted file that potentially could harm the computer. This way users can distinguish this binary file from raw text .txt files in their .gitignore in case they do not want any binary files in their repo.

  2. This is much more work, but would be the best thing. Use the r() values from datasignature to write a real raw text file with the same information. That is what iesave does.

But since iesave does that maybe it is ok to not do that here. And then users who wants to track this type of info use iesave for this feature and the people who is ok that this is not tracked in git can use iecodebook for this feature. But then we should not allow the file from iecodebook to seem like a raw text file.

@kbjarkefur
Copy link
Contributor

kbjarkefur commented Feb 7, 2023

(edit: or .sig.txt, although I don't know how "double" extensions work on other OS)

The file extension is the text after the last .. So file kristoffer.luiza.txt.ben has the file extension .ben. This does not change if the file is binary or raw text or anything. It just tells the computer it should be the file type .ben but it says nothing about what .ben is or should be.

But since this is just text that is a part of a file name then a language can make use of "double extensions". If you end your file names in Spark with .csv.gz then Spark expect it to be a .gz compressed file that includes a single .csv file. Such a file can be read on the fly in Spark without decompress it first. But this is just a naming convention so Spark knows what to expect, which then again is what all file extensions to any computer are.

@bbdaniels
Copy link
Contributor

bbdaniels commented Feb 7, 2023 via email

@kbjarkefur
Copy link
Contributor

kbjarkefur commented Feb 7, 2023

Git can handle all files. Git ignore is just a string match (a simple version of regex or something) on the file name. So to ignore any file with extension .txt you do *.txt, and to ignore a file without extension you do * but that is not great as it ignore any files with a name, i.e. all files.

No extension does not tell the computer anything about what the file is or should be. So that is almost as bad as the wrong extension in my mind.

What is your reason to not use .dtasig?

@bbdaniels
Copy link
Contributor

bbdaniels commented Feb 7, 2023 via email

@kbjarkefur
Copy link
Contributor

Anyone not using a gitignore file will not ignore it. Most templates do not use allowlists so it would not ignore it. We can update our template that use allowlists to allow it (will only apply to new repos tho).

I think that is the proper long term solution. And whatever we save from not doing the proper long term solution I do not justify relying on a file without file extension. That is bad for the computer, but a file without extension also looks like a corrupted file to a human.

You can deduct my preference, but you are lead author so it is up to you.

@luizaandrade
Copy link
Collaborator Author

I am all for updating the gitignore. We will always have to add new things to it anyway, it's not a static file.

If I see a .txt file, I expect git to track changes to it well and to be able to see its content. maybe we already had this discussion and I forgot about it and got confused again. but if the intended format for this file is .dtasig, I think we can stick to it.

Agreed that this does not warrant a new release right now, we can add it to the next one, whenever it comes. The help file is consistent with using .dtasig too, so it seems like a very small change. But if you feel strongly that .txt works best with the intended workflow for the command, then I we can leave it as is and just add a line on the helpfile saying that the data signature will be saved using a .txt format by defaults instead of the standard .dtasig to facilitate tracking using git.

@bbdaniels
Copy link
Contributor

add a line on the helpfile saying that the data signature will be saved using a .txt format by defaults instead of the standard .dtasig to facilitate tracking using git.

I prefer this most, but as I mentioned I am also happy to, for example, use .sig.txt instead of -sig.txt as it is now to indicate that it is not a standard text file -- that that's just for the string matching in tracking. I have seen this style in other applications although it's rather uncommon...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
minor bug command still works, no calc errors
Projects
None yet
Development

No branches or pull requests

3 participants