Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image file names shouldn’t assume JPEG #8

Open
dhouck opened this issue Jan 27, 2015 · 2 comments
Open

Image file names shouldn’t assume JPEG #8

dhouck opened this issue Jan 27, 2015 · 2 comments

Comments

@dhouck
Copy link
Contributor

dhouck commented Jan 27, 2015

Currently, all the profile images are assumed to be JPEGs. However, there are a lot of PNGs and a few GIFs. I thought I had a fix for this, but unfortunately it assumes the images are downloaded in extract_all.py, which is run before downloading images. As the chapter pages don’t give any details about image type, I think we will be unable to figure out internal file name until after the images are downloaded.

There was previously some discussion here about using the MIME type the server provides or determining it ourselves. At the time, I thought the drawback to using the server-provided MIME type was adding another stage, but since that looks inevitable now, I think that is actually the best approach. Both add an extra dependency, but the one for using the MIME type is smaller; additionally, the server-provided MIME type is probably faster.

@ssafar ssafar assigned ssafar and unassigned ssafar Jan 27, 2015
@ssafar
Copy link
Owner

ssafar commented Jan 27, 2015

A question: does the MIME type version support fetching only images that weren't present before? Just to be nice to dreamwidth :)

@dhouck
Copy link
Contributor Author

dhouck commented Jan 27, 2015

No, the version I mentioned does not. However, if we create another folder, web_headers, we can have wget use the HEAD method and save headers there. The problem with that approach is it would require an extra request for each image, which might be less nice. Since HEAD requests are so small, though, I think that’s probably the best option.

I don’t know about the HTML mirror, but because epubs store MIME type separately from just the file extension, it would be possible to just use an image file name without an extension. If that works for the HTML mirror too, it would save time by avoididng the image_parse stage I said might become necessary.

@dhouck dhouck mentioned this issue Jan 29, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants