Image file names shouldn’t assume JPEG #8

dhouck · 2015-01-27T17:50:52Z

Currently, all the profile images are assumed to be JPEGs. However, there are a lot of PNGs and a few GIFs. I thought I had a fix for this, but unfortunately it assumes the images are downloaded in extract_all.py, which is run before downloading images. As the chapter pages don’t give any details about image type, I think we will be unable to figure out internal file name until after the images are downloaded.

There was previously some discussion here about using the MIME type the server provides or determining it ourselves. At the time, I thought the drawback to using the server-provided MIME type was adding another stage, but since that looks inevitable now, I think that is actually the best approach. Both add an extra dependency, but the one for using the MIME type is smaller; additionally, the server-provided MIME type is probably faster.

The text was updated successfully, but these errors were encountered:

ssafar · 2015-01-27T18:52:46Z

A question: does the MIME type version support fetching only images that weren't present before? Just to be nice to dreamwidth :)

dhouck · 2015-01-27T22:30:34Z

No, the version I mentioned does not. However, if we create another folder, web_headers, we can have wget use the HEAD method and save headers there. The problem with that approach is it would require an extra request for each image, which might be less nice. Since HEAD requests are so small, though, I think that’s probably the best option.

I don’t know about the HTML mirror, but because epubs store MIME type separately from just the file extension, it would be possible to just use an image file name without an extension. If that works for the HTML mirror too, it would save time by avoididng the image_parse stage I said might become necessary.

ssafar assigned ssafar and unassigned ssafar Jan 27, 2015

dhouck mentioned this issue Jan 29, 2015

External links #9

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Image file names shouldn’t assume JPEG #8

Image file names shouldn’t assume JPEG #8

dhouck commented Jan 27, 2015

ssafar commented Jan 27, 2015

dhouck commented Jan 27, 2015

Image file names shouldn’t assume JPEG #8

Image file names shouldn’t assume JPEG #8

Comments

dhouck commented Jan 27, 2015

ssafar commented Jan 27, 2015

dhouck commented Jan 27, 2015