-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#42 import documents #57
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This MR has a lot in it which seems unrelated to the "import documents" ticket. Can we do something to make it smaller, so it can be fully reviewed?
cache = image_linker(cache) | ||
cache = document_linker(cache) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment: It's interesting that we call this variable "cache", but that the functions in question call it html
, and take and return a string. The name "cache" confused me because it suggests either a dict or an API to a cache backend.
Nonetheless why don't we call it html
here?
Admittedly, I don't yet have the full picture.
@@ -288,3 +328,61 @@ def get_alignment_class(image): | |||
alignment = "right" | |||
|
|||
return alignment | |||
|
|||
|
|||
def document_linker(html): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment: These functions, image_linker, and document_linker, both take and return a string, but parse it with BeautifulSoup. I wonder whether my suggestion the other day that a function should both take and return a soup object was misguided, because it leads to inconsistency within the app. Or whether these should change for the same reason (better performance if we do less parsing and exporting).
Ref. #48 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you OK with me adding a ticket for this. The first time we get some soup is with the lxml
parser and that's the soup we have just before it's passed to this function so we pass a string of the elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first time we get some soup is with the
lxml
parser.
Are you saying that the soups are different because one uses lxml
and another uses html.parser
? If so, then that's a strong argument for passing around string HTML objects instead of soup instances.
I favour readability and consistency at this stage, especially as we aren't yet aware of any performance concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#67 Where we should look at this in more detail.
string: the html with img tags modified | ||
|
||
BS4 performs a find and replace on all img tags found in the HTML. | ||
If the image can be retrived from the remote site and saved into a Wagtail ImageModel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the image can be retrived from the remote site and saved into a Wagtail ImageModel | |
If the image can be retrieved from the remote site and saved into a Wagtail ImageModel |
self.assertEqual(get_image_file_name("folder/fakeimage.jpg"), "fakeimage.jpg") | ||
self.assertEqual( | ||
get_image_file_name( | ||
"http://www.example.com/folder1/folder2//fakeimage.jpg" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Is the double forward slash intentional?
return r, status, content_type | ||
|
||
|
||
def get_abolute_src(src, domain_prefix=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion Typo, used throughout this MR.
def get_abolute_src(src, domain_prefix=None): | |
def get_abolute_src(src, domain_prefix=None): |
|
||
|
||
def get_abolute_src(src, domain_prefix=None): | ||
src = re.sub("^\/+", "", src) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Is this just the same as src.lstrip("/")
? That would be more readable.
self.assertEqual( | ||
get_abolute_src("folder/fakeimage.jpg"), | ||
"folder/fakeimage.jpg", | ||
) # the test settings has no BASE_URL setting so try having no domain prefix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: These seem like distinct test cases. Can we configure them thus?
input = get_soup( | ||
'<img src="fakeimage.jpg" alt="image alt" class="align-left" />', | ||
"html.parser", | ||
).find("img") | ||
self.assertEqual(get_alignment_class(input), "left") | ||
input = get_soup( | ||
'<img src="fakeimage.jpg" alt="image alt" class="align-right" />', | ||
"html.parser", | ||
).find("img") | ||
self.assertEqual(get_alignment_class(input), "right") | ||
input = get_soup( | ||
'<img src="fakeimage.jpg" alt="image alt" />', | ||
"html.parser", | ||
).find("img") | ||
self.assertEqual(get_alignment_class(input), "fullwidth") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: These are three different tests.
Suggestion: Try to avoid using Python keywords as variables. Maybe call it "tag".
# work in progress | ||
def test_with_real_image(self): | ||
# but we need to test with mocked images if we can. | ||
raw_html_file = """ | ||
<p>Lorem <img src="https://dummyimage.com/600x400/000/fff" alt=""></p> | ||
""" | ||
self.builder = BlockBuilder(raw_html_file, None, None) | ||
self.builder.promote_child_tags() | ||
self.blocks = self.builder.build() | ||
self.assertTrue("<embed" in self.blocks[0]["value"]) | ||
|
||
# work in progress | ||
def test_with_real_pdf(self): | ||
# but we need to test with mocked files if we can. | ||
raw_html_file = """ | ||
<p> | ||
<a href="https://file-examples-com.github.io/uploads/2017/10/file-sample_150kB.pdf">A pdf</a> | ||
</p> | ||
""" | ||
self.builder = BlockBuilder(raw_html_file, None, None) | ||
self.builder.promote_child_tags() | ||
self.blocks = self.builder.build() | ||
self.assertTrue('linktype="document"' in self.blocks[0]["value"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: What's the plan with these work-in-progress test cases? Can we remove them if they're not ready yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's because we should really be testing without needed to use external files. I opened a ticket to look at this at some point rather than hold the PR's up. https://projects.torchbox.com/projects/wordpress-to-wagtail-importer-package/tickets/76
docs/yoast.md
Outdated
@@ -0,0 +1,37 @@ | |||
- [Yoast plugin compatibility](#yoast-plugin-compatibility) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Why is there a Yoast documentation file in this MR? It's not related.
d3b2574
to
d882e43
Compare
07c740a
to
6657bf7
Compare
testing using live url rather than local files. will need more work here
code review changes
6657bf7
to
abe9135
Compare
No description provided.