-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TIPS FOR IMPROVEMENT #978
Comments
Hi @aleksandar-devedzic , here is my fork |
Oh, thanks
I hope that I helped you with this.
All best
…On Thu, Nov 16, 2023 at 10:26 PM Andrei P. ***@***.***> wrote:
Hi @aleksandar-devedzic <https://github.com/aleksandar-devedzic> ,
i forked newspaper3k and in the next version your suggestions are
implemented (code is at the moment in the work-0.9.2 branch, but if you
need it, you can pull it from there. alternatively, you can wait for the
release ;)
here is my fork
https://github.com/AndyTheFactory/newspaper4k
—
Reply to this email directly, view it on GitHub
<#978 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ATCV65J6TZYTQGMGX2NJOSLYE2AIDAVCNFSM6AAAAAA7OUG6XGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJVGMZTSNRYGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I had issue with bbc where the nest p tags in divs. Newspaper4k seems to work perfectly after installing typing-extensions e.g. Thanks |
One more improvement (about dates)... |
On dates why does bbc article prepend date to _text string now? e.g.
source https://www.bbc.co.uk/news/uk-england-london-68511760 |
I have extracted some meta tags, you can try to identify title, text, description and date by replacing provided tags in :
meta[property='{}']
meta[name='{}']
meta[itemprop='{}']
Meta tags for publication and modification date:
published_date
published_time
cXenseParse:publishtime
pubdate
publish_date
PublishDate
dcterms.created
rnews:datePublished
article:published_time
prism.publicationDate
displaydate
OriginalPublicationDate
og:published_time
datePublished
article_date_original
article.published
published_time_telegram
sailthru.date
datePublished
date
Date
original-publish-date
DC.date.issued
dc.date
DC.Date
parsely-pub-date
publishtime
publication_date
uploadDate
coverageEndTime
publishdate
publish-date
publishedAtDate
dcterms.date
publishedDate
creationDateTime
pub_date
updated_time
og:updated_time
datemodified
last-modified
Last-Modified
DC.date.modified
article:modified_time
modified_time
modifiedDateTime
dc.dcterms.modified
lastmod
Meta tags for title:
dc.title
og:title
headline
articletitle
article-title
parsely-title
title
Meta tags for description:
description
og:description
Meta tags for body:
articleBody
articleText
FYI
It would be good if you can fix/improve/adapt the code so that it can extract full information from these websites since these websites are the most popular websites in the world.
By "full information" i mean title, publication date and article body
CNN - https://edition.cnn.com/
BBC News - https://www.bbc.com/news
Reuters - https://www.reuters.com/
The New York Times - https://www.nytimes.com/
The Guardian - https://www.theguardian.com/international
Al Jazeera - https://www.aljazeera.com/
Associated Press (AP) News - https://apnews.com/
NBC News - https://www.nbcnews.com/
Fox News - https://www.foxnews.com/
USA Today - https://www.usatoday.com/
ABC News - https://abcnews.go.com/
CBS News - https://www.cbsnews.com/
The Washington Post - https://www.washingtonpost.com/
Time - https://time.com/
Forbes - https://www.forbes.com/
Bloomberg - https://www.bloomberg.com/
The Wall Street Journal - https://www.wsj.com/
The Huffington Post - https://www.huffpost.com/
The Independent - https://www.independent.co.uk/
The Sydney Morning Herald - https://www.smh.com.au/
The Economist - https://www.economist.com/
The Times of India - https://timesofindia.indiatimes.com/
The Daily Mail - https://www.dailymail.co.uk/home/index.html
The Telegraph - https://www.telegraph.co.uk/
The Sun - https://www.thesun.co.uk/
The Mirror - https://www.mirror.co.uk/
The Daily Beast - https://www.thedailybeast.com/
The Atlantic - https://www.theatlantic.com/
National Geographic - https://www.nationalgeographic.com/
Science Daily - https://www.sciencedaily.com/
The Verge - https://www.theverge.com/
Wired - https://www.wired.com/
TechCrunch - https://techcrunch.com/
Engadget - https://www.engadget.com/
Mashable - https://mashable.com/
Forbes India - https://www.forbesindia.com/
Hindustan Times - https://www.hindustantimes.com/
CNN Business - https://www.cnn.com/business
Financial Times - https://www.ft.com/
CNBC - https://www.cnbc.com/
Business Insider - https://www.businessinsider.com/
Politico - https://www.politico.eu/
The Hill - https://thehill.com/
The Washington Times - https://www.washingtontimes.com/
The Boston Globe - https://www.bostonglobe.com/
The LA Times - https://www.latimes.com/
The Chicago Tribune - https://www.chicagotribune.com/
The Sydney Morning Herald - https://www.smh.com.au/
The Globe and Mail - https://www.theglobeandmail.com/
The Toronto Star - https://www.thestar.com/
The text was updated successfully, but these errors were encountered: