Skip to content

Website crawler that identifies accessibility issues in HTML pages and PDF files of discovered pages. Leverages axe & Google Lighthouse and PDFDocument.

License

Notifications You must be signed in to change notification settings

OpenConceptConsulting/perception

 
 

Repository files navigation

PERCEPTION

This tool combines various open source tools to give insight into accessibility and performance metrics for a list of URLs. There are several parts that can be understood as such:

  • This application requires a least one CSV wth a one column header labeled "Address" and one URL per line (ignores other comma delimited data).
  • A crawl can be also be executed (e.g. currently using a licenced version of ScreamingFrogSEO CLI tools https://www.screamingfrog.co.uk/seo-spider/)
  • Runs Deque AXE for all URLs and produces both a detailed and summary report (including updating the associated Google Sheet) See: https://pypi.org/project/axe-selenium-python/
  • Runs Lighthouse CLI for all URLs and produces both a detailed and summary report (including updating the associated Google Sheet) See: https://github.com/GoogleChrome/lighthouse
  • Runs a PDF audit for all PDF URLs and produces both a detailed and summary report (including updating the associated Google Sheet)

Get get started, follow the installation instructions below. Once complete:

  1. Start the virtual environment ( python -m venv venv && source venv/bin/activate )
  2. Run start app.py or python app.py.
  3. Navigate to http://127.0.0.1:8888/reports/ or http://localhost/reports/ where the sample "DRUPAL" report will be visible.
  4. View the report by clicking on the report address or providing the link as such http://localhost/reports/?id=DRUPAL
  5. Here is a link to the sample data Google Sheet report: DRUPAL Google Sheet

NOTE: At the moment, no database is used due to an initial interest in CSV DATA ONLY. The system creates one folder for each as follows (under /REPORTS/your_report_name):

  • /AXE (used to store AXE data)
  • /CSV (CSVs to analyse; PDF CSV requests are appended with with a PDF qualifier)
  • /LIGHTHOUSE (used to store Lighthouse data)
  • /logs (tracks progress and requests)
  • /SPIDER (used to store crawl data)

At this point, a database would make more sense and adding a function to "Export to CSV", etc.

Workflow

As mentioned, simply provide a CSV with a list of URLs (column header = "Address") and select the tests to run through the web form.

The application is configured through environment variables. On startup, the application will also read environment variables from a .env file.

  • HOST (defaults to 127.0.0.1)
  • PORT (defaults to 8888)
  • SECRET_KEY (no default, used to sign the Flask session cookie. Use a cryptographically strong sequence of characters, like you might use for a good password.)
  • ALLOWED_EXTENSIONS (defaults to "csv", comma separated list)

Installation

To get all tests running, the following steps are required:

Linux Installation

sudo apt update

sudo apt install git

sudo apt-get install python3-pip

sudo apt-get install python3-venv

sudo apt-get update

sudo apt-get install software-properties-common

sudo add-apt-repository ppa:deadsnakes/ppa

sudo apt-get install python3.6

Clone and install requirements

git clone https://github.com/soliagha-oc/perception.git

sudo python -m venv venv

source venv/bin/activate

pip install -r requirements.txt

Run the python app and launch brwoser

python app.py

Browse to http://127.0.0.1:8888/ (or alternatively to port 5000 if you didn't set 8888 in the .env file)

CLI-TOOLS

Install the following CLI tools for your operating system:

chromedriver

  1. Download and install the matching/required chromedriver

    https://chromedriver.chromium.org/downloads

  2. Download latest version from official website and upzip it (here for instance, verson 2.29 to ~/Downloads)

    wget https://chromedriver.storage.googleapis.com/2.29/chromedriver_linux64.zip

  3. Move to /usr/local/share (or any folder) and make it executable

    sudo mv -f ~/Downloads/chromedriver /usr/local/share/

    sudo chmod +x /usr/local/share/chromedriver

  4. Create symbolic links

    sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver

    sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver

    OR

    export PATH=$PATH:/path-to-extracted-file/

    OR

    add to .bashrc

geckodriver

  1. Go to the geckodriver releases page. Find the latest version of the driver for your platform and download it. For example: https://github.com/mozilla/geckodriver/releases

    wget https://github.com/mozilla/geckodriver/releases/download/v0.24.0/geckodriver-v0.24.0-linux64.tar.gz

  2. Extract the file with:

    tar -xvzf geckodriver*

  3. Make it executable:

    chmod +x geckodriver

  4. Add the driver to your PATH so other tools can find it:

    export PATH=$PATH:/path-to-extracted-file/

    OR

    add to .bashrc

lighthouse

  1. Install node

    https://nodejs.org/en/download/

    curl -sL https://deb.nodesource.com/setup_14.x | sudo -E bash -

    sudo apt-get install -y nodejs

  2. Install npm

    npm install npm@latest -g

    sudo npm install npm@latest -g

  3. Install lighthouse

    npm install -g lighthouse

    sudo npm install -g lighthouse

pdfimages

https://www.xpdfreader.com/download.html

To install this binary package:

  1. Copy the executables (pdfimages, xpdf, pdftotext, etc.) to to /usr/local/bin.

  2. Copy the man pages (*.1 and *.5) to /usr/local/man/man1 and /usr/local/man/man5.

  3. Copy the sample-xpdfrc file to /usr/local/etc/xpdfrc. You'll probably want to edit its contents (as distributed, everything is commented out) -- see xpdfrc(5) for details.

Google APIs

See this "Quick Start" guide to enable the Drive API: https://developers.google.com/drive/api/v3/quickstart/python

Complete the steps described in the rest of this page to create a simple Python command-line application that makes requests to the Drive API.

nginx (optional)

See: https://www.nginx.com/

ScreamingFrog SEO

See: https://www.screamingfrog.co.uk/seo-spider/user-guide/general/#commandlineoptions

ScreamingFrog SEO CLI tools provide the following data sets (required listed is bold): - crawl_overview.csv (used to create report DASHBOARD)

  • external_all.csv - external_html.csv (used to audit external URLs) - external_pdf.csv (used to audit external PDFs)
  • h1_all.csv
  • images_missing_alt_text.csv
  • internal_all.csv
  • internal_flash.csv - internal_html.csv (used to audit internal URLs)
  • internal_other.csv - internal_pdf.csv (used to audit internal PDFs)
  • internal_unknown.csv
  • page_titles_all.csv
  • page_titles_duplicate.csv
  • page_titles_missing.csv

Note: There are spider config files located in the /conf folder. You will require a licence to alter the configurations.

Note: If a licence is not available, simply provide a CSV where at least one column has the header "address". See DRUPAL example.

Deque AXE

Installed via pip install -r .\requirements.txt

See: https://pypi.org/project/axe-selenium-python/ and https://github.com/dequelabs/axe-core

Google Lighthouse

Lighthouse is an open-source, automated tool for improving the performance, quality, and correctness of your web apps.

When auditing a page, Lighthouse runs a barrage of tests against the page, and then generates a report on how well the page did. From here you can use the failing tests as indicators on what you can do to improve your app.

Google APIs

Authentication

While there is a /reports/ dashboard, the system is enabled to write to a Google Sheets. To do this, set up credentials for Google API authentication here: https://console.developers.google.com/apis/credentials to get a valid "credentials.json" file.

Google Sheets Template

To facilitate branding and other report metrics, a "non-coder/sheet formula template" is used. Here is a sample template. When a report is run from the /reports/ route, the template is loaded (template report and folder ID found in globals.py and need to be setup/updated once), and the Google Sheet is either created or updated (unique report ID auto generated and found in /REPORTS/your_report_name/logs/_gdrive_logs.txt).

Running with sample data

If you have a Screaming Frog SEO Spider licence be sure to add it to your PATH. Even if Screaming Frog SEO Spider is not installed, a CSV can be provided to guide the report tools. Once installed, try to run the sample CSV. To do this:

  • Visit http://127.0.0.1:8888/
  • Enter a report name and email. Leave URL blank.
  • Click on "Choose File" under "Spider SEO Reports" to upload a file with a list of URLs, column header = 'address'.
  • Select the tests you wish to run.

NOTE: This would exclude PDFs which require a list of exclusively PDF URLs.

Running a sample can be accomplished two ways, using the samples provided in the "/REPORTS/DRUPAL/" folder or by downloading and installing Screaming Frog SEO Spider and running a free crawl (500 URL limit and no configuration/CLI tool access). Once the crawl is completed or file created, create/save the following CSVs:

  • crawl_overview.csv (via "Reports >> Crawl Overview" in the ScreamingFrog menu) - used to create Report Overview. Without this CSV, the Report Overview will be missing (working on calculating the results to eliminate this report)
  • internal_html.csv (via "Export" button in the ScreamingFrog interface) - used to point the reporting tools to the desired URLs
  • internal_pdf.csv (via "Export" button in the ScreamingFrog interface) - used to point the reporting tools to the desired URLs
  • external_html.csv (via "Export" button in the ScreamingFrog interface) - used to point the reporting tools to the desired URLs
  • external_pdf.csv (via "Export" button in the ScreamingFrog interface) - used to point the reporting tools to the desired URLs

If another method is used to crawl a base URL, be sure to include the results in a CSV file where at least one header (first row) reads "Address", provide one or more web or PDF URLs, and ensure that the filename(s) is the same as the one listed above and in "/REPORTS/your_report_name/SPIDER/" folder. At least one *_html.csv file is required and to be in the appropriate folder.

Cautions

Spider, scanning, and viruses

It is possible when crawling and scanning sites to encounter various security risks. Please be sure to have a virus scanner enabled to protect against JavaScript and other attacks or disable JavaScript in the configuration.

About

Website crawler that identifies accessibility issues in HTML pages and PDF files of discovered pages. Leverages axe & Google Lighthouse and PDFDocument.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 56.9%
  • Roff 34.9%
  • HTML 8.0%
  • Shell 0.2%