Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archive and Hiring Event #534

Merged
merged 31 commits into from
Oct 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
c9faf7d
Initial commit of workflow to test / archive site
wesley-dean-gsa Oct 17, 2024
0606d52
[MegaLinter] Apply linters fixes
wesley-dean-gsa Oct 17, 2024
9f0a6fc
Fix the name of the script YOU JUST WROTE
wesley-dean-gsa Oct 17, 2024
aad7e0a
Tweak directory creation, docs, and slugify
wesley-dean-gsa Oct 18, 2024
c23b4c5
[MegaLinter] Apply linters fixes
wesley-dean-gsa Oct 18, 2024
8b8f1b2
Adjust how sed gets the logs further
wesley-dean-gsa Oct 18, 2024
3aa07bd
Merge branch '491-verify-site-rendering-upon-build' of https://github…
wesley-dean-gsa Oct 18, 2024
1b7c8cc
[MegaLinter] Apply linters fixes
wesley-dean-gsa Oct 18, 2024
7493fee
Break sed and grep into separate commands
wesley-dean-gsa Oct 18, 2024
06c71b8
Resolve merge conflict
wesley-dean-gsa Oct 18, 2024
8fed852
[MegaLinter] Apply linters fixes
wesley-dean-gsa Oct 18, 2024
2fe90a7
Always push the logs -- I want to see what's happening
wesley-dean-gsa Oct 18, 2024
8b77c15
Attempt sitemap approach
wesley-dean-gsa Oct 18, 2024
44deb97
[MegaLinter] Apply linters fixes
wesley-dean-gsa Oct 18, 2024
818c0ca
Return delays so as to be kind to the Pages team
wesley-dean-gsa Oct 18, 2024
1ca6e81
Capture sitemap and use timestamp
wesley-dean-gsa Oct 18, 2024
015c57f
Merge branch '491-verify-site-rendering-upon-build' of https://github…
wesley-dean-gsa Oct 18, 2024
5ba6090
Add commit SHA to archive name
wesley-dean-gsa Oct 18, 2024
063f695
Fix SHA reference
wesley-dean-gsa Oct 18, 2024
61545f7
Merge branch 'staging' into 491-verify-site-rendering-upon-build
wesley-dean-gsa Oct 18, 2024
e8fc197
Merge branch 'staging' into 491-verify-site-rendering-upon-build
wesley-dean-gsa Oct 21, 2024
aada6bc
Add hiring event to join page
debjudy Oct 24, 2024
233db6d
Update hiring event info session URL
debjudy Oct 24, 2024
4200144
Merge branch 'main' into debjudy-patch-1
debjudy Oct 24, 2024
b8fd58e
Merge pull request #531 from GSA-TTS/debjudy-patch-1
debjudy Oct 24, 2024
9fb9d9e
Merge branch 'staging' into 491-verify-site-rendering-upon-build
wesley-dean-gsa Oct 24, 2024
77cc82d
Merge pull request #499 from GSA-TTS/491-verify-site-rendering-upon-b…
wesley-dean-gsa Oct 24, 2024
7e8ad8d
Add hiring event to jobs section
debjudy Oct 24, 2024
3a22d15
Add bold to text
debjudy Oct 24, 2024
f03e925
Remove hiring event
debjudy Oct 24, 2024
2a61f5a
Merge pull request #533 from GSA-TTS/530-hiring-event
debjudy Oct 24, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 74 additions & 0 deletions .github/workflows/archive_website.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
---
name: Archive website

# yamllint disable-line rule:truthy
on:
pull_request:
workflow_dispatch:

permissions: read-all

concurrency:
group: ${{ github.ref }}-${{ github.workflow }}
cancel-in-progress: true

jobs:
archive:
runs-on: ubuntu-latest

permissions:
issues: write
pull-requests: write

steps:
- uses: actions/checkout@692973e3d937129bcbf40652eb9f2f61becf3332 # [email protected]

- name: Checkout repository
uses: actions/checkout@692973e3d937129bcbf40652eb9f2f61becf3332 # [email protected]

- name: Setup custom variables
id: customvars
run: |
( echo -n "BASE_URL="
if [ -n "${{ vars.BASE_URL }}" ] ; then
echo "${{ vars.BASE_URL }}"
else
echo "https://federalist-a2423046-fe43-4e75-a2ef-2651e5e123ca.sites.pages.cloud.gov/preview/gsa-tts/tts.gsa.gov/"
fi

echo -n "URL_PATH="
if [ -n "${{ vars.URL_PATH }}" ] ; then
echo "${{ vars.URL_PATH }}/"
elif [ -n "${GITHUB_HEAD_REF:-}" ] ; then
echo "${GITHUB_HEAD_REF}/"
else
echo "staging/"
fi

echo -n "DELAY_SECONDS="
if [ -n "${{ vars.DELAY_SECONDS }}" ] ; then
echo "${{ vars.DELAY_SECONDS }}"
else
echo "60"
fi

echo -n "BUILD_TIMESTAMP=$(date +%Y%m%d%H%M)"

) >> "$GITHUB_OUTPUT"

- name: Delay while the preview URL is built
run: "sleep ${{ steps.customvars.outputs.DELAY_SECONDS }}"

- name: Build archive
run: bin/archive_website.bash "${{ steps.customvars.outputs.BASE_URL }}${{ steps.customvars.outputs.URL_PATH}}"

# Upload artifacts
- name: Archive artifacts
if: always()
uses: actions/upload-artifact@834a144ee995460fba8ed112a2fc961b36a5ec5a # [email protected]
with:
name: "site-archive-${{ steps.customvars.outputs.BUILD_TIMESTAMP }}-${{ github.sha }}"
path: |
site-archive-*.tar.gz
site-archive-*.log.txt
site-archive-*.sitemap.txt
5 changes: 5 additions & 0 deletions _includes/layouts/jointts/jobs.html
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
<!-- Hiring Events -->
<h2>Hiring events</h2>
<p>Look for us at the <a href="https://www.techtogov.org/events">Tech to Gov Virtual Hiring Forum + Job Fair</a> on Thursday, October 31, 2024 from 12:00 PM-4:00 PM ET (9:00 AM-1:00 PM PT).</p>
<p><b>Not able to attend Tech to Gov? No worries, everyone can apply to our upcoming roles!</b> Please <a href="https://events.zoomgov.com/ev/ApHdaoDfDk7vEPBDjo38kHFgRZVgNT1hP8JoTVG4fbRxDsQhSYOB~Aheq_usTUGVJQDT3DiExAiT56KGCpzdTya3C-fUkwKU8Ztl9roEQGVePll7K8UMzH5PaFUciKmLt8fG0Jk68AaFnyQ">join our information session</a> on Tuesday, October 29, 2024 from 4:00 PM-5:00 PM ET (1:00 PM-2:00 PM PT) to hear more.</p>

<!-- Open Jobs -->
<section class="open-jobs">
<h2 id="open-positions">Open positions</h2>
Expand Down
193 changes: 193 additions & 0 deletions bin/archive_website.bash
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
#!/usr/bin/env bash

## @fn archive_website.bash
## @brief given the URL, attempt to create an archive of said site
## @details
## This will create an archive (a gzipped tar file, specifically) of a web
## site located at a provided URL.
##
## Internally, this tool uses 'wget' to make a mirror copy of the requested
## site. Any assets (images, CSS files, JavaScript, HTML, etc.) referenced
## along the way is downloaded locally and the links to those assets are
## updated to point to the local copies.
##
## Once the mirror is complete, a compressed tarball is created containing
## all of the downloaded content. In essence, this creates a build artifact
## that can be examined offline at a later date.
##
## Additionally, if the tool encounters any 400 or 500 HTTP responses (client-
## and server-side errors, respectively), then the tool will a 100 result
## code.
##
## The original purpose of this tool was to determine if a site build was
## successful. There were instances where a site would be built with HTML
## that referenced incorrect asset filenames. Therefore, the homepage would
## return a 200 HTTP response and the text of the site would appear to be
## correct; however, the site would appear to be "broken" because the CSS
## and other assets wouldn't load. This ought to test for conditions like
## those.
##
## Unfortunately (for us), one of the first things wget would do would be to
## look for a '/robots.txt' file. In the event that a site didn't include
## one, wget would issue a message saying that it was attempting to load
## robots.txt and to ignore any errors message followed by several lines of
## log output and, finally, a 404 error saying that the robots.txt file
## couldn't be loaded. As a result, it invovled some sed magic to look for
## messages like those and remove the resulting error messages so that we could
## scan the log file for actual 400 and 500 level HTTP responses.
##
## So, in the event that an asset is missing, this tool will return a non-zero
## result code; otherwise, it'll write the filename of the archive it created
## to STDOUT.
## @author The TTS Website Team

set -euo pipefail

## @fn slugify()
## @brief this will write a string suitable for use as a slug to STDOUT
## @details
## Many Content Management Systems (CMSs) refer to a modified path / filename
## that may be safely used without having to URL-escape characters as a
## "slug". Slugs are typically sanitized with problematic characters (e.g.,
## those that change directories, fork processes, separate commands, etc.)
## removed. Slugs are also limited in length, typically to 63 characters
## to prevent overflows. Strings of one or more non-alphanumeric characters
## are replaced by a single '-',
##
## So, for example, "Hello, World!" would be slugified as "hello-world"
##
## This will accept zero or more strings as parameters and return
## "slugified" versions of those strings.
##
## The results of this are written to STDOUT, one string (argument) per line
## @param string[] strings to slugify
## @returns slugified strings via STDOUT, one per line
## @par Examples
## @code
## URL="http://tts.gsa.gov/"
## archive="$(slugify "$URL")"
## @endcode
slugify() {
for string in "$@"; do
echo "$string" \
| iconv -c -t ascii//TRANSLIT \
| tr '[:upper:]' '[:lower:]' \
| sed \
-Ee 's/[~^]+//g' \
-Ee 's/[^[:alnum:]]+/-/g' \
-Ee 's/^-+|-+$//g' \
-Ee 's/.//63g'
done
}

## @fn mirror_site()
## @brief given a URL, create a local copy of the site or return an error
## @details
## This will use wget to attempt to create a local copy -- a mirror -- of a
## site located at the requested URL. The archive is gzip-compressed GNU
## tar file and a log file whose names are written to STDOUT upon successful
## creation. See the detailed notes for this file for more context and detail.
## @param URL the URL to download
## @param options[] these options are added to wget before the URL
## @returns archive and log filenames via STDOUT
## @retval 0 (True) if the download and archive creation were successful
## @retval 100 if the download resulted in failing HTTP responses
## @par Examples
## @code
## local url="https://tts.gsa.gov/"
## my_archive="$(mirror_site "$url")" || exit 1
## @endcode
mirror_site() {

local URL="${1?Error: no URL passed}"
shift

slugified_url="$(slugify "$URL")"
now="$(date +%Y%m%d%H%M)"
tarball="site-archive-${slugified_url}-${now}.tar.gz"
logfile="site-archive-${slugified_url}-${now}.log.txt"
sitemapfile="site-archive-${slugified_url}-${now}.sitemap.txt"

## perform some cleanup

if [ -e "${tarball}" ]; then
echo "Removing old tarball '${tarball}'" 1>&2
rm -rf "${tarball}"
fi

if [ -e "${sitemapfile}" ]; then
echo "Removing old sitemap '${sitemapfile}'" 1>&2
rm -rf "${sitemapfile}"
fi

if [ -e "${slugified_url}" ]; then
echo "Removing old directory '${slugified_url}'" 1>&2
rm -rf "${slugified_url}"
fi

# make sure the destination directory exists

if [ ! -d "${slugified_url}" ]; then
echo "Creating '${slugified_url}'" 1>&2
mkdir -p "${slugified_url}"
fi

## acquire the sitemap

touch "$logfile"

echo "Downloading sitemap" 1>&2
wget \
"${URL}sitemap.xml" \
--append-output="${logfile}" \
--output-document=- \
| sed -Ene "/<loc>/p" \
| sed -Ee "s/<[^>]*>//g" \
> "$sitemapfile"

## download the site

echo "Beginning download" 1>&2
wget \
--wait 1 \
--level=inf \
--limit-rate=500K \
--recursive \
--user-agent=TTSSiteArchiver \
--no-host-directories \
--directory-prefix="${slugified_url}" \
--no-clobber \
--no-parent \
--page-requisites \
--convert-links \
--execute "robots=off" \
--input-file="$sitemapfile" \
"$@" \
|| true \
2>&1 \
| tee -a "${logfile}"

## scan the results looking for failing HTTP responses

echo "Examining results..." 1>&2
# shellcheck disable=SC2002
sed -i~ -Ee '/^Loading\s*.*;\s*please ignore errors/,+5d' "${logfile}"
grep -Eqe '^HTTP request sent.*\b[45][[:digit:]]{2}\b' \
< "${logfile}" \
&& return 100

echo "No 400 or 500 level errors found; creating archive." 1>&2
tar -czf "${tarball}" "${slugified_url}"

echo "Archive: ${tarball}"
echo "Logs: ${logfile}"
}

## @fn main()
## @brief the main function
main() {
mirror_site "$@"
}

# if we're not being sourced and there's a function named `main`, run it
[[ "$0" == "${BASH_SOURCE[0]}" ]] && [ "$(type -t "main")" == "function" ] && main "$@"
Loading