If you run a Jekyll blog (like this one!), you might be interested in having your blog posts saved in a web archive like the Internet Archive Wayback Machine. In this post, I’ll show you how you can use an auto-generated sitemap to get a list of all URLs on your Jekyll blog, then feed those URLs to a web archiving process.
Adding a sitemap to your Jekyll blog or website is easy. Assuming your configuration is relatively straightforward, using the
jekyll-sitemap plugin can be as simple as adding a line to your site’s
_config.yml if you’re using GitHub Pages. Once you’ve done that, test that the URLs you’re generating in the resulting
sitemap.xml are valid and you should be good to go. Generating a sitemap also has SEO benefits, as it allows search engines to crawl your site more easily.1
curl https://mysite.github.io/sitemap.xml | sitemap-urls
Now we can use that list of URLs to drive our archiving process:
curl https://mysite.github.io/sitemap.xml | sitemap-urls | while read url; do \ curl -g --fail --retry 3 -L -o/dev/null -s "http://web.archive.org/save/$url"; \ done
This should tell the Wayback Machine to save all the URLs from the sitemap. Wrap this up in a script you can put in a periodic
cron job and you can rest easy knowing that your pages are being regularly archived. A similar process should work for archiving any (non-Jekyll) website that provides a sitemap.3 You could also use the scripts from my web-archive-triage repository to do some more complicated things, such as only archiving pages that have no snapshot.