I had to extract all Titles and Meta-Descriptions from a small website so they could be optimized. The HTML wasn’t consistent so I decided to write a small script that gets the job done using wget, curl, xmllint and xpath expressions.
#!/bin/bash
if [ $1 ]; then
rm -rf urls.txt metadata.tsv
wget -m $1 2>&1 | grep '^--' | awk '{ print $3 }' > urls.txt
rm -f metadata.tsv
echo -e "URL\tTitle\tMeta-Description\t" > metadata.tsv
while read -r url; do
curl -s "$url" > tmp_file
title=$(cat tmp_file | xmllint --html --xpath '/html/head/title/text()' - 2>/dev/null)
metadesciption=$(cat tmp_file | xmllint --html --xpath 'string(/html/head/meta[@name="description"]/@content)' - 2>/dev/null)
echo -e "$url\t$title\t$metadesciption" >> metadata.tsv
done < "urls.txt"
else
echo "Usage: ./extract.sh <URL>"
fi
rm -rf urls.txt tmp_file
To use the script make sure wget, curl and xmllint are installed. Make sure the script is executable and run ./extract.sh https://example.com/
.
The script mirrors the website using wget to generate a list of all URLs. In a second step it fetches all URLs again with curl and extracts the titles and meta descriptions. This is not ideal but I was to lazy to create a solution that only does one request for each URL. Feel free to send me such an solution and I will update this post.