Pup, web scraping

https://github.com/ericchiang/pup

pup is a command line tool for processing HTML. It reads from stdin, prints to stdout, and allows the user to filter parts of the page using CSS selectors. Inspired by jq, pup aims to be a fast and flexible way of exploring HTML from the terminal.

Trying to get last few HN titles, i got this ‘dumb’ cli:

curl -sS https://news.ycombinator.com/news | pup span.titleline a json{} | jq -r '.[] | select(.text) | .text' | head

JQ is used because I could not figure out how to extract just the link text and not the link href for some reason.

And an attempt at version that also converts some of the html entities back to text:

curl -sS https://news.ycombinator.com/news | pup span.titleline a json{} | jq -r '.[] | select(.text) | .text' | head | perl -n -mHTML::Entities -e ' ; print HTML::Entities::decode_entities($_) ;'

could return:

CatalaLang/catala: Programming language for law specification
Browsing like it's 1994: Integrating a Mac SE, ImageWriter II into a modern LAN
Linear Book Scanner – Open-source automatic book scanner (2014)
Why Resumes Are Dead and How Indeed.com Keeps Killing the Job Market
LaTeX for publishing tabletop role-playing games
The Sound Proof Booths of Silence
Apocalypse Proof: 33 Thomas Street in New York City
How HTTPS Works (2018)
Grabbing Dinner
Hexadecimal clock counting down to 2038 problemspecification

Note: pup is an old tool, that doesn’t seem to be developed/managed anymore.