Screen scraping and robotic process automation (RPA) for fun (and profit?)

When I first started screen scraping things were simple. Most things were in simple HTML tables, there was (almost) no JavaScript. You could just download pages with cURL, parse them with Perl (!), and get whatever data you wanted.

Today, things are different. Everything uses JavaScript, things load asynchronously, filling in forms enables and disables elements based on validation rules, validation doesn’t always run when you expect it to, and the list goes on.

Here I’m capturing some of the things that I’ve run into and how to get around them:

  • Buttons not clickable until a form field is filled in
  • Form validation not running until a resource is loaded
  • Warnings about duplicate submission attempts

Notes:

  • Selenide - https://youtu.be/P-vureOnDWY?t=1062
    • No need to download Web Drivers
    • Writing code that is compatible with AJAX
    • No need to add explicit waits
  • Page object pattern - https://youtu.be/P-vureOnDWY?t=2084
    • IntelliJ → Tools → Open Selenium Page Object Playground
    • image
    • Separate tests from page specific code
    • Single repository for operations offered by the page
  • Selenoid
    • Run browsers in Docker containers
    • To install and run

Resources: