Sans Bytes

Daniel Heath blogs here

Learn browser automation from scratch

Browser automation with selenium-webdriver


Browser tests are super useful, but they’re famous for having hard to debug, unreliable failures.

This reputation is largely the result of developers learning high-level frameworks without understanding the underlying abstractions.

In this guide I’ll show you some cool terminal tricks - including how to drive a browser directly.

I wouldn’t write my tests for a production app this way, but it illustrates the underlying concepts in a way most tools abstract away.

This will help you to develop a better understanding of how most web testing tools work.


You will need:

If you are using homebrew on mac, you can install the last three that way:

brew install selenium-server-standalone
brew install jq
brew install chromedriver

Getting started

selenium is the predominant way of writing automatinc browser tests. It can operate all major browsers.

Selenium runs on a client/server model with the following components:

Open a terminal and run the standalone server:

# On OSX using homebrew:
/usr/local/bin/selenium-server -port 4444

# Everyone else - you'll need to be in the directory where selenium is downloaded to / installed:
java -jar selenium-server-standalone-3.0.1.jar -port 4444

Leave this running and open a another terminal window. In the new window, run the following to ask the server to start a new chrome session (copy & paste will work, or save it to a file and run source <file>):

`# Save the URL to the local selenium server`
`# to the variable SELENIUM_URL`

`# Save the result of asking for a new browser session`
`# to the variable SELENIUM_SESSION_ID`
  `# Use curl to make a HTTP POST asking the selenium` \
  `# server to open chrome` \
  curl "$SELENIUM_URL" \
       --data-ascii \
       '{"desiredCapabilities":{"browserName":"chrome"}}' \
       | jq -r .sessionId \

You should see a new window running chrome, looking at a blank page.

The above command sets your current shell (terminal window) up with the variables SELENIUM_SESSION_ID and SELENIUM_ID, which are used in later commands.

If you open a new shell, those variables will not be present and the commands won’t work.

Lets make our browser go to google.

     --data-ascii \
     '{"url": ""}' \
     | jq .

You should see:

Filling in forms

Lets automate issuing a search. First, find the input element:

       --data-ascii \
       '{"using": "css selector", "value": "input[name=q]"}' \
       `# Use 'jq' to extract part of the response. ` \
       `# -r to get the text without quotes. ` \
       | jq -r .value.ELEMENT \

Now we’ll type something into the search box:

     --data-ascii \
     '{"value": ["s","e","l","e","n","i","u","m", "\n"]}' \
     | jq .

Switching to the chrome window, you should be able to see the search results.

Customizing your shell

This is getting repetitive. Lets define some bash functions to make it easier:

function debug {
  if [ -n "$DEBUG" ] ; then
    # Print debug messages to STDERR
    echo "$@" 1>&2

function get {
  # Build the URL, requiring that a path was provided
  local URL="${SELENIUM_URL}/${SELENIUM_SESSION_ID}${1?Must provide URL}"

  debug "$URL"

  # Fetch the URL and format with jq.
  curl --silent "${URL}" | jq .

function post {
  # Build the URL, requiring that a path was provided
  local URL="${SELENIUM_URL}/${SELENIUM_SESSION_ID}${1?Must provide URL}"

  # Set a default value for the POST data
  local DATA="${2-{\}}"

  # Validate that the provided data is valid JSON
  echo "$DATA" | jq . > /dev/null || \
    ${ALLOW_INVALID_JSON?Second argument is not valid JSON}

  debug "$URL"

  # Fetch the URL and format the response with jq.
  curl --silent "${URL}" --data-ascii "${DATA}" | jq .


Lets use these new functions to take a screenshot:

get /screenshot `# Ask for a screenshot` \
    | jq -r .value `# Fetch the raw (-r) string from the 'value'` \
    | base64 -D `# Decode the base64-encoded data provided by selenium` \
    > screenshot.png `# Save the result to screenshot.png`

If you open ‘screenshot.png’ you should see the page screenshotted.

Extracting text from the page

Lets grab the URL of each search result.

Looking at the document using ‘inspect element’, I can see that search result URLs are all in cite elements and nothing else is. That sounds like a good way to identify those URLs.

for element_id in $( \
  post /elements \
       '{"using": "css selector", "value": "cite"}' \
       | jq -r .value[].ELEMENT \
  ) ; do
  get "/element/$element_id/text" | jq -r .value

I get the following output: › Geoscience › Topics

Further reading

I’ve referred frequently to the spec in writing this. In particular, the “List of Endpoints” and “Command Contexts” sections are useful. At the time of writing you can search the page for those terms to get there quickly.

Builtin methods

The list of supported actions (from the spec) are summarized below.

The spec makes for hard reading but each section is quite short.

Element interaction

These APIs operate on element handles, which are opaque IDs allocated by selenium when you find/get an element.

Use those IDs to interact with elements.


Inject arbitrary JS to run in the page.

Modals: ‘alert’, ‘confirm’ and ‘prompt’


Window management

Lets you open multiple tabs/windows in the same selenium session.

Timeout management

Lets you define how long a page is allowed to spend loading html/scripts.

I haven’t had a reason to use them to date so can’t describe them in much more detail than that.

Action API

This API lets you script relatively advanced sequences of user behaviors - eg by describing the touch events that form a gesture.

I haven’t had a reason to use them to date so can’t describe them in much more detail than that.

There is no comment box here.
If you have something to contribute I would be delighted to hear it via email.