Learn browser automation from scratch
Dec 10, 2016 Daniel Heath

Browser automation with selenium-webdriver

Why?

Browser tests are super useful, but they’re famous for having hard to debug, unreliable failures.

This reputation is largely the result of developers learning high-level frameworks without understanding the underlying abstractions.

In this guide I’ll show you some cool terminal tricks - including how to drive a browser directly.

I wouldn’t write my tests for a production app this way, but it illustrates the underlying concepts in a way most tools abstract away.

This will help you to develop a better understanding of how most web testing tools work.

Setup

You will need:

Java
Google chrome
A base64 decoder (preinstalled on mac) to get screenshots
A terminal (preinstalled on linux/mac)
curl (installed on most linux/mac )
jq to extract fields from JSON documents
selenium-server-standalone.jar selenium-webdriver (not the same as selenium-rc - referred to as ‘selenium’ below)
chromedriver

If you are using homebrew on mac, you can install the last three that way:

brew install selenium-server-standalone
brew install jq
brew install chromedriver

Getting started

selenium is the predominant way of writing automatinc browser tests. It can operate all major browsers.

Selenium runs on a client/server model with the following components:

Some website(s)
A browser (chrome, in this example)
Selenium-server (to open and control the browser)
Your custom browser automation code (I’ll help you write some in this post)

Open a terminal and run the standalone server:

# On OSX using homebrew:
/usr/local/bin/selenium-server -port 4444

# Everyone else - you'll need to be in the directory where selenium is downloaded to / installed:
java -jar selenium-server-standalone-3.0.1.jar -port 4444

Leave this running and open a another terminal window. In the new window, run the following to ask the server to start a new chrome session (copy & paste will work, or save it to a file and run source <file>):

`# Save the URL to the local selenium server`
`# to the variable SELENIUM_URL`
SELENIUM_URL="http://localhost:4444/wd/hub/session"

`# Save the result of asking for a new browser session`
`# to the variable SELENIUM_SESSION_ID`
SELENIUM_SESSION_ID=$( \
  `# Use curl to make a HTTP POST asking the selenium` \
  `# server to open chrome` \
  curl "$SELENIUM_URL" \
       --data-ascii \
       '{"desiredCapabilities":{"browserName":"chrome"}}' \
       | jq -r .sessionId \
)

You should see a new window running chrome, looking at a blank page.

The above command sets your current shell (terminal window) up with the variables SELENIUM_SESSION_ID and SELENIUM_ID, which are used in later commands.

If you open a new shell, those variables will not be present and the commands won’t work.

Navigating

Lets make our browser go to google.

curl "$SELENIUM_URL/$SELENIUM_SESSION_ID/url" \
     --data-ascii \
     '{"url": "https://google.com"}' \
     | jq .

You should see:

A page of JSON, including {"state": "success"}
The chrome window is now on google.com.

Filling in forms

Lets automate issuing a search. First, find the input element:

SEARCH_BOX_ELEMENT=$(
  curl "$SELENIUM_URL/$SELENIUM_SESSION_ID/element" \
       --data-ascii \
       '{"using": "css selector", "value": "input[name=q]"}' \
       `# Use 'jq' to extract part of the response. ` \
       `# -r to get the text without quotes. ` \
       | jq -r .value.ELEMENT \
)

Now we’ll type something into the search box:

curl "$SELENIUM_URL/$SELENIUM_SESSION_ID/element/$SEARCH_BOX_ELEMENT/value" \
     --data-ascii \
     '{"value": ["s","e","l","e","n","i","u","m", "\n"]}' \
     | jq .

Switching to the chrome window, you should be able to see the search results.

Customizing your shell

This is getting repetitive. Lets define some bash functions to make it easier:

function debug {
  if [ -n "$DEBUG" ] ; then
    # Print debug messages to STDERR
    echo "$@" 1>&2
  fi
}

function get {
  # Build the URL, requiring that a path was provided
  local URL="${SELENIUM_URL}/${SELENIUM_SESSION_ID}${1?Must provide URL}"

  debug "$URL"

  # Fetch the URL and format with jq.
  curl --silent "${URL}" | jq .
}

function post {
  # Build the URL, requiring that a path was provided
  local URL="${SELENIUM_URL}/${SELENIUM_SESSION_ID}${1?Must provide URL}"

  # Set a default value for the POST data
  local DATA="${2-{\}}"

  # Validate that the provided data is valid JSON
  echo "$DATA" | jq . > /dev/null || \
    ${ALLOW_INVALID_JSON?Second argument is not valid JSON}

  debug "$URL"

  # Fetch the URL and format the response with jq.
  curl --silent "${URL}" --data-ascii "${DATA}" | jq .

Screenshots

Lets use these new functions to take a screenshot:

get /screenshot `# Ask for a screenshot` \
    | jq -r .value `# Fetch the raw (-r) string from the 'value'` \
    | base64 -D `# Decode the base64-encoded data provided by selenium` \
    > screenshot.png `# Save the result to screenshot.png`

If you open ‘screenshot.png’ you should see the page screenshotted.

Extracting text from the page

Lets grab the URL of each search result.

Looking at the document using ‘inspect element’, I can see that search result URLs are all in cite elements and nothing else is. That sounds like a good way to identify those URLs.

for element_id in $( \
  post /elements \
       '{"using": "css selector", "value": "cite"}' \
       | jq -r .value[].ELEMENT \
  ) ; do
  get "/element/$element_id/text" | jq -r .value
done

I get the following output:

www.seleniumhq.org/
www.seleniumhq.org/
www.webmd.com/a-to-z-guides/supplement-guide-selenium
https://en.wikipedia.org/wiki/Selenium
https://ods.od.nih.gov/factsheets/Selenium-HealthProfessional/
https://www.nrv.gov.au/nutrients/selenium
https://cosmosmagazine.com › Geoscience › Topics
www.whfoods.com/genpage.php?dbid=95&tname=nutrient
selenium-python.readthedocs.io/
selenium-python.readthedocs.io/getting-started.html
https://seleniumhq.wordpress.com/

Builtin methods

The list of supported actions (from the spec) are summarized below.

The spec makes for hard reading but each section is quite short.

New Session (POST) - Open a browser session. The only API not requiring a session ID
Delete Session (DELETE) - Terminate a browser session
Status (GET) - Describe a session
Go (POST) - Navigate the current tab
Get Current URL (GET)
Back (POST)
Forward (POST)
Refresh (POST)
Get Title (GET)
Get Page Source (GET)
Take Screenshot (GET) - Return a base64 encoded PNG of the page

Element interaction

These APIs operate on element handles, which are opaque IDs allocated by selenium when you find/get an element.

Use those IDs to interact with elements.

Find Element (POST)
Find Element From Element (POST)
Find Elements (POST)
Find Elements From Element (POST)
Get Active Element (GET)
Is Element Selected (GET)
Get Element Attribute (GET)
Get Element Property (GET)
Get Element CSS Value (GET)
Get Element Text (GET)
Get Element Tag Name (GET)
Get Element Rect (GET) - get element position and size
Is Element Enabled (GET) - only for form controls
Element Click (POST)
Element Clear (POST) - Empty text fields, unchecks checkboxes, etc
Element Send Keys (POST) - Can be used to fill in forms, hit enter, etc.
Take Element Screenshot (GET) - Return a base64 encoded PNG of the element

Javascript

Inject arbitrary JS to run in the page.

Execute Script (POST) - Must be expressed as an inline function call - return value is serialized and sent back
Execute Async Script (POST)

Modals: ‘alert’, ‘confirm’ and ‘prompt’

Dismiss Alert (POST)
Accept Alert (POST)
Get Alert Text (GET)
Send Alert Text (POST) - Fill in a prompt

Cookies

Get All Cookies (GET)
Get Named Cookie (GET)
Add Cookie (POST)
Delete Cookie (DELETE)
Delete All Cookies (DELETE)

Window management

Lets you open multiple tabs/windows in the same selenium session.

Get Window Handle (GET)
Close Window (DELETE)
Switch To Window (POST)
Get Window Handles (GET)
Fullscreen Window (POST)
Maximize Window (POST)
Set Window Size (POST)
Get Window Position (GET)
Set Window Position (POST)
Get Window Size (GET)
Switch To Frame (POST)
Switch To Parent Frame (POST)

Timeout management

Lets you define how long a page is allowed to spend loading html/scripts.

I haven’t had a reason to use them to date so can’t describe them in much more detail than that.

Get Timeouts (GET)
Set Timeouts (POST)

Action API

This API lets you script relatively advanced sequences of user behaviors - eg by describing the touch events that form a gesture.

I haven’t had a reason to use them to date so can’t describe them in much more detail than that.

Perform Actions (POST)
Releasing Actions (DELETE)

Sans Bytes

Learn browser automation from scratch
Dec 10, 2016 Daniel Heath

Browser automation with selenium-webdriver

Why?

Setup

Getting started

Navigating

Filling in forms

Customizing your shell

Screenshots

Extracting text from the page

Further reading

Builtin methods

Element interaction

Javascript

Modals: ‘alert’, ‘confirm’ and ‘prompt’

Cookies

Window management

Timeout management

Action API

Learn browser automation from scratch Dec 10, 2016 Daniel Heath

Browser automation with selenium-webdriver

Why?

Setup

Getting started

Navigating

Filling in forms

Customizing your shell

Screenshots

Extracting text from the page

Further reading

Builtin methods

Element interaction

Javascript

Modals: ‘alert’, ‘confirm’ and ‘prompt’

Cookies

Window management

Timeout management

Action API

Learn browser automation from scratch
Dec 10, 2016 Daniel Heath