HTTP

Terms defined: character encoding, concurrency, HTTP, header (of HTTP request or response), HTTP method, HTTP request, HTTP response, HTTP status code, JavaScript Object Notation, local server, localhost, MIME type, port, query parameter, refactor, resolve (a path), sandbox, static file, UTF-8, web scraping

Start with Something Simple

import requests

url = "https://gvwilson.github.io/safety-tutorial/site/motto.txt"
response = requests.get(url)
print(f"status code: {response.status_code}")
print(f"body:\n{response.text}")

status code: 200
body:
Start where you are, use what you have, help who you can.

Use the requests module to send an HTTP request
The URL identifies the file we want
- Though as we’ll see, the server can interpret it differently
Response includes:
- HTTP status code such as 200 (OK) or 404 (Not Found)
- The text of the response

What Just Happened

Figure 5.1 shows what happened

HTTP request/response lifecycle — Figure 5.1: Lifecycle of an HTTP request and response

Open a connection to the server
Send an HTTP request for the file we want
Server creates a response that includes the contents of the file
Sends it back
requests parses the response and creates a Python object for us

Request Structure

import requests
from requests_toolbelt.utils import dump

url = "https://gvwilson.github.io/safety-tutorial/site/motto.txt"
response = requests.get(url)
data = dump.dump_all(response)
print(str(data, "utf-8"))

GET /safety-tutorial/site/motto.txt HTTP/1.1
Host: gvwilson.github.io
User-Agent: python-requests/2.31.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive

First line is method, URL, and protocol version
Every HTTP request can have headers with extra information
- And optionally data being uploaded
Yes, it’s all just text
- Except for uploaded data, which is just bytes

Response Structure

import requests

url = "https://gvwilson.github.io/web-tutorial/site/motto.txt"
response = requests.get(url)
for key, value in response.headers.items():
    print(f"{key}: {value}")

Connection: keep-alive
Content-Length: 5142
Server: GitHub.com
Content-Type: text/html; charset=utf-8
permissions-policy: interest-cohort=()
ETag: W/"65c56fc7-239b"
Content-Security-Policy: default-src 'none'; style-src 'unsafe-inline'; img-src data:; connect-src 'self'
Content-Encoding: gzip
X-GitHub-Request-Id: A08C:5A357:7CF794:9CEA06:65D13923
Accept-Ranges: bytes
Date: Sat, 17 Feb 2024 22:54:27 GMT
Via: 1.1 varnish
Age: 0
X-Served-By: cache-yyz4563-YYZ
X-Cache: MISS
X-Cache-Hits: 0
X-Timer: S1708210467.301651,VS0,VE20
Vary: Accept-Encoding
X-Fastly-Request-ID: cb16df2dfa73aaf6de87924c743dd1e50a0ce570

Every HTTP response also has with extra information
- Does not include status code: that appears in the first line
Most important for now are:
- Content-Length: number of bytes in response data (i.e., how much to read)
- Content-Type: MIME type of data (e.g., text/plain)
From now on we will only show interesting headers

Exercise

Add header called Studying with the value safety to the requests script shown above. Does it make a difference to the response? Should it?
What is the difference between the Content-Type and the Content-Encoding headers?

When Things Go Wrong

import requests

url = "https://gvwilson.github.io/web-tutorial/site/nonexistent.txt"
response = requests.get(url)
print(f"status code: {response.status_code}")
print(f"body length: {len(response.text)}")

status code: 404
body length: 9115

The 404 status code tells us something went wrong
The 9 kilobyte response is an HTML page with an embedded image (the GitHub logo)
The page contains human-readable error messages
- But we have to know page format to pull them out

Exercise

Look at this list of HTTP status codes.

What is the difference between status code 403 and status code 404?
What is status code 418 used for?
Under what circumstances would you expect to get a response whose status code is 505?

Getting JSON

import requests

url = "https://gvwilson.github.io/web-tutorial/site/motto.json"
response = requests.get(url)
print(f"status code: {response.status_code}")
print(f"body as text: {len(response.text)} bytes")
as_json = response.json()
print(f"body as JSON:\n{as_json}")

status code: 200
body as text: 107 bytes
body as JSON:
{'first': 'Start where you are', 'second': 'Use what you have', 'third': 'Help who you can'}

Parsing data out of HTML is called web scraping
- Painful and error prone
Better to have the server return data as data
- Preferred format these days is JSON
- So common that requests has built-in support
Unfortunately, there is no standard for representing tabular data as JSON Figure 5.2
- A list with one list with N column names + N lists of values?
- A list with N dictionaries, all with the same keys?
- A dictionary with column names and lists of values, all the same length?

Three ways to represent tables as JSON — Figure 5.2: Representing tables as JSON

Exercise

Write a requests script that gets the current location and crew roster of the International Space Station.

Local Web Server

Pushing files to GitHub so that we can use them is annoying
And we want to show how to make things wrong so that we can then make them right
Use Python’s http.server module to run a local server

python -m http.server -d site

Host name is localhost
Uses port 8000 by default
- So URLs look like http://localhost:8000/path/to/file
-d site tells the server to use site as its root directory
Use this local server for the next few examples
- Build our own server later on to show how it works

Talk to Local Server

import requests

URL = "http://localhost:8000/motto.txt"

response = requests.get(URL)
print(f"status code: {response.status_code}")
print(f"body:\n{response.text}")

::ffff:127.0.0.1 - - [18/Feb/2024 09:12:24] "GET /motto.txt HTTP/1.1" 200 -
status code: 200
body:
Start where you are, use what you have, help who you can.

Concurrent systems are hard to debug
- Multiple streams of activity
- Order may change from run to run
- Usually easiest to run each process in its own terminal window

Our Own File Server

class RequestHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        try:
            url_path = self.path.lstrip("/")
            full_path = Path.cwd().joinpath(url_path)
            print(f"'{self.path}' => '{full_path}'")
            if not full_path.exists():
                raise ServerException(f"{self.path} not found")
            elif not full_path.is_file():
                raise ServerException(f"{self.path} not file")
            else:
                self.handle_file(self.path, full_path)
        except Exception as msg:
            self.handle_error(msg)

Our RequestHandler handles a single GET request
Combine working directory with requested file path to get local path to file
Return that if it exists and is a file or raise an error

Support Code

Serve files

    def send_content(self, content, status):
        self.send_response(int(status))
        self.send_header("Content-Type", "text/html; charset=utf-8")
        self.send_header("Content-Length", str(len(content)))
        self.end_headers()
        self.wfile.write(content)

Handle errors

ERROR_PAGE = """\
<html>
  <head><title>Error accessing {path}</title></head>
  <body>
    <h1>Error accessing {path}: {msg}</h1>
  </body>
</html>
"""

    def handle_error(self, msg):
        content = ERROR_PAGE.format(path=self.path, msg=msg)
        content = bytes(content, "utf-8")
        self.send_content(content, HTTPStatus.NOT_FOUND)

Define our own exceptions so we’re sure we’re only catching what we expect

class ServerException(Exception):
    pass

Running Our File Server

if __name__ == "__main__":
    os.chdir(sys.argv[1])
    serverAddress = ("", 8000)
    server = HTTPServer(serverAddress, RequestHandler)
    print(f"serving in {os.getcwd()}...")
    server.serve_forever()

And then get motto.txt as before

Built-in Safety

Modify requests script to take URL as command-line parameter

import requests
import sys

URL = sys.argv[1]

response = requests.get(URL)
print(f"status code: {response.status_code}")
print(f"body:\n{response.text}")

Add a sub-directory to site called sandbox with a file example.txt
- Called a sandbox because it’s a safe place to play
Serve that sub-directory

python src/file_server_unsafe.py site/sandbox

Can get files from that directory

python src/get_url.py http://localhost:8000/example.txt

'/example.txt' => '/tut/safety/site/sandbox/example.txt'
127.0.0.1 - - [21/Feb/2024 06:04:32] "GET /example.txt HTTP/1.1" 200 -

status code: 200
body:
example file

But not from parent directory (which isn’t part of sandbox)

python src/requests_local_url.py http://localhost:8000/motto.txt

'/motto.txt' => '/tut/safety/site/sandbox/motto.txt'
127.0.0.1 - - [21/Feb/2024 06:04:38] "GET /motto.txt HTTP/1.1" 404 -

status code: 404
body:
<html>
  <head><title>Error accessing /motto.txt</title></head>
  <body>
    <h1>Error accessing /motto.txt: /motto.txt not found</h1>
  </body>
</html>

requests strips the leading .. off the path before sending it
And if we try that URL in the browser, same thing happens
So we’re safe, right?

Introducing netcat

netcat (often just nc) is a computer networking tool
Open a connection, send exactly what the user types, and show exactly what is sent in response

nc localhost 8000

GET /example.txt HTTP/1.1

HTTP/1.0 200 OK
Server: BaseHTTP/0.6 Python/3.12.1
Date: Thu, 22 Feb 2024 18:37:37 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 13

example file

Let’s see what happens if we do send a URL with .. in it

GET ../motto.txt HTTP/1.1

HTTP/1.0 200 OK
Server: BaseHTTP/0.6 Python/3.12.1
Date: Thu, 22 Feb 2024 18:38:50 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 58

Start where you are, use what you have, help who you can.

We shouldn’t be able to see files outside the sandbox
But if someone doesn’t strip out the .. characters, users can escape

Exercise

The shortcut ~username means “the specified user’s home directory” in the shell, while ~ on its own means “the current user’s home directory”. Create a file called test.txt in your home directory and then try to get ~/test.txt using your browser, requests, and netcat. What happens with each and why?

A Safer File Server

    def handle_file(self, given_path, full_path):
        try:
            resolved_path = str(full_path.resolve())
            sandbox = str(Path.cwd().resolve())
            if not resolved_path.startswith(sandbox):
                raise ServerException(f"Cannot access {given_path}")
            with open(full_path, "rb") as reader:
                content = reader.read()
            self.send_content(content, HTTPStatus.OK)
        except FileNotFoundError:
            raise ServerException(f"Cannot find {given_path}")
        except IOError:
            raise ServerException(f"Cannot read {given_path}")

Resolve the constructed path
Then check that it’s below the current working directory (i.e., the sandbox)
And fail if it isn’t
- Using our own ServerException guarantees that all errors are handled the same way

Exercise

Refactor the do_GET and handle_file methods in RequestHandler so that all checks are in one place. Does this make the code easier to understand overall? Do you think making code easier to understand also makes it safer?

Serving Data

Rarely have JSON lying around as static files
More common to have either CSV or a database

head -n 10 site/birds.csv

loc_id,latitude,longitude,region,year,month,day,species_id,num
L13476859,60.8606726,-135.2015181,CA-YT,2021,2,1,redcro,3.0
L13476859,60.8606726,-135.2015181,CA-YT,2021,2,1,rebnut,1.0
L13476859,60.8606726,-135.2015181,CA-YT,2021,2,1,comred,13.0
L13476859,60.8606726,-135.2015181,CA-YT,2021,2,1,dowwoo,1.0
L13476859,60.8606726,-135.2015181,CA-YT,2021,2,1,bkcchi,3.0
L13476859,60.8606726,-135.2015181,CA-YT,2021,2,1,haiwoo,1.0
L13476859,60.8606726,-135.2015181,CA-YT,2021,2,8,nobird,
L13476859,60.8606726,-135.2015181,CA-YT,2021,2,15,rebnut,2.0
L13476859,60.8606726,-135.2015181,CA-YT,2021,2,15,bkcchi,3.0

Modify server to generate it dynamically
Main program

def main():
    sandbox, filename = sys.argv[1], sys.argv[2]
    os.chdir(sandbox)
    df = pl.read_csv(filename)
    serverAddress = ("", 8000)
    server = BirdServer(df, serverAddress, RequestHandler)
    server.serve_forever()

Create our own server class because we want to pass the dataframe in the constructor

class BirdServer(HTTPServer):
    def __init__(self, data, server_address, request_handler):
        super().__init__(server_address, request_handler)
        self._data = data

do_GET converts the dataframe to JSON (will modify later to do more than this)

class RequestHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        result = self.server._data.write_json(row_oriented=True)
        self.send_content(result, HTTPStatus.OK)

send_content encodes the JSON string as UTF-8 and sets the MIME type to application/json

    def send_content(self, content, status):
        content = bytes(content, "utf-8")
        self.send_response(int(status))
        self.send_header("Content-Type", "application/json; charset=utf-8")
        self.send_header("Content-Length", str(len(content)))
        self.end_headers()
        self.wfile.write(content)

Can view in browser at http://localhost:8000 or use requests to fetch as before

Slicing Data

URL can contain query parameters
Want http://localhost:8000/?year=2021&species=rebnut to select red-breasted nuthatches in 2021
Put slicing in a method of its own

    def do_GET(self):
        result = self.filter_data()
        as_json = result.to_json(orient="records")
        self.send_content(as_json, HTTPStatus.OK)

Use urlparse and parse_qs from urllib.parse to get query parameters
- (Key, list) dictionary
Then filter data as requested

    def filter_data(self):
        params = parse_qs(urlparse(self.path).query)
        result = self.server._data
        if "species" in params:
            species = params["species"][0]
            result = result[result["species_id"] == species]
        if "year" in params:
            year = int(params["year"][0])
            result = result[result["year"] == year]
        return result

Exercise

Write a function that takes a URL as input and returns a dictionary whose keys are the query parameters’ names and whose values are lists of their values. Do you now see why you should use the library function to do this?
Modify the server so that clients can specify which columns they want returned as a comma-separated list of names. If the client asks for a column that doesn’t exist, ignore it.
Modify your solution to the previous exercise so that if the client asks for a column that doesn’t exist the server returns a status code 400 (Bad Request) and a JSON blog with two keys: status_code (set to 400) and error_message (set to something informative). Explain why the server should return JSON rather than HTML in the case of an error.