HTTP

Terms defined: character encoding, concurrency, HTTP, header (of HTTP request or response), HTTP method, HTTP request, HTTP response, HTTP status code, JavaScript Object Notation, local server, localhost, MIME type, port, query parameter, refactor, resolve (a path), sandbox, static file, UTF-8, web scraping

Start with Something Simple

import requests

url = "https://gvwilson.github.io/safety-tutorial/site/motto.txt"
response = requests.get(url)
print(f"status code: {response.status_code}")
print(f"body:\n{response.text}")
status code: 200
body:
Start where you are, use what you have, help who you can.

What Just Happened

HTTP request/response lifecycle
Figure 5.1: Lifecycle of an HTTP request and response

Request Structure

import requests
from requests_toolbelt.utils import dump

url = "https://gvwilson.github.io/safety-tutorial/site/motto.txt"
response = requests.get(url)
data = dump.dump_all(response)
print(str(data, "utf-8"))
GET /safety-tutorial/site/motto.txt HTTP/1.1
Host: gvwilson.github.io
User-Agent: python-requests/2.31.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive

Response Structure

import requests

url = "https://gvwilson.github.io/web-tutorial/site/motto.txt"
response = requests.get(url)
for key, value in response.headers.items():
    print(f"{key}: {value}")
Connection: keep-alive
Content-Length: 5142
Server: GitHub.com
Content-Type: text/html; charset=utf-8
permissions-policy: interest-cohort=()
ETag: W/"65c56fc7-239b"
Content-Security-Policy: default-src 'none'; style-src 'unsafe-inline'; img-src data:; connect-src 'self'
Content-Encoding: gzip
X-GitHub-Request-Id: A08C:5A357:7CF794:9CEA06:65D13923
Accept-Ranges: bytes
Date: Sat, 17 Feb 2024 22:54:27 GMT
Via: 1.1 varnish
Age: 0
X-Served-By: cache-yyz4563-YYZ
X-Cache: MISS
X-Cache-Hits: 0
X-Timer: S1708210467.301651,VS0,VE20
Vary: Accept-Encoding
X-Fastly-Request-ID: cb16df2dfa73aaf6de87924c743dd1e50a0ce570

Exercise

  1. Add header called Studying with the value safety to the requests script shown above. Does it make a difference to the response? Should it?

  2. What is the difference between the Content-Type and the Content-Encoding headers?

When Things Go Wrong

import requests

url = "https://gvwilson.github.io/web-tutorial/site/nonexistent.txt"
response = requests.get(url)
print(f"status code: {response.status_code}")
print(f"body length: {len(response.text)}")
status code: 404
body length: 9115

Exercise

Look at this list of HTTP status codes.

  1. What is the difference between status code 403 and status code 404?

  2. What is status code 418 used for?

  3. Under what circumstances would you expect to get a response whose status code is 505?

Getting JSON

import requests

url = "https://gvwilson.github.io/web-tutorial/site/motto.json"
response = requests.get(url)
print(f"status code: {response.status_code}")
print(f"body as text: {len(response.text)} bytes")
as_json = response.json()
print(f"body as JSON:\n{as_json}")
status code: 200
body as text: 107 bytes
body as JSON:
{'first': 'Start where you are', 'second': 'Use what you have', 'third': 'Help who you can'}
Three ways to represent tables as JSON
Figure 5.2: Representing tables as JSON

Exercise

Write a requests script that gets the current location and crew roster of the International Space Station.

Local Web Server

python -m http.server -d site

Talk to Local Server

import requests

URL = "http://localhost:8000/motto.txt"

response = requests.get(URL)
print(f"status code: {response.status_code}")
print(f"body:\n{response.text}")
::ffff:127.0.0.1 - - [18/Feb/2024 09:12:24] "GET /motto.txt HTTP/1.1" 200 -
status code: 200
body:
Start where you are, use what you have, help who you can.

Our Own File Server

class RequestHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        try:
            url_path = self.path.lstrip("/")
            full_path = Path.cwd().joinpath(url_path)
            print(f"'{self.path}' => '{full_path}'")
            if not full_path.exists():
                raise ServerException(f"{self.path} not found")
            elif not full_path.is_file():
                raise ServerException(f"{self.path} not file")
            else:
                self.handle_file(self.path, full_path)
        except Exception as msg:
            self.handle_error(msg)

Support Code

    def send_content(self, content, status):
        self.send_response(int(status))
        self.send_header("Content-Type", "text/html; charset=utf-8")
        self.send_header("Content-Length", str(len(content)))
        self.end_headers()
        self.wfile.write(content)
ERROR_PAGE = """\
<html>
  <head><title>Error accessing {path}</title></head>
  <body>
    <h1>Error accessing {path}: {msg}</h1>
  </body>
</html>
"""
    def handle_error(self, msg):
        content = ERROR_PAGE.format(path=self.path, msg=msg)
        content = bytes(content, "utf-8")
        self.send_content(content, HTTPStatus.NOT_FOUND)
class ServerException(Exception):
    pass

Running Our File Server

if __name__ == "__main__":
    os.chdir(sys.argv[1])
    serverAddress = ("", 8000)
    server = HTTPServer(serverAddress, RequestHandler)
    print(f"serving in {os.getcwd()}...")
    server.serve_forever()

Built-in Safety

import requests
import sys

URL = sys.argv[1]

response = requests.get(URL)
print(f"status code: {response.status_code}")
print(f"body:\n{response.text}")
python src/file_server_unsafe.py site/sandbox
python src/get_url.py http://localhost:8000/example.txt
'/example.txt' => '/tut/safety/site/sandbox/example.txt'
127.0.0.1 - - [21/Feb/2024 06:04:32] "GET /example.txt HTTP/1.1" 200 -
status code: 200
body:
example file
python src/requests_local_url.py http://localhost:8000/motto.txt
'/motto.txt' => '/tut/safety/site/sandbox/motto.txt'
127.0.0.1 - - [21/Feb/2024 06:04:38] "GET /motto.txt HTTP/1.1" 404 -
status code: 404
body:
<html>
  <head><title>Error accessing /motto.txt</title></head>
  <body>
    <h1>Error accessing /motto.txt: /motto.txt not found</h1>
  </body>
</html>

Introducing netcat

nc localhost 8000
GET /example.txt HTTP/1.1
HTTP/1.0 200 OK
Server: BaseHTTP/0.6 Python/3.12.1
Date: Thu, 22 Feb 2024 18:37:37 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 13

example file
GET ../motto.txt HTTP/1.1
HTTP/1.0 200 OK
Server: BaseHTTP/0.6 Python/3.12.1
Date: Thu, 22 Feb 2024 18:38:50 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 58

Start where you are, use what you have, help who you can.

Exercise

The shortcut ~username means “the specified user’s home directory” in the shell, while ~ on its own means “the current user’s home directory”. Create a file called test.txt in your home directory and then try to get ~/test.txt using your browser, requests, and netcat. What happens with each and why?

A Safer File Server

    def handle_file(self, given_path, full_path):
        try:
            resolved_path = str(full_path.resolve())
            sandbox = str(Path.cwd().resolve())
            if not resolved_path.startswith(sandbox):
                raise ServerException(f"Cannot access {given_path}")
            with open(full_path, "rb") as reader:
                content = reader.read()
            self.send_content(content, HTTPStatus.OK)
        except FileNotFoundError:
            raise ServerException(f"Cannot find {given_path}")
        except IOError:
            raise ServerException(f"Cannot read {given_path}")

Exercise

Refactor the do_GET and handle_file methods in RequestHandler so that all checks are in one place. Does this make the code easier to understand overall? Do you think making code easier to understand also makes it safer?

Serving Data

head -n 10 site/birds.csv
loc_id,latitude,longitude,region,year,month,day,species_id,num
L13476859,60.8606726,-135.2015181,CA-YT,2021,2,1,redcro,3.0
L13476859,60.8606726,-135.2015181,CA-YT,2021,2,1,rebnut,1.0
L13476859,60.8606726,-135.2015181,CA-YT,2021,2,1,comred,13.0
L13476859,60.8606726,-135.2015181,CA-YT,2021,2,1,dowwoo,1.0
L13476859,60.8606726,-135.2015181,CA-YT,2021,2,1,bkcchi,3.0
L13476859,60.8606726,-135.2015181,CA-YT,2021,2,1,haiwoo,1.0
L13476859,60.8606726,-135.2015181,CA-YT,2021,2,8,nobird,
L13476859,60.8606726,-135.2015181,CA-YT,2021,2,15,rebnut,2.0
L13476859,60.8606726,-135.2015181,CA-YT,2021,2,15,bkcchi,3.0
def main():
    sandbox, filename = sys.argv[1], sys.argv[2]
    os.chdir(sandbox)
    df = pl.read_csv(filename)
    serverAddress = ("", 8000)
    server = BirdServer(df, serverAddress, RequestHandler)
    server.serve_forever()
class BirdServer(HTTPServer):
    def __init__(self, data, server_address, request_handler):
        super().__init__(server_address, request_handler)
        self._data = data
class RequestHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        result = self.server._data.write_json(row_oriented=True)
        self.send_content(result, HTTPStatus.OK)
    def send_content(self, content, status):
        content = bytes(content, "utf-8")
        self.send_response(int(status))
        self.send_header("Content-Type", "application/json; charset=utf-8")
        self.send_header("Content-Length", str(len(content)))
        self.end_headers()
        self.wfile.write(content)

Slicing Data

    def do_GET(self):
        result = self.filter_data()
        as_json = result.to_json(orient="records")
        self.send_content(as_json, HTTPStatus.OK)
    def filter_data(self):
        params = parse_qs(urlparse(self.path).query)
        result = self.server._data
        if "species" in params:
            species = params["species"][0]
            result = result[result["species_id"] == species]
        if "year" in params:
            year = int(params["year"][0])
            result = result[result["year"] == year]
        return result

Exercise

  1. Write a function that takes a URL as input and returns a dictionary whose keys are the query parameters’ names and whose values are lists of their values. Do you now see why you should use the library function to do this?

  2. Modify the server so that clients can specify which columns they want returned as a comma-separated list of names. If the client asks for a column that doesn’t exist, ignore it.

  3. Modify your solution to the previous exercise so that if the client asks for a column that doesn’t exist the server returns a status code 400 (Bad Request) and a JSON blog with two keys: status_code (set to 400) and error_message (set to something informative). Explain why the server should return JSON rather than HTML in the case of an error.