HTTP
Terms defined: character encoding, concurrency, HTTP, header (of HTTP request or response), HTTP method, HTTP request, HTTP response, HTTP status code, JavaScript Object Notation, local server, localhost, MIME type, port, query parameter, refactor, resolve (a path), sandbox, static file, UTF-8, web scraping
Start with Something Simple
import requests
url = "https://gvwilson.github.io/safety-tutorial/site/motto.txt"
response = requests.get(url)
print(f"status code: {response.status_code}")
print(f"body:\n{response.text}")
status code: 200
body:
Start where you are, use what you have, help who you can.
- Use the
requests
module to send an HTTP request - The URL identifies the file we want
- Though as we’ll see, the server can interpret it differently
- Response includes:
- HTTP status code such as 200 (OK) or 404 (Not Found)
- The text of the response
What Just Happened
- Figure 5.1 shows what happened
- Open a connection to the server
- Send an HTTP request for the file we want
- Server creates a response that includes the contents of the file
- Sends it back
requests
parses the response and creates a Python object for us
Request Structure
import requests
from requests_toolbelt.utils import dump
url = "https://gvwilson.github.io/safety-tutorial/site/motto.txt"
response = requests.get(url)
data = dump.dump_all(response)
print(str(data, "utf-8"))
GET /safety-tutorial/site/motto.txt HTTP/1.1
Host: gvwilson.github.io
User-Agent: python-requests/2.31.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
- First line is method, URL, and protocol version
- Every HTTP request can have headers with extra information
- And optionally data being uploaded
- Yes, it’s all just text
- Except for uploaded data, which is just bytes
Response Structure
import requests
url = "https://gvwilson.github.io/web-tutorial/site/motto.txt"
response = requests.get(url)
for key, value in response.headers.items():
print(f"{key}: {value}")
Connection: keep-alive
Content-Length: 5142
Server: GitHub.com
Content-Type: text/html; charset=utf-8
permissions-policy: interest-cohort=()
ETag: W/"65c56fc7-239b"
Content-Security-Policy: default-src 'none'; style-src 'unsafe-inline'; img-src data:; connect-src 'self'
Content-Encoding: gzip
X-GitHub-Request-Id: A08C:5A357:7CF794:9CEA06:65D13923
Accept-Ranges: bytes
Date: Sat, 17 Feb 2024 22:54:27 GMT
Via: 1.1 varnish
Age: 0
X-Served-By: cache-yyz4563-YYZ
X-Cache: MISS
X-Cache-Hits: 0
X-Timer: S1708210467.301651,VS0,VE20
Vary: Accept-Encoding
X-Fastly-Request-ID: cb16df2dfa73aaf6de87924c743dd1e50a0ce570
- Every HTTP response also has with extra information
- Does not include status code: that appears in the first line
- Most important for now are:
Content-Length
: number of bytes in response data (i.e., how much to read)Content-Type
: MIME type of data (e.g.,text/plain
)
- From now on we will only show interesting headers
Exercise
-
Add header called
Studying
with the valuesafety
to therequests
script shown above. Does it make a difference to the response? Should it? -
What is the difference between the
Content-Type
and theContent-Encoding
headers?
When Things Go Wrong
import requests
url = "https://gvwilson.github.io/web-tutorial/site/nonexistent.txt"
response = requests.get(url)
print(f"status code: {response.status_code}")
print(f"body length: {len(response.text)}")
status code: 404
body length: 9115
- The 404 status code tells us something went wrong
- The 9 kilobyte response is an HTML page with an embedded image (the GitHub logo)
- The page contains human-readable error messages
- But we have to know page format to pull them out
Exercise
Look at this list of HTTP status codes.
-
What is the difference between status code 403 and status code 404?
-
What is status code 418 used for?
-
Under what circumstances would you expect to get a response whose status code is 505?
Getting JSON
import requests
url = "https://gvwilson.github.io/web-tutorial/site/motto.json"
response = requests.get(url)
print(f"status code: {response.status_code}")
print(f"body as text: {len(response.text)} bytes")
as_json = response.json()
print(f"body as JSON:\n{as_json}")
status code: 200
body as text: 107 bytes
body as JSON:
{'first': 'Start where you are', 'second': 'Use what you have', 'third': 'Help who you can'}
- Parsing data out of HTML is called web scraping
- Painful and error prone
- Better to have the server return data as data
- Preferred format these days is JSON
- So common that
requests
has built-in support
- Unfortunately, there is no standard for representing tabular data as JSON Figure 5.2
- A list with one list with N column names + N lists of values?
- A list with N dictionaries, all with the same keys?
- A dictionary with column names and lists of values, all the same length?
Exercise
Write a requests
script that gets the current location and crew roster
of the International Space Station.
Local Web Server
- Pushing files to GitHub so that we can use them is annoying
- And we want to show how to make things wrong so that we can then make them right
- Use Python’s
http.server
module to run a local server
python -m http.server -d site
- Host name is
localhost
- Uses port 8000 by default
- So URLs look like
http://localhost:8000/path/to/file
- So URLs look like
-d site
tells the server to usesite
as its root directory- Use this local server for the next few examples
- Build our own server later on to show how it works
Talk to Local Server
import requests
URL = "http://localhost:8000/motto.txt"
response = requests.get(URL)
print(f"status code: {response.status_code}")
print(f"body:\n{response.text}")
::ffff:127.0.0.1 - - [18/Feb/2024 09:12:24] "GET /motto.txt HTTP/1.1" 200 -
status code: 200
body:
Start where you are, use what you have, help who you can.
- Concurrent systems are hard to debug
- Multiple streams of activity
- Order may change from run to run
- Usually easiest to run each process in its own terminal window
Our Own File Server
class RequestHandler(BaseHTTPRequestHandler):
def do_GET(self):
try:
url_path = self.path.lstrip("/")
full_path = Path.cwd().joinpath(url_path)
print(f"'{self.path}' => '{full_path}'")
if not full_path.exists():
raise ServerException(f"{self.path} not found")
elif not full_path.is_file():
raise ServerException(f"{self.path} not file")
else:
self.handle_file(self.path, full_path)
except Exception as msg:
self.handle_error(msg)
- Our
RequestHandler
handles a singleGET
request - Combine working directory with requested file path to get local path to file
- Return that if it exists and is a file or raise an error
Support Code
- Serve files
def send_content(self, content, status):
self.send_response(int(status))
self.send_header("Content-Type", "text/html; charset=utf-8")
self.send_header("Content-Length", str(len(content)))
self.end_headers()
self.wfile.write(content)
- Handle errors
ERROR_PAGE = """\
<html>
<head><title>Error accessing {path}</title></head>
<body>
<h1>Error accessing {path}: {msg}</h1>
</body>
</html>
"""
def handle_error(self, msg):
content = ERROR_PAGE.format(path=self.path, msg=msg)
content = bytes(content, "utf-8")
self.send_content(content, HTTPStatus.NOT_FOUND)
- Define our own exceptions so we’re sure we’re only catching what we expect
class ServerException(Exception):
pass
Running Our File Server
if __name__ == "__main__":
os.chdir(sys.argv[1])
serverAddress = ("", 8000)
server = HTTPServer(serverAddress, RequestHandler)
print(f"serving in {os.getcwd()}...")
server.serve_forever()
- And then get
motto.txt
as before
Built-in Safety
- Modify
requests
script to take URL as command-line parameter
import requests
import sys
URL = sys.argv[1]
response = requests.get(URL)
print(f"status code: {response.status_code}")
print(f"body:\n{response.text}")
- Add a sub-directory to
site
calledsandbox
with a fileexample.txt
- Called a sandbox because it’s a safe place to play
- Serve that sub-directory
python src/file_server_unsafe.py site/sandbox
- Can get files from that directory
python src/get_url.py http://localhost:8000/example.txt
'/example.txt' => '/tut/safety/site/sandbox/example.txt'
127.0.0.1 - - [21/Feb/2024 06:04:32] "GET /example.txt HTTP/1.1" 200 -
status code: 200
body:
example file
- But not from parent directory (which isn’t part of sandbox)
python src/requests_local_url.py http://localhost:8000/motto.txt
'/motto.txt' => '/tut/safety/site/sandbox/motto.txt'
127.0.0.1 - - [21/Feb/2024 06:04:38] "GET /motto.txt HTTP/1.1" 404 -
status code: 404
body:
<html>
<head><title>Error accessing /motto.txt</title></head>
<body>
<h1>Error accessing /motto.txt: /motto.txt not found</h1>
</body>
</html>
requests
strips the leading..
off the path before sending it- And if we try that URL in the browser, same thing happens
- So we’re safe, right?
Introducing netcat
netcat
(often justnc
) is a computer networking tool- Open a connection, send exactly what the user types, and show exactly what is sent in response
nc localhost 8000
GET /example.txt HTTP/1.1
HTTP/1.0 200 OK
Server: BaseHTTP/0.6 Python/3.12.1
Date: Thu, 22 Feb 2024 18:37:37 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 13
example file
- Let’s see what happens if we do send a URL with
..
in it
GET ../motto.txt HTTP/1.1
HTTP/1.0 200 OK
Server: BaseHTTP/0.6 Python/3.12.1
Date: Thu, 22 Feb 2024 18:38:50 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 58
Start where you are, use what you have, help who you can.
- We shouldn’t be able to see files outside the sandbox
- But if someone doesn’t strip out the
..
characters, users can escape
Exercise
The shortcut ~username
means
“the specified user’s home directory” in the shell,
while ~
on its own means “the current user’s home directory”.
Create a file called test.txt
in your home directory
and then try to get ~/test.txt
using your browser,
requests
,
and netcat
.
What happens with each and why?
A Safer File Server
def handle_file(self, given_path, full_path):
try:
resolved_path = str(full_path.resolve())
sandbox = str(Path.cwd().resolve())
if not resolved_path.startswith(sandbox):
raise ServerException(f"Cannot access {given_path}")
with open(full_path, "rb") as reader:
content = reader.read()
self.send_content(content, HTTPStatus.OK)
except FileNotFoundError:
raise ServerException(f"Cannot find {given_path}")
except IOError:
raise ServerException(f"Cannot read {given_path}")
- Resolve the constructed path
- Then check that it’s below the current working directory (i.e., the sandbox)
- And fail if it isn’t
- Using our own
ServerException
guarantees that all errors are handled the same way
- Using our own
Exercise
Refactor the do_GET
and handle_file
methods in RequestHandler
so that all checks are in one place.
Does this make the code easier to understand overall?
Do you think making code easier to understand also makes it safer?
Serving Data
- Rarely have JSON lying around as static files
- More common to have either CSV or a database
head -n 10 site/birds.csv
loc_id,latitude,longitude,region,year,month,day,species_id,num
L13476859,60.8606726,-135.2015181,CA-YT,2021,2,1,redcro,3.0
L13476859,60.8606726,-135.2015181,CA-YT,2021,2,1,rebnut,1.0
L13476859,60.8606726,-135.2015181,CA-YT,2021,2,1,comred,13.0
L13476859,60.8606726,-135.2015181,CA-YT,2021,2,1,dowwoo,1.0
L13476859,60.8606726,-135.2015181,CA-YT,2021,2,1,bkcchi,3.0
L13476859,60.8606726,-135.2015181,CA-YT,2021,2,1,haiwoo,1.0
L13476859,60.8606726,-135.2015181,CA-YT,2021,2,8,nobird,
L13476859,60.8606726,-135.2015181,CA-YT,2021,2,15,rebnut,2.0
L13476859,60.8606726,-135.2015181,CA-YT,2021,2,15,bkcchi,3.0
- Modify server to generate it dynamically
- Main program
def main():
sandbox, filename = sys.argv[1], sys.argv[2]
os.chdir(sandbox)
df = pl.read_csv(filename)
serverAddress = ("", 8000)
server = BirdServer(df, serverAddress, RequestHandler)
server.serve_forever()
- Create our own server class because we want to pass the dataframe in the constructor
class BirdServer(HTTPServer):
def __init__(self, data, server_address, request_handler):
super().__init__(server_address, request_handler)
self._data = data
do_GET
converts the dataframe to JSON (will modify later to do more than this)
class RequestHandler(BaseHTTPRequestHandler):
def do_GET(self):
result = self.server._data.write_json(row_oriented=True)
self.send_content(result, HTTPStatus.OK)
def send_content(self, content, status):
content = bytes(content, "utf-8")
self.send_response(int(status))
self.send_header("Content-Type", "application/json; charset=utf-8")
self.send_header("Content-Length", str(len(content)))
self.end_headers()
self.wfile.write(content)
- Can view in browser at
http://localhost:8000
or userequests
to fetch as before
Slicing Data
- URL can contain query parameters
- Want
http://localhost:8000/?year=2021&species=rebnut
to select red-breasted nuthatches in 2021 - Put slicing in a method of its own
def do_GET(self):
result = self.filter_data()
as_json = result.to_json(orient="records")
self.send_content(as_json, HTTPStatus.OK)
- Use
urlparse
andparse_qs
fromurllib.parse
to get query parameters- (Key, list) dictionary
- Then filter data as requested
def filter_data(self):
params = parse_qs(urlparse(self.path).query)
result = self.server._data
if "species" in params:
species = params["species"][0]
result = result[result["species_id"] == species]
if "year" in params:
year = int(params["year"][0])
result = result[result["year"] == year]
return result
Exercise
-
Write a function that takes a URL as input and returns a dictionary whose keys are the query parameters’ names and whose values are lists of their values. Do you now see why you should use the library function to do this?
-
Modify the server so that clients can specify which columns they want returned as a comma-separated list of names. If the client asks for a column that doesn’t exist, ignore it.
-
Modify your solution to the previous exercise so that if the client asks for a column that doesn’t exist the server returns a status code 400 (Bad Request) and a JSON blog with two keys:
status_code
(set to 400) anderror_message
(set to something informative). Explain why the server should return JSON rather than HTML in the case of an error.