Serving Web Pages
- The HyperText Transfer Protocol (HTTP) specifies one way to interact via messages over sockets.
- A minimal HTTP request has a method, a URL, and a protocol version.
- A complete HTTP request may also have headers and a body.
- An HTTP response has a status code, a status phrase, and optionally some headers and a body.
- HTTP is a stateless protocol: the application is responsible for remembering things between requests.
Terms defined: HTTP, body (of HTTP request or response), header (of HTTP request or response), HTTP method, HTTP protocol version, HTTP request, HTTP response, HTTP status code, path resolution, query parameter, throw low, catch high, Universal Resource Locator
Copying files from one machine to another is useful (Chapter 21), but we want to do more. What we don’t want to do is create a new protocol for every application, any more than we create new file formats (Chapter 5).
The HyperText Transfer Protocol (HTTP) defines a way for programs to exchange data over the web. It is deliberately simple: the client sends a request specifying what it wants over a socket, and the server sends a response containing some data. Servers can construct responses however they want: they can copy files from disk, generate HTML dynamically, or do anything else a programmer can think of.
This chapter shows how to build a simple web server that understands the basics of HTTP and how to test programs of this kind. What we will build is much simpler than Apache, nginx, or other industrial-strength servers, but all the key ideas will be there.
Protocol
An HTTP request is just text: any program that wants to can create one or parse one. An absolutely minimal HTTP request has just the name of a method, a URL, and a protocol version on a single line separated by spaces:
GET /index.html HTTP/1.1
The HTTP method is almost always either GET
(to fetch information)
or POST
(to submit form data or upload files).
The URL specifies what the client wants:
it is often a path to a file on disk,
such as /index.html
,
but (and this is the crucial part)
it’s completely up to the server to decide what to do with it.
The HTTP version is usually “HTTP/1.0” or “HTTP/1.1”;
the differences between the two don’t matter to us.
Most real requests have a few extra lines called headers, which are key-value pairs like the ones shown below:
GET /index.html HTTP/1.1
Accept: text/html
Accept-Language: en, fr
If-Modified-Since: 16-May-2023
Unlike the keys in hash tables, keys may appear any number of times in HTTP headers, so that (for example) a request can specify that it’s willing to accept several types of content.
Finally,
the body of the request is any extra data associated with it,
such as form data or uploaded files.
There must be a blank line between the last header and the start of the body
to signal the end of the headers,
and if there is a body,
the request must have a header called Content-Length
that tells the server how many bytes are in the body.
An HTTP response is formatted like an HTTP request.
Its first line has the protocol
followed by a status code and a status phrase,
such as “200 OK” or “404 Not Found”.
There are then some headers
(including Content-Length
if the reply has a body),
a blank line,
and the body:
HTTP/1.1 200 OK
Date: Thu, 16 June 2023 12:28:53 GMT
Content-Type: text/html
Content-Length: 53
<html>
<body>
<h1>Hello, World!</h1>
</body>
</html>
Constructing HTTP requests is tedious,
so most people use a library to do the repetitive work.
The most popular one in Python is the requests
module,
and works like this:
import requests
response = requests.get("http://third-bit.com/test.html")
print("status code:", response.status_code)
print("content length:", response.headers["content-length"])
print(response.text)
status code: 200
content length: 103
<html>
<head>
<title>Test Page</title>
</head>
<body>
<p>test page</p>
</body>
</html>
request.get
sends an HTTP GET request to a server
and returns an object containing the response (Figure 22.1).
That object’s status_code
member is the response’s status code;
its content_length
member is the number of bytes in the response data,
and text
is the actual data—in this case, an HTML page
that we can analyze or render.
Keep in mind that requests
isn’t doing anything magical:
it is just formatting some text,
opening a socket connection (Chapter 21),
sending that text through the connection,
and then reading a response.
We will implement some of this ourselves in the exercises.
Hello, Web
We’re now ready to write a simple HTTP server that will:
- wait for someone to connect and send an HTTP request;
- parse that request;
- figure out what to send back; and
- reply with an HTML page.
Steps 1, 2, and 4 are the same from one application to another,
so the Python standard library has a module called http.server
to do most of the work.
Here’s the entire server:
from http.server import BaseHTTPRequestHandler, HTTPServer
PAGE = """<html><body><p>test page</p></body></html>"""
class RequestHandler(BaseHTTPRequestHandler):
def do_GET(self):
content = bytes(PAGE, "utf-8")
self.send_response(200)
self.send_header(
"Content-Type", "text/html; charset=utf-8"
)
self.send_header("Content-Length", str(len(content)))
self.end_headers()
self.wfile.write(content)
if __name__ == "__main__":
server_address = ("localhost", 8080)
server = HTTPServer(server_address, RequestHandler)
server.serve_forever()
Let’s start at the bottom and work our way up.
server_address
specifies the hostname and port of the server.- The
HTTPServer
class takes care of parsing requests and sending back responses. When we construct it, we give it the server address and the name of the class we’ve written that handles requests the way we want—in this case,RequestHandler
. - Finally, we call the server’s
serve_forever
method, which runs until it crashes or we stop it with Ctrl-C.
So what does RequestHandler
do?
- When the server receives a
GET
request, it looks in the request handler for a method calleddo_GET
. (If it gets aPOST
, it looks fordo_POST
and so on.) do_GET
converts the text of the page we want to send back from characters to bytes—we’ll talk about this below.- It then sends a response code (200),
a couple of headers to say what the content type is
and how many bytes the receiver should expect,
and a blank line (produced by the
end_headers
method). - Finally,
do_GET
sends the content of the response by callingself.wfile.write
.self.wfile
is something that looks and acts like a write-only file, but is actually sending bytes to the socket connection.
If we run this program from the command line, it doesn’t display anything:
python basic_http_server.py
but if we then go to http://localhost:8080
with our browser
we see this:
Hello, web!
and this in our shell:
127.0.0.1 - - [16/Sep/2022 06:34:59] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [16/Sep/2022 06:35:00] "GET /favicon.ico HTTP/1.1" 200 -
The first line is straightforward:
since we didn’t ask for a particular file,
our browser has asked for ‘/’ (the root directory of whatever the server is serving).
The second line appears because
our browser automatically sends a second request
for an image file called /favicon.ico
,
which it will display as an icon in the address bar if it exists.
Serving Files
Serving the same page for every request isn’t particularly useful, so let’s rewrite our simple server to return files. The basic logic looks like this:
class RequestHandler(BaseHTTPRequestHandler):
def do_GET(self):
try:
url_path = self.path.lstrip("/")
full_path = Path.cwd().joinpath(url_path)
if not full_path.exists():
raise ServerException(f"{self.path} not found")
elif full_path.is_file():
self.handle_file(self.path, full_path)
else:
raise ServerException(f"{self.path} unknown")
except Exception as msg:
self.handle_error(msg)
We first turn the path in the URL into a local file path
by removing the leading /
.
Translating filenames this way is called path resolution,
and in doing it,
we assume that all the files we’re supposed to serve
live in or below the directory in which the server is running.
If the resolved path corresponds to a file,
we send it back to the client;
if not,
we generate and send an error message.
It might seem simpler to rewrite do_GET
to use if
/else
instead of try
/except
,
but doing the latter has an advantage:
we can handle errors that occur inside methods we’re calling (like handle_file
)
in the same place and in the same way as we handle errors that occur here.
This approach is sometimes called throw low, catch high,
which means that errors should be flagged in many places
but handled in a few places high up in the code.
The method that handles files is an example of this:
def handle_file(self, given_path, full_path):
try:
with open(full_path, 'rb') as reader:
content = reader.read()
self.send_content(content, HTTPStatus.OK)
except IOError:
raise ServerException(f"Cannot read {given_path}")
If there’s an error at any point in the processing cycle, we send a page with an error message and an error status code. The former gives human users something to read, while the latter gives software a meaningful value in a predictable place:
def handle_error(self, msg):
content = ERROR_PAGE.format(path=self.path, msg=msg)
content = bytes(content, "utf-8")
self.send_content(content, HTTPStatus.NOT_FOUND)
The error page is just HTML with some placeholders for the path and message:
ERROR_PAGE = """\
<html>
<head><title>Error accessing {path}</title></head>
<body>
<h1>Error accessing {path}: {msg}</h1>
</body>
</html>
"""
The code that actually sends the response is similar to what we’ve seen before:
def send_content(self, content, status):
self.send_response(int(status))
self.send_header("Content-Type", "text/html; charset=utf-8")
self.send_header("Content-Length", str(len(content)))
self.end_headers()
self.wfile.write(content)
This server works, but only for a very forgiving definition of “works”.
We are careful not to show clients the actual paths to files on the server
in our error messages,
but if someone asked for http://localhost:8080/../../passwords.txt
,
this server will happily look two levels up from the directory where it’s running
and try to return that file.
The server machine’s passwords probably aren’t stored there,
but with enough ..
’s and some patience,
an attacker could poke around large parts of our filesystem.
We will tackle this in the exercises.
Another problem is that send_content
always tells clients that
it is returning an HTML file with the Content-Type
header.
It should instead look at the extension on the file’s name
and set the content type appropriately,
e.g.,
return image/png
for a PNG-formatted image.
One thing the server is doing right is character encoding.
The send_content
method expects content
to be a bytes
object,
not a string,
because the HTTP protocol requires the content length to be the number of bytes.
The server reads files in binary mode
by using "rb"
instead of just "r"
when it opens files in handle_file
,
converts the internally-generated error page from characters to bytes
using the UTF-8 encoding
and specifies charset=utf-8
as part of the content type.
Testing
As with the server in Chapter 21, we can work backward from a test we want to be able to write to create a testable server. We would like to create a file, simulate an HTTP GET request, and check that the status, headers, and content are correct. Figure 22.2 shows the final inheritance hierarchy:
-
BaseHTTPRequestHandler
comes from the Python standard library. -
MockRequestHandler
defines replacements for its method. -
ApplicationRequestHandler
contains our server’s logic. -
RequestHandler
combines our application code with Python’s request handler. -
MockHandler
combines it with our mock request handler.
It’s a lot of work to test a single GET request,
but we can re-use MockRequestHandler
to test
the application-specific code for other servers.
Most libraries don’t provide helper classes like this to support testing,
but programmers appreciate those that do.
MockRequestHandler
is just a few lines of code,
though it would be longer if our application relied on
more methods from the library class we’re replacing:
from io import BytesIO
class MockRequestHandler:
def __init__(self, path):
self.path = path
self.status = None
self.headers = {}
self.wfile = BytesIO()
def send_response(self, status):
self.status = status
def send_header(self, key, value):
if key not in self.headers:
self.headers[key] = []
self.headers[key].append(value)
def end_headers(self):
pass
The application-specific class contains the code we’ve already seen:
class ApplicationRequestHandler:
def do_GET(self):
try:
url_path = self.path.lstrip("/")
full_path = Path.cwd().joinpath(url_path)
if not full_path.exists():
raise ServerException(f"'{self.path}' not found")
elif full_path.is_file():
self.handle_file(self.path, full_path)
else:
raise ServerException(f"Unknown object '{self.path}'")
except Exception as msg:
self.handle_error(msg)
# ...etc...
MockHandler
handles the simulated request
and also stores the values that the client would receive:
def test_existing_path(fs):
content_str = "actual"
content_bytes = bytes(content_str, "utf-8")
fs.create_file("/actual.txt", contents=content_str)
handler = MockHandler("/actual.txt")
handler.do_GET()
assert handler.status == int(HTTPStatus.OK)
assert handler.headers["Content-Type"] == \
["text/html; charset=utf-8"]
assert handler.headers["Content-Length"] == \
[str(len(content_bytes))]
assert handler.wfile.getvalue() == content_bytes
The main body of our runnable server combines the two classes to create what it needs:
if __name__ == '__main__':
class RequestHandler(
BaseHTTPRequestHandler,
ApplicationRequestHandler
):
pass
serverAddress = ('', 8080)
server = HTTPServer(serverAddress, RequestHandler)
server.serve_forever()
Our tests, on the other hand, create a server with mocked methods:
class MockHandler(
MockRequestHandler,
ApplicationRequestHandler
):
pass
Summary
Figure 22.3 summarizes the ideas introduced in this chapter. Given the impact the World-Wide Web has had, newcomers are often surprised by how simple of HTTP actually is.
Exercises
Parsing HTTP Requests
Write a function that takes a list of lines of text as input and parses them as if they were an HTTP request. The result should be a dictionary with the request’s method, URL, protocol version, and headers.
Query Parameters
A URL can contain query parameters.
Read the documentation for the urlparse
module
and then modify the file server example so that
a URL containing a query parameter bytes=N
(for a positive integer N)
returns the first N bytes of the requested file.
Better Path Resolution
Modify the file server so that:
-
it must be given the absolute path to a directory as a command-line argument when started; and
-
it only serves files in or below that directory (so that paths containing
..
and other tricks can’t be used to retrieve arbitrary files).
Better Content Types
Read the documentation for the mimetypes
module
and then modify the file server to return the correct content type
for files that aren’t HTML (such as images).
Uploading Files
Modify the file server to handle POST requests.
-
The URL must specify the name of the file being uploaded.
-
The body of the request must be the bytes of the file.
-
All uploaded files are saved in a single directory, i.e., upload paths cannot contain directory components.
Checking Content Length
Modify the file server so that:
-
if the client sends more content than indicated in the
Content-Length
header, the extra bytes are read but ignored; and -
if the client sends less content, the server doesn’t wait indefinitely for the missing bytes.
What status code should the server return to the client in each case?
Directory Listing
-
Modify the file server so that if the path portion of the URL identifies a directory, the server returns a plain text list of the directory’s contents.
-
Write tests for this using the
pyfakefs
module.
Dynamic Results
Modify the file server so that if the client requests the “file” /time
,
the server returns an HTML page that reports the current time on the server’s machine.
Templated Results
Modify the file server to:
-
turn the query parameters in the URL into a dictionary;
-
use that dictionary to fill in a template page (Chapter 12); and
-
return the resulting HTML page to the client.