Performance Profiling

  • Create abstract classes to specify interfaces.
  • Store two-dimensional data as rows or as columns.
  • Use reflection to match data to function parameters.
  • Measure performance to evaluate engineering tradeoffs.

Terms defined: batch processing, benchmark, column-wise storage, data engineer, dataframe, docstring, immutable, index (a database), join (tables), online analytical processing, online transaction processing, parameter sweeping, profiling, row-wise storage

One peril of publishing a book online is obsessing over analytics. How many people visited the site today? Which pages did they look at, and for how long? Whether we use Excel, SQL, or Python, we will almost certainly be analyzing tables with named columns and multiple rows. Such tables are called dataframes, and their performance is important when we are working with large data sets. This chapter therefore implements dataframes in two ways and shows how to compare their performance.

Options

To start, let’s create an abstract class that defines the methods our dataframe classes will support. This class requires concrete classes to implement the methods shown below:

class DataFrame:
    def ncol(self):
        """Report the number of columns."""

    def nrow(self):
        """Report the number of rows."""

    def cols(self):
        """Return the set of column names."""

    def eq(self, other):
        """Check equality with another dataframe."""

    def get(self, col, row):
        """Get a scalar value."""

    def select(self, *names):
        """Select a named subset of columns."""

    def filter(self, func):
        """Select a subset of rows by testing values."""

Docstrings Are Enough

Every method in Python needs a body, so many programmers will write pass (Python’s “do nothing” statement). However, a docstring also counts as a body, so if we write those (which we should) there’s no need to write pass.

For our first usable implementation, we will derive a class DfRow that uses row-wise storage (Figure 15.1). The dataframe is stored as a list of dictionaries, each of which represents a row. All of the dictionaries must have the same keys so that the concept of “column” is meaningful, and the values associated with a particular key must all have the same type.

Row-wise storage
Figure 15.1: Storing a dataframe’s data in rows.

Our second implementation, DfCol, will use column-wise storage (Figure 15.2). Each column is stored as a list of values, all of which are of the same type. The dataframe itself is a dictionary of such lists, all of which have the same length so that there are no holes in any of the rows. How we store the data determines which methods are easy to implement and which are hard, and which are fast or slow.

Column-wise storage
Figure 15.2: Storing a dataframe’s data in columns.

Row-Wise Storage

We start by deriving DfRow from DataFrame and writing its constructor, which takes a list of dictionaries as an argument, checks that they’re consistent with each other, and saves them:

from df_base import DataFrame
from util import dict_match

class DfRow(DataFrame):
    def __init__(self, rows):
        assert len(rows) > 0
        assert all(dict_match(r, rows[0]) for r in rows)
        self._data = rows

The helper function to check that a bunch of dictionaries all have the same keys and the same types of values associated with those keys is:

def dict_match(d, prototype):
    if set(d.keys()) != set(prototype.keys()):
        return False
    return all(isinstance(d[k], prototype[k]) for k in d)

Notice that DfRow’s constructor compares all of the rows against the first row. Doing this means that we can’t create an empty dataframe, i.e., one that has no rows. This restriction wasn’t part of our original design: it’s an accident of implementation that might surprise our users. It’s OK not to fix this while we’re prototyping, but we will look at ways to address it in the exercises.

Four of the methods required by DataFrame are easy to implement on top of row-wise storage, though once again our implementation assumes there is at least one row:

    def ncol(self):
        return len(self._data[0])

    def nrow(self):
        return len(self._data)

    def cols(self):
        return set(self._data[0].keys())

    def get(self, col, row):
        assert col in self._data[0]
        assert 0 <= row < len(self._data)
        return self._data[row][col]

Checking equality is also relatively simple. Two dataframes are the same if they have exactly the same columns and the same values in every column:

    def eq(self, other):
        assert isinstance(other, DataFrame)
        for (i, row) in enumerate(self._data):
            for key in row:
                if key not in other.cols():
                    return False
                if row[key] != other.get(key, i):
                    return False
        return True

Notice that we use other.cols() and other.get() rather than reaching into the other dataframe. We are planning to implement dataframes in several different ways, and we might want to compare instances of those different implementations. Since they might have different internal data structures, the only safe way to do this is to rely on the interface defined in the base class.

Our final operations are selection, which returns a subset of the original dataframe’s columns, and filtering, which returns a subset of its rows. Since we don’t know how many columns the user might want, we give the select method a single parameter *names that will capture zero or more positional arguments. We then build a new list of dictionaries that only contain the fields with those names (Figure 15.3):

    def select(self, *names):
        assert all(n in self._data[0] for n in names)
        rows = [{key: r[key] for key in names} for r in self._data]
        return DfRow(rows)
Row-wise select
Figure 15.3: Selecting columns from data stored as rows.

We now need to decide how to filter rows. Typical filtering conditions include, “Keep rows where red is non-zero,” “Keep rows where red is greater than green,” and, “Keep rows where red+green is within 10% of blue.” Rather than trying to anticipate every possible rule, we require users to define functions whose parameters match the names of the table’s columns. For example, if we have this test fixture:

def odd_even():
    return DfRow([{"a": 1, "b": 3}, {"a": 2, "b": 4}])

then we should be able to write this test:

def test_filter():
    def odd(a, b):
        return (a % 2) == 1

    df = odd_even()
    assert df.filter(odd).eq(DfRow([{"a": 1, "b": 3}]))

We can implement this by using ** to spread the row across the function’s parameters (Chapter 2). If there are keys in the row that don’t match parameters in the function or vice versa, Python will throw an exception, but that’s probably what we want. Using this, the implementation of DfRow.filter is:

    def filter(self, func):
        result = [r for r in self._data if func(**r)]
        return DfRow(result)

Notice that the dataframe created by filter re-uses the rows of the original dataframe (Figure 15.4). This is safe and efficient as long as dataframes are immutable, i.e., as long as their contents are never changed in place. Most dataframe libraries work this way: while recycling memory can save a little time, it usually also makes bugs much harder to track down.

Row-wise filtering
Figure 15.4: Filtering data stored as rows.

Column-Wise Storage

Having done all of this thinking, our column-wise dataframe class is somewhat easier to write. We start as before with its constructor:

from df_base import DataFrame
from util import all_eq

class DfCol(DataFrame):
    def __init__(self, **kwargs):
        assert len(kwargs) > 0
        assert all_eq(len(kwargs[k]) for k in kwargs)
        for k in kwargs:
            assert all_eq(type(v) for v in kwargs[k])
        self._data = kwargs

and use a helper function all_eq to check that all of the values in any column have the same types:

def all_eq(*values):
    return (not values) or all(v == values[0] for v in values)

One Allowable Difference

Notice that DfCol’s constructor does not have the same signature as DfRow’s. At some point in our code we have to decide which of the two classes to construct. If we design our code well, that decision will be made in exactly one place and everything else will rely solely on the common interface defined by DataFrame. But since we have to type a different class name at the point of construction, it’s OK for the constructors to be different.

The four methods that were simple to write for DfRow are equally simple to write for DfCol, though once again our prototype implementation accidentally disallows empty dataframes:

    def ncol(self):
        return len(self._data)

    def nrow(self):
        n = list(self._data.keys())[0]
        return len(self._data[n])

    def cols(self):
        return set(self._data.keys())

    def get(self, col, row):
        assert col in self._data
        assert 0 <= row < len(self._data[col])
        return self._data[col][row]

As with DfRow, the method that checks equality relies on the internal details of its own class but uses the interface defined by DataFrame to access the other object:

    def eq(self, other):
        assert isinstance(other, DataFrame)
        for n in self._data:
            if n not in other.cols():
                return False
            for i in range(len(self._data[n])):
                if self.get(n, i) != other.get(n, i):
                    return False
        return True

To select columns, we pick the ones named by the caller and use them to create a new dataframe. Again, this recycles the existing storage (Figure 15.5):

    def select(self, *names):
        assert all(n in self._data for n in names)
        return DfCol(**{n: self._data[n] for n in names})
Column-wise selection
Figure 15.5: Column-wise selection.

Finally, we need to filter the rows of a column-wise dataframe. Doing this is complex: since values are stored in columns, we have to extract the ones belonging to each row to pass them into the user-defined filter function (Figure 15.6). And if that wasn’t enough, we want to do this solely for the columns that the user’s function needs.

Packing column values into rows
Figure 15.6: Extracting values from columns to create temporary rows.

For now, we will solve this problem by requiring the user-defined filter function to define parameters to match all of the dataframe’s columns regardless of whether they are used for filtering or not. We will then build a temporary dictionary with all the values in a “row” (i.e., the corresponding values across all columns) and use ** to spread it across the filter function. Appendix B looks at a safer, but more complex, way to do this.

    def filter(self, func):
        result = {n: [] for n in self._data}
        for i in range(self.nrow()):
            args = {n: self._data[n][i] for n in self._data}
            if func(**args):
                for n in self._data:
                    result[n].append(self._data[n][i])
        return DfCol(**result)

Time to write some tests. This one checks that we can construct a dataframe with some values:

def test_construct_with_two_pairs():
    df = DfCol(a=[1, 2], b=[3, 4])
    assert df.get("a", 0) == 1
    assert df.get("a", 1) == 2
    assert df.get("b", 0) == 3
    assert df.get("b", 1) == 4

while this one checks that filter works correctly:

def test_filter():
    def odd(a, b):
        return (a % 2) == 1

    df = DfCol(a=[1, 2], b=[3, 4])
    assert df.filter(odd).eq(DfCol(a=[1], b=[3]))

Performance

Our two implementations of dataframes have identical interfaces, so how can we choose which to use?

Transactions vs. Analysis

Regardless of data volumes, different storage schemes are better (or worse) for different kinds of work. Online transaction processing (OLTP) refers to adding or querying individual records, such as online sales. Online analytical processing (OLAP), on the other hand, processes selected columns of a table in bulk to do things like find averages over time. Row-wise storage is usually best for OLTP, but column-wise storage is better suited for OLAP. If data volumes are large, data engineers will sometimes run two databases in parallel, using batch processing jobs to copy new or updated records from the OLTP databases over to the OLAP database.

To compare the speed of these classes, let’s write a short program to create dataframes of each kind and time how long it takes to select their columns and filter their rows. To keep things simple, we will create dataframes whose columns are called label_1, label_2, and so on, and whose values are all integers in the range 0–9. A thorough set of benchmarks would create columns with other datatypes as well, but this example is enough to illustrate the technique.

RANGE = 10

def make_col(nrow, ncol):
    def _col(n, start):
        return [((start + i) % RANGE) for i in range(n)]
    fill = {f"label_{c}": _col(nrow, c) for c in range(ncol)}
    return DfCol(**fill)

def make_row(nrow, ncol):
    labels = [f"label_{c}" for c in range(ncol)]
    def _row(r):
        return {
            c: ((r + i) % RANGE) for (i, c) in enumerate(labels)
        }
    fill = [_row(r) for r in range(nrow)]
    return DfRow(fill)

To time filter, we arbitrarily decide to keep rows with an even value in the first column:

FILTER = 2

def time_filter(df):
    def f(label_0, **args):
        return label_0 % FILTER == 1
    start = time.time()
    df.filter(f)
    return time.time() - start

Since DfCol and DfRow derive from the same base class, time_filter doesn’t care which we give it. Again, if we were doing this for real, we would look at actual programs to see what fraction of rows filtering usually kept and simulate that.

To time select, we arbitrarily decide to keep one-third of the columns:

SELECT = 3

def time_select(df):
    indices = [i for i in range(df.ncol()) if ((i % SELECT) == 0)]
    labels = [f"label_{i}" for i in indices]
    start = time.time()
    df.select(*labels)
    return time.time() - start

Finally, we write a function that takes a list of strings like 3x3 or 100x20, creates dataframes of each size, times operations, and reports the results. We call this function sweep because executing code multiple times with different parameters to measure performance is called parameter sweeping:

def sweep(sizes):
    result = []
    for (nrow, ncol) in sizes:
        df_col = make_col(nrow, ncol)
        df_row = make_row(nrow, ncol)
        times = [
            time_filter(df_col),
            time_select(df_col),
            time_filter(df_row),
            time_select(df_row),
        ]
        result.append([nrow, ncol, *times])
    return result

The results are shown in Table 15.1 and Figure 15.7. For a \( 1000 \times 1000 \) dataframe, selection is over 250 times faster with column-wise storage than with row-wise, while filtering is 1.8 times slower.

Table 15.1: Dataframe timings.
nrow ncol filter col select col filter row select row
10 10 8.87e-05 7.70e-05 4.41e-05 2.50e-05
100 100 0.00275 4.10e-05 0.00140 8.76e
1000 1000 0.146 0.000189 0.0787 0.0508
10000 10000 19.0 0.00234 9.97 5.57
Performance curves
Figure 15.7: Relative performance of row-wise and column-wise storage.

We can get much more insight by profiling our code using Python cProfile module, which collects detailed information on how long each function runs and reports the result:

python -m cProfile --sort=tottime \
  timing.py --silent 10x10 50x50 100x100 500x500 1000x1000
         3007281 function calls (3003108 primitive calls) in 2.120 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  2319840    0.671    0.000    0.671    0.000 util.py:10(<genexpr>)
        5    0.271    0.054    0.521    0.104 df_col.py:50(filter)
     1660    0.261    0.000    0.261    0.000 timing.py:20(<dictcomp>)
8066/3916    0.213    0.000    1.056    0.000 {built-in method builtins.all}
     1660    0.191    0.000    0.191    0.000 df_col.py:53(<dictcomp>)
     4150    0.165    0.000    1.048    0.000 util.py:7(dict_match)
     1660    0.144    0.000    0.144    0.000 timing.py:13(<listcomp>)
        5    0.061    0.012    0.062    0.012 df_row.py:49(<listcomp>)
   631305    0.054    0.000    0.054    0.000 {method 'append' of 'list' objects}
     1660    0.049    0.000    0.049    0.000 df_row.py:43(<dictcomp>)
        1    0.008    0.008    2.120    2.120 timing.py:1(<module>)
     4165    0.004    0.000    1.052    0.000 df_row.py:8(<genexpr>)
       15    0.003    0.000    0.008    0.001 df_col.py:6(__init__)
        5    0.002    0.000    0.148    0.030 timing.py:14(<dictcomp>)
     1660    0.002    0.000    0.263    0.000 timing.py:19(_row)
     1660    0.002    0.000    0.146    0.000 timing.py:12(_col)
     3891    0.002    0.000    0.004    0.000 util.py:2(all_eq)
     3320    0.002    0.000    0.002    0.000 timing.py:31(f)
       10    0.001    0.000    0.244    0.024 timing.py:41(time_select)
       10    0.001    0.000    0.872    0.087 timing.py:30(time_filter)
        1    0.001    0.001    2.109    2.109 timing.py:50(sweep)
     7782    0.001    0.000    0.001    0.000 util.py:3(<genexpr>)
        5    0.001    0.000    0.264    0.053 timing.py:23(<listcomp>)
     8305    0.001    0.000    0.001    0.000 {method 'keys' of 'dict' objects}
        5    0.001    0.000    0.050    0.010 df_row.py:43(<listcomp>)
        1    0.000    0.000    0.000    0.000 {built-in method _imp.create_dynamic}
     3876    0.000    0.000    0.000    0.000 df_col.py:10(<genexpr>)
       10    0.000    0.000    0.000    0.000 timing.py:42(<listcomp>)
        6    0.000    0.000    0.000    0.000 {built-in method posix.getcwd}
        5    0.000    0.000    0.000    0.000 timing.py:18(<listcomp>)
        5    0.000    0.000    0.152    0.030 timing.py:11(make_col)
       10    0.000    0.000    0.000    0.000 timing.py:43(<listcomp>)
        5    0.000    0.000    0.000    0.000 df_col.py:51(<dictcomp>)
       22    0.000    0.000    0.000    0.000 {built-in method posix.stat}
        5    0.000    0.000    0.000    0.000 {built-in method io.open_code}
        5    0.000    0.000    0.002    0.000 df_col.py:44(select)
        5    0.000    0.000    0.000    0.000 {built-in method marshal.loads}
      561    0.000    0.000    0.000    0.000 df_col.py:45(<genexpr>)
      561    0.000    0.000    0.000    0.000 df_row.py:42(<genexpr>)
       10    0.000    0.000    0.000    0.000 {built-in method builtins.__build_class__}
        5    0.000    0.000    0.000    0.000 df_col.py:46(<dictcomp>)
       11    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1604(find_spec)
       40    0.000    0.000    0.000    0.000 {built-in method time.time}
        5    0.000    0.000    0.000    0.000 {method '__exit__' of '_io._IOBase' objects}
        5    0.000    0.000    0.350    0.070 df_row.py:48(filter)
        6    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap>:1054(_find_spec)
        5    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap_external>:1007(get_code)
      6/3    0.000    0.000    0.003    0.001 <frozen importlib._bootstrap>:1165(_find_and_load)
       15    0.000    0.000    1.053    0.070 df_row.py:6(__init__)
        5    0.000    0.000    0.000    0.000 {method 'read' of '_io.BufferedReader' objects}
       56    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:128(<listcomp>)
       56    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:126(_path_join)
       42    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
        5    0.000    0.000    0.240    0.048 df_row.py:41(select)
       10    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:437(cache_from_source)
        6    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap_external>:1464(_get_spec)
        5    0.000    0.000    0.839    0.168 timing.py:17(make_row)
        1    0.000    0.000    0.000    0.000 {built-in method _imp.exec_dynamic}
       66    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        6    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:493(_init_module_attrs)
      6/3    0.000    0.000    0.002    0.001 <frozen importlib._bootstrap>:666(_load_unlocked)
        6    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:179(_get_module_lock)
        1    0.000    0.000    0.001    0.001 csv.py:1(<module>)
        5    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1127(get_data)
      6/3    0.000    0.000    0.003    0.001 <frozen importlib._bootstrap>:1120(_find_and_load_unlocked)
        5    0.000    0.000    0.000    0.000 df_col.py:18(nrow)
        6    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:778(spec_from_file_location)
      6/1    0.000    0.000    2.120    2.120 {built-in method builtins.exec}
        3    0.000    0.000    0.000    0.000 {built-in method _csv.register_dialect}
      5/3    0.000    0.000    0.002    0.001 <frozen importlib._bootstrap_external>:934(exec_module)
        6    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1599(_get_spec)
       64    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:244(_verbose_message)
       10    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:132(_path_split)
        6    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap>:566(module_from_spec)
       66    0.000    0.000    0.000    0.000 {method 'join' of 'str' objects}
        6    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:100(acquire)
        6    0.000    0.000    0.000    0.000 __init__.py:89(find_spec)
       13    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1421(_path_importer_cache)
        6    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:125(release)
        6    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:567(_get_cached)
        5    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:727(_compile_bytecode)
        5    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:642(_classify_pyc)
      122    0.000    0.000    0.000    0.000 {method 'rstrip' of 'str' objects}
       10    0.000    0.000    0.000    0.000 {built-in method builtins.max}
        6    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:169(__enter__)
       15    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:84(_unpack_uint32)
     13/6    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap>:233(_call_with_frames_removed)
       24    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:1030(__exit__)
       22    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:140(_path_stat)
       11    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:392(cached)
        6    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap_external>:1496(find_spec)
        6    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:198(cb)
        1    0.000    0.000    0.001    0.001 df_col.py:1(<module>)
       20    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:134(<genexpr>)
       11    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:67(_relax_case)
        6    0.000    0.000    0.000    0.000 {method 'format' of 'str' objects}
       24    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:1026(__enter__)
        6    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:71(__init__)
        5    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:675(_validate_timestamp_pyc)
       33    0.000    0.000    0.000    0.000 {method 'rpartition' of 'str' objects}
       15    0.000    0.000    0.000    0.000 df_col.py:8(<genexpr>)
       23    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}
        5    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:599(_check_name_wrapper)
        6    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:173(__exit__)
       36    0.000    0.000    0.000    0.000 {built-in method _imp.release_lock}
       36    0.000    0.000    0.000    0.000 {built-in method _imp.acquire_lock}
        6    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:920(find_spec)
        5    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1146(path_stats)
        6    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:150(_path_is_mode_type)
        6    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:159(_path_isfile)
        6    0.000    0.000    0.000    0.000 {built-in method builtins.locals}
        5    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:48(_new_module)
       23    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
       12    0.000    0.000    0.000    0.000 {built-in method _thread.allocate_lock}
        1    0.000    0.000    0.000    0.000 csv.py:80(DictReader)
        6    0.000    0.000    0.000    0.000 {built-in method _imp.is_builtin}
        6    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:405(parent)
        6    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:357(__init__)
        6    0.000    0.000    0.000    0.000 {built-in method _imp.find_frozen}
        1    0.000    0.000    0.000    0.000 df_row.py:1(<module>)
        1    0.000    0.000    0.000    0.000 csv.py:165(Sniffer)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1231(create_module)
       10    0.000    0.000    0.000    0.000 {method 'rfind' of 'str' objects}
        1    0.000    0.000    0.000    0.000 timing.py:68(<listcomp>)
        5    0.000    0.000    0.000    0.000 df_row.py:13(ncol)
        6    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:180(_path_isabs)
        7    0.000    0.000    0.000    0.000 {method 'endswith' of 'str' objects}
        1    0.000    0.000    0.000    0.000 timing.py:67(<listcomp>)
        6    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:748(find_spec)
        6    0.000    0.000    0.000    0.000 {method 'startswith' of 'str' objects}
       15    0.000    0.000    0.000    0.000 {built-in method from_bytes}
        1    0.000    0.000    0.000    0.000 df_row.py:5(DfRow)
        5    0.000    0.000    0.000    0.000 df_col.py:15(ncol)
       12    0.000    0.000    0.000    0.000 {method '__exit__' of '_thread.lock' objects}
        5    0.000    0.000    0.000    0.000 {built-in method _imp._fix_co_filename}
        1    0.000    0.000    0.000    0.000 <frozen io>:60(__getattr__)
        1    0.000    0.000    0.000    0.000 timing.py:66(setup)
       16    0.000    0.000    0.000    0.000 {built-in method posix.fspath}
        6    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:165(__init__)
       18    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1239(exec_module)
       12    0.000    0.000    0.000    0.000 {built-in method _thread.get_ident}
        1    0.000    0.000    0.000    0.000 df_col.py:5(DfCol)
        5    0.000    0.000    0.000    0.000 {method 'split' of 'str' objects}
        1    0.000    0.000    0.000    0.000 csv.py:23(Dialect)
        1    0.000    0.000    0.000    0.000 df_base.py:1(DataFrame)
        6    0.000    0.000    0.000    0.000 {method 'pop' of 'dict' objects}
        5    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1097(__init__)
        6    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:413(has_location)
        1    0.000    0.000    0.000    0.000 csv.py:130(DictWriter)
        1    0.000    0.000    0.000    0.000 df_base.py:1(<module>)
        1    0.000    0.000    0.000    0.000 csv.py:54(excel)
        5    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1122(get_filename)
        6    0.000    0.000    0.000    0.000 __init__.py:96(<lambda>)
        5    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:931(create_module)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1220(__init__)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'setter' of 'property' objects}
        1    0.000    0.000    0.000    0.000 csv.py:69(unix_dialect)
        1    0.000    0.000    0.000    0.000 util.py:1(<module>)
        1    0.000    0.000    0.000    0.000 csv.py:64(excel_tab)

The profiler’s output tells us the number of times each function or method was called, the total time spent in those calls (which is what we care about most), the time spent per call, and the cumulative time spent in that call and all the things it calls. We can see right away that the dict_match function that checks the consistency of the rows in a row-oriented dataframe is eating up a lot of time. It’s only called in the constructor, but since we’re constructing a new dataframe for each filter and select, removing that safety check would speed things up.

Looking down a little further, the dictionary comprehension in DfCol.filter takes a lot of time as well. That isn’t surprising: we’re copying the values out of the columns into a temporary dictionary for every row when we filter, and building all those temporary dictionaries adds up to a lot of time.

Summary

Figure 15.8 summarizes the key ideas introduced in this chapter. The most important is that experiments can help us decide how to implement key features of our software, but the results of those experiments depend on exactly what we measure. Good software designers collect and analyze data all the time to find out whether one website design works better than another [Kohavi2020] or to improve the performance of CPUs [Patterson2017]. A few simple experiments like these can save weeks or months of misguided effort.

Concept map for dataframes
Figure 15.8: Concepts for dataframes.

Exercises

More Efficient Filtering

Derive a class from DfCol and override its filter method so that the user-defined filtering functions take zero or more columns and a row index called i_row as parameters and return True or False to signal whether the row passes the test.

  1. How much faster does this make filtering?

  2. When would it be useful for filtering functions to take no column at all as parameters?

Empty Dataframes

An empty dataframe is as reasonable and as useful as an empty string or an empty list. DfCol can represent this, but DfRow cannot: if the list of dictionaries is empty, we cannot ask for column names. Derive another dataframe class from DF that uses row-wise storage but can represent a dataframe with no rows.

Unified Constructors

Modify the constructors of DfRow and DfCol to have the same signatures. Where and why might this be useful?

Fixture Functions

Read the documentation for the @fixture decorator in pytest and modify the tests in this chapter to use it.

Using Arrays

Derive another dataframe class from DF that uses Python’s array module for column-wise storage. How does it perform compared to other implementations?

Crossover

  1. At what ratio of filter operations to select operations are DfRow and DfCol equally fast? (Your answer may depend on the size of the dataframe.)

  2. How does the relative performance of the two classes change if tables have a fixed number of columns (such as 10 or 20) but an increasing numbers of rows? Is this scenario more realistic?

Conversion

Write a function to convert a DfRow into a DfCol and another to do the opposite. Which one is faster? How does the difference in performance depend on the size and shape of the dataframes being converted?

Filtering by Strings

Modify the comparison of filter and select to work with tables that contain columns of strings instead of columns of numbers and see how that changes performance. For testing, create random 4-letter strings using the characters A-Z and then filter by:

  • an exact match,
  • strings starting with a specific character, and
  • strings that contain a specific character.

Inspection

Rewrite DfCol.filter using Python’s inspect module so that users’ filtering functions only need to define parameters for the columns of interest.

Join Performance

A join combines data from two tables based on matching keys. For example, if the two tables are:

Key Left
A a1
B b1
C c1

and:

Key Right
A a2
A a3
B b2

then the join is:

Key Left Right
A a1 a2
A a1 a3
B b1 b2

Write a test to compare the performance of row-wise vs. column-wise storage when joining two tables based on matching numeric keys. Does the answer depend on the fraction of keys that match?

Join Optimization

The simplest way to join two tables is to look for matching keys using a double loop. An alternative is to build an index for each table and then use it to construct matches. For example, suppose the tables are:

Key Left
A a1
B b1
C c1

and:

Key Right
A a2
A a3
B b2

The first step is to create a Map showing where each key is found in the first table:

{"A": [0], "B": [1], "C": [2]}

The second step is to create a similar Map for the second table:

{"A": [0, 1], "B": [2]}

We can then loop over the keys in one of the maps, look up values in the second map, and construct all of the matches.

Write a function that joins two tables this way. Is it faster or slower than using a double loop? How does the answer depend on the number of keys and the fraction that match?