Research Software Design by Example
Site:,GBY
site,date,lon,lat,reading
GBY,2023-05-08,-124.45981,48.92163,26.7
GBY,2023-05-08,-124.45932,48.92091,70.9
GBY,2023-05-08,-124.46036,48.92098,38.1
GBY,2023-05-08,-124.45743,48.92101,1.7
GBY,2023-05-08,-124.46048,48.92059,33.0
GBY,2023-05-08,-124.46061,48.92166,7.9
GBY,2023-05-08,-124.46004,48.92069,50.1
GBY,2023-05-08,-124.45828,48.92066,32.6
import pandas as pd
import sys
df = pd.read_csv(sys.argv[1], skiprows=1)
print(df)
site date lon lat reading
0 GBY 2023-05-08 -124.45981 48.92163 26.7
1 GBY 2023-05-08 -124.45932 48.92091 70.9
2 GBY 2023-05-08 -124.46036 48.92098 38.1
3 GBY 2023-05-08 -124.45743 48.92101 1.7
4 GBY 2023-05-08 -124.46048 48.92059 33.0
5 GBY 2023-05-08 -124.46061 48.92166 7.9
6 GBY 2023-05-08 -124.46004 48.92069 50.1
7 GBY 2023-05-08 -124.45828 48.92066 32.6
Unnamed: 0 site date lon lat reading
0 NaN YOU 2023-05-01 -124.19699 48.87250 96.6
1 NaN YOU 2023-05-01 -124.19707 48.87281 70.0
2 NaN YOU 2023-05-01 -124.19808 48.87288 41.7
3 NaN YOU 2023-05-01 -124.19678 48.87271 87.6
4 NaN YOU 2023-05-01 -124.19725 48.87254 83.8
5 NaN YOU 2023-05-01 -124.19829 48.87155 5.4
6 NaN YOU 2023-05-01 -124.19696 48.87250 72.8
7 NaN YOU 2023-05-01 -124.19737 48.87231 74.9
8 NaN YOU 2023-05-01 -124.19669 48.87268 43.0
9 NaN YOU 2023-05-01 -124.19704 48.87257 94.2
10 NaN YOU 2023-05-01 -124.19734 48.87358 15.9
11 NaN YOU 2023-05-01 -124.19670 48.87378 4.5
Site:,COW
Analyst:,P. Srinath
SITE,DATE,LON,LAT,READING
COW,2023-04-27,-124.04518,48.82171,106.9
COW,2023-04-27,-124.045,48.8216,81.7
…
def main():
"""Main driver."""
args = parse_args()
if args.infile:
with open(args.infile, 'r') as reader:
df = load(reader)
else:
df = load(sys.stdin)
if args.outfile:
with open(args.outfile, 'w') as writer:
df.to_csv(writer, index=False)
else:
df.to_csv(sys.stdout, index=False)
argparse
module to parse argumentsdef parse_args():
"""Parse command-line arguments."""
parser = argparse.ArgumentParser()
parser.add_argument('--infile', type=str, default=None, help='input')
parser.add_argument('--outfile', type=str, default=None, help='output')
return parser.parse_args()
def load(reader):
"""Load messy data."""
lines = [row for row in csv.reader(reader)]
header, body = split(lines)
titles, data = normalize(body)
assert titles[0] == 'site'
return pd.DataFrame(data, columns=titles)
class State(Enum):
"""Enumerate possible parser states."""
HEADER = 'header'
SEARCHING = 'searching'
BODY = 'body'
DONE = 'done'
def split(rows):
"""Split header from body."""
header, body, state = ([], [], State.HEADER)
for row in rows:
if state == State.HEADER:
if is_empty(row):
state = State.SEARCHING
elif is_start_of_body(row):
state = State.BODY
body.append(row)
else:
header.append(row)
elif state == State.SEARCHING:
if is_start_of_body(row):
state = State.BODY
body.append(row)
elif state == State.BODY:
if is_empty(row):
state = State.DONE
else:
body.append(row)
else:
assert state == State.DONE
return (header, body)
def normalize(rows):
"""Remove leading spaces from rows if necessary."""
if rows[0][0] == '':
rows = [r[1:] for r in rows]
return ([r.lower() for r in rows[0]], rows[1:])
Check that all rows of body are indented the same amount.
Check that all expected columns are there.
Check consistency of site name in header with site names on rows.
Check longitude and latitude: what are reasonable bounds on these?
Check readings: what are reasonable values for these?
Were your “reasonable” bounds for the previous two exercises the same as other people’s? How can you make them easier to discover? Should they be changeable from the command line?
Modify the parser to take an optional filename as an argument. If one is provided, the parser reads a dataframe from that file and compares it to the dataframe loaded from the first file. How would you use this in testing?