tbl: testing round one

In the previous post, I rearranged the structure of the code to align it more closely with what we might expect for a Python package that can be installed via pip. It is never too early to begin arranging the structure of a project appropriately, and it is never too early to begin testing.

I have a clear idea of what I expect this project to be, because I’ve written it before, and written tests, documentation, and a draft paper on it. However, coming soon, I’ll need to get those ideas expressed here in something resembling a coherent design. For now, though, I’ll continue exploring a bit. But, I’ll do it properly.

My “driver” code right now looks like this:

import tbl

pets_url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vSK2rd47ogfI2CpQi2L6HDo9AOEhnhqBN4zR4kLPUO28vBzmlc8XQWrvTfBYCU0ePf478yxcNKdOy5m/pub?gid=0&single=true&output=csv"

a_tbl = tbl.tbl(url = pets_url)
a_tbl.show_columns()

And, when executed, it outputs this:

Name        : None
Weight      : None
Color       : None

That’s fine, because those are the contents of the header row of my spreadsheet. But, saying “it looks right” is no way to test software. Although I haven’t articulated a complete design, one thing I know my library will need to be able to do is read a CSV file from a URL and convert it into a tbl (a structure that is yet to be fully described).

So, for my next trick, I’ll put some light testing in place. Even though the structures might change, and this might require me to re-write some tests, there are two important things to committing to testing early: first, I can continue exploring with confidence that the code I’ve written so far is working the way I expect. Second, even if I change the structures over which I’m writing the tests, it should remain true that the tests themselves are “good.” Or, put another way, I may have to rewrite the tests, but the tests will be a framework that will remain constant regardless of how the structures change. This will again provide confidence in the library in the face of refactoring, and give me velocity both in terms of development and the development of future tests.

(Some googling suggests that there may be more than one way to do it, and that I may have some additional refactoring to do. But, I’ll proceed with documentation from pytest for now, knowing that a first step that is reasonable is better than no step at all.)

my friend pytest

There are many testing frameworks in many languages. I’m going to leverage pytest here because it is lightweight to leverage, and I’ll take speed over complexity early in any programming endeavor. And, in truth, I may never need anything more complex than pytest, because it really is quite capable.

first problem: not importable

Uh-oh.

(venv) jadudm@lego:~/git/pytbl$ pip install -e . 
Directory '.' is not installable. File 'setup.py' not found.

It looks like I need a setup.py. My initial setup looks like this:

from setuptools import setup

setup(name='tbl',
      version='0.1',
      description='A tabular way to think about data.',
      url='http://github.com/jadudm/pytbl',
      author='Matt Jadud',
      author_email='[email protected]',
      license='MIT',
      packages=['tbl'],
      zip_safe=False)

And, now:

(venv) jadudm@lego:~/git/pytbl$ pip install -e . 
Obtaining file:///home/jadudm/git/pytbl
Installing collected packages: tbl
  Running setup.py develop for tbl
Successfully installed tbl

That ‘pip install’ command creates a symlink to my package directory in the venv. This way, I can keep developing the code and running tests, and I will always be testing against the “live” code.

I can now run python3 -m pytest, and get:

(venv) jadudm@lego:~/git/pytbl$ python3 -m pytest
======================================= test session starts ========================================
platform linux -- Python 3.6.9, pytest-5.3.5, py-1.8.1, pluggy-0.13.1
rootdir: /home/jadudm/git/pytbl
collected 0 items                                                                                  

====================================== no tests ran in 1.03s =======================================

This is good.

testing one function

At this point, I want to test the import of the CSV file. There’s a lot of tests I can run at this point, because (really) my first function is almost too complex.

What if the programmer using tbl

  • gives me a bad URL?
  • says there is a header, but there isn’t?
  • says there isn’t a header, but there is?
  • gives me a spreadsheet with a header and no data?
  • gives me a URL to more data than I can hold in memory?
  • gives me a URL to more data than I can store on disk?
  • gives me a URL to something that is not a CSV/spreadsheet?
  • gives me a spreadsheet with data, and a header, and says it has a header?

The last one is actually the easy/ideal case. The others are failure cases, some of which might be difficult to catch early. But, anywhere you give a programmer the ability to pull data in—especially over the network—you have to begin thinking in a really paranoid way. And, when dealing with novice programmers, they might be taking random stabs at things, or (more likely) really trying hard to figure things out, but this will still be in the space of “desperate guessing” in some cases.

So, time to write some tests.

a bad URL

What is a “bad” URL? In this case, we’ll call it a URL that does not point to a CSV file, or (worse) is simply not a URL. This could look like the following:

a_tbl = tbl.tbl(url = True)
a_tbl = tbl.tbl(url = 1)
a_tbl = tbl.tbl(url = "lobster")
a_tbl = tbl.tbl(url = [True])
a_tbl = tbl.tbl(url = [1])
a_tbl = tbl.tbl(url = ["lobster"])
a_tbl = tbl.tbl(url = ["https://lobster.org/northaven.csv"])
a_tbl = tbl.tbl(url = "http")
a_tbl = tbl.tbl(url = "https")
a_tbl = tbl.tbl(url = "https://lobster")
# Technically, this is a good URL, but we have no idea if 
# it serves up a CSV file. 
a_tbl = tbl.tbl(url = "https://lobster.org/")
a_tbl = tbl.tbl(url = "https://lobster.org/northhaven")
a_tbl = tbl.tbl(url = "https://lobster.org/northhaven.txt")

This has begun to suggest what we’re going to consider a “good” URL. This may not be obvious, but I’m going to bet money that validating URLs is hard. There’s whole specs on how to format a URL/URI, so… why would I want to try and write this myself? A bit of googling confirms that Python has what I want: a validation package for URLs (and other stuff). I found this from a Stack Overflow thread, which (had I followed the first recommendation), I would have ended up implementing my own. Not a good idea.

I’m going to have to do the checking inside of the call to the tbl constructor, but I’ll farm it out to a helper function. I’ve created a module called validation.py that will contain all of my validation code, so that the class doesn’t get too heavy. (Is this good OOP? Probably not.)

My first validation function looks like this:

import validators as val
from collections import namedtuple as NT
import re

OK = NT("OK", [])
KO = NT("KO", ["code", "message"])

# Error Codes
BAD_URL = 0
DOES_NOT_END_IN_CSV = 1
URL_NOT_A_STRING = 2

def _check_from_sheet(url, has_header):
  # These will "fail fast."
  # Make sure it is a string.
  if not isinstance(url, str):
    return KO(URL_NOT_A_STRING, "The URL you passed does not look like a string: {}".format(url))
  if not val.url(url):
    return KO(BAD_URL, "The URL '{}' appears to be invalid.".format(url))
  # Should the URL end in CSV? Am I guaranteed that a Google Sheets 
  # CSV URL will end this way? This might get tricky.
  # If it is a sheets URL, and the letters "csv" appear in the URL, it will be OK.
  if (re.search("docs.google.com", url)
      and re.search("spreadsheets", url)
      and re.search("csv", url)):
      return OK()
  # If it isn't a sheets URL, then perhaps it is a valid URL that 
  # just points to a CSV. Therefore, it should end in '.csv'.
  if not (re.search(".csv$", url) or re.search(".CSV$", url)):
    return KO(DOES_NOT_END_IN_CSV, "The file you linked to does not end in '.csv'.")
  return OK() 

I’ve created two unique types – OK and KO – and started defining some error codes. I don’t know how I’ll use them yet, but I do like the idea of being able to ask if something is validation.OK(). Now, I need to see if I can write test code for all of the above examples, and get back responses that I expect.

This has turned into the following test file:

import tbl
from tbl import validation as V

pets_url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vSK2rd47ogfI2CpQi2L6HDo9AOEhnhqBN4zR4kLPUO28vBzmlc8XQWrvTfBYCU0ePf478yxcNKdOy5m/pub?gid=0&single=true&output=csv"

def test_bool_url():
  a_tbl = tbl.tbl(url = True)
  assert type(a_tbl.fields.status) is V.KO

def test_int_url():
  a_tbl = tbl.tbl(url = 1)
  assert type(a_tbl.fields.status) is V.KO

def test_url_str_not_url():
  a_tbl = tbl.tbl(url = "lobster")
  assert type(a_tbl.fields.status) is V.KO

def test_url_list_bool():
  a_tbl = tbl.tbl(url = [True])
  assert type(a_tbl.fields.status) is V.KO

def test_url_list_int():
  a_tbl = tbl.tbl(url = [1])
  assert type(a_tbl.fields.status) is V.KO

def test_url_list_str():
  a_tbl = tbl.tbl(url = ["lobster"])
  assert type(a_tbl.fields.status) is V.KO

def test_list_good_url():
  a_tbl = tbl.tbl(url = ["https://lobster.org/northaven.csv"])
  assert type(a_tbl.fields.status) is V.KO

def test_protocol():
  a_tbl = tbl.tbl(url = "http")
  assert type(a_tbl.fields.status) is V.KO

def test_protocol_s():
  a_tbl = tbl.tbl(url = "https")
  assert type(a_tbl.fields.status) is V.KO

def test_partial_url():
  a_tbl = tbl.tbl(url = "https://lobster")
  assert type(a_tbl.fields.status) is V.KO

# Technically, this is a good URL, but we have no idea if 
# it serves up a CSV file. 
def test_good_url_not_csv():
  a_tbl = tbl.tbl(url = "https://lobster.org/")
  assert type(a_tbl.fields.status) is V.KO

def test_good_url_not_csv2():
  a_tbl = tbl.tbl(url = "https://lobster.org/northhaven")
  assert type(a_tbl.fields.status) is V.KO

def test_good_url_txt():  
  a_tbl = tbl.tbl(url = "https://lobster.org/northhaven.txt")
  assert type(a_tbl.fields.status) is V.KO

def test_goog_url_incomplete():
  a_tbl = tbl.tbl(url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vSK2rd47ogfI2CpQi2L6HDo9AOEhnhqBN4zR4kLPUO28vBzmlc8XQWrvTfBYCU0ePf478yxcNKdOy5m/pub?gid=0&single=true")
  assert type(a_tbl.fields.status) is V.KO

def test_complete_goog_url():
  a_tbl = tbl.tbl(url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vSK2rd47ogfI2CpQi2L6HDo9AOEhnhqBN4zR4kLPUO28vBzmlc8XQWrvTfBYCU0ePf478yxcNKdOy5m/pub?gid=0&single=true&output=csv")
  assert type(a_tbl.fields.status) is V.OK

It feels repetitious. In fact, I now realize that I could create some tables/lists of input data, and do all of this testing in a loop. However, I’ll leave this for the moment: I now have good tests over the possible inputs a user might throw my way, and that makes me happy.

error codes, or exceptions?

What should this library do if a user provides a bad URL? Should a_tbl be an object that is in a bad state, but the object knows it, and therefore won’t do bad things? Or, should the object throw exceptions, causing the user’s code to crash out?

This question has answers that are less obvious than I would like. There are different schools of thought on language/class design around the topic of exceptions. Python likes exceptions, golang prefers error codes. This will require some thought and a bit more reading, as I prefer the latter, but wonder if it is more important to be Pythonic.

And, regardless of whether it is “Pythonic,” the question really is “what would be most usable to a novice programmer working with data?”

still not done testing…

And, if you’re still reading, you’ll realize that I’m not done testing. That is, I had imagined more ways the user might try and abuse my library than I actually tested for. So far, I’ve only handled the “bad URL” condition. What if the CSV they hand me is malformed? That’s another whole round of validation that has to come after I check if the CSV URL is even valid. Then, I have to check if I can fetch the URL, and if it is a reasonable size, and …

For tomorrow. For tonight, I’ve made progress.

posts in the series

2020 03 10 tbl: a slice of cake code
2020 03 09 tbl: testing round one code
2020 03 09 tbl: structuring the project code
2020 03 08 tbl: abstractions and imports code
2020 03 07 tbl: thinking about data code

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.