python: from the command line
If I was writing a Python command-line script today, how would I do it?
To start
To start, I’d create a folder for my work, put it in git, and make sure I had a license and README in place. Then, I’d create a venv, so that everything I was doing was local to my development directory. And, that venv
directory would need to get added to my .gitignore
.
cd cmdline_script_dir
python3 -m venv venv
cat venv >> .gitignore
source venv/bin/activate
Now, I’m about ready to get started. Because I’m going to use click
to structure the command-line program, I’ll be putting my requirements in a setup.py
file. I’ll want the following to start:
click
pandas
peewee
Peewee is an object relationship manager (ORM) that makes talking to databases a bit easier. Pandas is a power tool for tabular data manipulation in Python. Click is a really nice library for building command-line interfaces.
Setting up a CSV checker
By using Click, I’ll be able to implement a series of tools as single scripts. To start, I’m going to implement a tool that checks my input CSV. I’ll call it check.py
.
@click.command()
@click.argument('filename')
def cli(filename):
click.echo(filename)
Then, in a setup.py
, I would write the following:
from setuptools import setup
setup(
name='admin tools',
version='0.1.0',
py_modules=['check'],
install_requires=[
'click',
'pandas',
'peewee'
],
entry_points={
'console_scripts': [
'check = check:cli',
],
},
)
I’m leveraging the click documentation in this case. After creating check.py
and the setup.py
, I can then “install” it into my local venv.
pip install --editable .
The nice thing about this is that, when I set up my venv, and then do an install, is that I get a command-line tool called check
. Now, I can pass it a single parameter (a CSV filename), and I can use that filename to check things.
Checking the CSV
What am I going to check? First, lets assume my input looks like this:
name,age,color
Alice,3,red
Bob,2,red
Clarice,4,blue
Darren,1,yellow
Now, I might check the following:
- I want to make sure the file exists.
- I want to make sure the input filename ends with
.csv
- I want to make sure the file has a header file with the right headers.
- I want to make sure the
age
column contains only numbers. - I want to make sure the
color
column contains only valid colors.
This feels like a good start. Because I’m going to drive important processes with this data, I don’t want to operate on anything that is incorrect. If I have garbage in, I’ll get garbage out. And, because I need to know what is happening, I’m going to log everything from the start.
check.py
, handling the first and second conditions, looks like:
import click
import logging
import pandas
from pathlib import Path
import peewee
import sys
# Define the custom logger
logger = logging.getLogger(__name__)
# Set up a console and file logger
stream_handler = logging.StreamHandler()
file_handler = logging.FileHandler('check.log', mode='a')
# Send warnings and up to the console; send everything to the file.
stream_handler.setLevel(logging.WARN)
file_handler.setLevel(logging.DEBUG)
# Define our format
format = logging.Formatter('%(asctime)s:%(levelname)s:%(message)s', datefmt='%d-%b-%y %H:%M:%S')
stream_handler.setFormatter(format)
file_handler.setFormatter(format)
# Add the handlers
logger.addHandler(stream_handler)
logger.addHandler(file_handler)
# https://stackoverflow.com/questions/82831/how-do-i-check-whether-a-file-exists-without-exceptions
def check_file_exists(filename):
file = Path(filename)
return file.is_file()
# https://stackoverflow.com/questions/10873777/in-python-how-can-i-check-if-a-filename-ends-in-html-or-files
def check_filename_ends_with(filename, ext):
return filename.endswith(ext)
@click.command()
@click.argument('filename')
def cli(filename):
does_file_exist = check_file_exists(filename)
if not does_file_exist:
logger.error("File '{}' does not exist.".format(filename))
sys.exit(-1)
does_filename_end_with = check_filename_ends_with(filename, "csv")
if not does_filename_end_with:
logger.error("{} does not end with CSV.".format(filename))
sys.exit(-1)
I’m pretty sure I’ll be able to pull that logging stuff into a separate file (because I’ll need it in every tool I develop). Now, I can do a quick test on the command-line:
(venv) jadudm@poke:~/git/python-example$ check lobsters.txt
24-Jan-23 15:21:52:ERROR:File 'lobsters.txt' does not exist.
(venv) jadudm@poke:~/git/python-example$ check lobsters.csv
24-Jan-23 15:22:04:ERROR:File 'lobsters.csv' does not exist.
(venv) jadudm@poke:~/git/python-example$ check check.py
24-Jan-23 15:22:15:CRITICAL:check.py does not end with CSV. (venv) jadudm@poke:~/git/python-example$ check lobsters1.csv
This is good. So far, my basic tests are doing their job. And, my logfile looks like this:
24-Jan-23 15:21:52:ERROR:File 'lobsters.txt' does not exist.
24-Jan-23 15:22:04:ERROR:check.py does not end with CSV.
24-Jan-23 15:22:15:ERROR:File 'lobsters.csv' does not exist.
Perhaps that CRITICAL
should also be an error. I’ll fix that.
Pull the logging code into a “library”
I’ll do that now. Don’t make the mistake of clobbering the names ‘logger’ or ‘logging’, as that will mess up Python something fierce. I’ll call this… lgr
. In the file lgr.py
, I’ll move some things over:
import logging
# Define the custom logger
logger = logging.getLogger(__name__)
# Set up a console and file logger
stream_handler = logging.StreamHandler()
file_handler = logging.FileHandler('check.log', mode='a')
# Send warnings and up to the console; send everything to the file.
stream_handler.setLevel(logging.WARN)
file_handler.setLevel(logging.DEBUG)
# Define our format
format = logging.Formatter('%(asctime)s:%(name)s:%(levelname)s:%(message)s', datefmt='%d-%b-%y %H:%M:%S')
stream_handler.setFormatter(format)
file_handler.setFormatter(format)
# Add the handlers
logger.addHandler(stream_handler)
logger.addHandler(file_handler)
And, in check.py
, I’ll add the line:
from lgr import logger
and remove all of that code. Things are neater now:
import click
from lgr import logger
import pandas
from pathlib import Path
import peewee
import sys
# https://stackoverflow.com/questions/82831/how-do-i-check-whether-a-file-exists-without-exceptions
def check_file_exists(filename):
file = Path(filename)
return file.is_file()
# https://stackoverflow.com/questions/10873777/in-python-how-can-i-check-if-a-filename-ends-in-html-or-files
def check_filename_ends_with(filename, ext):
return filename.endswith(ext)
@click.command()
@click.argument('filename')
def cli(filename):
does_file_exist = check_file_exists(filename)
if not does_file_exist:
logger.error("File '{}' does not exist.".format(filename))
sys.exit(-1)
does_filename_end_with = check_filename_ends_with(filename, "csv")
if not does_filename_end_with:
logger.error("{} does not end with CSV.".format(filename))
sys.exit(-1)
Checking headers
At this point, I know the file exists, and it seems to be a CSV. (Someone could sneak a non-CSV past us with the extension ‘.csv,’ but… we’ll assume they’re not trying to mess things up.) I can now read it in and check that the headers are right.
I have two files for testing: lobsters1.csv
and lobsters2.csv
.
lobsters1.csv
looks like:
name,age,color
Alice,3,red
Bob,2,red
Clarice,4,blue
Darren,1,yellow
and lobsters2.csv
looks like:
name,years,color
Alice,3,red
Bob,2,red
Clarice,4,blue
Darren,1,yellow
The important thing to note is that we want our headers to be name,age,color
, not name,years,color
. If I add a new check, I can make sure that everything is correct.
In cli()
, I add two new lines:
# Read in the CSV with headers
df = pandas.read_csv(filename, header=0)
# check_headers will throw specific errors for specific mismatches.
check_headers(df)
and the check_headers()
function looks like:
def check_headers(df):
actual_headers = list(df.columns.values)
for expected, actual in zip(EXPECTED_HEADERS, actual_headers):
if not (expected == actual):
logger.error("Expected header '{}', found '{}'".format(expected, actual))
sys.exit(-1)
assuming, of course, that I have defined EXPECTED_HEADERS
at the top of my code as follows:
EXPECTED_HEADERS = ['name', 'age', 'color']
When I run this on lobsters1.csv
, everything is fine. But, when I run it on lobsters2.csv
, I get this:
(venv) jadudm@poke:~/git/python-example$ check lobsters2.csv
24-Jan-23 16:09:09:ERROR:Expected header 'age', found 'years'
Now, if someone tries to feed us a datafile with the wrong data in the wrong places, we have a hope of catching the error.
Checking the age
column is only integers
This one is fun. I have a pandas dataframe in the variable df
, so I’ll pass that to a new check function as well. Because it is going to come in as a bunch of strings (and therefore become a bunch of pandas objects), we will do this by applying a conversion to the column.
In cli
:
all_ages_are_ints = check_ages(df)
if not all_ages_are_ints:
logger.error("One of the values in the column `age` is not an integer.")
sys.exit(-1)
and:
def check_ages(df):
try:
df['age'] = df['age'].astype(int)
except ValueError as ve:
logger.warn("bad value found -- {}".format(ve))
return df['age'].dtypes == 'int'
This attempts the conversion, and if we find something bad, it becomes a warning… but, we do then check whether the function as a whole succeeded. If it didn’t, we throw an error, and exit.
When I create a new file, lobsters3.csv
, and it contains:
name,age,color
Alice,3,red
Bob,2,red
Clarice,4,blue
Darren,one,yellow
my code fails as follows:
(venv) jadudm@poke:~/git/python-example$ check lobsters3.csv
24-Jan-23 16:22:55:WARNING:bad value found -- invalid literal for int() with base 10: 'one'
24-Jan-23 16:22:55:ERROR:One of the values in the column `age` is not an integer.
Checking for valid colors
Our final check: are all of the colors valid options?
This will be a value-by-value check, but a simple one.
In cli
, we call check_colors()
and pass it the dataframe.
def check_colors(df):
for c in df['color']:
if c not in VALID_COLORS:
logger.error("{} is not a valid color.".format(c))
sys.exit(-1)
The function checks every value, and makes sure we have valid colors in every row. It does assume that we have defined:
VALID_COLORS = ['red', 'blue', 'yellow']
With a new file, lobsters4.csv
, that looks like:
name,age,color
Alice,3,red
Bob,2,red
Clarice,4,blue
Darren,1,purple
I get:
(venv) jadudm@poke:~/git/python-example$ check lobsters4.csv
24-Jan-23 16:31:18:ERROR:purple is not a valid color.
Wrapping up input validation
This is just step one.
What I have done is implement the first part of a data processing pipeline. What, though, did I accomplish?
- I structured my code so that each function – checking, processing, etc. – can be broken out into its own command.
- I came up with a reasonable set of validations that will help assure me that my input data is well formed.
- I implemented each of my data validations.
- I have test files that fail each of my validations, and a test file that pass all of my validations.
What I do not yet have are unit tests that automatically run my tests. My next step needs to be to add unit tests, so that these tests run every time I modify my code. (And, ultimately, every time I commit my code to the repository.) At no point do I want to have a regression that allows bad data into my processing pipeline.
Why did I do this?
I want to make sure the input data (provided by a user) is correctly formatted/formed. From this point forward, I’m going to assume I can read in the CSV, that it will be well-formed, and everything is just fine. I will reuse my check code in other modules (because I don’t trust my users), but I also don’t want to be “looking over my shoulder” the whole time I’m developing my code. I need to know, once I’m deep into processing, that everything came in correctly the first time, and from then on, I can (in this case) rely on the contents of the Pandas dataframe as being, to a first approximation, correct.