tbl: abstractions and imports
There is a debate in the data science community (and, in particular, in the R community) as to whether one should learn libraries or a core language when working with data. For R programmers, it is a question of learning the dyplr-family of libraries vs. working directly in the language without those tools. This is, from what I can gather, a sometimes divisive argument.
As an educator and a developer, I’ve come to appreciate the power of a good abstraction and tools that support that abstraction. I want tools that help me map the way I think about a problem directly into code. Or, I want tools that will shape the way I think about problems, so that I can more concisely express solutions using those tools. Here, “tools” means “libraries” or “programming languages.”
My approach to working on tbl
has therefore been to think about how to make it easy for beginners to work with interesting data. “Interesting” might mean it is personally meaningful, and possibly a small amount of data. “Interesting” might mean it is large and complex data… but, important to the developer. This means tbl
needs to support data that is both small and big, and it needs to be easy for a developer to get started.
Imports: The tbl
I want my beginner to be thinking about tabular data. So, I want a tbl
to make it easy to turn a spreadsheet into something that they can do meaningful work with. In this way, the first abstraction that a programmer sees with tbl
is the spreadsheet, and they can map that abstraction directly into the library. A tbl
is, in the first instance, a spreadsheet.
Here, for example, is the spreadsheet I use to keep track of my pet lobsters.
In the case that I have small, but interesting data, it would be nice if I could have a GUI for manipulating/entering that data, and could quickly pull it into a program that I’m writing without having to go through lots of hoops. If I want a good GUI for manipulating tabular data, I should use a spreadsheet! As it happens, not only can I use Google Sheets for this, but Sheets will let me publish my data to the web for embedding, and Sheets also makes it easy for me to pull the CSV directly. But, I don’t want a programmer to know that there’s a CSV file waiting for them… I just want them to be able to import the data.
Something that might look like this:
import tbl
pets_url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vSK2rd47ogfI2CpQi2L6HDo9AOEhnhqBN4zR4kLPUO28vBzmlc8XQWrvTfBYCU0ePf478yxcNKdOy5m/pub?gid=0&single=true&output=csv"
pets_tbl = tbl_from_sheet(pets_url)
To test this out, I’ll drop some code in lobsters.py
and tbl.py
.
When I run
python lobsters.py
I get the following:
(venv) jadudm@lego:~/git/pytbl$ python lobsters.py
['Name', 'Weight', 'Color']
['Bart', '0.75', 'Muddy']
['Flick', '1', 'Muddy']
['Bubbles', '1.2', 'Blue']
['Crabby', '0.5', 'Muddy']
Now, this doesn’t get us all the way, but it takes the first step: I’ve created a data table in Google Sheets, and I can pull it in via the Requests library as a CSV document that is parseable and iterable. So far, so good.
Abstractions
The next step is to design the abstraction for a tbl
.
So that no one post gets too long, this will be the subject of tomorrow’s explorations. The goal here will be to avoid creating an abstraction that is overly leaky, to use Joel Spolsky’s terminology. I’m going to want a way to work with this data that:
- Can store the data locally or remotely
- Can work with centralized and decentralized data
- Can leverage multiple concrete representations, invisibly
- Can operate on the data closer conceptually rather than syntactically
- Can support programmers at multiple levels of experience and expertise
These are going to be a complex set of requirements, and I’ll miss the first time. (This is actually the second time I’ve explored this idea; I’ve already done a deep dive in the programming language Racket, so in truth, I’ve got some ideas in my back pocket.)
For the exercise that this is, I’ll probably do the following:
- Explore SQLite for local and CockroachDB for remote/distributed data
- Use an ORM (SQLalchemy, Peewee, or similar) to manage those relationships
- Use R, Python (pandas), Pyret, and other data languages/frameworks as inspiration
- Choose some authentic use-cases to drive the development (e.g. perhaps interface into some of my own research data to drive both the research and the development of
tbl
forward)
Get the code!
It’s early days, but you can get the code. This work will be open (as all of my work is, whenever possible), at Github. I’ll call the project pytbl.