tbl: abstractions and imports

There is a debate in the data science community (and, in particular, in the R community) as to whether one should learn libraries or a core language when working with data. For R programmers, it is a question of learning the dyplr-family of libraries vs. working directly in the language without those tools. This is, from what I can gather, a sometimes divisive argument.

As an educator and a developer, I’ve come to appreciate the power of a good abstraction and tools that support that abstraction. I want tools that help me map the way I think about a problem directly into code. Or, I want tools that will shape the way I think about problems, so that I can more concisely express solutions using those tools. Here, “tools” means “libraries” or “programming languages.”

My approach to working on tbl has therefore been to think about how to make it easy for beginners to work with interesting data. “Interesting” might mean it is personally meaningful, and possibly a small amount of data. “Interesting” might mean it is large and complex data… but, important to the developer. This means tbl needs to support data that is both small and big, and it needs to be easy for a developer to get started.

Imports: The `tbl`

I want my beginner to be thinking about tabular data. So, I want a tbl to make it easy to turn a spreadsheet into something that they can do meaningful work with. In this way, the first abstraction that a programmer sees with tbl is the spreadsheet, and they can map that abstraction directly into the library. A tbl is, in the first instance, a spreadsheet.

Blue lobster photo by David Clode on Unsplash. Here, for example, is the spreadsheet I use to keep track of my pet lobsters.

In the case that I have small, but interesting data, it would be nice if I could have a GUI for manipulating/entering that data, and could quickly pull it into a program that I’m writing without having to go through lots of hoops. If I want a good GUI for manipulating tabular data, I should use a spreadsheet! As it happens, not only can I use Google Sheets for this, but Sheets will let me publish my data to the web for embedding, and Sheets also makes it easy for me to pull the CSV directly. But, I don’t want a programmer to know that there’s a CSV file waiting for them… I just want them to be able to import the data.

Something that might look like this:

import tbl

pets_url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vSK2rd47ogfI2CpQi2L6HDo9AOEhnhqBN4zR4kLPUO28vBzmlc8XQWrvTfBYCU0ePf478yxcNKdOy5m/pub?gid=0&single=true&output=csv"
pets_tbl = tbl_from_sheet(pets_url)

To test this out, I’ll drop some code in lobsters.py and tbl.py.

When I run

python lobsters.py

I get the following:

(venv) jadudm@lego:~/git/pytbl$ python lobsters.py 
['Name', 'Weight', 'Color']
['Bart', '0.75', 'Muddy']
['Flick', '1', 'Muddy']
['Bubbles', '1.2', 'Blue']
['Crabby', '0.5', 'Muddy']

Now, this doesn’t get us all the way, but it takes the first step: I’ve created a data table in Google Sheets, and I can pull it in via the Requests library as a CSV document that is parseable and iterable. So far, so good.

Abstractions

The next step is to design the abstraction for a tbl.

So that no one post gets too long, this will be the subject of tomorrow’s explorations. The goal here will be to avoid creating an abstraction that is overly leaky, to use Joel Spolsky’s terminology. I’m going to want a way to work with this data that:

Can store the data locally or remotely
Can work with centralized and decentralized data
Can leverage multiple concrete representations, invisibly
Can operate on the data closer conceptually rather than syntactically
Can support programmers at multiple levels of experience and expertise

These are going to be a complex set of requirements, and I’ll miss the first time. (This is actually the second time I’ve explored this idea; I’ve already done a deep dive in the programming language Racket, so in truth, I’ve got some ideas in my back pocket.)

For the exercise that this is, I’ll probably do the following:

Explore SQLite for local and CockroachDB for remote/distributed data
Use an ORM (SQLalchemy, Peewee, or similar) to manage those relationships
Use R, Python (pandas), Pyret, and other data languages/frameworks as inspiration
Choose some authentic use-cases to drive the development (e.g. perhaps interface into some of my own research data to drive both the research and the development of tbl forward)

Get the code!

It’s early days, but you can get the code. This work will be open (as all of my work is, whenever possible), at Github. I’ll call the project pytbl.

2020 03 10	tbl: a slice of cake	code
2020 03 09	tbl: testing round one	code
2020 03 09	tbl: structuring the project	code
2020 03 08	tbl: abstractions and imports	code
2020 03 07	tbl: thinking about data	code

Matt Jadud
cv | linkedin | github | bitbucket

tbl: abstractions and imports

Imports: The `tbl`

Abstractions

Get the code!

posts in the series

Matt Jadud cv | linkedin | github | bitbucket

tbl: abstractions and imports

Imports: The tbl

Abstractions

Get the code!

posts in the series

Matt Jadud
cv | linkedin | github | bitbucket

Imports: The `tbl`