python: metaprogramming marshmallow
tl;dr
I used Python’s metaprogramming features to auto-generate Marshmallow schemas that correspond to attrs
-derived data classes.
If you like the thought of thinking about metaprogramming as much as I do, you’ll grove on this post.
a theme of metaprogramming…
Oddly, related as a piece to my explorations of tbl
in Python, as well looking at GraphQL, but still it’s own post…
It is hard to extend Python’s syntax, but that doesn’t mean you can’t engage in some dynamic metaprogramming in the language. While it isn’t always the first tool you should reach for, it can be nice for reducing boilerplate.
For example, I am staring down a bunch of JSON-y things. They come-and-go from the front-end to the back-end:
{ email: "[email protected]",
token: "89425abc-69f9-11ea-b973-a244a7b51496" }
Let’s pretend that the front-end is React, the storage layer is MongoDB, and the middleware is Flask (a Python web framework).
At the Flask layer, there’s a lot of work that needs to be done: the JSON comes in, and in the first instance, it comes in as a dictionary. This is not very nice. By “not very nice,” I mean “dictionary convey no notion of types or the regularity of their contents, and therefore provide us with no notion of safety.” What I’d like is for the data coming from the front-end to be strongly typed and well described, the middleware to be aware of those types, and the database to help enforce them as well. (I’m thinking GraphQL starts to do things like this… almost.)
BUT, we have a RESTful web application sharing data in webby, untyped ways. This inspired me to do some digging. First, I found Flask Resful, which is a nice library. It lets you define a class, set up get
, put
, post
, and other methods on endpoints, and register them with the app. Leaving a bunch of bits out, this looks like:
from flask_restful import Resource, Api
import db.models as M
import db.db as DB
class Tokens(Resource):
def post(self, email):
# Create a UUID string
tok = str(uuid.uuid1())
# Create a TimedToken object, with a current timestamp
t = M.TimedToken(email=email, token=tok, created_at=time())
# Grab the correct collection in Mongo for tokens
collection = DB.get_collection(M.TimedToken.collection)
# Save the token into Mongo by dumping the token through marshmallow
as_json = t.dump()
collection.insert(as_json)
# Return the token as JSON to the client
return as_json
mapping = [
[Tokens, "/token/<string:email>"]
]
def add_api(api):
for m in mapping:
api.add_resource(m[0], m[1])
which is in a module called “API”, and at the top level of the app:
from flask_restful import Api
from flask import Flask
import hydra
from api.api import add_api
app = Flask(__name__)
@hydra.main(config_path="config.yaml")
def init(cfg):
# Dynamically define classes from the YAML config.
M.create_classes(cfg)
# Set the Mongo params from the config.
DB.set_params(cfg.db.host, cfg.db.port, cfg.db.database)
# Add the REST API to the app.
A = Api(app)
add_api(A)
This is a lot to take in, but I’m actually trying to get to the good bit. The top level has an init
function that reads in a configuration file (more on that later), and uses that to build a whole bunch of classes dynamically at run time. (This is the cool bit.) Those are instantiated in the models
submodule of db
, and they get used throughout the application.
Looking back at the first code block, it’s possible to see some of those uses. For example, I’m creating a timed token (e.g. a random string associated with a user that will ultimately have a finite lifetime).
t = M.TimedToken(email=email, token=tok, created_at=time())
This class takes three parameters: email
, token
, and created_at
. The whole purpose of the class is that I want it to serve as a struct
(in Racket or C) or record
(in… Pascal?). In Python, namedtuple
s, dataclass
es, and classes decorated with attrs
are all examples of what I’m aiming for.
But… BUT… I also want easy marshalling to-and-from JSON. The front-end speaks it, and Mongo speaks it… but, while I’m in the middle, I need to interact with it. I would like it to be typed (in as much as Python is typed) while I am working with it in the middleware. And, I’d rather not do the conversions myself. (Why would I write code if I wanted to do all the hard stuff by hand?)
To solve this, enter marshmallow. This Python library lets you define schemas for classes, and in doing so, leverage machinery to marshal JSON structures to-and-from those classes. For example, my TimedToken
class looks looks (er, used to look like):
@attr.s
class TimedToken:
email = attr.ib(type=int)
token = attr.ib(type=str)
created_at = attr.ib(type=float)
To marshal this to-and-from JSON, I can use marshmallow. I need to create a schema first:
from marshmallow import Schema, fields
class TimedTokenSchema(Schema):
email = fields.Str()
token = fields.Str()
created_at = fields.Number()
Once I have a schema, I can do things like this:
a_token = TimedToken(...)
schema = TimedTokenSchema()
as_json = schema.dump(a_token)
The machinery inside of marshmallow will take an object of type TimedToken
, a schema describing them (TimedTokenSchema
), and use the schema to walk through a TimedToken
object to convert it to JSON (and, back, if you want).
This is cool.
But, it’s not automatic. And, for every data structure I want to create in my app, I need to write a schema. This is duplicating code. If I change a structure, I need to remember to change the corresponding schema. That isn’t going to happen. What’s actually going to happen is that I’ll forget something, and everything will break.
enter metaprogramming!
I wanted to be able to declare my data structures as YAML, and then have Python generate both the attrs
-based class as well as the marshmallow
-based schema. Is that so much to ask? No, I don’t think it is.
Using Facebook’s Hydra, I created a config file. This important bit (for this discussion) looks like this:
models:
- name: TimedToken
fields:
- email
- token
- created_at
types:
- String
- UUID
- Number
Then, the fun bit is the function create_classes
. It takes a config that includes the models
key, and does the following:
def create_classes(cfg):
for c in cfg.models:
make_classes(c.name, c.fields, c.types)
OK… so, make_classes
must do the interesting work.
def make_classes(name, fs, ts):
# Dynamically generate the marshmallow schema
schema = make_schema(fs, ts)
# Generate a base class, and wrap it with the attr.s decorator.
base = attr.s(make_base(name, fs,ts, schema))
# Insert the class into the namespace.
globals()[name] = base
This is probably really bad. But, it’s fun, so I’ll keep going.
I pass in the name of the class as a string ("TimedToken"
), and then I pass in the fields as a list of strings, and their types as a list of strings. (These are given in the YAML, above). The last line here is where the evil happens. The function globals()
returns the dictionary representing the current namespace. I proceed to overwrite the namespace; specifically, I insert a new class of the name TimedToken
(in this example). (I hope the use of global()
is restricted to the module, and not the entire application… I have some more reading/experimenting to do in that regard. It seems like it is the module…)
Backing up, I’ll start with make_schema()
. It takes the fields and types, and does the following:
def make_schema(fs, ts):
# Create an empty dictionary
d = {}
# Walk the fields and types together (using zip)
for f, t in zip(fs, ts):
# Convert each type into the appropriate fields.X from marshmallow
# and insert it into the dictionary
d[f] = get_field_type(t)
# Use marshmallow's functionality to create a schema from a dictionary
return Schema.from_dict(d)
get_field_type()
is pretty simple:
def get_field_type(t):
if t == "Integer":
return fields.Integer()
if t == "Float":
return fields.Float()
if t == "String":
return fields.String()
if t == "UUID":
return fields.UUID()
if t == "Number":
return fields.Number()
(No, there’s no error handling yet. Not even a default case… sigh.)
The make_schema
function literally returns a class
that I can use to convert objects that match the layout of the dictionary that I built. That’s great… but what good is a TimedTokenSchema
if I don’t have a TimedToken
class in the first place? Hm…
@attr.s
class Base ():
pass
def make_base(name, fs, ts, schema):
cls = type(name, tuple([Base]), {})
setattr(cls, "schema", schema)
setattr(cls, "dump", lambda self: self.schema().dump(self))
setattr(cls, "collection", "{}s".format(name.lower()))
for f, t in zip(fs, ts):
setattr(cls, f, attr.ib())
return cls
The function make_base()
does some heavy lifting for me. First, it uses the type()
function in Python to dynamically generate a class. In this case, it will create a class with the name TimedToken
, it will use Base
as a superclass, and it will attach no attributes at time of creation. (I actually do not want to overwrite anything, because attrs
does a lot of invisible work.)
The function setattr
is, used casually, probably a bad thing. It literally reaches into a class (not an object, but a class) and attaches attributes to the class. If you’re not used to metaprogramming, this is like… writing the code for the class on-the-fly.
I attach three attributes:
schema
is a field that will hold a marshmallowSchema
class. (Because, in Python, classes are objects too! Wait…) If you look back, you can see that I pass it in after creating it inmake_classes()
.dump
, which is a function of zero arguments. It takes a reference toself
(because this class will get instantiated as an object), and it instantiates theschema
that I’ve stored, and then invokesdump()
on… itself. This feels metacircular, but fortunately marshmallow knows to only look for fields that are in the schema. Therefore, we don’t get an infinite traversal here.collection
, which is so I can map directly into Mongo. I take the name of the class, lowercase it, and add an ‘s’. So,TimedToken
becomestimedtokens
as a collection name. I like the idea of the object knowing where it should be stored, so I don’t have to think about it.
Once I have these things set up, I walk the fields, and add them to the class. For each, I add a (currently) untyped attr.ib()
to the field. This way, the TimedToken
class will act like a proper attrs
class.
Finally, I return this class, which then gets attached (back in make_classes()
) to the global()
namespace.
what?
If you like the thought of thinking about metaprogramming as much as I do, you’re excited at this point. If you’re wondering why I would do this… well, I’ll go back to my REST handler for TimedTokens:
from flask_restful import Resource, Api
import db.models as M
import db.db as DB
class Tokens(Resource):
def post(self, email):
# Create a UUID string
tok = str(uuid.uuid1())
# Create a TimedToken object, with a current timestamp
t = M.TimedToken(email=email, token=tok, created_at=time())
# Grab the correct collection in Mongo for tokens
collection = DB.get_collection(M.TimedToken.collection)
# Save the token into Mongo by dumping the token through marshmallow
as_json = t.dump()
collection.insert(as_json)
# Return the token as JSON to the client
return as_json
mapping = [
[Tokens, "/token/<string:email>"]
]
def add_api(api):
for m in mapping:
api.add_resource(m[0], m[1])
The function create_classes(cfg)
is in the db.models
module. I import that as M
. Because I created classes in this module at the point that Flask was initialized, I now have a whole bunch of dynamically generated classes floating around in there. Those classes were generated from a YAML file, and can be used anywhere in the application.
models:
- name: TimedToken
fields:
- email
- token
- created_at
types:
- String
- UUID
- Number
To add a new class to my application, I add it to the YAML file, and restart Flask. This will call create_classes
as part of the init, and the new class will be generated in the db.models
module. I can then use those classes just as if I had written them out, by hand, duplicating the effort of defining both the attrs
class and the marshmallow Schema
class.
In my REST handler, this is where this dynamic programming comes into play:
# Create a TimedToken object, with a current timestamp
t = M.TimedToken(email=email, token=tok, created_at=time())
# Grab the correct collection in Mongo for tokens
collection = DB.get_collection(M.TimedToken.collection)
# Save the token into Mongo by dumping the token through marshmallow
as_json = t.dump()
collection.insert(as_json)
# Return the token as JSON to the client
return as_json
I create the object. Then, I use the collection
attribute to ask for a database connection to the collection that holds objects of this type (this is like a table in relational databases). Next, I convert the object to JSON by invoking the .dump()
method, which was added dynamically. In fact, it is using a Schema class that was created dynamically as well, and then embedded in the enclosing object for later use. Finally, I insert this JSON into the Mongo database, and return it to the client, because both Mongo and the client speak JSON natively.
The result is that I’ve metaprogrammed my way around attrs
and marshmallow
to create a dynamic middleware layer that can marshal to-and-from JSON. In doing this, I’ve saved myself a large amount of boilerplate, and I have a single point of control/failure for all of my class definitions, which is external to the code itself. (I think I still need to add the marshalling from JSON, but that won’t be hard.)
what will you do with this, matt?
Personally, I haven’t found anything on the net that eliminates the boilerplate in marshmallow. In the world of open source, I’d say this is an “itch” that I scratched. It might be an itch other people have.
Perhaps my next post will be about packing code for pip
?