Tools for tiny teams

stats and snakeoil

2021-05-28

bootstrap, dbt, flask, postgres, sql, tools

Sometimes, tech teams are tiny. This seems to happen a lot in sports, with one or two people being expected to cover a lot of different functions from data analysis and reporting to data engineering and web development.

As a result, tiny teams often need to be very productive, returning great slabs of business value in order to stick around (or gain further buy-in and influence). There’s been plenty of discussion of the strategies people can use to provide value and win trust (see any one of the guest talks from the Opta Pro Forums, for example), and I don’t have much to add on that front. Instead, I’d like to talk about some of the tools that I’ve found useful when trying to cover a lot of ground quickly.

In general, a lot of the ideas here are covered more succinctly and eloquently by Dan McKinley in his talk/essay “Choose Boring Technology”. Hopefully, I can add some value by talking about some specific technologies and choices I’ve made.

Databases & SQL

I have found that having a database, even as a tiny team, is essential since it allows you to move from ad-hoc, manual processes to automated ones. This is important if you want to keep delivering new projects, rather than getting bogged down in manual and maintenance work.^[1]

And you can’t talk about databases without talking about SQL. SQL might be the most Boring Technology around. It’s everywhere, declarative code is really good, and when you use it, you’re standing on a giant heap of decades of accumulated knowledge. As a result, investing the time in learning to query and manage data with SQL can payoff massively.

In particular, I’ve gravitated towards two bits of software for managing analytics databases: Postgres and dbt.

Postgres

I’ve been using Postgres as my default choice for a few years now, and I haven’t ever felt the need to switch.

Here are some of the things I’ve been particularly pleased with:

Inlined CTEs (requires Postgres 12 or later) allows you to prioritise readability without paying a performance penalty
Good extensions
- PostGIS for spatial queries (e.g. on x,y event data) is very cool and always leaves me looking for more excuses to use it
- As an English-speaker, unaccent is a life-saver for searching accented player and team names
The official documentation is really good. It’s very rare that I’ve had to resort to googling or stack-overflow instead of just navigating to the appropriate section of the docs.
It’s supported by dbt (see next point)
Easy deployment options - more or less every cloud provider has a managed Postgres offering.^[2]

I would consider moving to something designed specifically for data-warehousing (e.g. Snowflake) as part of a merely small team, but advantages of Postgres make it more attractive to me in a tech team of 1-2.^[3]

dbt

The dbt docs are great, and state that:

dbt (data build tool) enables analytics engineers to transform data in their warehouses by simply writing select statements. dbt handles turning these select statements into tables and views.
dbt does the T in ELT (Extract, Load, Transform) processes – it doesn’t extract or load data, but it’s extremely good at transforming data that’s already loaded into your warehouse. [See this article for a breakdown of the difference between ETL and ELT.]
The role of dbt within a modern data stack is discussed in more detail here.
dbt also enables analysts to work more like software engineers, in line with the dbt Viewpoint.

In short, dbt enables you to manage your data using only SQL select statements with jinja templating.

I recommend investigating the links above for more detail and examples, but here are some of the reasons I love using it:

dbt…

Leverages extremely common tech in the form of SQL, so it’s relatively easy to debug, get advice and onboard new people onto a project compared to an equivalent ETL pipeline
Makes it easy to test your data, giving you confidence that your calculations are correct
Has an active and helpful community - if you’re stuck on something, it’s likely that someone else has an answer
Works with Postgres (but can be ported without too much hassle to a cloud data platform if you outgrow Postgres)
Uses Jinja, which has nice synergy with Flask, where it’s used for HTML templating. A small win, admittedly, but nice nonetheless

But perhaps the most useful thing of all is how it flattens what I’ll call “the gradient of automation”^[4]. If you use SQL queries for all of…

One-off queries and ad-hoc analyses
Automated scripts, reports, etc
Data transformations (ELT)

… it makes it very easy to move from a query “up” a level of automation. A query used to validate a specific question as a one-off might prove useful enough to add to a weekly report. The information might then prove to be worth staging as a pre-calculated table so that it can be used in other analyses, or joined to other data sources for more insights.

Like Postgres, dbt is open-source, but there is a hosted service that you can use to manage your dbt deployment (scheduling and monitoring jobs, etc).

Overall, I have found dbt to be an incredibly productive tool that Just Works, and I am very happy using it.

Records/PugSQL

Remember how I said I liked using SQL for everything? That extends to using SQL for querying within apps, reports, and so on.

There’s a family of libraries out there that enable you to expose SQL queries as functions within your programming language of choice. In Python, there’s PugSQL and the more minimalist Records.

I appreciate it may not be everyone’s cup of tea, but maximising the use of SQL reduces the number of different things I need to know while being productive.

Learning SQL

If you’re looking for a place to start learning SQL, Execute Program has a course (first 16 lessons are free, after that there’s a paid subscription) which incorporates spaced repetition. I haven’t used it (yet!), but other people seem to like it.

I’ve also developed a small project for creating a database of Understat data, that might be worth checking out if you want to get a bit more experience querying and working with relational databases. This project uses both Postgres and dbt, and is designed to be altered, extended, and remixed.

Web applications

Being able to deploy to the web is extremely useful. Having a platform to expose reports, models, or input data, even if it’s just for your own use, can go a really long way.^[5]

However, I think web development can be a bit of a trap for non-experts (like me). For this reason, I tend to be extremely Javascript-averse and choose the least exciting options when developing anything web-related^[6]. While my web apps will never have the sickest UIs, I can keep the maintenance cost manageable, and get more work done overall.

Flask/Django

As part of my Boring Web Apps, I tend to develop server-side rendered applications. In Python, the ones I have most familiarity with are Flask and Django.

Both are extremely widely used which means there’s lots of support and documentation available online. Both use HTML templating, which is another old/Boring technology. Having decent knowledge of HTML can also come in handy with any web-scraping projects.

There are also many managed hosting options for app deployment that I have found very useful.

Bootstrap

Bootstrap is “the world’s most popular framework for building responsive, mobile-first sites”.

I might not mind a website looking like it’s from 2010, but I would at least like it to look like a nice website from 2010. That’s what I use Bootstrap for.

While I’m sure there’s myriad other options available that fill the same niche, Bootstrap is the one I’ve been using for years, and I haven’t felt a burning need to switch. If you think I’m really missing out on something let me know!

General themes

I think there are a few themes running through the decisions I’ve made on these types of teams:

As part of a tiny tech team, time is likely to be your biggest constraint
- Paying attention to the time required to maintain your work if essential to remaining productive beyond the short-term
Embrace frameworks for high-reward given time/effort
- Tools that are good enough (even if not perfect) can allow you to spend more time on your core work
- Using popular frameworks can also make it easier to hand-off to and/or integrate new team members
Invest in core skills/tools that can be applied widely and robustly
- Established tech is more likely to be well tested, and is easier to find learning resources, documentation, and help online ~if~ when something goes wrong
- Being comfortable with boring, ubiquitous tech like relational databases and simple web-applications allows you to have a big impact across a range of domains
- Or, to steal from Dan McKinley, choose boring technology. There’s so much good stuff in this talk/essay.
Learn to love problems, not solutions
- Focusing on problems to solve (as opposed to a specific solution) enables you to iterate effectively and not get stuck on getting an elegant idea to work, or endlessly refining a product that’s good enough for now

I think these things amount to okay advice. But I am still early on in my career, and the trade-offs for small, medium and large teams can be quite different. In an industry where lots of people are trying to get hired, we don’t often talk about the boring stuff. I think that’s a shame, because the boring stuff is often what makes you productive.

As the industry matures, and more organisations build enough trust to invest in full tech and data teams, I think we’ll see these kinds of “tech+data generalist” roles start to die out. However, for the time being, these roles exist and these are some of the bits and pieces I’ve been using to get (some) things done.

If you think I’m missing out on something great that you’ve been using, or if your experiences in a similar role have been different, you can reach me on twitter - I’d love to hear from you!

Knowing when not to bother automating something has been a pretty useful skill as well, to be fair. ↩︎
It can be cheaper to manage your own database, but I have generally opted for a managed solution. I’ve been really grateful for the ease of adding monitoring, alerting, backups, replicas etc and so I’m happy to pay a bit more for the privilege in a tiny-team environment. ↩︎
See some discussion on Postgres for data warehousing on SE Daily and narrator.ai ↩︎
I promise I tried to think of a term for this. But I couldn’t do it - I’m sorry. ↩︎
If I were building a new tech department in a sports team, a web developer would probably be one of my first hires ↩︎
In cases where interactivity has been really important, I have had great experiences with Elm. I’ve evangelised about Elm to plenty of people in-person and heartily recommend anyone reading to try it out. However, for truly tiny teams, I have generally opted for server-side rendering ahead of an Elm SPA (single-page app). ↩︎