Skip to main content

Prioritizing and Planning within Heroku Postgres

Over a year ago I blogged about Heroku’s approach to Teams and Tools. Since that time Heroku has grown from around 25 people to over 100, we’ve continued to iterate and find new tools that work for how we do things. For many of the product management and software engineering books I’ve read I’ve yet to find something that helps a team priorize in a fashion I that feels right.

One process emerged nearly a year ago from within the Heroku Postgres team and is now followed by many others. Within a team this process is now commonly conducted each 6 months. Lets take a look at how this process looks

Fixing Database Connections in Django

If you’re looking to get better performance from your Django apps you can check out Pro Django, PostgreSQL High Performance, or read some my earlier posts on Postgres Performance. All of these are of course good things to do – you can also start by correcting an incredibly common but also painful performance issue, that until 1.6 is unaddressed in Django.

Django’s current default behavior is to establish a connection for each request within a Django application. In many cases any particularly in distributed cloud environments this is a large time sink of your response time. An example application running on Heroku shows a typical connection time of 70ms. A large part of this time is the SSL negotiation that occurs in connecting to your database, which is a good practice to ensure security of your data. Regardless, this is a long time in simply establishing a connection. As a point of comparisson its commonly encourage that most queries to your database are under 10ms.

An example that highlights this in a small lightweight application shows the bulk of a request time being within a connection displayed by New Relic:

connection time

Simple database read scaling without sharding in rails

In an earlier post I provided a high level overview of sharding. Sharding while a very solid approach to scaling capacity versus simply only relying on vertical scaling can also be a time intensive one. Additionally in some cases certain sites may only need extra capacity for a short lived period of time. Fortunately theres a nice middle ground alternative for scaling capacity that works well in quite a few cases. It even has a benefit that can potentially in place of sharding.

Getting more out of psql (The PostgreSQL CLI)

After my last post I had a variety of readers reach out about many different tweaks they’d made to their workflows using with psql. One people Grégoire Hubert had a wondeful extensive list of items. Grégoire has been a freelance in web development and he has worked with Postgresql for some time now in addition to being the author of Pomm. Without further ado heres what he has to say on how he uses psql:

Get the most of psql

Psql, the CLI postgreSQL client, is a powerful tool. Sadly, lot of developers are not aware of the features and instead look for a GUI to provide what they need. Let’s fly over what can psql do for you.

Introducing django-db-tools

For any successful web application there is likely to come a time when you need to conduct some large migration on the backend. I dont mean simple add a column here or add an index there, but rather truly sizeable migrations… Going from MySQL to Postgres or migrating from an older version of Postgres such as a 32 bit instance to a newer 64 bit instance. In these cases the default approach is to just schedule downtime often throwing up a splash screen saying so.

For many sites this approach is simply wrong and lazy, with little effort you can improve the experience and there by ease the burden in conducting these types of migrations. By having the ability to turn your site into a read only mode which Simon Wilson talked about in his post on Lanyrd you can still continue to operate just in a limited capacity. Andrew Godwin further talks about some of this as well in regards to the Lanyrd move and even includes the script they used to migrate data from MySQL to Postgres. Though just in talking with Simon about this a week ago it occurred to me they had not released the code for their read-only mode.

Finally onto the announcing, today I’m releasing django-db-tools. This is currently a very lightweight utility that allows you to flip your site into two modes.

More on Postgres Performance

If you missed my previous post on Understanding Postgres Performance its a great starting point. On this particular post I’m going to dig in to some real life examples of optimizing queries and indexes.

It all starts with stats

I wrote about some of the great new features in Postgres 9.2 in the recent announcement on support of Postgres 9.2 on Heroku. One of those awesome features, is pg_stat_statements. Its not commonly known how much information Postgres keeps about your database (beyond the data of course), but in reality it keeps a great deal. Ranging from basic stuff like table size to cardinality of joins to distribution of indexes, and with pg_stat_statments it keeps a normalized record of when queries are run.

Using Postgres Arrays in Django

A few weeks back I did a brief feature highlight on Postgres arrays. Since that time I’ve found myself using them with increasing regularity on small side projects. Much of this time I’m using Django and of course not opting to write raw SQL to be able to use arrays. Django actually makes it quite simple to work with Arrays in Postgres with a package by Andrey Antukh. Lets get started by installing two libraries:

Redis in my Postgres

Yesterday there was a post which hit Hacker News that talked about using SQL to access Mongo. While this is powerful I think much of the true value was entirely missed within the post.

SQL is an expressive language, though people are often okay with accessing Mongo data through its own ORM. The real value is that you could actually query the data from within Postgres then join across your data stores, without having to do some ETL process to move data around. Think… joining sales data from Postgres with user reviews stored in Mongo or searching for visits to a website (retained in redis) against purchases by user in Postgres.

The mechanism pointed out was a MongoDB Foreign Data Wrapper. A Foreign Data Wrapper or FDW essentially lets you connect to an external datastore from within a Postgres database. In addition to the Mongo FDW released the other day there’s many others. For example Postgres 9.0 and up ships with one called db_link, which lets you query and join across two different Postgres databases. Beyond that there’s support for a variety of other data stores including some you may have never expected:

Lets look at actually getting the Redis one running then see what some of the power of it really looks like. First we have to get the code then build it:

Postgres Pooling with Django

A feature thats glaringly missing within Django and common in many other frameworks including many Java frameworks and Rails is connection pooling for your database connection. As most Django users are Postgres users the default answer is to use something like pgPool or pgBouncer. This are tools that you can run that will persist a connection to your Postgres database intended to offer:

  • Connection Pooling
  • Replication
  • Load Balancing

Its of not that PgBouncer is intended very specifically for pooling while pgPool does much more.