Using array_agg in Postgres – powerful and flexible

2013-04-17 [Last modified: 2024-10-21]

/2013/04/17/Using-array_agg-in-Postgres-powerful-and-flexible/ map[name:Craig Kerstiens]

Postgres

In almost any application it’s common to want to aggregate some set of values together, commonly in a comma separated form. Most developers do this by running a query to get much of the raw data, looping over the data and pushing it into a set, appending each new value to the appropriate key. Hopefully, it’s not a surprise that there’s a much better way to do this with PostgreSQL.

Postgres has a flexible and robust array datatype that comes with a variety of functions. Even without taking advantage of the array datatype in your application, you can still take advantage of some of the functions to get the functionality you need. Lets take a look at an example schema and use case.

An example

Given a project management application, you may have users who have projects that have tasks. An example piece of functionality might be to send an email with a list of all projects that have tasks that are past their due dates of completion. Your schema might look something like this:

 # \d users
             Table "public.users"
    Column   |            Type             | Modifiers
 ------------+-----------------------------+-----------
  id         | integer                     | not null
  email      | character varying(255)      |
  ...

# \d projects
             Table "public.projects"
    Column   |            Type             | Modifiers
 ------------+-----------------------------+-----------
  id         | integer                     | not null
  user_id    | integer                     | not null
  name       | character varying(255)      | not null
  ...

# \d tasks
             Table "public.tasks"
    Column     |            Type             | Modifiers
 --------------+-----------------------------+-----------
  id           | integer                     | not null
  project_id   | integer                     | not null
  completed_at | timestamp without time zone | 
  due_at       | timestamp without time zone | 
  ...

To get a list of all projects that have tasks that haven’t been completed, you would start with something like:

SELECT 
  projects.name
FROM
  projects,
  tasks
WHERE projects.id = tasks.project_id
  AND tasks.due_at > tasks.completed_at
  AND tasks.due_at > now()

This would give you a list of projects which you could then easily join this with users:

SELECT 
  users.email
  projects.name
FROM
  projects,
  tasks,
  users
WHERE projects.id = tasks.project_id
  AND tasks.due_at > tasks.completed_at
  AND tasks.due_at > now()
  AND users.id = projects.user_id

At this point you’ve got everything you need to pull this up into Ruby, Python, or other language of your choice and then build the full set. However if this is thousands or even hundreds of results you’ll be spending more time than necessary, grouping this data for a sensible email. With 3 other small changes you can have this already formatted for you to immediately send of in an email. The first is using a handy function called array_agg which will aggregate items and then you can format them how you wish. The second is just ensuring you’re grouping correctly. Finally you’ll want to unnest the array so it formats the data in a clean way for you.

Looking at it all put together:

SELECT 
  users.email,
  array_to_string(array_agg(projects.name), ',')) as projects
FROM
  projects,
  tasks,
  users
WHERE projects.id = tasks.project_id
  AND tasks.due_at > tasks.completed_at
  AND tasks.due_at > now()
  AND users.id = projects.user_id
GROUP BY 
  users.email

This would give you a nice clean result of projects that have overdue tasks that you could then send to the user in an email:

           email            |     projects       
 ---------------------------+-------------------
 craig.kerstiens@gmail.com  | blog, timetracker      
 craig@heroku.com           | foo, bar, baz