I’ve been working on Jam Buds, a social network for sharing music you like, on and off for the past few years. Jam Buds’s main feature is a feed of new music your friends have posted. The core idea is that (eventually) your feed will act as a social “playlist,” which you can listen back to at your leisure, more easily than tracking down songs your friends have posted on Twitter or Facebook.
One of the expected use patterns of Jam Buds is that when hot new releases come out, various users will immediately post them to their Jam Buds feed. When this happens at scale, it means that users may find their feed flooded with users posting the same songs.
This has a few problems:
To solve this, Jam Buds prevents the same song from appearing more than once in a user’s feed.
When a user posts a song to Jam Buds, we create a row in the
posts table mapping their
user_id to the
song_id. A song has its own normalized entry in a
songs table. Originally, the feed was simply iterating over all of the posts the current user’s followed users had made.
Now, to solve duplicate songs, we aggregate posts that are about the same song in a user’s feed. We handle this the same way Twitter handles retweets:
Doing this as a query is kind of tricky, but seems to work:
-- Note that this query, like the other examples in this post, are -- simplified to be easier to read. -- -- In reality, these queries filter the results to only show songs that -- have been posted by either the current user or the current user's -- followed users. SELECT songs.*, MIN(posts.created_at) as earliest_posted, array_agg(users.name) as user_names FROM songs JOIN posts ON songs.id=posts.song_id JOIN users ON users.id=posts.user_id GROUP BY songs.id ORDER BY earliest_posted DESC;
A few frontend tweaks (and a whole lot of API refactoring) later, and we’ve got our desired UX in the feed:
One tricky problem with feed aggregation is handling a user’s own posts. Basically, if three of my friends have posted a song, and then I post a song, what happens in my feed?
Without any special cases, there’s a couple problems that could arise:
The special case to solve these problems is that, rather than just getting the earliest posted time of a song as shown above, we instead prioritize the user’s own post of a song if it exists. This has several good outcomes:
To handle this, we use a neat Postgres function called
COALESCE(). This function takes multiple values, and returns the first non-null value from left to right. Thus, our query becomes more complicated:
SELECT songs.*, COALESCE( ( SELECT posts.created_at FROM posts WHERE user_id=? AND song_id=songs.id ), MIN(posts.created_at) ) as feed_timestamp, array_agg(users.name) as user_names FROM songs JOIN posts ON songs.id=posts.song_id JOIN users ON users.id=posts.user_id GROUP BY songs.id ORDER BY feed_timestamp DESC;
feed_timestamp will now either be the time you made a post, or the earliest time someone you followed posted it.
Now, I don’t expect too much from the performance of this query - from little what I know, the performance of a subquery like this can be kind of nasty, since you’re relying on the Postgres query planner to turn “iterate over all the things again” into something more performant.
There might be better ways to write this out, but in practice, the query is already even more complex than is here (as a feed has to be filtered to a user’s followed users, and a few extra fields are added), so I feel that this is an okay balance of query readability and performance.
One fun problem I discovered late in this implementation was pagination. Jam Buds has a UI much like Twitter, where when you reach the bottom of a page of posts, you can click a button and load the next page. To get the “next page” (older entries), you just needed to query posts earlier than the oldest item in the previous page:
SELECT songs.*, post_id FROM posts JOIN songs ON songs.id=posts.song_id WHERE posts.created_at < ? ORDER BY posts.created_at DESC LIMIT 20;
I ran into a snag implementing this in the new aggregated feed query, though. Given the query we wound up with in the last section, you’d think you could just add:
-- ... WHERE feed_timestamp < ? GROUP BY songs.id ORDER BY feed_timestamp DESC LIMIT 20;
However, if you try to do this, you’d run into a fun error from Postgres:
ERROR: column "feed_timestamp" does not exist
This is because, according to the Postgres documentation, you cannot use an output column’s name in the
HAVING clauses of a query.
Sadly, as far as I can tell, the only way to solve this without a large-scale query rewrite is to duplicate the query in the
-- ... WHERE COALESCE( ( SELECT posts.created_at FROM posts WHERE user_id=? AND song_id=songs.id ), MIN(posts.created_at) ) < ? GROUP BY songs.id ORDER BY feed_timestamp DESC LIMIT 20;
One last note is that even that query wouldn’t work, and will result in another Postgres error:
ERROR: aggregate functions are not allowed in WHERE
But it turns out the fix for that is as simple as using a
HAVING clause instead.
HAVING is literally just a
WHERE clause that can handle grouping and aggregate functions, as described by the docs. It goes after
GROUP BY in the statement:
-- ... GROUP BY songs.id HAVING COALESCE( ( SELECT posts.created_at FROM posts WHERE user_id=? AND song_id=songs.id ), MIN(posts.created_at) ) < ? ORDER BY feed_timestamp DESC LIMIT 20;
This duplication makes the final query somewhat more unwieldy:
SELECT songs.*, COALESCE( ( SELECT posts.created_at FROM posts WHERE user_id=? AND song_id=songs.id ), MIN(posts.created_at) ) as feed_timestamp, array_agg(users.name) as user_names FROM songs JOIN posts ON songs.id=posts.song_id JOIN users ON users.id=posts.user_id GROUP BY songs.id HAVING COALESCE( ( SELECT posts.created_at FROM posts WHERE user_id=? AND song_id=songs.id ), MIN(posts.created_at) ) < ? ORDER BY feed_timestamp DESC LIMIT 20;
Of course, because I’m writing TypeScript and not raw SQL queries, it’s fairly easily to just use the
COALESCE() call as a string partial and interpolate it into multiple places in the query, but it still feels awkward. I suspect there is some join-wizardry-magic that you could use to avoid the duplication, but I feel like this works well enough for now.
There are a couple follow-up tasks to consider here:
Caching. As I said, I expect the performance of this to be pretty gnarly, and some sort of caching would be great. I started to write out the specifics of what this would look like, and it got very difficult to reason about fast, so I think I’m going to punt on it. The hard part is figuring out what updates look like in a cached world - it’s basically a cache-busting problem, as most things tend to be.
Time-boxed aggregation. I think there should be some time period where songs are allowed to be duplicated in the feed - basically, after six months since the last time a song was posted, a new entry should be created. I believe Twitter does this (or has done this) with retweets, for example. This seems like a really hard query to write, though, and won’t be an issue for Jam Buds until it’s been running for a long time.
I’m hopeful these naive SQL queries will carry me to at least having a few hundred active users on Jam Buds, and some basic caching will carry me to a few thousand, and.. well, if I somehow had more than that, that’s a good problem to have, if a harder engineering one.