The Pipes That Make Up Spotify's 'Circulatory System,' Explained

"It’s rare that we get to talk about the exciting world of technical infrastructure -- the real power behind the music -- but today is special," wrote Nicholas Harteau, Spotify's vp of engineering and infrastructure, in a blog post yesterday announcing the company's move from a self-maintained backend to a Google-powered version of the same thing. While Harteau may be optimistic in his deployment of the word 'exciting,' the technological nervous system and spine of a global company like Spotify can't help but at least be interesting.

To test that theory, Billboard spoke with Harteau about the anatomy of what he calls Spotify's 'circulatory system.'

Can you give me a bird's-eye view of what this is?

We have a million bazillion users -- we're at a paid billion now [laughs]. Whenever anybody plays music or are interacting with the service, we are collecting and processing and trying to turn those actions into useful insights. That could be insight into what music is popular right now, where to spend marketing dollars, answering questions like 'Is this week's Discover Weekly playlist better performing than lasts because of an algorithm tweak? Do people like this green versus the other green? What do we need to do to get more users engaged in Ecuador?'

We have a pretty big Rube Goldberg machine which turns those activities into insight. A user presses play, that generates an event which winds its way through backend systems, is filtered, aggregated and compared against a schema. As Spotify has grown the scale of this operation has grown quite a bit.

Today we run one of the largest Hadoop clusters [for an explanation of what that means, go here -- but be warned, it is not for the faint of heart] in Europe, we obviously have a pretty big team of engineers, analysts and scientists working on this -- reflected in our acquisition of Seed Scientific, The Echo Nest.

'Why Google?' The answer is that we want to give those people the best platform and tools that we can to turn that activity into insight.

How much data is processed in a day, in a week?


I know that our Hadoop clusters, the data retained, is 90 petabytes.

Does the changeover give your teams more breathing room?


I don't know if breathing room is the right way to put it. We have to spend more and more time and energy and focus on maintaining the back end, the larger it gets. The more that we can get from Google, the more we can focus on higher-value stuff, like turning the data into insights instead of the system that supports the system.

What's this system look like?

It's like a big circulatory system -- you're feeding this information into our global data footprint [and from there] into a handful of places. Data that we gather, observe, collect from clients connecting all over the world, that all pipes back to London, primarily, for processing. That would be one of those few other places.

How can you assure users that Google or Amazon aren't snooping in on them?

Graham James: There are numerous safeguards in place -- we're not giving credit card to Google -- it's the backend infrastructure data. It's not personal information.

This system compared to launch and now?


In 2008 it was non-existent. A few years in, when we first built a Hadoop cluster, we decommissioned the foosball room and installed a bunch of fans and A/C units and piled a bunch of machines into the foosball room to build it. It wasn't the big circulatory system that exists today -- the industrial scale, weaponized version of that.

Do you worry about consolidation and concentration of these cloud computing services like Amazon and Google?

I would say right now the competition in that space is pretty fierce. Amazon and Microsoft and Google are the three big players, fighting pretty hard for business. We're actually a ways out from consolidation, certainly I would be concerned if that space were a monopoly, but that's not the case and I don't see that happening the foreseeable future.

Can you describe, literally, the geographical path that pressing play causes?

You'd have to start listening to something in the long tail. Say "Polka Hits of the '70s."

Essentially what happens is you press play. In the majority of cases, what happens is you send a play request, and in return you receive an access token that allows you to play the track you requested. The reason we do that is to prevent unauthorized playback of tracks. At the same time that you make that request, you make another request to go fetch Polka Hits From the '70s, in anticipation of being granted an access token -- there are cases where we wouldn't send one, but those are rare.

The request to actually fetch Polka Hits is going to one of a handful of CDN [Content Delivery Network] providers. Amazon is one of those. For content delivery, for hosting the catalog, we have a strategy; we use whatever CDN provider is going to provide the best performance for the user in the market that they're in. All the CDN providers out there have different strengths and weaknesses. Hypothetically let's say maybe Amazon is great in South America, less great in the U.S. So we serve South American users from Aamzon and the West Coast from say, Akamai or Fastly, or a handful of other players.

Your request for Polka Hits would go to whatever one of these is the best. If it goes to Akamai, its going to hit their service, they're going to say 'Polka hits i don't have that! where do i go?' It'll then go through a couple tiers of that before it says fine I don't know where that is I'm going to fetch it directly form Spotify, probably from us in one two or three locations -- one in Europe, two in the US. As part of the POlka Hits its going to populate the caches of all those tiers in between so when your neighbor wants to hear it you get as good a performance as Justin Bieber [who is not on the 'long tail' catalog of music]. That's essentially the path.

That's easy.

[Laughs]

While we record your play, we recognize that you play the track, we store the record of that play in a couple of places, one of which is our data processing facility, so the next Discover Weekly playlist for you it might have Polka Hits on it. We use that same data set to figure out who, and how much, we need to pay for that play, and a bunch of other stuff that might determine that 'Hey some of the people we have building great playlists for us, we see this uptick in polka music. We're going to ask them to put together a curated polka playlist.' We might also go tell the sales team, 'Why don't you go sell some ads to some people selling... bowling equipment. We know there's an audience.'

And they say are you fucking kidding me?

And we say 'No we're not.'