Case Study: Spotify Albums

Introduction

In this Case Study, we will analyze properties of popular music, based on measurements taken by the streaming service Spotify. For every song that is available on Spotify, automatic measurements are taken of certain musical and sound-based properties. These include:

  • mode: Mode indicates the modality (major = 1 or minor = 0) of a track, the type of scale from which its melodic content is derived.
  • tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
  • loudness: The overall loudness of a track in decibels (dB).
  • acoustic: Whether or not a track is acoustic (i.e., played without any background instruments or noises).
  • live: Whether or not the track was performed live.
  • duration: How long the song is in seconds
  • explicit: Whether the song involves explicit words
  • danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity.
  • energy: Energy is a perceptual measure of intensity and activity.
  • valence: A measure describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
  • speechiness: How much the song involves speaking a lot of words.
  • time signature: The beat pattern of the song (0, 1, 2, 3, or 4).

We will use a smaller version of a dataset gathered from Spotify’s service by a former Cal Poly student. The full data, and information about it, can be found here: https://www.kaggle.com/danield2255/data-on-songs-from-billboard-19992019

The cases of this dataset are Albums, not individual songs. All measurements are averaged over the songs in the album, except: * Whether an album is considered Live or Acoustic depends on whether more than half the songs on the album are Live or Acoustic. * The Duration is the length in seconds of the full album.

A few rows (not all of them!) of the dataset are shown here:

Name

Album

Artist

Acoustic

Danceability

Duration

Energy

Explicit

Live

Loudness

Mode

Speechiness

Tempo

TimeSignature

Valence

P.L.U.C.K.

System Of A Down (Bonus Pack)

system of a down

FALSE

0.267

217133

0.931

TRUE

FALSE

-4.087

1

0.3050

159.759

4

0.1170

Kim

The Marshall Mathers LP

eminem

FALSE

0.587

377893

0.923

TRUE

FALSE

-3.050

0

0.4410

142.075

4

0.0778

Never Gonna Leave This Bed - Acoustic Version

Hands All Over (Deluxe)

maroon 5

FALSE

0.705

202320

0.688

FALSE

FALSE

-5.773

1

0.0415

114.943

4

0.5980

Waking Up In Vegas

One Of The Boys

katy perry

FALSE

0.524

199187

0.878

FALSE

FALSE

-3.108

0

0.0346

130.989

4

0.5900

Intro - Main Version - Explicit

Idlewild

outkast

TRUE

0.483

132027

0.506

TRUE

FALSE

-14.899

0

0.3600

159.825

4

0.4660

Ur So Gay - Live At MTV Unplugged, New York, NY/2009

Unplugged (Live At MTV Unplugged, New York, NY/2009)

katy perry

FALSE

0.536

264213

0.651

FALSE

TRUE

-5.828

0

0.1480

159.833

4

0.2660

Burn the Witch

A Moon Shaped Pool

radiohead

FALSE

0.541

220609

0.847

FALSE

FALSE

-6.520

1

0.0297

148.937

4

0.6200

All In It

Purpose (Deluxe)

justin bieber

FALSE

0.356

231413

0.707

FALSE

FALSE

-7.546

0

0.0990

136.250

5

0.5060

White As Snow

No Line On The Horizon

u2

TRUE

0.402

281067

0.308

FALSE

FALSE

-11.615

0

0.0290

173.755

4

0.1400

Roman's Revenge - Album Version (Edited)

Pink Friday (Deluxe)

nicki minaj

TRUE

0.817

276173

0.947

FALSE

FALSE

-2.881

1

0.2900

112.364

4

0.4650

Data Apps and Tables

To address the questions in this case study, you will of course need to calculate some summary statistics and estimated measurements from the dataset. Rather than giving you these summaries ahead of time, or asking you to calculate them by hand, we have provided for you a web application that will let you decide which statistics to calculate, then the calculation will be done for you.

Access the Data App here

Since correlations are a bit more complicated, here is a table of the sample correlations between all quantitative variables:

Variable

Danceability

Duration

Energy

Loudness

Speechiness

Tempo

Valence

Danceability

1.00000000

0.048073809

-0.07955758

0.07357243

0.18028556

-0.220553226

0.32204309

Duration

0.04807381

1.000000000

-0.00956553

0.01114260

-0.07796373

0.004695732

-0.17626606

Energy

-0.07955758

-0.009565530

1.00000000

0.73587747

0.05387733

0.169815148

0.38158658

Loudness

0.07357243

0.011142596

0.73587747

1.00000000

-0.04918635

0.136233201

0.29612644

Speechiness

0.18028556

-0.077963730

0.05387733

-0.04918635

1.00000000

0.032884040

0.02115068

Tempo

-0.22055323

0.004695732

0.16981515

0.13623320

0.03288404

1.000000000

0.06423906

Valence

0.32204309

-0.176266061

0.38158658

0.29612644

0.02115068

0.064239061

1.00000000

You will also be asked to make visualizations to explore the data and support your analyses. This will also be accomplished via a web application that lets you choose variables and plot types.

Access the Visualization App here

Summarize the dataset

As always, we begin by reporting the information about the dataset. Write one short paragraph that addresses the structure of the dataset (i.e. cases and variables), where it came from, and any ethical concerns you might have about this analysis.

(You do not need to state variable types in this paragraph; you will be asked about that later.)

Research Questions

Answer each of the following research questions including all of these steps:

  1. State the variables involved in the question, and give their types.

  2. Make a visualization that addresses the research question.

  3. State the parameter(s) that would answer the research question. Then, compute and report summary statistic(s) that estimate these parameters.

  4. Calculate and interpret a z-score to address the question.

  5. Report a hypothesis test, including stating the null and alternate hypotheses and using the Empirical Rule to estimate your p-value.

  6. Summarize your results in real-world terms: what (if anything) do you conclude about Spotify songs based on this analysis?

Are musical artists less likely to use explicit language when an album is played live?

Are acoustic albums quieter than non-acoustic albums, on average?

Do explicit albums tend to be more danceable?

Do high energy albums have a faster tempo?

Tell a story

Choose two artists in this dataset who you would like to compare to each other.

Come up with three research questions:

  1. A question comparing a categorical variable across the two artists.

  2. A question comparing a quantitative variable across the two artists.

  3. A question comparing the correlation between two quantitative variables across artists. (That is, we want to ask “Is the correlation between Variable X and Variable Y different for Artist A than for Artist B?”)

Answer one of the research questions you wrote by performing a full hypothesis test.