COVID-19 Data Exploration Using SQL

COVID-19 Data Exploration Using SQL

Hello everyone ✨✨, This is my first blog on Hashnode. With this blog, I wish to start a journey of writing technical blogs.

In this blog, I describe my journey of working on a Portfolio Project, which was a part of The Data Analyst Bootcamp, created by Alex Freberg. The Project is focused on exploring the COVID-19 data using SQL Server.


Data

I have downloaded the dataset for this Project from Our World in Data. For downloading the dataset from the link, follow these steps:-

  1. After going to the link, scroll down to the "Explore the global data on confirmed COVID-19 deaths" section.

  2. You will notice a time series chart in the section, select the range of dates you wish to explore.

  3. Click on Download, then select the "Full data (CSV)" option. The dataset will be downloaded.


Basic Preprocessing

I created two different CSV files from this dataset.

  • CovidDeaths

  • CovidVaccinations

The first CSV file named CovidDeaths contained the following columns:-

The second CSV file named CovidVaccinations contained the following columns:-

I then imported these files as flat files to Microsoft SQL Server Management Studio.


Data Selection

I first looked at the data contained in these tables. I used the SELECT statement for this task.

select * from CovidDeaths order by 3,4;
select * from CovidVaccinations order by 3,4;

I ordered the data based on Location (column 3), followed by Date (column 4).

I then selected the columns which I was going to use from the CovidDeaths Table.

select location, date, total_cases, new_cases, total_deaths,population
from CovidDeaths
order by 1,2;

Data Exploration

I first looked at Total Cases vs Total Deaths from the CovidDeaths Table for a specific Country India.

select location, date, total_cases, total_deaths, ((total_deaths/total_cases)*100) as DeathPercentage
from CovidDeaths
where location like '%India%'
order by 1,2;

A snapshot of a few rows returned by the above query

I then looked at Total Cases vs Population to find out what percentage of Population in India got infected by COVID-19.

select location, date, population, total_cases, ((total_cases/population)*100) as CovidPercentage
from CovidDeaths
where location like '%India%'
order by 1,2;

A snapshot of a few rows returned by the above query

I then looked at the Top 10 countries with the Highest Infection Rate compared to Population.

select TOP 10 location, population, max(total_cases) as HighestInfectionCount, (max((total_cases/population))*100) as CovidPercentage
from CovidDeaths
group by location, population
order by CovidPercentage desc;

The query returned the following result:-

I then looked at the Top 10 countries with the Highest Death Count.

select TOP 10 location, max(total_deaths) as TotalDeathCount
from CovidDeaths
where continent is not null
group by location
order by TotalDeathCount desc;

The query returned the following result:-

select continent, max(total_deaths) as TotalDeathCount
from CovidDeaths
where continent is not null
group by continent
order by TotalDeathCount desc;

I then looked for continent-wise Total Death Count and retrieved the following result from the above query:-

select date, sum(new_cases) as TotalCases, sum(new_deaths) as TotalDeaths, (sum(new_deaths)/sum(new_cases))*100 as DeathPercentage
from CovidDeaths
where continent is not null
group by date
order by 4 desc;

I then looked for Global Statistics aggregated around Date for Death Percentage and retrieved the following result from the above query:-

with PopvsVac (continent, location, date, population, new_vaccination, RollingPeopleVaccinated)
as
(select dea.continent, dea.location, dea.date, dea.population, vac.new_vaccinations, sum(vac.new_vaccinations) over (partition by dea.location order by dea.location, dea.date) as RollingPeopleVaccinated
from CovidDeaths dea join CovidVaccinations vac
on dea.location = vac.location and dea.date = vac.date
where dea.continent is not null
--order by 2,3;
)

select *, (RollingPeopleVaccinated/population)*100 as VaccinationPercentage
From PopvsVac;

I have created a CTE named PopvsVac and used it to retrieve the following result:-

create view PercentPopulationVaccinated as
select dea.continent, dea.location, dea.date, dea.population, vac.new_vaccinations, sum(vac.new_vaccinations) over (partition by dea.location order by dea.location, dea.date) as RollingPeopleVaccinated
from CovidDeaths dea join CovidVaccinations vac
on dea.location = vac.location and dea.date = vac.date
where dea.continent is not null

I then created a view called PercentPopulationVaccinated for future use

create view GlobalDeath as
select sum(new_cases) as TotalCases, sum(new_deaths) as TotalDeaths, (sum(new_deaths)/sum(new_cases))*100 as DeathPercentage
from CovidDeaths
where continent is not null

At last, I created a view named GlobalDeath to store the Total Global Statistics of COVID-19.


The entire SQL code can be found below:-

select * from CovidDeaths order by 3,4;

select * from CovidVaccinations order by 3,4;


----Selecting Data that we will use From CovidDeaths
select location, date, total_cases, new_cases, total_deaths,population
from CovidDeaths
order by 1,2;


---- Looking at Total Cases vs Total Deaths
----Likelihood of Death if you contract covid in India
select location, date, total_cases, total_deaths, ((total_deaths/total_cases)*100) as DeathPercentage
from CovidDeaths
where location like '%India%'
order by 1,2;

----Looking at Total cases vs Population
----What percentage of population got Covid
select location, date, population, total_cases, ((total_cases/population)*100) as CovidPercentage
from CovidDeaths
where location like '%India%'
order by 1,2;

---- Looking at countries with Highest Infection Rate compared to Population
select TOP 10 location, population, max(total_cases) as HighestInfectionCount, (max((total_cases/population))*100) as CovidPercentage
from CovidDeaths
group by location, population
order by CovidPercentage desc;

---- Showing Countries with Highest Death Count per population
select TOP 10 location, max(total_deaths) as TotalDeathCount
from CovidDeaths
where continent is not null
group by location
order by TotalDeathCount desc;

---- Let's look continent wise Death count
select continent, max(total_deaths) as TotalDeathCount
from CovidDeaths
where continent is not null
group by continent
order by TotalDeathCount desc;


---- Let's look at Global Values

select date, sum(new_cases) as TotalCases, sum(new_deaths) as TotalDeaths, (sum(new_deaths)/sum(new_cases))*100 as DeathPercentage
from CovidDeaths
where continent is not null
group by date
order by 1,2

---- Ordering by Death Percent
select date, sum(new_cases) as TotalCases, sum(new_deaths) as TotalDeaths, (sum(new_deaths)/sum(new_cases))*100 as DeathPercentage
from CovidDeaths
where continent is not null
group by date
order by 4 desc;


---- Looking at Total Population vs Vaccinations
---- We will Use CTE for this task

with PopvsVac (continent, location, date, population, new_vaccination, RollingPeopleVaccinated)
as
(select dea.continent, dea.location, dea.date, dea.population, vac.new_vaccinations, sum(vac.new_vaccinations) over (partition by dea.location order by dea.location, dea.date) as RollingPeopleVaccinated
from CovidDeaths dea join CovidVaccinations vac
on dea.location = vac.location and dea.date = vac.date
where dea.continent is not null
--order by 2,3;
)

select *, (RollingPeopleVaccinated/population)*100 as VaccinationPercentage
From PopvsVac;


---- Creating Views to store data for later visualizations

---- To store Percentage of Population Vaccinated
create view PercentPopulationVaccinated as
select dea.continent, dea.location, dea.date, dea.population, vac.new_vaccinations, sum(vac.new_vaccinations) over (partition by dea.location order by dea.location, dea.date) as RollingPeopleVaccinated
from CovidDeaths dea join CovidVaccinations vac
on dea.location = vac.location and dea.date = vac.date
where dea.continent is not null

---- To store Total Deaths
create view GlobalDeath as
select sum(new_cases) as TotalCases, sum(new_deaths) as TotalDeaths, (sum(new_deaths)/sum(new_cases))*100 as DeathPercentage
from CovidDeaths
where continent is not null

Now, can we create visualizations based on these insights, I guess that's a story for another Project. Goodbye until we meet again🎉🎉.