top of page
Data Sources
Challanges

The data wrangling challenges we faced are-

 

  1. Differences in country names among the different sites. For example, "South Korea" is the same as "Republic of Korea" and "North Korea" is the same as "Democratic People's Republic of Korea.

  2. Differences in player names among the different sites. For example, some records used just the last names, whereas others used the player's full name.

  3. Most of the data was embedded in web sites instead of an easily imported data file.

  4. The United Kingdom has separate teams for Wales, England, Scotland, and Northern Ireland.

 

 

 

 

We used the rvest package in R to scrape data from the web sites.  To match country data, we manually coded a handful of non-obvious country name translations and then used a distance matrix to match strings to the closest match across the different sites.  To match players, we limited the search space by country and year and then used a distance matrix to find the closest player name match.

 

DATA

We utilized 3 primary data sources for our analysis.

 

  • World cup games and goals scored information:

http://soccerstats.us/c/fifa-world-cup/

 

  • World cup team and detailed player information:

https://en.wikipedia.org/wiki/<year>_FIFA_World_Cup_squads

 

  • Continent/Region information:

R package: countrycodes

Data Wrangling
bottom of page