How To Create Datasets From Wikipedia Tables

Manav Sehgal
Manav Sehgal
Published in
2 min readApr 4, 2017

--

Sometimes your data science project may require creation of a dataset from public domain information. Wikipedia is a great source of semi-structured public domain data. This article walks you through a Python Notebook that I have created to extract tables from Wikipedia as CSV datasets, which you can use for your data science projects.

You can download the Jupyter Notebook written in Python from GitHub.

The way the notebook is written, you can follow a simple three step workflow to grab any table from Wikipedia as a CSV dataset.

  1. Identify the Wikipedia page name and count the table location or sequence from the top in case the page has multiple tables. Mention these values in the second cell.
  2. Turn on the trace feature by setting theTrue value. This will enable you to test the data extraction logic without saving the dataset to a CSV file.
  3. Turn off the trace feature and run the Notebook to create the CSV dataset file.

The notebook has been tested on around ten different types of tables and datasets across Wikipedia handling following types of data variations automatically.

  • Extract feature names including from multi-line table column names.
  • Extract text features from links.
  • Ignore superscript text or reference text in square brackets.
  • Extract coordinates in latitude and longitude.
  • Process numerical feature values.
  • Handle hidden values within table cells.
  • Handle images within table cells.

If you find the notebook useful please Heart this article and share it forward.

--

--