[Automation] Getting Data from the ‘New’ GitHub Projects

Level of Difficulty: Beginner – Senior.

This image has an empty alt attribute; its file name is automation-1.png

GitHub Projects is nothing new – it has been around for a while now with a really cool API that allows you to pull data from it. For those who have made use of that functionality, it has been a breeze trying to automate and integrate different tasks across different platforms. But… GitHub have launched an ‘all-new’ project element to their platform that now allows you to do cool new things that was previously limited in the Classic Projects.

Now from an automation and integration perspective, this addition is exciting but getting the data is notoriously tough. The addition to the platform is still relatively new and hasn’t been out of preview for all that long so it is understandable that the API (which is based on Graph QL) is not quite ‘there’ yet. It’s not as easy as lifting and shifting from Classic Projects to new Projects. So what now? How do we use this cool new (free) tech and still be able to grab the data we need if not even front-end Robotic Process Automation (RPA) can work?

The answer didn’t seem as ‘simple’ at the time, when I had exhausted nearly every possible solution I could think of, but web-scraping definitely does the trick.

Due to the way that GitHub (and GitHub Projects – more specifically) has been created, a lot of the data that is being rendered on the front-end is actually quite easily accessible from the front-end elements of the web page which is why the web-scraping option works so well. The only caveat is that you need to have access to the Project to scrape it.

Here are a few examples of the data you can pull:

Project Views
Project View Columns
Project View Data
Project Charts

You can pull individual elements and associated data out as well (like list of values for a specific column, etc.). Let’s see how that works using Python.

Install Beautiful Soup

If you haven’t yet installed Beautiful Soup, do so using the following command:

pip install beautifulsoup4

Import Libraries

Import all of the following libraries/packages which are essential to the successful scraping of the data:

import requests
import json
import pandas as pd

from bs4 import BeautifulSoup as bs

Use Beautiful Soup to Scrape the Project

Beautiful Soup is the packaged used to scrape a web page which, in this case, works quite well when scraping a GitHub project. The soup is the content scraped from the webpage and will be used further for more manipulation to get the required data.

def GetProjectSoup(project_url):

    # load the projectpro webpage content 
    r = requests.get(project_url) 

    # convert to beautiful soup 
    soup = bs(r.content) 

    return soup

Get Project Views

All the required data already lives in the ‘soup’ – all you need to do is grab the data living in the ‘memex-views’ element:

def GetViews(soup):

    view_data_text = str(soup.find_all(id='memex-views')).replace('[<script id="memex-views" type="application/json">','').replace('</script>]','')

    json_object = json.loads(view_data_text)

    return pd.DataFrame.from_dict(json_object)

Get Project View Columns

All the required data already lives in the ‘soup’ – all you need to do is grab the data living in the ‘memex-columns-data’ element:

def GetColumns(soup):
  
    column_data_text = str(soup.find_all(id='memex-columns-data')).replace('[<script id="memex-columns-data" type="application/json">','').replace('</script>]','')

    json_object = json.loads(column_data_text)

    json_object

    return pd.DataFrame.from_dict(json_object)

Get View Data

All the required data already lives in the ‘soup’ – all you need to do is grab the data living in the ‘memex-items-data’ element:

def GetData(soup):
  
    data_text = str(soup.find_all(id='memex-items-data')).replace('[<script id="memex-items-data" type="application/json">','').replace('</script>]','')

    json_object = json.loads(data_text)

    json_object

    df_tempdata = pd.DataFrame.from_dict(json_object)

    return df_tempdata

Get Project Charts

All the required data already lives in the ‘soup’ – all you need to do is grab the data living in the ‘memex-charts-data’ element:

def GetCharts(project_url):

    # load the projectpro webpage content 
    r = requests.get(project_url) 

    # convert to beautiful soup 
    soup = bs(r.content) 

    #print(soup)

    column_data_text = str(soup.find_all(id='memex-charts-data')).replace('[<script id="memex-charts-data" type="application/json">','').replace('</script>]','')

    json_object = json.loads(column_data_text)

    json_object

    charts_list = [ element['name'] for element in json_object]

    return charts_list

Bringing it all Together

It is totally possible to scrape more data from the site, all you would need to do is rescrape the new URL which you can put together based on some of the info you’ve already scraped, like appending ‘/views/<index>’ to the project URL to get specific view data.

project_url = 'https://github.com/users/JacquiM/projects/23/views/1'

# Get Soup from Website Scrape
soup = GetProjectSoup(project_url)

# Get Views and assign to new Dataframe
df_views = GetViews(soup)

# Get Columns and assign to new Dataframe
df_columns = GetColumns(soup)

# Get View Data and assign to new Dataframe
df_data = GetData(soup)

# Get Project Charts and return as List
charts_list = GetCharts(project_url)

All the code that you would need can be found in this GitHub Repo. Did this work for you? Pop a comment down below.

[Automation] Getting Data from the ‘New’ GitHub Projects

Install Beautiful Soup

Import Libraries

Use Beautiful Soup to Scrape the Project

Get Project Views

Get Project View Columns

Get View Data

Get Project Charts

Bringing it all Together

Published by Jacqui Muller

Leave a comment Cancel reply

Install Beautiful Soup

Import Libraries

Use Beautiful Soup to Scrape the Project

Get Project Views

Get Project View Columns

Get View Data

Get Project Charts

Bringing it all Together

Share this:

Related

Published by Jacqui Muller

Leave a comment Cancel reply