[Python] Scraping Context (Categories, Topics and Entities) From Websites Using TextRazor and Python

Level of Difficulty: Beginner – Senior.

Say you have a website that you’d like to scrape to find out what information the website represents. Well the good news is that you can, in Python, using a library named TextRazor. TextRazor allows you to scrape categories, entities and topics from the website. To use this library, you would need an API key, thereafter you can adjust the parameters accepted by the library, as necessary, to extract the information, as required.

Let’s start by getting an API key.

Get an API Key

To get an API key visit the TextRazor site. You’ll need to create an account to get your API key.

Then click ‘Create Account‘. Once you’ve verified your account, you’ll be able to see your API key which you should keep her safe and warm (as Mary Lambert would say :p).

Now that we are all set up, let’s move over to Python and get started there. The documentation of the library is available here.

Install Library – TextRazor

We need to use the TextRazor library to do what we want to do. If you don’t have it installed, you’ll need to install it using this command:

pip install textrazor

Import Libraries

Import the installed textrazor library before being able to install it.

import textrazor

Initialise Library

You’ll need to initialise the variables needed to make the API work. For this you’ll need the URL of the site you want to scrape as well as the API key.

API_Key = '<apikey>'
URL = 'https://www.azlyrics.com/lyrics/marylambert/shekeepsmewarm.html'
textrazor.api_key = API_Key

Select the extractor you’d like to use.

  • Extractor options include: entities, topics, words, phrases, dependency-trees, relations, entailments, senses, spelling
  • Cleanup mode options include: stripTags and cleanHTML
  • Classifier options include: textrazor_iab, textrazor_iab_content_taxonomy, textrazor_newscodes, textrazor_mediatopics and custom classifier name
client = textrazor.TextRazor(extractors=["entities", "topics"])
client.set_cleanup_mode("cleanHTML")
client.set_classifiers(["textrazor_newscodes"])

Assign the analysis of the site to the variable named response.

response = client.analyze_url(URL)

Get the Desired Results

Print out the entities information:

entities = list(response.entities())
entities.sort(key=lambda x: x.relevance_score, reverse=True)
seen = set()

for entity in entities:
    
     if entity.id not in seen:
    
        #print(entity.id, entity.relevance_score, entity.confidence_score, entity.freebase_types)
        seen.add(entity.id)

Print out the topics information:

[ print(topic.json) for topic in response.topics() ]

Print out the categories information:

[ print(category.json) for category in response.categories() ]

Do you like what you see? She says ‘people stare, because we look so good together’. A more complete implementation can be found in this GitHub repo.

Published by Jacqui Muller

I am an application architect and part time lecturer by current professions who enjoys dabbling in software development, RPA, IOT, advanced analytics, data engineering and business intelligence. I am aspiring to complete a PhD degree in Computer Science within the next three years. My competencies include a high level of computer literacy as well as programming in various languages. I am passionate about my field of study and occupation as I believe it has the ability and potential to impact lives - both drastically and positively. I come packaged with an ambition to succeed and make the world a better place.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: