[Python] Scraping Context (Categories, Topics and Entities) From Websites Using TextRazor and Python

Level of Difficulty: Beginner – Senior.

Say you have a website that you’d like to scrape to find out what information the website represents. Well the good news is that you can, in Python, using a library named TextRazor. TextRazor allows you to scrape categories, entities and topics from the website. To use this library, you would need an API key, thereafter you can adjust the parameters accepted by the library, as necessary, to extract the information, as required.

Let’s start by getting an API key.

Get an API Key

To get an API key visit the TextRazor site. You’ll need to create an account to get your API key.

Then click ‘Create Account‘. Once you’ve verified your account, you’ll be able to see your API key which you should keep her safe and warm (as Mary Lambert would say :p).

Now that we are all set up, let’s move over to Python and get started there. The documentation of the library is available here.

Install Library – TextRazor

We need to use the TextRazor library to do what we want to do. If you don’t have it installed, you’ll need to install it using this command:

pip install textrazor

Import Libraries

Import the installed textrazor library before being able to install it.

import textrazor

Initialise Library

You’ll need to initialise the variables needed to make the API work. For this you’ll need the URL of the site you want to scrape as well as the API key.

API_Key = '<apikey>'
URL = 'https://www.azlyrics.com/lyrics/marylambert/shekeepsmewarm.html'
textrazor.api_key = API_Key

Select the extractor you’d like to use.

Extractor options include: entities, topics, words, phrases, dependency-trees, relations, entailments, senses, spelling
Cleanup mode options include: stripTags and cleanHTML
Classifier options include: textrazor_iab, textrazor_iab_content_taxonomy, textrazor_newscodes, textrazor_mediatopics and custom classifier name

client = textrazor.TextRazor(extractors=["entities", "topics"])
client.set_cleanup_mode("cleanHTML")
client.set_classifiers(["textrazor_newscodes"])

Assign the analysis of the site to the variable named response.

response = client.analyze_url(URL)

Get the Desired Results

Print out the entities information:

entities = list(response.entities())
entities.sort(key=lambda x: x.relevance_score, reverse=True)
seen = set()

for entity in entities:
    
     if entity.id not in seen:
    
        #print(entity.id, entity.relevance_score, entity.confidence_score, entity.freebase_types)
        seen.add(entity.id)

Print out the topics information:

[ print(topic.json) for topic in response.topics() ]

Print out the categories information:

[ print(category.json) for category in response.categories() ]

Do you like what you see? She says ‘people stare, because we look so good together’. A more complete implementation can be found in this GitHub repo.

[Python] Scraping Context (Categories, Topics and Entities) From Websites Using TextRazor and Python

Get an API Key

Install Library – TextRazor

Import Libraries

Initialise Library

Get the Desired Results

Published by Jacqui Muller

Leave a comment Cancel reply

Get an API Key

Install Library – TextRazor

Import Libraries

Initialise Library

Get the Desired Results

Share this:

Related

Published by Jacqui Muller

Leave a comment Cancel reply