Level of Difficulty: Beginner – Senior.
Say you have a website that you’d like to scrape to find out what information the website represents. Well the good news is that you can, in Python, using a library named TextRazor. TextRazor allows you to scrape categories, entities and topics from the website. To use this library, you would need an API key, thereafter you can adjust the parameters accepted by the library, as necessary, to extract the information, as required.
Let’s start by getting an API key.
Get an API Key
To get an API key visit the TextRazor site. You’ll need to create an account to get your API key.
Then click ‘Create Account‘. Once you’ve verified your account, you’ll be able to see your API key which you should keep her safe and warm (as Mary Lambert would say :p).
Now that we are all set up, let’s move over to Python and get started there. The documentation of the library is available here.
Install Library – TextRazor
We need to use the TextRazor library to do what we want to do. If you don’t have it installed, you’ll need to install it using this command:
pip install textrazor
Import the installed textrazor library before being able to install it.
You’ll need to initialise the variables needed to make the API work. For this you’ll need the URL of the site you want to scrape as well as the API key.
API_Key = '<apikey>' URL = 'https://www.azlyrics.com/lyrics/marylambert/shekeepsmewarm.html' textrazor.api_key = API_Key
Select the extractor you’d like to use.
- Extractor options include: entities, topics, words, phrases, dependency-trees, relations, entailments, senses, spelling
- Cleanup mode options include: stripTags and cleanHTML
- Classifier options include: textrazor_iab, textrazor_iab_content_taxonomy, textrazor_newscodes, textrazor_mediatopics and custom classifier name
client = textrazor.TextRazor(extractors=["entities", "topics"]) client.set_cleanup_mode("cleanHTML") client.set_classifiers(["textrazor_newscodes"])
Assign the analysis of the site to the variable named response.
response = client.analyze_url(URL)
Get the Desired Results
Print out the entities information:
entities = list(response.entities()) entities.sort(key=lambda x: x.relevance_score, reverse=True) seen = set() for entity in entities: if entity.id not in seen: #print(entity.id, entity.relevance_score, entity.confidence_score, entity.freebase_types) seen.add(entity.id)
Print out the topics information:
[ print(topic.json) for topic in response.topics() ]
Print out the categories information:
[ print(category.json) for category in response.categories() ]
Do you like what you see? She says ‘people stare, because we look so good together’. A more complete implementation can be found in this GitHub repo.