The Extractor Tool
The extractor tool does several things, but most importantly, it extracts. It’s a strong starting point for building SEO tools because it goes through the motions that all SEO tools follow.
- Crawl a website and return raw HTML
- Process HTML into elements
- Transform raw text into data
Google indexing follows this exact process, then stores it to power their search engine. The processing portion is a large part of their mythical algorithm.
For this tool, it mostly uses the TextRank algorithm, which is now relatively old. It’s an unsupervised process, meaning you don’t have to train it like many machine learning systems. It works by breaking a text into sentences and then finding one sentence that is most like the remaining content. Then we can assume, that sentence is an excellent summary of the text body.
What it does
The Extractor should take a URL and return several extracted pieces of information.
- Polarity (read as % positivity, or % negativity)
- Subjectivity (read as % subjective or % objective)
- Keyword List
- Extracted Content
If it doesn’t, then something has gone wrong. It could be on my end. I pay for minimal servers. Alternatively, it could be that your website doesn’t have any content.
It might have words, but it needs paragraphs, descriptions, and context. Make sure your websites use sentences, paragraphs, and large bodies of text!
Tools I used
The Extractor is a python tool that uses other libraries to do all the heavy lifting.
The request library is a simple way to fetch web content. It is beyond simple when compared to Google’s Puppeteer library.
Urllib2 allows you to send headers, and user-agent is one of the most critical headers when crawling. Since we aren’t using a real web browser to crawl, many websites become leery without a user-agent. I use Fake User-Agent to spit out a random one to keep other secure sites happy.
Dragnet is a machine learning implementation for extracting primary content. It is possible to use BeautifulSoup4 to pull out what you need with an HTML parser. I did try that implementation, and it resulted in errant HTML not caught by edge cases. Dragnet doesn’t always get all the content, which is why I always suggest writing HTML with a single paragraph section for your keywords.
Summa provides the TextRank algorithm. The tool gives functions for creating an extracted summary and extracted keywords. The Extractor uses both. It generally uses the idea of a similarity matrix, which uses vectors assigned to words to get a cosine similarity model.
I ran into a problem with similar keywords. The list of words had multiple similar words that polluted the model. Lemmatization was the immediate solution.
Lemmatization is the idea that similar words should be morphed into their base word. The one that dictionary entries of similar words would use as a reference. I had a lousy experience lemmatizing with most of the existing implementations. Instead of getting the lemma (base word), I wanted to get only the most relevant version of the keyword.
The current implementation uses FuzzyWuzzy and Itertools to iterate over all keywords, find words with close Levenshtein-distance, and remove the less relevant matching keywords.
TextBlob is used to gather sentiment. It’s a powerful library with functions for several everyday NLP tasks. I use it just for the sentiment analysis.
As far as I know, it’s another vector implementation that runs a simple average on the polarity or subjectivity of words. It’s useful if your content’s intended audience expect emotion or just the facts.
It also needs a queue status, so under load, you can see how the server work is going.
Finally, it would be nice to store the request and provide a unique URL. Then users could leave the page and come back when the crawl and analysis are complete. Alternatively, SEO users could send the unique URL to friends to show off the results.
Do you have a suggestion on how the tool should work or features you want? Send me a tweet.