July 21st, 2018 I can say confidently that I was writing some SEO content. I know because I tweeted about it, and I never tweet about anything. "Someone should make a corpus analysis tool" I mused to myself. Someone who understands that the web is a really abstract place. It's full of documents, pages, and links to constantly changing and reorganizing topics. Then I forgot all about it for a year.
Fast forward to late 2019, and the tool is well underway. I realized no one had read my tweet and it was time to build what I wanted myself. Lemmatic solved several core problems that I had.
- Organizing web content into structured topics or project buckets.
- Extracting the real content from HTML filled pages.
- Collecting real keywords from real the content.
Organizing web content
Organizing indexed keywords into topics is what makes a search engine fast. Topics, subtopics, and deeply nested keywords with intent all relate to categorizing words and what they mean. A topic is the key in keyword.
I needed a way to organize web urls by topic. Not a hard task, but an important one.
The web is full of junk. Advertising, unrelated content, and markup. We just want the words. The useful content that provides ideas, context, and constructive thought.
Most keyword tools do three things. Word density, context synonyms, and scraping Google's suggestions.
Counting words is the easiest way to collect keywords. The most common words must be important. Of course, you have to clean up the results because inevitably "a", "the", and "it" become the top results. Word Density based results can be improved even further if you use an algorithm such as TF-IDF to find the most common topical keywords. TF-IDF however is an old tool. Search engines have moved on.
Synonyms, Combinations, and Databases
This is a contextual generator which uses words with similar intents or connotations to display a whole list of terms. Latent semantic indexing would generate keywords that would fit into this category. Latent semantic analysis is great at finding words that appear in the same contexts and meanings. However, it's been around since 1988. We can do better than that!
Scraping Google suggestions is one of the most common and effective tools these days for finding related keywords. It's a result straight from the RankBrain and algorithms of Google. HOWEVER! It's already been passed through the algorithm black box. We don't know why certain pages on Google appear for a suggested keyword.
The Final Fouth Way
Lemmatic is introducing a new way to collect keywords. It uses eigenvector centrality. That might not mean anything to you, but what if I told you it's the same algorithm as Google's PageRank. It observes a word's centrality and weight based on dependent word's attachments and weights. The document becomes a network of nodes and links where the core ideas rise to the top.
That's the modern way we collect keywords.