Building an AI agent to check for do follow sites

diegog · 7 hours ago

Not all backlinks are created equal. A database full of URLs looks impressive, until you realize half of them are either dead sites, nofollow, sponsored, or UGC links that won’t pass authority. Here’s how at CheckForma we built an AI agent to sift through a URL database and automatically determine which links are truly “dofollow” for a backlink strategy. The core goal of our agent is simple: Given a database of URLs, determine whether each page contains a high-quality dofollow backlink pointing to your domain.

Under the hood, however, this requires a multi-step reasoning process. The agent needs to:

Visit the target page.
Locate the link pointing to your domain.
Inspect its HTML attributes.
Classify the link type (dofollow, nofollow, sponsored, or UGC).
Store and score the result based on context.

Everything starts with your task queue. You'll need a database table containing:

A list of URLs to check
Your target domain(s)
A status field (unchecked, valid, error, etc.)
Relevant metadata (Domain Rating, estimated traffic, niche relevance)

The agent needs a reliable way to fetch page content. Because many modern sites inject links dynamically via JavaScript, you need to render the full HTML.

Use a Headless browser (like Playwright or Puppeteer) for JS-heavy or SPA (Single Page Application) sites.
Use standard HTTP requests for simple, static pages to save compute resources.

Once the page loads, the extraction engine takes over. The agent parses all <a> tags on the page and filters for links containing your target domain.
For every matched link, it extracts:

The rel attribute
The Anchor text
The surrounding text context
The placement on the page (e.g., footer, body content, sidebar)

Now, the hard-coded classification logic kicks in to categorize the link:

Dofollow: No rel attribute present, OR the rel attribute lacks nofollow, ugc, or sponsored.
Nofollow: rel="nofollow"
UGC: rel="ugc"
Sponsored: rel="sponsored"

Ensure your parser can handle compound attributes, like rel="nofollow sponsored".

By feeding the extracted link data and surrounding text into a Large Language Model (LLM), the agent can score link quality based on:

Content relevance: Does the surrounding paragraph make sense for your niche?
Placement evaluation: Is this an editorial link inside a high-quality article, or a spammy directory drop in the footer?
Link density: Is the page stuffed with thousands of outbound links?
Anchor text naturalness: Does the anchor text read naturally, or is it overly optimized?

The LLM effectively answers the question: "Is this link editorial and contextually relevant, or a low-value placement?" This provides a massive upgrade over traditional backlink checkers.This is where the system graduates from a basic scraper to a true AI agent. Instead of merely checking HTML attributes, the AI reasoning layer evaluates the context of the link.

By feeding the extracted link data and surrounding text into a Large Language Model (LLM), the agent can score link quality based on:

Content relevance: Does the surrounding paragraph make sense for your niche?
Placement evaluation: Is this an editorial link inside a high-quality article, or a spammy directory drop in the footer?
Link density: Is the page stuffed with thousands of outbound links?
Anchor text naturalness: Does the anchor text read naturally, or is it overly optimized?

The LLM effectively answers the question: "Is this link editorial and contextually relevant, or a low-value placement?" This provides a massive upgrade over traditional backlink checkers.

After processing, each URL in your database is updated with a rich profile:

Follow status (Dofollow / Nofollow)
Link type (Editorial, directory, comment, etc.)
Context classification
Custom SEO value score (Based on the AI Reasoning Layer)
Crawl timestamp
Screenshot (Optional, but incredibly powerful for visual audits)

With this data structured, your system can automatically filter out the noise. You can instantly query for dofollow links, prioritize outreach follow-ups, identify valuable links that were sneakily changed from dofollow to nofollow, and flag toxic link patterns before they hurt your rankings.

Processing 100 URLs is easy; processing 100,000 requires solid engineering. To scale this infrastructure:

Run crawler jobs asynchronously.
Implement smart retry logic for timeouts and 500 errors.
Always respect robots.txt to maintain ethical scraping practices.
Utilize proxy rotation to avoid rate limits and IP bans.
Log errors and JS rendering failures meticulously.
For large datasets, deploy distributed background workers (e.g., Celery, Redis Queue).

Now just copy paste this into your AI dev agent to create it.



Remember me