2026/03/18

How to Turn AI Sources into an Automated Writing Pipeline

A workflow guide to AI source automation, covering fetching, filtering, deduplication, validation, and topic-card generation for writing systems.

Many AI content workflows start at the wrong end.

They begin with "generate the article automatically" before stabilizing the information flow that feeds the article. The result is predictable:

too many duplicated links
too many noisy items and too little prioritization
mixed-quality sources with weak verification
articles that feel assembled instead of selected

A better sequence is the reverse:

automate the source flow first, then automate the article flow.

This article uses a public AI newsletter fetcher as a concrete example. It does several things that matter:

fetches 20+ RSS sources
fetches Hacker News
fetches GitHub Trending
calls a HuggingFace paper fetcher
filters by AI-related keywords
validates GitHub results again with README content

Reference:

bozhouDev/bozhou-skills: fetch_ai_news.py

The core judgment: the bottleneck is usually filtering, not writing

If you already write AI-related WeChat content, the pattern becomes obvious quickly:

getting a first draft is not the hard part
filtering a large source pool into items worth writing about is harder

That means the first part worth standardizing is usually not the prose template. It is the upstream information flow.

A practical pipeline should include at least five layers:

fetching
filtering
deduplication
validation
topic output

If you only solve layer one, you still end up with noise.

1. Fetching: do not depend on one interface type

One good design choice in the reference script is that it does not treat all sources as one category.

It separates:

RSS for stable subscriptions
Hacker News for community discussion
GitHub Trending for project discovery
HuggingFace papers for research input

That matters because the input layer is already multi-channel.

The immediate benefit is structural:

release news does not get mixed with project momentum
research updates are not drowned by discussion heat
later ranking can apply different weights by source class

2. Filtering: keep irrelevant material out early

The script uses an AI keyword filter that includes terms such as:

AI, LLM, GPT, Claude, Agent, RAG
DeepSeek, Gemini, Llama, MCP
Embedding, Vector DB, LoRA, vLLM, GGUF

That highlights an important rule:

the point of source automation is not to capture everything. It is to reject obviously irrelevant material early.

But keyword filtering alone is not enough.

It still has two weaknesses:

some relevant items do not expose the right keywords directly
some low-value items happen to match the keywords anyway

So keyword filtering works best as a first-pass screen, not a final relevance decision.

3. Deduplication: otherwise you mistake repetition for value

The Hacker News part of the script deduplicates by HN discussion URL. That is a small detail with real editorial value.

Without deduplication, the system creates a false signal:

the same event appears in multiple places
repeated visibility gets mistaken for topic value
the output starts to look like trend chasing instead of topic selection

A steadier method is to split deduplication into two layers.

1. URL-level deduplication

The same link should only appear once.

2. topic-level deduplication

The same launch, model update, or project release should have one primary entry plus a few supporting discussions, not a pile of near-duplicates.

That is how a source feed starts to look like editorial input rather than log output.

4. Validation: do not confuse "looks like AI" with "is worth tracking"

One of the most useful parts of the reference script is the GitHub validation step:

check whether the repo name or short description matches AI terms
fetch the README
check the README content again

That logic is worth copying.

Many automated pipelines fail because they trust title-level signals too much:

the name looks relevant
the description sounds relevant
the actual project still does not matter

README validation is really doing one thing:

do not stop at headline-level relevance. Verify body-level relevance.

The same principle applies beyond GitHub. Blog posts, newsletters, and release notes also benefit from a second-pass validation step.

5. Topic output: do not stop at raw entries

Many pipelines stop once they have clean JSON.

That is not enough for a writing workflow.

A WeChat article does not need raw entries. It needs topic candidates.

The output should answer questions such as:

what category the item belongs to
who it matters to
why it matters now
whether it should become a brief, a commentary piece, a tutorial, or a selection article

In other words, the raw entry still needs one more transformation into a topic card.

A stronger output structure for writing systems

If the goal is to feed a writing workflow, useful output fields include:

source
category
title
url
time
summary
why_it_matters
best_for
story_angle

The last three fields are the important ones.

`why_it_matters`

Why this item is worth attention.

`best_for`

Which audience or account type it is most relevant to.

`story_angle`

Whether the item fits better as:

a brief
a commentary
a tool recommendation
a tutorial
a selection or comparison piece

Once the pipeline produces that layer, it starts serving writing instead of just collection.

If you want this to work in production, add three more rules

1. Source weighting

For example:

official blogs should usually outrank second-hand retellings
GitHub items that pass README validation should outrank name-only matches
topics with sustained discussion should outrank one-off spikes

2. Time windows

The reference script exposes --hours, which is more important than it first appears.

For WeChat writing, the goal is not always "newest possible."
The goal is often:

keep items fresh enough to matter
avoid dragging old items back into the daily queue

3. Output limits

The script also exposes --limit. This is not just an engineering detail. It is editorial discipline.

If the daily candidate set is too large, the pipeline produces clutter. A better pattern is:

collect broadly
shortlist narrowly

A practical path inside the current project context

If this logic is applied to the md2wechat Agent API content flow, the order becomes:

fetch AI sources
apply source-type and keyword filtering
validate projects and articles again
convert the result into topic cards
ask the agent to draft the article
then continue into formatting, drafts, and publishing

That order is much steadier than asking the model to write first and justify later.

Closing thought

Turning AI sources into an automated writing pipeline is not mainly a collection problem. It is an editorial problem moved upstream.

The real goal is not only to fetch content. The goal is to make source material:

filterable
verifiable
rankable
convertible into topics

Once that part is stable, prompts, drafts, formatting, and publishing become much easier to automate without losing quality.

If you want the adjacent pieces next, continue with:

All Posts

Author

geekjourney

Join the community

Subscribe to our newsletter for the latest news and updates

2026/03/18

How to Turn AI Sources into an Automated Writing Pipeline

A workflow guide to AI source automation, covering fetching, filtering, deduplication, validation, and topic-card generation for writing systems.

Many AI content workflows start at the wrong end.

They begin with "generate the article automatically" before stabilizing the information flow that feeds the article. The result is predictable:

too many duplicated links
too many noisy items and too little prioritization
mixed-quality sources with weak verification
articles that feel assembled instead of selected

A better sequence is the reverse:

automate the source flow first, then automate the article flow.

This article uses a public AI newsletter fetcher as a concrete example. It does several things that matter:

fetches 20+ RSS sources
fetches Hacker News
fetches GitHub Trending
calls a HuggingFace paper fetcher
filters by AI-related keywords
validates GitHub results again with README content

Reference:

bozhouDev/bozhou-skills: fetch_ai_news.py

The core judgment: the bottleneck is usually filtering, not writing

If you already write AI-related WeChat content, the pattern becomes obvious quickly:

getting a first draft is not the hard part
filtering a large source pool into items worth writing about is harder

That means the first part worth standardizing is usually not the prose template. It is the upstream information flow.

A practical pipeline should include at least five layers:

fetching
filtering
deduplication
validation
topic output

If you only solve layer one, you still end up with noise.

1. Fetching: do not depend on one interface type

One good design choice in the reference script is that it does not treat all sources as one category.

It separates:

RSS for stable subscriptions
Hacker News for community discussion
GitHub Trending for project discovery
HuggingFace papers for research input

That matters because the input layer is already multi-channel.

The immediate benefit is structural:

release news does not get mixed with project momentum
research updates are not drowned by discussion heat
later ranking can apply different weights by source class

2. Filtering: keep irrelevant material out early

The script uses an AI keyword filter that includes terms such as:

AI, LLM, GPT, Claude, Agent, RAG
DeepSeek, Gemini, Llama, MCP
Embedding, Vector DB, LoRA, vLLM, GGUF

That highlights an important rule:

the point of source automation is not to capture everything. It is to reject obviously irrelevant material early.

But keyword filtering alone is not enough.

It still has two weaknesses:

some relevant items do not expose the right keywords directly
some low-value items happen to match the keywords anyway

So keyword filtering works best as a first-pass screen, not a final relevance decision.

3. Deduplication: otherwise you mistake repetition for value

The Hacker News part of the script deduplicates by HN discussion URL. That is a small detail with real editorial value.

Without deduplication, the system creates a false signal:

the same event appears in multiple places
repeated visibility gets mistaken for topic value
the output starts to look like trend chasing instead of topic selection

A steadier method is to split deduplication into two layers.

1. URL-level deduplication

The same link should only appear once.

2. topic-level deduplication

The same launch, model update, or project release should have one primary entry plus a few supporting discussions, not a pile of near-duplicates.

That is how a source feed starts to look like editorial input rather than log output.

4. Validation: do not confuse "looks like AI" with "is worth tracking"

One of the most useful parts of the reference script is the GitHub validation step:

check whether the repo name or short description matches AI terms
fetch the README
check the README content again

That logic is worth copying.

Many automated pipelines fail because they trust title-level signals too much:

the name looks relevant
the description sounds relevant
the actual project still does not matter

README validation is really doing one thing:

do not stop at headline-level relevance. Verify body-level relevance.

The same principle applies beyond GitHub. Blog posts, newsletters, and release notes also benefit from a second-pass validation step.

5. Topic output: do not stop at raw entries

Many pipelines stop once they have clean JSON.

That is not enough for a writing workflow.

A WeChat article does not need raw entries. It needs topic candidates.

The output should answer questions such as:

what category the item belongs to
who it matters to
why it matters now
whether it should become a brief, a commentary piece, a tutorial, or a selection article

In other words, the raw entry still needs one more transformation into a topic card.

A stronger output structure for writing systems

If the goal is to feed a writing workflow, useful output fields include:

source
category
title
url
time
summary
why_it_matters
best_for
story_angle

The last three fields are the important ones.

`why_it_matters`

Why this item is worth attention.

`best_for`

Which audience or account type it is most relevant to.

`story_angle`

Whether the item fits better as:

a brief
a commentary
a tool recommendation
a tutorial
a selection or comparison piece

Once the pipeline produces that layer, it starts serving writing instead of just collection.

If you want this to work in production, add three more rules

1. Source weighting

For example:

official blogs should usually outrank second-hand retellings
GitHub items that pass README validation should outrank name-only matches
topics with sustained discussion should outrank one-off spikes

2. Time windows

The reference script exposes --hours, which is more important than it first appears.

For WeChat writing, the goal is not always "newest possible."
The goal is often:

keep items fresh enough to matter
avoid dragging old items back into the daily queue

3. Output limits

The script also exposes --limit. This is not just an engineering detail. It is editorial discipline.

If the daily candidate set is too large, the pipeline produces clutter. A better pattern is:

collect broadly
shortlist narrowly

A practical path inside the current project context

If this logic is applied to the md2wechat Agent API content flow, the order becomes:

fetch AI sources
apply source-type and keyword filtering
validate projects and articles again
convert the result into topic cards
ask the agent to draft the article
then continue into formatting, drafts, and publishing

That order is much steadier than asking the model to write first and justify later.

Closing thought

Turning AI sources into an automated writing pipeline is not mainly a collection problem. It is an editorial problem moved upstream.

The real goal is not only to fetch content. The goal is to make source material:

filterable
verifiable
rankable
convertible into topics

Once that part is stable, prompts, drafts, formatting, and publishing become much easier to automate without losing quality.

If you want the adjacent pieces next, continue with:

All Posts

Author

geekjourney

Join the community

Subscribe to our newsletter for the latest news and updates

How to Turn AI Sources into an Automated Writing Pipeline

Author

Categories

More Posts

How to Design an Agent Content Topic Selection Workflow

Advanced Layout Marks a New Stage for md2wechat

How to Choose WeChat Automation

Join the community

How to Turn AI Sources into an Automated Writing Pipeline

Author

Categories

More Posts

How to Design an Agent Content Topic Selection Workflow

Advanced Layout Marks a New Stage for md2wechat

How to Choose WeChat Automation

Join the community