
How to Turn AI Sources into an Automated Writing Pipeline
A workflow guide to AI source automation, covering fetching, filtering, deduplication, validation, and topic-card generation for writing systems.
Many AI content workflows start at the wrong end.
They begin with "generate the article automatically" before stabilizing the information flow that feeds the article. The result is predictable:
- too many duplicated links
- too many noisy items and too little prioritization
- mixed-quality sources with weak verification
- articles that feel assembled instead of selected
A better sequence is the reverse:
automate the source flow first, then automate the article flow.
This article uses a public AI newsletter fetcher as a concrete example. It does several things that matter:
- fetches 20+ RSS sources
- fetches Hacker News
- fetches GitHub Trending
- calls a HuggingFace paper fetcher
- filters by AI-related keywords
- validates GitHub results again with README content
Reference:
The core judgment: the bottleneck is usually filtering, not writing
If you already write AI-related WeChat content, the pattern becomes obvious quickly:
- getting a first draft is not the hard part
- filtering a large source pool into items worth writing about is harder
That means the first part worth standardizing is usually not the prose template. It is the upstream information flow.
A practical pipeline should include at least five layers:
- fetching
- filtering
- deduplication
- validation
- topic output
If you only solve layer one, you still end up with noise.
1. Fetching: do not depend on one interface type
One good design choice in the reference script is that it does not treat all sources as one category.
It separates:
- RSS for stable subscriptions
- Hacker News for community discussion
- GitHub Trending for project discovery
- HuggingFace papers for research input
That matters because the input layer is already multi-channel.
The immediate benefit is structural:
- release news does not get mixed with project momentum
- research updates are not drowned by discussion heat
- later ranking can apply different weights by source class
2. Filtering: keep irrelevant material out early
The script uses an AI keyword filter that includes terms such as:
- AI, LLM, GPT, Claude, Agent, RAG
- DeepSeek, Gemini, Llama, MCP
- Embedding, Vector DB, LoRA, vLLM, GGUF
That highlights an important rule:
the point of source automation is not to capture everything. It is to reject obviously irrelevant material early.
But keyword filtering alone is not enough.
It still has two weaknesses:
- some relevant items do not expose the right keywords directly
- some low-value items happen to match the keywords anyway
So keyword filtering works best as a first-pass screen, not a final relevance decision.
3. Deduplication: otherwise you mistake repetition for value
The Hacker News part of the script deduplicates by HN discussion URL. That is a small detail with real editorial value.
Without deduplication, the system creates a false signal:
- the same event appears in multiple places
- repeated visibility gets mistaken for topic value
- the output starts to look like trend chasing instead of topic selection
A steadier method is to split deduplication into two layers.
1. URL-level deduplication
The same link should only appear once.
2. topic-level deduplication
The same launch, model update, or project release should have one primary entry plus a few supporting discussions, not a pile of near-duplicates.
That is how a source feed starts to look like editorial input rather than log output.
4. Validation: do not confuse "looks like AI" with "is worth tracking"
One of the most useful parts of the reference script is the GitHub validation step:
- check whether the repo name or short description matches AI terms
- fetch the README
- check the README content again
That logic is worth copying.
Many automated pipelines fail because they trust title-level signals too much:
- the name looks relevant
- the description sounds relevant
- the actual project still does not matter
README validation is really doing one thing:
do not stop at headline-level relevance. Verify body-level relevance.
The same principle applies beyond GitHub. Blog posts, newsletters, and release notes also benefit from a second-pass validation step.
5. Topic output: do not stop at raw entries
Many pipelines stop once they have clean JSON.
That is not enough for a writing workflow.
A WeChat article does not need raw entries. It needs topic candidates.
The output should answer questions such as:
- what category the item belongs to
- who it matters to
- why it matters now
- whether it should become a brief, a commentary piece, a tutorial, or a selection article
In other words, the raw entry still needs one more transformation into a topic card.
A stronger output structure for writing systems
If the goal is to feed a writing workflow, useful output fields include:
sourcecategorytitleurltimesummarywhy_it_mattersbest_forstory_angle
The last three fields are the important ones.
why_it_matters
Why this item is worth attention.
best_for
Which audience or account type it is most relevant to.
story_angle
Whether the item fits better as:
- a brief
- a commentary
- a tool recommendation
- a tutorial
- a selection or comparison piece
Once the pipeline produces that layer, it starts serving writing instead of just collection.
If you want this to work in production, add three more rules
1. Source weighting
For example:
- official blogs should usually outrank second-hand retellings
- GitHub items that pass README validation should outrank name-only matches
- topics with sustained discussion should outrank one-off spikes
2. Time windows
The reference script exposes --hours, which is more important than it first appears.
For WeChat writing, the goal is not always "newest possible."
The goal is often:
- keep items fresh enough to matter
- avoid dragging old items back into the daily queue
3. Output limits
The script also exposes --limit. This is not just an engineering detail. It is editorial discipline.
If the daily candidate set is too large, the pipeline produces clutter. A better pattern is:
- collect broadly
- shortlist narrowly
A practical path inside the current project context
If this logic is applied to the md2wechat Agent API content flow, the order becomes:
- fetch AI sources
- apply source-type and keyword filtering
- validate projects and articles again
- convert the result into topic cards
- ask the agent to draft the article
- then continue into formatting, drafts, and publishing
That order is much steadier than asking the model to write first and justify later.
Closing thought
Turning AI sources into an automated writing pipeline is not mainly a collection problem. It is an editorial problem moved upstream.
The real goal is not only to fetch content. The goal is to make source material:
- filterable
- verifiable
- rankable
- convertible into topics
Once that part is stable, prompts, drafts, formatting, and publishing become much easier to automate without losing quality.
If you want the adjacent pieces next, continue with:
Author
Categories
why_it_mattersbest_forstory_angleIf you want this to work in production, add three more rules1. Source weighting2. Time windows3. Output limitsA practical path inside the current project contextClosing thoughtMore Posts

obsidian-md2wechat: Convert Obsidian Notes into WeChat Layouts
An overview of the obsidian-md2wechat plugin, including installation, theme support, and the Obsidian-first workflow it fits.

feishu-md2wechat: Convert Feishu Docs into WeChat Layouts
An overview of feishu-md2wechat, including fit, export path, and where it sits in a Feishu-first publishing workflow.

How to Create a WeChat Draft from Generated Content
A step-by-step guide to moving generated content through formatting, image handling, and draft creation for WeChat.
Newsletter
Join the community
Subscribe to our newsletter for the latest news and updates