Social Content to Markdown Pipeline 2026

June 16, 2026 · 8 min read · Guide

If you create or consume social content across X, Bluesky, and LinkedIn, you have a data fragmentation problem: threads live inside walled gardens, posts vanish behind rate limits, and your best research is scattered across six tabs.

The fix is a Markdown-first content pipeline. Capture everything as Markdown, process it with standard tools, and publish or archive it anywhere. This guide builds a real pipeline using ThreadGrab for capture, the Markdown ecosystem for transformation, and knowledge-base tools for storage.

TL;DR. Use ThreadGrab to pull X threads, Bluesky posts, and LinkedIn articles as Markdown. Process with standard tools (Pandoc, jq, grep). Route to any destination: Obsidian vault, Notion, newsletter, or LLM dataset. The entire pipeline runs on a single cron job with no API keys needed for Bluesky content.

Why Markdown Is the Universal Intermediate Format

Markdown occupies a unique position in the content ecosystem: it is human-readable, version-controllable, LLM-friendly, and convertable to virtually any output format. Every major note-taking app (Obsidian, Notion, Logseq), publishing platform (Substack, Ghost, Dev.to), and AI tool (LangChain, LlamaIndex) accepts Markdown as input.

By making Markdown the intermediate format in your pipeline, you decouple content capture from content consumption. You can archive today, publish next week, and feed into an LLM six months from now — using the same .md files.

Format	Source	Markdown via ThreadGrab	Destination
X Threads	twitter.com	Yes (profile API)	Obsidian, newsletter
X Articles	x.com/articles	Yes (article API)	LLM training, blog
Bluesky Posts	bsky.app	Yes (AT Protocol)	Archive, research
LinkedIn Newsletter	linkedin.com	Yes (web scrape)	Knowledge base

Step 1: Capture — ThreadGrab as the Universal Collector

ThreadGrab acts as the ingestion layer. A single API endpoint handles all three major platforms:

# Save X Articles as Markdown
curl -s "https://threadgrab.com/api/profile/paulg" \
  | jq -r '.[] | select(.type == "article") | .text' > x-paulg-article.md

# Save Bluesky long-form posts
curl -s "https://threadgrab.com/api/profile/jack.bsky.social" \
  | jq -r '.[] | .text' > bsky-jack.md

# Save LinkedIn newsletter articles
curl -s "https://threadgrab.com/api/profile/jasonxmai-newsletter" \
  | jq -r '.[] | .text' > linkedin-jason.md

No API keys required. ThreadGrab handles authentication, rate limits, and JavaScript rendering transparently. For Bluesky, the AT Protocol is public by default. For X, ThreadGrab rotates proxies to avoid CAPTCHAs. For LinkedIn, it renders the newsletter page server-side.

Step 2: Structure — Organize Your Markdown Archive

Raw Markdown from different platforms needs consistent structure. Use a standard frontmatter schema so every file is self-describing:

---
title: "Article Title"
author: "@username"
platform: "x" | "bluesky" | "linkedin"
url: "https://..."
captured: "2026-06-16"
tags: [tech, AI, productivity]
---

## Article Body

Captured via ThreadGrab at https://threadgrab.com

A simple script can post-process raw API output into this format. The jq tool is your friend here — extract fields from the API response and inject them as YAML frontmatter before saving.

Step 3: Route — Send Markdown Anywhere

Once your content is structured Markdown, the routing options are endless:

To a Knowledge Base (Obsidian / Notion)

Obsidian reads a local directory of .md files directly. Point it at your archive folder. For Notion, use Notion's Markdown import or the API:

# Sync Markdown vault to Notion (one-way)
# Uses notion-md-sync, a lightweight Python tool
pip install notion-md-sync
notion-md-sync --input ~/archive/social-content/ \
  --notion-database YOUR_DATABASE_ID

To a Newsletter (Substack / LinkedIn / Ghost)

Markdown is the native input for most newsletter platforms:

# Convert Markdown to HTML for newsletter pasting
pandoc article.md -f markdown -t html -o article.html

# Ghost CMS has a direct Markdown import API
ghost-cli import article.md --url your.ghost.io

To an LLM Training Dataset

Structured Markdown files are excellent training data because they preserve content hierarchy:

# Concatenate a week of captures into a single training file
cat ~/archive/social-content/*.md > training-data-2026-06-week3.md

# Split into chunk-aligned JSONL documents
python3 -c "
import json, glob
for f in sorted(glob.glob('~/archive/social-content/*.md')):
    with open(f) as fh:
        print(json.dumps({'text': fh.read(), 'source': f}))
" > training-2026-06-week3.jsonl

The Complete Pipeline: One Cron Job

Here is the full pipeline as a daily cron job. It captures content from all three platforms, structures it, and routes it to both an Obsidian vault and a newsletter draft folder:

#!/bin/bash
# daily-social-content-pipeline.sh
# Run daily at 07:00 via cron

OUTPUT_DIR="$HOME/archive/social-content/$(date +%Y-%m-%d)"
mkdir -p "$OUTPUT_DIR"

# Step 1: Capture from X
echo "=== Capturing X Articles ==="
for user in paulg kelseyhightower levelsio; do
  curl -s "https://threadgrab.com/api/profile/$user" \
    | jq -r '.[] | select(.type == "article") | .text' \
    > "$OUTPUT_DIR/x-$user-article.md"
done

# Step 2: Capture from Bluesky
echo "=== Capturing Bluesky Posts ==="
for user in jack.bsky.social tante.bsky.social; do
  curl -s "https://threadgrab.com/api/profile/$user" \
    | jq -r '.[] | .text' \
    > "$OUTPUT_DIR/bsky-$user.md"
done

# Step 3: Capture from LinkedIn Newsletter
echo "=== Capturing LinkedIn Newsletters ==="
curl -s "https://threadgrab.com/api/profile/paulg-newsletter" \
  | jq -r '.[] | .text' \
  > "$OUTPUT_DIR/linkedin-paulg.md"

# Step 4: Route to Obsidian vault
echo "=== Syncing to Obsidian ==="
cp "$OUTPUT_DIR"/*.md ~/obsidian-vault/inbox/

# Step 5: Route to Ghost newsletter draft
echo "=== Building newsletter draft ==="
cat "$OUTPUT_DIR"/*.md > ~/newsletter-drafts/daily-digest-$(date +%Y-%m-%d).md

echo "Pipeline complete: $(ls "$OUTPUT_DIR"/*.md | wc -l) files archived"

Platform-Specific Considerations

Platform	Rate Limits	Best Capture Method	Ideal Destination
X (Twitter)	~100 views/15min/IP	ThreadGrab proxy rotation	Daily newsletter draft
Bluesky	Generous (AT Protocol)	Direct API + ThreadGrab	LLM training dataset
LinkedIn	Moderate	ThreadGrab server-side render	Research archive

FAQ

Do I need an API key for any of these platforms?

No. ThreadGrab handles X scraping without requiring an X API subscription. Bluesky's AT Protocol is fully public. LinkedIn newsletters are captured via server-side rendering — no OAuth needed.

Can I run this pipeline on a headless server?

Yes. The entire pipeline runs in a shell script with curl and jq. No browser, no GUI, no interactive authentication. Schedule it with cron and forget it.

What if a post is deleted from the original platform?

Your local Markdown copy is permanent. The pipeline captures a snapshot; once saved, deletion on the source has no effect on your archive. This is the core advantage of a local-first pipeline.

How much disk space does a daily archive use?

Approximately 50-200 KB per day for 10-15 posts. A year of daily archives fits in under 100 MB. Markdown is extremely space-efficient.

Can I feed captured Markdown directly into an LLM prompt?

Yes. Markdown is the preferred input format for most LLMs. You can pipe a captured file directly into any LLM context: cat article.md | llm -m claude-sonnet-4 "summarize this"

Start building your Markdown content pipeline today. Capture, structure, and route your social content in one workflow.

Try ThreadGrab — Free Cross-Platform Content Downloader

Markdown First, Platform Agnostic

The social content landscape will keep changing: new platforms emerge, APIs change, rate limits tighten. But Markdown stays the same. By building a Markdown-first pipeline, you future-proof your content archive against platform lock-in. ThreadGrab handles the capture layer; your Markdown vault is the permanent record.

Start with one platform, one user, one cron job. Add more sources as you go. The pipeline scales horizontally: more users, more platforms, more destinations. The only constant is Markdown.