Microsoft markitdown: Social Content to Markdown Workflow
In June 2026, Microsoft open-sourced markitdown — a Python library that converts Office documents (DOCX, PPTX, XLSX), PDFs, HTML pages, and even images into clean Markdown. For social content creators, this is the missing piece in the content pipeline puzzle.
You already have ThreadGrab to capture X threads, Bluesky posts, and LinkedIn newsletters as Markdown. Now you can convert your offline content — slide decks, spreadsheets, PDF reports, and meeting notes — into the same unified format.
TL;DR. ThreadGrab handles social content capture. Microsoft markitdown handles Office and file conversion. Together they create a complete content-to-Markdown pipeline. Use ThreadGrab for X, Bluesky, and LinkedIn. Use markitdown for DOCX, PPTX, XLSX, PDF, HTML, and images. Both output clean Markdown for any downstream use — knowledge bases, AI training, newsletters.
What Is Microsoft markitdown?
Markitdown is a Python library released on GitHub by Microsoft under an MIT license. It takes a file path or URL and returns Markdown. Behind the scenes, it uses platform-specific parsers:
| Input Format | Engine | Output Quality |
|---|---|---|
| DOCX (Word) | python-docx | Excellent — preserves headings, bold, lists |
| PPTX (PowerPoint) | python-pptx | Good — extracts slide notes and text |
| XLSX (Excel) | openpyxl | Good — outputs tables as Markdown tables |
| pypdf / pdfminer | Good — extracts text with basic structure | |
| HTML | html2text / readability | Excellent — strips styling, preserves hierarchy |
| Images (JPG, PNG) | OCR (pytesseract) | Basic — extracts visible text |
| ZIP / Archives | Built-in extraction | Recursively processes contained files |
pip install markitdown
# Basic usage — convert a file to Markdown
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("presentation.pptx")
print(result.text_content)
# Convert a web page
result = md.convert("https://example.com/report")
print(result.text_content)
Why Markitdown Matters for Social Content
Your content workflow spans multiple sources: social threads, email newsletters, slide decks, research papers, and meeting notes. Each lives in a different format. Without a unified pipeline, you waste hours copying and formatting across tools.
| Content Source | Conversion Tool | Output | Downstream Use |
|---|---|---|---|
| X Threads & Articles | ThreadGrab API | Markdown | Blog, newsletter, archive |
| Bluesky Posts | ThreadGrab API | Markdown | Knowledge base, LLM training |
| LinkedIn Newsletters | ThreadGrab API | Markdown | Research archive |
| Slide Decks (PPTX) | Markitdown | Markdown | Obsidian notes, blog drafts |
| PDF Reports | Markitdown | Markdown | LLM context, research |
| Spreadsheets (XLSX) | Markitdown | Markdown tables | Comparison articles |
| Web Pages (HTML) | Markitdown | Markdown | Content research, archiving |
Building the Combined Pipeline
#!/usr/bin/env python3
# unified-content-pipeline.py
# Combines ThreadGrab (social) + markitdown (files)
import subprocess, os
from pathlib import Path
from markitdown import MarkItDown
OUTPUT = Path.home() / "content-archive" / "2026-06-19"
OUTPUT.mkdir(parents=True, exist_ok=True)
md_converter = MarkItDown()
# Step 1: Social content via ThreadGrab
social_users = ["paulg", "jack.bsky.social", "levelsio"]
for user in social_users:
url = f"https://threadgrab.com/api/profile/{user}"
result = subprocess.run(
["curl", "-s", url], capture_output=True, text=True
)
if result.stdout:
path = OUTPUT / f"social-{user.replace('.', '-')}.md"
path.write_text(result.stdout)
print(f"Captured: {user}")
# Step 2: File conversion via markitdown
file_sources = ["meeting.pptx", "research.pdf", "budget.xlsx"]
for filepath in file_sources:
if os.path.exists(filepath):
result = md_converter.convert(filepath)
out_name = Path(filepath).stem + ".md"
(OUTPUT / out_name).write_text(result.text_content)
print(f"Converted: {filepath}")
print(f"Pipeline complete.")
Real-World Use Cases
1. Research Knowledge Base
Capture X threads on a topic, convert related PDF reports and slide decks with markitdown, and store everything as Markdown in an Obsidian vault. The unified Markdown format lets you search, tag, and link across both social and offline sources.
# Import all Markdown files into Obsidian
cp ~/content-archive/*.md ~/obsidian-vault/inbox/
# Now Obsidian treats social posts and PDF extracts
# as first-class notes with full search and backlinks
2. Newsletter Content Factory
Use ThreadGrab to capture social content and markitdown to convert internal documents. Combine everything into a weekly newsletter:
# Assemble a newsletter from mixed sources
cat ~/content-archive/social-*.md > draft.md
cat ~/content-archive/meeting-notes.md >> draft.md
# Convert to HTML for newsletter paste
pandoc draft.md -o draft.html
3. LLM Training Dataset
Build a training dataset from curated social content and supplementary documents.
FAQ
No. Markitdown is an MIT-licensed open-source library. Install it with pip and use it locally. No account, API key, or cloud service required.
It extracts text from images using OCR but does not preserve image positions. For most workflows, the extracted text is sufficient.
Yes. The typical workflow is side-by-side: ThreadGrab for capture, markitdown for conversion, then merge the outputs.
Yes. Microsoft ships markitdown as a supported tool for enterprise document conversion. It handles complex formatting and multi-language content reliably.
Start building your unified content pipeline today.
Try ThreadGrab — Free Cross-Platform DownloaderThree Tools, One Pipeline
Your content lives in many places: social platforms, slide decks, PDF reports, spreadsheets, and web pages. ThreadGrab handles the social capture layer. Microsoft markitdown handles the file conversion layer. Markdown is the common language that ties them together.
Install markitdown today — pip install markitdown — and start converting your offline files into the same universal format.