Tensorlake's Avatar

Tensorlake

@tensorlake.ai

41
Followers
7
Following
56
Posts
07.11.2024
Joined
Posts Following

Latest posts by Tensorlake @tensorlake.ai

Amazing!!! We always want you to try it out with the "hardest" documents - So glad to hear it worked!

The goal of an effective document parser SHOULD be to help with the hardest problems. a low-quality scan of a Norwegian document with statistical tables from 1926 is the perfect test doc 🔥

06.11.2025 00:10 👍 1 🔁 0 💬 0 📌 0

The tensorlake playground was, unlike AWS textract and every other tool I have tried, able to parse my angled, low-quality scan of Norwegian pay statistics from 1926.

Not that 1926 Norwegian statistical tables is a generally useful benchmark…

05.11.2025 19:52 👍 3 🔁 2 💬 1 📌 0
Preview
Benchmarking the Most Reliable Document Parsing API | Tensorlake Learn how Tensorlake built the most reliable document parsing API by measuring what actually matters: structural preservation, reading order accuracy, and downstream usability. See benchmark results c...

We published:
✓ Full methodology
✓ Corrected OCRBench v2 ground truth
✓ Comparative analysis across all major providers

Read the full benchmark: tlake.link/benchmarks

Stop benchmarking vanity metrics. Start measuring what breaks.

05.11.2025 17:05 👍 0 🔁 0 💬 0 📌 0

The results were clear:

Tensorlake: 86.8% TEDS, 91.7% F1
AWS Textract: 80.7% TEDS, 88.4% F1
Azure: 78.1% TEDS, 88.1% F1
Docling: 63.8% TEDS, 68.9% F1

The gap? 670 fewer manual reviews per 10k documents.

05.11.2025 17:05 👍 0 🔁 0 💬 1 📌 0

We evaluated on OCRBench v2, OmniDocBench, and 100 real enterprise docs using two metrics that predict production success:

TEDS (Tree Edit Dist): Measures if tables stay tables
JSON F1: Measures if downstream systems can use the output

Not "is the text similar?" but "can automation actually work?"

05.11.2025 17:05 👍 0 🔁 0 💬 1 📌 0

Traditional benchmarks test on clean PDFs and measure text accuracy.

But your production failures come from:
- Collapsed tables
- Jumbled reading order
- Missing visual content
- Hallucinated extractions

None of this shows up in your scores.

05.11.2025 17:05 👍 0 🔁 0 💬 1 📌 0
Two dense document pages flank a skeptical person’s sticker-style portrait against a green gradient, link text centered below.

Two dense document pages flank a skeptical person’s sticker-style portrait against a green gradient, link text centered below.

Document parsing benchmarks have been measuring the wrong thing.

We tested every major parser on real enterprise documents.

The results will change how you think about OCR accuracy 🧵

05.11.2025 17:05 👍 3 🔁 3 💬 1 📌 3
Promotional banner for the Qdrant Essentials Course featuring Tensorlake. Text reads: ‘Improve collection querying with knowledge graphs.’ On the left is the Qdrant logo and course title; on the right is a smiling woman with long curly brown hair wearing a cream-colored top, set against a purple gradient grid background.

Promotional banner for the Qdrant Essentials Course featuring Tensorlake. Text reads: ‘Improve collection querying with knowledge graphs.’ On the left is the Qdrant logo and course title; on the right is a smiling woman with long curly brown hair wearing a cream-colored top, set against a purple gradient grid background.

Want to build scalable data lakes w/ Tensorlake + @qdrant.bsky.social?

In the free Qdrant Essentials Course, learn how to:
- Architect vector-powered data lakes
- Optimize ETL pipelines
- Create knowledge graphs
- Integrate @langchain.bsky.social agents for natural language queries

t.co/OoPZswrL7z

23.10.2025 19:37 👍 2 🔁 3 💬 0 📌 1
Preview
New: Vision Language Models for Document Processing Tensorlake now uses Vision Language Models (VLMs) across multiple features including page classification, figure/table summarization, and structured extraction, enabling faster and more intelligent do...

Try it yourself with our SEC filing analysis notebook:
tlake.link/notebooks/vl...

Shows how to extract cryptocurrency metrics from 10-Ks and 10-Qs using page classification

Full changelog: tlake.link/changelog/vlm

What would you build with this?

16.10.2025 21:44 👍 0 🔁 0 💬 0 📌 0

Where we leverage VLM support:
📄 Page Classification: Large docs, specific sections needed
📊 Table/Figure Summarization: Visual data in reports
⚡ skip_ocr=True: When reading order is complex and for diagrams and scanned docs

Text extraction still uses OCR for best quality

16.10.2025 21:44 👍 0 🔁 0 💬 1 📌 0

Real results from analyzing 8 SEC filings:
- 1,500+ total pages
- 427 relevant pages identified by VLM
- Processing time: 5 minutes → 45 seconds per document

All without sacrificing accuracy

16.10.2025 21:44 👍 0 🔁 0 💬 1 📌 0

Our solution: VLMs understand document structure visually
Example: Extracting crypto holdings from SEC filings
1. VLM classifies which pages contain financial data (~50 out of 200 pages)
2. Extract only from relevant pages
3. Skip 70% of processing

Result: 80-90% faster ⚡

16.10.2025 21:44 👍 0 🔁 0 💬 1 📌 0

The problem: Processing 200-page documents when you only need specific information is slow and expensive

Traditional approach:
OCR everything → Convert to text → Search → Extract

This wastes time processing irrelevant pages

16.10.2025 21:44 👍 0 🔁 0 💬 1 📌 0

New: Vision Language Models now power key document processing features
We're using VLMs for:
- Page classification in large documents
- Table/figure summarization
- Fast structured extraction (skip_ocr mode)

Here's what this means for document processing 🧵

16.10.2025 21:44 👍 1 🔁 1 💬 1 📌 0
Preview
Tensorlake Transform Data Into Knowledege

The company I work for, @tensorlake.ai, is hiring a couple of roles remote within the US: tensorlake.ai/careers

You might be a great fit if you like working with Rust, Python, K8s, me?, and you enjoy building products for developers.

14.10.2025 22:21 👍 8 🔁 2 💬 0 📌 0

Build approval workflows that trigger on specific feedback. Extract complete edit history for regulatory compliance. Route documents based on flagged sections, all programmatically.

Live now in our API, SDK, and Cloud.

10.10.2025 17:25 👍 1 🔁 0 💬 0 📌 0

Now you can parse .docx files with tracked changes preserved as clean, structured HTML:
- <del> tags for deletions
- <ins> tags for insertions
- <span class="comment"> for reviewer notes

10.10.2025 17:25 👍 1 🔁 0 💬 1 📌 0
Tensorlake interface showing parsed Word document with tracked changes preserved as HTML tags, displaying an insurance claim report

Tensorlake interface showing parsed Word document with tracked changes preserved as HTML tags, displaying an insurance claim report

Most parsers strip all tracked changes when you extract the text.

That means:
❌ Lost audit trails
❌ Manual review of revision history
❌ No programmatic access to reviewer comments
❌ Workflows that can't route based on specific edits

10.10.2025 17:25 👍 1 🔁 1 💬 1 📌 0
Preview
Try it in Colab No description

Perfect for:
→ RAG pipelines (better chunking)
→ Knowledge graphs (accurate trees)
→ Document navigation
→ Table of contents generation

Changelog: tlake.link/changelog/he...
Try it: tlake.link/notebooks/he...

02.10.2025 16:21 👍 0 🔁 0 💬 0 📌 0

Every section header now returns:
- level: 0 for #, 1 for ##, 2 for ###, etc
- content: clean text
- proper nesting for up to 6 levels

Enable with:
cross_page_header_detection=True

That's it.

02.10.2025 16:21 👍 0 🔁 0 💬 1 📌 0

Tensorlake analyzes numbering patterns (1, 1.1, 1.2) and visual structure across the ENTIRE document.

Then corrects misidentified header levels automatically.

Works even when headers span page breaks.

02.10.2025 16:21 👍 0 🔁 0 💬 1 📌 0
Comparison of document header detection. Left side "Just OCR" shows incorrect hierarchy with section 2.2 at wrong indent level. Right side "Header Correction" shows proper nesting where 2.2 is correctly indented under section 2. Bottom shows Python code: doc_ai.parse_and_wait() with cross_page_header_detection=True parameter. Green gradient background with Tensorlake logo.

Comparison of document header detection. Left side "Just OCR" shows incorrect hierarchy with section 2.2 at wrong indent level. Right side "Header Correction" shows proper nesting where 2.2 is correctly indented under section 2. Bottom shows Python code: doc_ai.parse_and_wait() with cross_page_header_detection=True parameter. Green gradient background with Tensorlake logo.

OCR engines constantly mess up document hierarchy.

Section 2.2 becomes a top-level header (##) instead of nested (###).

We just shipped automatic header correction.

🧵 How it works:

02.10.2025 16:21 👍 2 🔁 1 💬 1 📌 1
Preview
Citation-Aware RAG: How to add Fine Grained Citations in Retrieval and Response Synthesis | Tensorlake Learn how to build citation-aware RAG systems that link AI responses back to exact source locations in documents. This technical guide covers document parsing with spatial metadata, chunking strategie...

There's no reason your applications should not be citation-ready.

Dive deeper and try out the Colab notebook linked at the bottom of the blog

19.09.2025 17:44 👍 0 🔁 0 💬 0 📌 0
RAG citation workflow diagram on dark green background showing document processing pipeline: Document (PDF/Image) → Tensorlake Document AI → Parsed Elements (Text, Tables, Figures, and Bounding Box) → merge and insert anchors → Chunks and Anchors (Clean text and citation IDs) → splits to Citation Metadata (page, bounding box, citation IDs) and Vector DB (embeddings, text, and metadata). URL: https://tlake.link/blog/rag-citations

RAG citation workflow diagram on dark green background showing document processing pipeline: Document (PDF/Image) → Tensorlake Document AI → Parsed Elements (Text, Tables, Figures, and Bounding Box) → merge and insert anchors → Chunks and Anchors (Clean text and citation IDs) → splits to Citation Metadata (page, bounding box, citation IDs) and Vector DB (embeddings, text, and metadata). URL: https://tlake.link/blog/rag-citations

Step 3: Generating AI responses with verifiable citations

Once your chunks carry anchors, retrieval doesn’t change. You can use the dense, hybrid, or reranker setup you already have. Consider hiding the anchors in prose, while keeping them in output and making IDs clickable.

19.09.2025 17:44 👍 0 🔁 0 💬 1 📌 0
Before and after comparison of document chunking on dark green background. Top panel "Without Contextualized Chunking" shows plain text: "SMOTE creates a broader decision region for the minority class...". Bottom panel "With Contextualized Chunking" shows same text with citation anchor "<c>2.1</c>" and metadata: {"2.1": {"page": 23, "bbox": {...}}}. URL: https://tlake.link/blog/rag-citations

Before and after comparison of document chunking on dark green background. Top panel "Without Contextualized Chunking" shows plain text: "SMOTE creates a broader decision region for the minority class...". Bottom panel "With Contextualized Chunking" shows same text with citation anchor "<c>2.1</c>" and metadata: {"2.1": {"page": 23, "bbox": {...}}}. URL: https://tlake.link/blog/rag-citations

Step 2: Create contextualized chunks

Iterate through page fragment objects and create appropriately sized chunks by combining them. As you create the chunks, you can create contextualized metadata to help during retrieval.

19.09.2025 17:44 👍 0 🔁 0 💬 1 📌 0
Tensorlake Document AI interface showing document layout analysis with JSON output on left displaying fragment types, content, and bounding box coordinates, and PDF preview on right with highlighted text regions and yellow bounding boxes overlaid on research paper content

Tensorlake Document AI interface showing document layout analysis with JSON output on left displaying fragment types, content, and bounding box coordinates, and PDF preview on right with highlighted text regions and yellow bounding boxes overlaid on research paper content

Step 1: Parse docs with bounding boxes

Using our Document AI API you get a full document layout. For each page fragment you have access to the page number, fragment type, content, and bounding box. Making it easy to add metadata and anchor points to chunks before embedding.

19.09.2025 17:44 👍 0 🔁 0 💬 1 📌 0

Citations.

When users ask "where did this come from?" your system should point to the exact page fragment...not just "file_name.pdf".

Built citation-aware RAG with spatial metadata has:
→ Parse docs with bounding boxes
→ Embed citation anchors in chunks
→ Return page numbers + coordinates

A 🧵

19.09.2025 17:44 👍 1 🔁 1 💬 1 📌 1

Job update: a couple of weeks ago, I joined @tensorlake.ai full time. I’m having a lot of fun building the product with @diptanu.bsky.social and the rest of this wonderful team.

We have a few open positions if you’d like to work with us: www.linkedin.com/jobs/search/...

15.09.2025 19:29 👍 8 🔁 4 💬 1 📌 1
Preview
Parse and Retrieve Dense Tables Accurately with Tensorlake | Tensorlake Learn how Tensorlake preserves structure in dense, multi-page tables—returning DataFrames with summaries and bounding boxes for accurate, explainable retrieval.

Trust in retrieval comes from evidence. Tensorlake ties every lookup back to the original table cell.

Read the blog and try it out for yourself 👇

11.09.2025 20:00 👍 0 🔁 0 💬 0 📌 0
Side-by-side comparison of a dense healthcare data table in PDF format and its structured DataFrame output. A green background with the Tensorlake logo shows an arrow pointing from the PDF to the DataFrame. The caption reads “Parse Dense Tables Reliably” with the link “tlake.link/blog/dense-tables” at the bottom.

Side-by-side comparison of a dense healthcare data table in PDF format and its structured DataFrame output. A green background with the Tensorlake logo shows an arrow pointing from the PDF to the DataFrame. The caption reads “Parse Dense Tables Reliably” with the link “tlake.link/blog/dense-tables” at the bottom.

In finance, clinical trials, or performance benchmarks, dense tables contain mission-critical data.

But flatten that data like most parsers do and trust is lost.

Tensorlake restores trust by preserving structure, generating summaries for effective embeddings, and attaching evidence via b-boxes.

11.09.2025 20:00 👍 1 🔁 1 💬 1 📌 0