
How I Used Document AI and Python to Turn 197 PDFs into Clean Data
October 8, 2025 · 6 min read
Many organizations struggle with transforming stacks of government or business PDFs into Excel formats. This task often involves inconsistent layouts, broken tables, and hours of mind-numbing copy-paste work. If you've ever been told to "get them into Excel," you're likely familiar with the pain and inefficiency that comes with it.
This isn't just inefficient—it’s prone to errors that can ruin your analysis.
In a recent project, I faced this exact challenge: transforming 197 pages of unstructured EMS service PDFs into a clean, searchable Excel dataset. I didn't just brute-force it. I used a powerful mix of automation, document AI, and smart data cleaning to deliver a complete package—including separate contact sheets and formatted summaries that were ready for immediate use.
In this project, I achieved a significant improvement in data processing efficiency. The phase-by-phase approach I used, along with the tools that made it possible, resulted in a dramatically improved system for our client. We delivered comprehensive and ready-to-use files that not only saved them time but also enhanced the accuracy of their data analysis. The business impact was substantial, leading to faster decision-making and improved operational efficiency.

Phase 1: Data Extraction — Capturing the Raw Text
The first goal was simple: capture every possible field/value pair from all 197 pages. I needed a tool smarter than a simple screen scraper.
I built a custom Python workflow connected to Google Document AI (Form Parser). This tool is designed to understand document structure, recognizing field labels, text blocks, and table cells even in low-quality or inconsistent documents. For example, it identified "987654" as the "Service Code" and "BLS NON EMERGENCY ONLY" as the "Highest Level of Service" for one record.

💡 Crucial Tip: Always preserve the raw extracted data first. You can always refine and clean it later, but you can’t recover what wasn't initially captured. This raw data log is your safety net.
Phase 2 - Cleaning & Normalization — Making Sense of the Mess
Once the client confirmed the data's initial accuracy, the real work began. AI is great, but it's never perfect, especially with legacy documents. During this phase, we collected feedback from the client to highlight any errors or ambiguities. This collaborative process involved discussing and resolving discrepancies to ensure the final output met the precise standards expected. By involving the client in the quality assurance step, we fostered trust and ensured any issues were addressed efficiently. This phase focused on standardizing the structure.

The entire process was logged step-by-step. If any correction looked suspicious, I could trace it back to the original raw data and quickly roll back the change.
Phase 3 - Merging & Structuring — Building the Usable Database
After cleaning, the data still lived in a sparse structure with multiple rows per record. To transform it into a usable database, I built a script that merged split fields, reconstructed name pairs, and attached missing details like medical directors and license numbers. The final result of each EMS provider is represented as a single, complete record with no duplicates and no missing context. This phase produced the clean, row-based dataset that became the project's primary deliverable.

Phase 4 - Formatting & Delivery — Professional Presentation
Data isn't truly finished until it’s presented well. Our final step was focused on making the work product professional and immediately usable, which involved several layers of presentation refinement.
This started with ensuring optimal usability. We auto-fitted columns and applied standardized fonts across all sheets for perfect readability.
Crucially, we prioritized robust Documentation by including a dedicated README sheet that summarized key project metrics, such as the total service count and contact count, and provided a clear outline of the automated process used.
The Final Package was delivered as a clean ZIP file containing the Master Tracker (.xlsx), individual Contact Workbooks, and the process README (.txt). The result was a final deliverable of 141 fully structured service records across 8 distinct EMS providers, all packaged in a single, searchable file ready for immediate analysis.

Key Takeaways - Discipline and Transparency
Working on this automation project proved to be a masterclass in the value of discipline and consistency. We quickly learned that predictable data structures starting with clear column names and uniform layouts were non-negotiable. This foundation helped the entire automation run smoothly and saved us significant time by preventing the kind of one-off errors that often require hours of manual fixing.
We also prioritized detailed logging, recognizing that it’s the backbone of a maintainable system. Comprehensive logs made it far easier to track changes, debug issues, and streamline both internal updates and external client reviews.
A major structural lesson was adopting a modular approach. Breaking the automation into distinct, manageable phases, specifically Extract, Clean, Merge, and Format. This made the overall complexity significantly easier to handle than trying to build a single, sprawling script.
Finally, we found that transparency builds trust. By proactively sharing diagnostic logs and data previews with our clients, we provided them with a clearer window into the process, which solidified their confidence in the final, automated results.

Wrap-Up - Turn Repetitive Work into Repeatable Systems
This project is a powerful example of how small, consistent automations can transform hours of manual, repetitive work into a scalable, repeatable system. The difference between a copy-paste job and a clean, AI-structured dataset is the difference between a high-value analyst and a data entry clerk.
If your organization still handles manual document entry, or if you want to integrate AI extraction into your existing workflow to create robust data solutions—let’s talk.
You can automate many repetitive office tasks with powerful tools like Microsoft Power Automate, which is user-friendly and integrates with hundreds of apps like Outlook, SharePoint, Teams, and Excel.
Power Automate is excellent for managing emails and files, streamlining approval processes, and automating reports and notifications. It's a key tool for saving time and boosting productivity with little to no coding.