Converting PDF into XML can be maddening. Here’s how to escape messy files and build structured, usable documents that actually work.

At Yonder, we specialize in digital documentation.

Awwww, digital documentation. How nerdy.

Perhaps, but it remains highly relevant in today’s world.

Because, in contrast to popular belief, most companies haven’t digitized their documentation yet. Seriously.

Think about it. Digital documentation started some 20 years ago with PDFs, which didn’t add much more value over paper documents other than eliminating their weight.

And that’s where many companies left the digitization journey. They still think they are on a full-digital track, but they are just administrators of huge piles of outdated and dumb PDF documents.

Forms of Digital Crap

Don’t believe me? Time for some real-life examples.

The biggest worry of our prospective customers is the amount of effort required to transform their legacy documents into our documentation software. This is because we strictly use the XML format for documents in our system. Importing structured documents such as XML from Airbus or Boeing is easy. Similarly, importing properly laid out Microsoft Word files is easy.

But what about all the unstructured PDF documents out there? Some of those documents are scanned paper. Some are in a two-column layout, and others have approval stamps from an authority. None of them have working links or a dynamically generated table of contents. All of these features are straightforward to handle in the XML world, but getting from PDF to XML is a cumbersome and laborious task. Often, it involves much more than just data transformation: It means making documentation user-centric, not just compliant.

Let’s increase the level of difficulty: Regulations. Most of our customers operate in regulated environments. As a consequence, they need reliable access to a wide range of regulations. But guess what? Most of those regulations are only available in PDF format, and many of them haven’t been updated for years. Converting regulations into a structured format is cumbersome and error-prone. Yet you need structured regulations if you want to properly link them to your internal documents and keep up with changed regulations.

Some of our competitors still only allow exports from their system in PDF format. It looks like they see it as a moat to defend against customer churn. We follow a different approach: Our customers can self-export both PDF and XML document formats. Because we believe that the future of digital documentation is structured. It’s the structure of their documentation that companies struggle with, not the content of the documentation.

Ways Out of The Trap

What can you do?

We advocate a three-fold approach: You need to combine human, deterministic, and statistical methods to properly structure documents.

Sounds difficult and technical? Let’s break it down.

Let’s start with the human element. When a human documentation specialist looks at a PDF document, it’s easy to determine whether the document is neat or messy. It’s also easy to spot where in the document the titles, the bullet point lists, and the images are. But it’s not so easy for humans to convert the PDF into XML — unless you love re-typing lengthy documents or have a fetish for the copy-paste function (disclaimer: our team has done both, due to lack of alternatives).

That’s where the deterministic element comes in. In plain terms, it means using a rules-based piece of software to help convert that PDF document into XML. Oftentimes, we call those software frameworks transformations. There is a plethora of frameworks out there that help you deal with the various degrees of messiness in PDF documents. But there is no one-size-fits-all, because there are so many different PDFs out there. That’s why the deterministic element doesn’t do the trick alone, and not even when you combine it with human experience.

Enter the statistical element. Although there are so many different PDFs out there, there are similarities between many PDF documents: Think of an authority issuing hundreds of different regulations, all published in the same PDF format. Once you know where the relevant elements are in one of those documents, you likely know where they are in all the other documents from that authority. AI models excel at such tasks.

Conclusion

It’s like in a team: Solutions are better if you combine different skills. 

So what are you waiting for? Combine human, deterministic, and statistical methods to finally take the leap into the world of fully digital documentation.