Introducing XML Hacking in Microsoft Office

Created: Wednesday, May 13, 2020, posted by at 9:30 am

By John Korchok

With the introduction of Office 2007, Microsoft changed the basic file format that underlies Word, PowerPoint, and Excel. Instead of the proprietary and mostly undocumented format that ruled from Office 97 to Office 2003, Microsoft made a smart decision and switched to XML. This is tagged text, similar in structure and concept to HTML code with which you may already be familiar.

XML opens up a world of possibilities for automated document construction, but that’s a topic for another day. The everyday relevance is that if a Word or PowerPoint file isn’t doing what you need it to do and there are no tools in the program for the job, we can now dive in and edit the file ourselves. If you’re a point-and-click user, this is probably not thrilling. But if you’re a hacker at heart, a midnight coder, or just a curious tinkerer, you can do some cool stuff.

Inside a simple Word file
Inside a simple Word file. The document text is stored in XML

The main tool you’re going to need is a text editor. While you can get away for a while with Notepad or TextEdit, those simple text editors don’t quite have the tools that get the job done efficiently. On Mac, I use BBEdit and on Windows, I reach for Notepad++. BBEdit is reasonably-priced shareware and Notepad++ is freeware. They have a similar style of operation, so if you’re a cross-platform hacker it’s easy to switch between them. Notepad++ uses a plugin system, so you can add tools. For this job, you’re definitely going to want the free XML plugin.

Word, Excel, and PowerPoint files in the new format are actually simple ZIP files with a different file extension. Getting into them couldn’t be easier: if you’re using Windows, suffix .ZIP to the end of the file (a copy of the file, if it’s anything important). You’ll get a warning from your OS, but you know what you’re doing! Now unzip it. Out pop several folders of XML, plus a top-level file or two.

For macOS Users

macOS requires somewhat more care with handling expanded Office files, or they won’t open. Please see this article for the best procedure on a Mac.

Select one of the files and open it in your text editor. All the files have been linearized to minimize file size. This is where your XML tools come into play.

  • In Notepad++, choose Plugins | XML Tools | Pretty Print (XML Only – with line breaks).
  • If you’re using BBEdit, choose Markup | Tidy | Reflow Document.

Now you have a nicely indented, easy-to-read page to edit. When you’re done, it’s not necessary to re-linearize. Word, PowerPoint, or Excel will do that for you later.

A plain vanilla PowerPoint file
A plain vanilla PowerPoint file: more complex than Word.

For people using Window’s built-in ZIP utility, there is an easy mistake to watch out for. By default, unzipping a file in Windows creates a new folder named for the file being expanded. If, when you’re re-assembling the file, you include this top-level folder.

PowerPoint may raise an error about unreadable content in the presentation. To avoid this error, first, open the folder that Windows created. Select the _rels, docProps and ppt folders, plus the [Content_Types].xml file, then create a ZIP file from them.

XML hacking is useful for Excel or Word when you want to add additional color themes or when you need to rescue a corrupt document. But it really shines with PowerPoint, allowing you to create custom table formats, extra custom colors that don’t fit into a theme, setting the default text size for tables and charts, and much more. This technique separates the PowerPoint pros from the wannabes.

Check out text editors and XML tools so you’re ready to hack! If you want to learn more about how to put some of this to practice, read my next article on XML Hacking: Default Table Text.

John Korchok
John Korchok has been creating reliable branded Office templates and web sites for more than 20 years. He is certified as a Microsoft Office Specialist Master, is an award-winning technical writer and is skilled in programming VBA, JavaScript for PDF and web, HTML, CSS, and PHP. John is a moderator for Microsoft Answers and a volunteer at Stack Overflow, providing answers for Word and PowerPoint for Windows and macOS. He currently works for Brandwares in metropolitan New York.

The views and opinions expressed in this blog post or content are those of the authors or the interviewees and do not necessarily reflect the official policy or position of any other agency, organization, employer, or company.

Related Posts

Filed Under: XML
Tagged as: ,

No Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Microsoft and the Office logo are trademarks or registered trademarks of Microsoft Corporation in the United States and/or other countries.

Plagiarism will be detected by Copyscape

© 2000-2023, Geetesh Bajaj - All rights reserved.

since November 02, 2000