Regex: find every HTML tag in a document

The following Regular Expression (regex) finds all HTML tags in a document:

<[^<>]+>
Screenshot of the regular expression used in Visual Studio Code

How is finding all HTML tags in a document useful?

Use case example

Let’s say you receive a bunch of documents from a client. These documents have been exported from some outdated CMS, and they’re full of old (and perhaps invalid) HTML tags. These documents have great, timeless content, but they need a good cleaning, followed by some new formatting (typography). Perhaps you want to reformat the documents using markdown (.md).

It would suck to have to remove all the HTML manually, right? Even with a typical Find and Replace it would be tidious because there are so many different HTML tags.

By running the regular expression <[^<>]+> inside your text editor or IDE you can find all the HTML tags at once. You can then immediately replace them by clicking on the Replace All button in your editor.

Video demo:

In the demonstration above you see me doing the following:

  • Open VSCode’s search bar
  • Paste the regular expression <[^<>]+>
  • Enable the Use Regular Expression function
  • Hit enter to begin searchin
  • Replace all the HTML tags with an empty string
  • Select all my text, right-click, and reformat it (not necessary, I just did it for the looks.).

Be carefull: always make a backup of any file before you make big changes to it. It could be the case that the documents you’re cleaning contain some valuable URLs, which will get removed if you replace the HTML tags with an empty string, as in the example above.


Has this been helpful to you?

You can support my work by sharing this article with others, or perhaps buy me a cup of coffee 😊

Kofi

Share & Discuss on