Learning to Love Regex

Learning to Love Regex

You create a table of information somewhere and decide to transfer it somewhere else in markdown format.
A table in Whimsical with values highlighted for copy The values from the table posted into a Repl -- the formatting is all wrong!

Only, when you go to copy the values into your code editor, you realize that the formatting is all wrong! Verdammt! You spent all day compiling this information and you really don't want to spend the rest of the day fiddling around with the formatting to turn it into a markdown table.

Never fear, friend. Regex to the rescue.

Step 1: Know how to write a markdown table

The basic format is this:

  • The heading for the table needs a pipe (|) on either side of each column.
  • Between the heading of the table and the table body, there needs to be a line where each column has a pipe on either side and the content of the column has three or more hyphens

Example:

| Heading1 | Heading2 |
| --- | --- |
| The most | Basic table ever |

Ends up looking like:

Heading1Heading2
The mostBasic table ever

Step 2: Remove newlines

To make subsequent regular expressions easier, remove all the newlines in the pasted text and replace them with a single space. Repl of pasted text with a find/replace set to find \n with regular expression option selected and replace with a space Repl result of running the find/replace operation - now all of the text is on one line

Step 3: Create your table header

This step requires figuring out a regular expression that will match your headers and capture each header individually so that you can manipulate what surrounds it (namely, to add the pipes). This can be accomplished by making use of capture groups. For capture groups, each capture is given a number internally which can then be used in the replace operation. In this example, our table header should include Alias, Canonical property name, and Matches letters and written signs belonging to ____ script as headers, so we need to come up with a way to match those. Note that with regular expressions, there are a ton of ways to approach a regular expression to match text, so this is by no means the only way to go about it.

Find:
(\w+) ([\w\s]+(?= Matches)) ([\w\s]+)

Replace:
| $1 | $2 | $3 |\n| --- | --- | --- |\n

Only the text that we want to be the header of our table is highlighted when we try out our regular expression find

The find regex:

  • Creates a capture group of one or more alphanumeric characters (including underscore)
  • Matches a space
  • Creates a second capture group of one or more of either alphanumeric characters (including underscore) or whitespace characters only if it is followed by a space and the word 'Matches' (Since the third heading begins with 'Matches', this is a way to ensure that the second capture group ends at the right spot)
  • Matches a space
  • Creates a third capture group of one or more either alphanumeric characters (including underscore) or whitespace characters

Alias becomes capture group 1 Canonical property name becomes capture group 2 Matches letters and written signs belonging to ____ script becomes capture group 3

The replacement regex:

  • Adds a pipe (|) and a space before capture group 1
  • Adds a space and a pipe and a space before capture group 2
  • Adds a space and a pipe and a space before capture group 3
  • Adds a space and a pipe after the third capture group
  • Adds a new line
  • Adds a pipe
  • Adds a space
  • Adds three hyphens
  • Adds a space
  • Adds a pipe
  • Adds a space
  • Adds three hyphens
  • Adds a space
  • Adds a pipe
  • Adds a space
  • Adds three hyphens
  • Adds a space
  • Adds a pipe
  • Adds a new line

After applying the find/replace regex, a header is created for the table

Step 4: Create the table body

This is much like the routine we went through to create the header for the table -- we need to come up with a regular expression that will match what we want to match and ensure that our replacement regular expression converts it into the format we're looking for.

I know from having input all of this data that the pattern for the table is:

  • The first column starts with \p{Script=, is followed by variable number of letters, followed by }
  • The second column starts with \p{Script=, is followed by a variable number of letters (and/or underscores), followed by }
  • The third column is a variable number of letters and can include multiple words (so can include whitespace)
Find:
(\\p{Script=\w+}) (\\p{Script=\w+}) ([\w\s]+)

Replace:
| $1 | $2 | $3 |\n

The text other than the newly created header is highlighted to indicate that the regex is matching.  There are 142 matches to this find, which is what we expect

The find regex:

  • Creates a capture group of the value \p{Script= followed by one ore more alphanumeric characters followed by a }
  • Matches a space
  • Creates a capture group of the value \p{Script= followed by one or more alphanumeric values followed by a }
  • Matches a space
  • Creates a capture group of one or more alphanumeric values or whitespaces

For the first row of the table: \p{Script=Adlm} becomes capture group 1 \p{Script=Adlam} becomes capture group 2 Adlam becomes capture group 3

The replacement regex:

  • Adds a pipe and a space before capture group 1
  • Adds a pipe and a space before capture group 2
  • Adds a pipe and a space before capture group 3
  • Adds a space and a pipe after capture group 3
  • Adds a new line

The text now looks like it's in markdown format!

Copying that newly formatted text here results in (moment of truth....)

AliasCanonical property nameMatches letters and written signs belonging to _ script
\p{Script=Adlm}\p{Script=Adlam}Adlam
\p{Script=Ahom}\p{Script=Ahom}Ahom
\p{Script=Hluw}\p{Script=Anatolian_Hieroglyphs}Anatolian Hieroglyphs
\p{Script=Arab}\p{Script=Arabic}Arabic
\p{Script=Armn}\p{Script=Armenian}Armenian
\p{Script=Avst}\p{Script=Avestan}Avestan
\p{Script=Bali}\p{Script=Balinese}Balinese
\p{Script=Bamu}\p{Script=Bamum}Bamum
\p{Script=Bass}\p{Script=Bassa_Vah}Bassa Vah
\p{Script=Batk}\p{Script=Batak}Batak
\p{Script=Beng}\p{Script=Bengali}Bengali
\p{Script=Bhks}\p{Script=Bhaiksuki}Bhaiksuki
\p{Script=Bopo}\p{Script=Bopomofo}Bopomofo
\p{Script=Brah}\p{Script=Brahmi}Brahmi
\p{Script=Brai}\p{Script=Braille}Braille
\p{Script=Bugi}\p{Script=Buginese}Buginese
\p{Script=Buhd}\p{Script=Buhid}Buhid
\p{Script=Cans}\p{Script=Canadian_Aboriginal}Canadian Aboriginal
\p{Script=Cari}\p{Script=Carian}Carian
\p{Script=Aghb}\p{Script=Caucasian_Albanian}Caucasian Albanian
\p{Script=Cakm}\p{Script=Chakma}Chakma
\p{Script=Cher}\p{Script=Cherokee}Cherokee
\p{Script=Zyyy}\p{Script=Common}Common
\p{Script=Copt}\p{Script=Coptic}Coptic
\p{Script=Qaac}\p{Script=Coptic}Coptic
\p{Script=Xsux}\p{Script=Cuneiform}Cuneiform
\p{Script=Cprt}\p{Script=Cypriot}Cypriot
\p{Script=Cyrl}\p{Script=Cyrillic}Cyrillic
\p{Script=Dsrt}\p{Script=Deseret}Deseret
\p{Script=Deva}\p{Script=Devanagari}Devanagari
\p{Script=Dupl}\p{Script=Duployan}Duployan
\p{Script=Egyp}\p{Script=Egyptian_Hieroglyphs}Egyptian Hieroglyphs
\p{Script=Elba}\p{Script=Elbasan}Elbasan
\p{Script=Ethi}\p{Script=Ethiopic}Ethiopic
\p{Script=Geor}\p{Script=Georgian}Georgian
\p{Script=Glag}\p{Script=Glagolitic}Glagolitic
\p{Script=Goth}\p{Script=Gothic}Gothic
\p{Script=Gran}\p{Script=Grantha}Grantha
\p{Script=Grek}\p{Script=Greek}Greek
\p{Script=Gujr}\p{Script=Gujarati}Gujarati
\p{Script=Guru}\p{Script=Gurmukhi}Gurmukhi
\p{Script=Hani}\p{Script=Han}Han
\p{Script=Hang}\p{Script=Hangul}Hangul
\p{Script=Hano}\p{Script=Hanunoo}Hanunoo
\p{Script=Hatr}\p{Script=Hatran}Hatran
\p{Script=Hebr}\p{Script=Hebrew}Hebrew
\p{Script=Hira}\p{Script=Hiragana}Hiragana
\p{Script=Armi}\p{Script=Imperial_Aramaic}Imperial Aramaic
\p{Script=Zinh}\p{Script=Inherited}Inherited
\p{Script=Qaai}\p{Script=Inherited}Inherited
\p{Script=Phli}\p{Script=Inscriptional_Pahlavi}Inscriptional Pahlavi
\p{Script=Prti}\p{Script=Inscriptional_Parthian}Inscriptional Parthian
\p{Script=Java}\p{Script=Javanese}Javanese
\p{Script=Kthi}\p{Script=Kaithi}Kaithi
\p{Script=Knda}\p{Script=Kannada}Kannada
\p{Script=Kana}\p{Script=Katakana}Katakana
\p{Script=Kali}\p{Script=Kayah_Li}Kayah Li
\p{Script=Khar}\p{Script=Kharoshthi}Kharoshthi
\p{Script=Khmr}\p{Script=Khmer}Khmer
\p{Script=Khoj}\p{Script=Khojki}Khojki
\p{Script=Sind}\p{Script=Khudawadi}Khudawadi
\p{Script=Laoo}\p{Script=Lao}Lao
\p{Script=Latn}\p{Script=Latin}Latin
\p{Script=Lepc}\p{Script=Lepcha}Lepcha
\p{Script=Limb}\p{Script=Limbu}Limbu
\p{Script=Lina}\p{Script=Linear_A}Linear A
\p{Script=Linb}\p{Script=Linear_B}Linear B
\p{Script=Lisu}\p{Script=Lisu}Lisu
\p{Script=Lyci}\p{Script=Lycian}Lycian
\p{Script=Lydi}\p{Script=Lydian}Lydian
\p{Script=Mahj}\p{Script=Mahajani}Mahajani
\p{Script=Mlym}\p{Script=Malayalam}Malayalam
\p{Script=Mand}\p{Script=Mandaic}Mandaic
\p{Script=Mani}\p{Script=Manichaean}Manichaean
\p{Script=Marc}\p{Script=Marchen}Marchen
\p{Script=Gonm}\p{Script=Masaram_Gondi}Masaram Gondi
\p{Script=Mtei}\p{Script=Meetei_Mayek}Meetei Mayek
\p{Script=Mend}\p{Script=Mende_Kikakui}Mende Kikakui
\p{Script=Merc}\p{Script=Meroitic_Cursive}Meroitic Cursive
\p{Script=Mero}\p{Script=Meroitic_Hieroglyphs}Meroitic Hieroglyphs
\p{Script=Plrd}\p{Script=Miao}Miao
\p{Script=Modi}\p{Script=Modi}Modi
\p{Script=Mong}\p{Script=Mongolian}Mongolian
\p{Script=Mroo}\p{Script=Mro}Mro
\p{Script=Mult}\p{Script=Multani}Multani
\p{Script=Mymr}\p{Script=Myanmar}Myanmar
\p{Script=Nbat}\p{Script=Nabataean}Nabataean
\p{Script=Talu}\p{Script=New_Tai_Lue}New Tai Lue
\p{Script=Newa}\p{Script=Newa}Newa
\p{Script=Nkoo}\p{Script=Nko}Nko
\p{Script=Nshu}\p{Script=Nushu}Nushu
\p{Script=Ogam}\p{Script=Ogham}Ogham
\p{Script=Olck}\p{Script=Ol_Chiki}Ol Chiki
\p{Script=Hung}\p{Script=Old_Hungarian}Old Hungarian
\p{Script=Ital}\p{Script=Old_Italic}Old Italic
\p{Script=Norb}\p{Script=Old_North_Arabian}Old North Arabian
\p{Script=Perm}\p{Script=Old_Permic}Old Permic
\p{Script=Xpeo}\p{Script=Old_Persian}Old Persian
\p{Script=Sarb}\p{Script=Old_South_Arabian}Old South Arabian
\p{Script=Orkh}\p{Script=Old_Turkic}Old Turkic
\p{Script=Orya}\p{Script=Oriya}Oriya
\p{Script=Osge}\p{Script=Osage}Osage
\p{Script=Osma}\p{Script=Osmanya}Osmanya
\p{Script=Hmng}\p{Script=Pahawh_Hmong}Pahawh Hmong
\p{Script=Palm}\p{Script=Palmyrene}Palmyrene
\p{Script=Pauc}\p{Script=Pau_Cin_Hau}Pau Cin Hau
\p{Script=Phag}\p{Script=Phags_Pa}Phags Pa
\p{Script=Phnx}\p{Script=Phoenician}Phoenician
\p{Script=Phlp}\p{Script=Psalter_Pahlavi}Psalter Pahlavi
\p{Script=Rjng}\p{Script=Rejang}Rejang
\p{Script=Runr}\p{Script=Runic}Runic
\p{Script=Samr}\p{Script=Samaritan}Samaritan
\p{Script=Saur}\p{Script=Saurashtra}Saurashtra
\p{Script=Shrd}\p{Script=Sharada}Sharada
\p{Script=Shaw}\p{Script=Shavian}Shavian
\p{Script=Sidd}\p{Script=Siddham}Siddham
\p{Script=Sgnw}\p{Script=SignWriting}SignWriting
\p{Script=Sinh}\p{Script=Sinhala}Sinhala
\p{Script=Sora}\p{Script=Sora_Sompeng}Sora Sompeng
\p{Script=Soyo}\p{Script=Soyombo}Soyombo
\p{Script=Sund}\p{Script=Sundanese}Sundanese
\p{Script=Sylo}\p{Script=Syloti_Nagri}Syloti Nagri
\p{Script=Syrc}\p{Script=Syriac}Syriac
\p{Script=Tglg}\p{Script=Tagalog}Tagalog
\p{Script=Tagb}\p{Script=Tagbanwa}Tagbanwa
\p{Script=Tale}\p{Script=Tai_Le}Tai Le
\p{Script=Lana}\p{Script=Tai_Tham}Thai Tham
\p{Script=Tavt}\p{Script=Tai_Viet}Tia Viet
\p{Script=Takr}\p{Script=Takri}Takri
\p{Script=Taml}\p{Script=Tamil}Tamil
\p{Script=Tang}\p{Script=Tangut}Tangut
\p{Script=Telu}\p{Script=Telugu}Telugu
\p{Script=Thaa}\p{Script=Thaana}Thaana
\p{Script=Thai}\p{Script=Thai}Thai
\p{Script=Tibt}\p{Script=Tibetan}Tibetan
\p{Script=Tfng}\p{Script=Tifinagh}Tifinagh
\p{Script=Tirh}\p{Script=Tirhuta}Tirhuta
\p{Script=Ugar}\p{Script=Ugaritic}Ugaritic
\p{Script=Vaii}\p{Script=Vai}Vai
\p{Script=Wara}\p{Script=Warang_Citi}Warang Citi
\p{Script=Yiii}\p{Script=Yi}Yi
\p{Script=Zanb}\p{Script=Zanzabar_Square}Zanzabar Square

So if you find yourself in a situation where you need to format data and you really don't want to manually go through the repetitive work involved.... look for patterns, embrace the regex, and save yourself some time.