[Question] Sed and regex string manipulation

N0x0n · edit-2 1 day ago

[Question] Sed and regex string manipulation

harsh3466 · 1 day ago

Don’t worry or apologize about your English. I’m having no trouble understanding. :)

I’m going to take the second part first and come back with another comment to address the %20 and https bits.

So these variations, like [1.3 Subtitles](BDMV_svt-av1_encode_anime.md#1.3%20Subtitles), are where you would start to craft a new expression. Trying to catch every variation in a single expression would get to complicated and more likely to fail and/or modify text you don’t want modified.

So in this case, here’s the expression I’d use:

sed -ri 's|(]\(.+[0-9]+)\.([0-9]+.+\))|\1-\2|' somefile

And the breakdown:

sed -ri calls sed with the expanded regular expressions capabilities and to edit the file in place

's| - Begins the pattern match|modify expression

( - This very first opening parentheses is a special metacharacter that is used to group a sub-expression within the larger expression. By doing this we can create variables that we can refer to in the modification portion of the command.

]\( - Find the closing bracket character and an opening parentheses character, which we know will be the beginning of a markdown url. The backslash precedes the open parentheses to escape it and indicate it needs to look for the actual open parentheses character

.+ - Find any character (indicated by the .) one or more times (indicated by the +). This will find any characters until it gets to the next specified character in the expression

[0-9]+ - This is two parts. The first part is [0-9]. The brackets are metacharacters in regex that enclose a character set to match from. In this case the character set is the numbers zero to nine. What this means on its own is that sed will look for one occurrence of any number between zero and nine. The + tells sed to find one or more occurrences of a number between one and nine until it gets to the next portion of the pattern. I did this because I don’t know the upper bounds of the documentation numeration you’re working with in the links. If all the links only contain single digit numbers before the decimal, you can remove the +.

) - This closing parentheses marks the end of the subexpression that we want to refer to. In this case, the sub expression is capturing from the closing bracket up to (but not including) the decimal in the number.

\. - This tells sed to find the period/dot/decimal character in the number. It’s preceded by the backslash because the period/dot/decimal character is a metacharacter in regular expressions.

( - This is the beginning of a new subexpression

[0-9]+ - The numeral capture repeats to find the number after the period/dot/decimal. Similarly to the number before the decimal, if the number after the decimal is only ever single digit, the + can be removed.

.+ - Find any character (indicated by the .) one or more times (indicated by the +). This will find any characters until it gets to the next specified character in the expression, taking us to the end of the url

\) - Find the closing parentheses of the url. The backslash precedes the closing parentheses to escape it and indicate it needs to look for the actual open parentheses character.

) - This closes our second subexpression, which captures everything from the number after the decimal to the closing parentheses of the link.

| - Indicates the end of the pattern matching portion of the expression/command. and the beginning of the modification part of the command/expression.

\1 - This is how we refer to or call the subexpressions. The syntax is a backslash followed by a number, and the number indicates the sequential position of the subexpression. So \1 refers to this portion of the regex in the command above: (]\(.+[0-9]+). This section of the expression is capturing everything from the closing bracket up to (but not including) the period/dot/decimal character. By using it in this position in the substitution/modification, we’re just using it as a variable, so in the substitution, it’s going to put everything it finds in the first subexpression first in the new/modified string of text.

- - This tells sed to put a dash immediately after the first subexpression in this new/modified string of text, effectively replacing the period/dot/decimal in the number portion of the url.

\2 - This is calling the second subexpression, which is this portion of the pattern matching regex: [0-9]+.+\). This captures everything in the url from the number after the period/dot/decimal (not including the decimal), to the closing parthenses of the markdown url. Used in this position of the substitution it tells sed to place it after the dash in the new/modified text.

|' - This indicates the end of the modification portion of the command and closes the match|substitution expression.

somefile - The file to be worked on

Here is the full command again: sed -ri 's|(]\(.+[0-9]+)\.([0-9]+.+\))|\1-\2|' somefile

Altogether what this does is: Begin the first subexpression that starts with finding a closing bracket followed by an opening parentheses followed by any character one or more times until finding at least one or more numbers between zero and nine until it finds a decimal, and then close and remember what was found for this sub expression (not including the decimal). Then begin the second subexpression that starts with finding a number between zero and nine one or more times, and then find any character any number of times until a closing parentheses is found. Then close and remember what was found in this subexpression. Replace everything with subexpression one followed by a dash followed by subexpression two.

If you also need this markdown link text to be converted to lowercase, just add \L to the replacement section before the \1 like so:

sed -ri 's|(]\(.+[0-9]+)\.([0-9]+.+\))|\L\1-\2|' somefile

[Question] Sed and regex string manipulation

[Question] Sed and regex string manipulation

What Am I trying to achieve?

What I tried