Regular Expressions in RGSS (Regexp)

Started by Zeriab, January 08, 2008, 06:29:53 am

Previous topic - Next topic

Zeriab

[SIZE=8]Regular Expressions in RGSS (Regexp)[/SIZE]

Preface

This article is intended to give basic insight in the use of Regular Expressions in RGSS and to be used for future reference.
I assume you, the reader, have scripting knowledge. If you have problems with scripting issues when you read/test the tutorial that is not directly related to regular expression it would probably be a good idea to seek basic scripting tutorials before continue reading or you will most likely not get as much out of reading this tutorial.
This is not an in-depth tutorial on regular expressions. It is more of an appetizer. A help to get you started. (Along with the reference part naturally)
I hope will be able to figure out how to use Regular expressions to your advantage with what I have provided as well as being able to figure out what I have not run through.

I am sure I will have made at least one error. Please do tell me if you find it.
If you have any problems understanding anything of the material here, ask. Worst case scenario is me not answering. Best case is you getting an answer that will help you understand the stuff here ^_^
Worth the risk? I would say so.

Enjoy reading
- Zeriab

Contents

  • Preface
  • Contents
  • Reference
  • What are regular expressions/languages?
  • How to create regular expressions
  • How to use regular expressions
  • Non-regular languages accepted
  • Examples
  • Exercises
  • Credits and thanks
  • Sources and useful links

Reference

Reference from ZenSpider:

Quote from: Reference
Regexen

/normal regex/iomx[neus]
%r|alternate form|

options:

/i         case insensitive
/o         only interpolate #{} blocks once
/m         multiline mode - '.' will match newline
/x         extended mode - whitespace is ignored
/[neus]    encoding: none, EUC, UTF-8, SJIS, respectively

regex characters:

.          any character except newline
[ ]        any single character of set
[^ ]       any single character NOT of set
*          0 or more previous regular expression
*?         0 or more previous regular expression(non greedy)
+          1 or more previous regular expression
+?         1 or more previous regular expression(non greedy)
?          0 or 1 previous regular expression
|          alternation
( )        grouping regular expressions
^          beginning of a line or string
$          end of a line or string
{m,n}      at least m but most n previous regular expression
{m,n}?     at least m but most n previous regular expression(non greedy)
\A         beginning of a string
\b         backspace(0x08)(inside[]only)
\B         non-word boundary
\b         word boundary(outside[]only)
\d         digit, same as[0-9]
\D         non-digit
\S         non-whitespace character
\s         whitespace character[ \t\n\r\f]
\W         non-word character
\w         word character[0-9A-Za-z_]
\z         end of a string
\Z         end of a string, or before newline at the end
(?# )      comment
(?: )      grouping without backreferences
(?= )      zero-width positive look-ahead assertion
(?! )      zero-width negative look-ahead assertion
(?ix-ix)   turns on/off i/x options, localized in group if any.
(?ix-ix: ) turns on/off i/x options, localized in non-capturing group.


What are regular expressions/languages?
You can skip this section if you want since it is a bit of theory without too much practical application.

A regular language is a language that can be expressed with a regular expression.
What do I mean by language?
I mean that words either can be in the language or not in the language.
Let us for the moment for simplification consider an alphabet with just a and b. This means that you only have the symbol a and the symbol b to make words of when using this alphabet.
A regular language with that alphabet will define which combinations of a and b are words in the language. When I say a regular language accepts a words I mean that it is an actual word in the language.
We could define a regular language to be an a followed by any amount of bs and ended by 1 a. 0 bs is a possibility.
Etc. aa, aba and abbba would be in the language.
b, abbbbab and bba would on the other hand not be in the language.
We can quickly see that there are an infinite amount of words belonging to the language ^_^

A regular expression is the definition of a regular language.
Let us start with a simple regular expression:
aba
This means that accepted word must consist of an a followed by a b followed by an a.
This particular regular expression defines a regular language that only accepts one word, aba.

The previous regular language can be expresses as:
a(b)*a

Notice the new notation. The star *.
It literally means any amount of the symbol it encompasses. By any amount it is 0, 1, 2, ...
As long as just one of the amounts fits it will be accepted.
The regular expression defines a regular language that accepts any words that consists of an a followed by any amount of bs followed by an a and nothing more.
Note that a(b)*(b)*a expresses the same regular language. This can easily be seen as one of the (b)* can be 0 all the time and you have the same regular expression.
We can conclude that there can exist an infinite amount ways to express a regular language

If we wanted the words just to start with a(b)*a and what come after is irrelevant we can use this regular expression: a(b)*a(a|b)*

Note: The () are not really needed when there is only 1 letter, I just put them on for clarity. It is perfectly fine to write ab*a(a|b)*

Let us define a new regular expression:
abba|baba

We have new notation again. The | simple means or like in normal RGSS.
So this regular expression defines a language that accepts abba and baba. It is pretty straightforward.

The? means either 0 or 1 of the symbol. ab?a accepts aa and aba.

One final often used notation is the + operator - a(b)+a
It is basically the same as the star * operator with the difference of at least one b.
(x)+ is the same as x(x)*, where x means that it could be anything. Any regular notation.

Note (a)* also accepts the empty word. I.e. "" in Ruby syntax.

I will end this section with an example of a non-regular language:
Any words in this language has a number of as in the start followed by the same number of bs.
I.e. ab, aabb, aaabbb and so on.

I will not trouble you with the proof because I think you will find it as boring as I do.
If you really want it I am sure you can manage to search for it yourself. ;)


How to create regular expressions

You can create a regular expression with one of the following syntaxes: (I will use the first one in my examples)

/.../ol
%r|...|ol
Regexp.new('...', options, language)

The dots (...) symbolize the actual regular expression. You can see the syntax up in the reference section.
The o symbolizes the place for options, which are optional. You do not have to write anything here.
o be any of the following i,o,m,x You can have several of them. You can choose them all if you want. From the reference section:
Quote/i         case insensitive
/o         only interpolate #{} blocks once
/m         multiline mode - '.' will match newline
/x         extended mode - whitespace is ignored
/[neus]    encoding: none, EUC, UTF-8, SJIS, respectively

The /[neus] is for the l. You can choose either n, e, u or s. Only one encoding is possible at a time.

The options block in Regexp.new is optional. This is a little different from the options part in the previous syntaxes.
Here you can put:
Regexp::EXTENDED - Newlines and spaces are ignored.
Regexp::IGNORECASE - Case insensitive.
Regexp::MULTILINE - Newlines are treated as any other character.

If you want more than one option, let us say ignore case and multiline, you or them together like this: Regexp::IGNORECASE | Regexp::MULTILINE


We move on to what you actually write instead of the dots (...)
Notice that . means any character.
If you want to find just a dot use \.

Let us take the example where you want to accept words that start with either a, b or c. What comes after does not matter
/(a|b|c).*/ is a possibility, so is [abc].* and [a-c].*
This illustrates the use of the []. If you want all letters use [a-zA-Z] or just [a-z] if you have ignore case on.
If you have weird letters, i.e. not a-z then you probably have to enter them manually.
Example, any word:
/[a-z]*/i gives the same as /[a-zA-Z]*/
The first case will not allow case sensitive stuff later though. I.e. /[a-z]*g/i is not the same as /[a-zA-Z]*g/.
In the first case words that end with big G are accepted while they are not in the latter case.
Numbers are [0-9]
Just use \w to get numbers, letters and _


Let us a bit more difficult example.
We have used Dir to get the files in a directory. We want to remove the file extension from the strings. How exactly we will do it is shown in the next section.
Here we will write the regular expression that accepts the file extension with dot and only the file extension.
If there are more dots as in FileName.txt.dat we will only consider the end extension. I.e. only .dat.
If you have self-discipline enough it would be a good time to try and figure out how to do it on your own. Just consider every extension.

My answer is /\.[^\.]*\Z/
I will go through this step by step.
First we have \. which means a dot like I told earlier. It is the dot in the file name.
Next we have the [^\.]* bit. The [] means one of the character in the brackets. [^] means any character that is not in the brackets. Since you have the dot \. in the brackets it means any character but the dot.
The star simple means any amount, so any amounts of non-dot characters are accepted.
Finally we have \Z which means at the end of the string. It will be explained in the next section.


How to use regular expressions

You may have wondered why there are both a * and *? operator that basically does the same.
I have also avoided other use specific operators.
These are related to how they should be applied to strings.

You now know how to create regular expressions and I will in this section tell you how to actually use.

I will continue the example from the previous section.
We have the string:
str = "Read.me.txt"

We want to remove the .txt which can be done this way:
str. gsub!(/\.[^\.]*\Z/) {|s| ""}

We use the gsub! which modifies the string. The return is "Read.me"
The \Z means that it has to be at the end of the line and/or string. '\n' is considered to be a new line. It does not take .me because there is a dot after it and it is therefore not at the end of the line. (Remember that [^\.] do not accept dots)
Here is a list of the methods on strings where you can apply the regular expression.
Look here for how to use them: http://www.rubycentral.com/ref/ref_c_string.html

  • =~
  • [ ]
  • [ ]=
  • gsub
  • gsub!
  • index
  • rindex
  • scan
  • slice
  • slice!
  • split
  • sub
  • sub!

Basically whenever you see aRegexp or pattern as an argument you can apply your regular expression.

The effects vary from method to method, but I am sure you can figure it out as the principle when considering regular expressions are the same.

Another fishy thing you might have notices is the non-greedy variants of * and +.
To illustrate the different effect try this code:
 "boobs".gsub(/bo*/) {|s| p s}     # 1
"boobs".gsub(/bo*?/) {|s| p s}    # 2
"boobs".gsub(/bo+/) {|s| p s}     # 3
"boobs".gsub(/bo+?/) {|s| p s}    # 4

The first one (greedy) will print out "boo" and "b". It takes all the os it can.
The next one (non greedy) will print out "b" and "b". It takes as few os as possible.
That is basically the difference. It will take os if necessary. In the following code "boob" is printed out in both examples.
 "boobs".gsub(/bo*b/) {|s| p s} 
"boobs".gsub(/bo*?b/) {|s| p s}


The + operator is similar except that there have to be at least 1 o.
In 3rd case you will get "boo" and 4th case you will get "bo".

All in all. The longest possible string is taken unless you have some non greedy operators. The non greedy operators will try to get the shortest possible string.
"boobs".gsub(/o*?/) {|s| p s} will give pure ""s.
As a finale I will talk about escaping characters. You may have wondered about how to search for the characters that have special meaning like *, |, /.
The trick is to escape them. That is what it is called. You just basically have to put a \ in front of them.
\*, \| and \/. To get \ just use \\.
I have already showed an example where I escape the dot. (\.)

There are still loads of operators I have not showed. Some are a bit advanced some are not. They will not be included in this tutorial except for the back reference shown in the next section
Until I make another tutorial or extend this tutorial you can have fun with discovering how they work on your own ^_^

Non-regular languages accepted

It is a bit ironic that the regular expression implementation in Ruby also accepts some non-regular languages.
It is the back-references I am talking about.
Look at this example: /(wii|psp)\1/
wiiwii and psppsp, but neither wiipsp nor pspwii.
You can use back references to make non-regular languages.
I am not going to supply neither proof nor example. You can google it yourself if you are doubting ^_^
One problem with back references is speed. It goes from polynomial time to exponential time. If the regular expressions have been implemented just a little sensible the speed down will only effect regular expressions with the extra functionality.
It should not be too much of a problem with short regular expressions, but it is still something to consider.

Examples

A couple of examples for reference and guidance ^_^

Example 1
I will start by giving some code:
files = Dir["Data/*"]
files.each {|s| s.gsub!(/\.[^\.]*$|[^\/]*\//) {|str| ""}}

The filenames of all the files in the Data directory are stored in the files variable as strings in an Array. (subdirectories and their contents not included)
That is what the Dir["Data/*"] command returns.
The next line calls gsub!(/\.[^\.]*$|[^\/]*\//) {|str| ""} on each filename. (Remember that it is a string)
Now we finally come to the big regular expression:]/\.[^\.]*$|[^\/]*\//
Notice the |. This means that we accept strings that fulfills either what comes before or what comes after. If  the string are accepted in both cases it will still be accepted
Let us look at /\.[^\.]*$. This looks like something we have seen before. Since $ means end of string/line it basically does the same thing as \Z. We have already run through this example, it removes the extension.
Next we will look [^\/]*\//. Remember the bit about escaping?
[^\/] means any character but /. The star means any amount of them.
It is followed by \/ which means the character /. The last character MUST be /
Since the greedy version of the star is used it will try to find the longest possible string which ends with /.
So this basically finds the directory bit of the path.
This means that it either has to be the extension or the path before the filename. We remove these parts by substituting the found strings with "".

This can be used if you for some reason want to draw the files in the location without the path and extension. (You might have them elsewhere.)

Exercises

I have made up a couple of exercise you can try to deal with. I have not provided the answer and I do not plan to provide them.
I consider them more of a stimulant; A way to help you at learning regular expressions.
If you really want to check your answers then PM me. Do not post your answers here.

Exercise 1
Let str be an actual string. What does this code do?
str.gsub!(/.*\..*/) {|s| ""}


Exercise 2
You have one long string that contains a certain amounts of words. Whitespace characters separate the words.
These words have a structure. They all start with either "donkey_" or "monkey_"
What comes after the "_" differs from word to word.
What you want is to separate the monkeys from the donkeys.
You want to put the donkeys in a donkey array and the monkeys in a monkey array.
Make the 2 arrays before and use just a gsub call to separate the monkeys from the donkeys.

Credits and thanks
This article have been made by Zeriab, please give credits
Credits to ZenSpider.com for the reference list.

Thanks to:
Ragnarok Rob for the example word "boobs". (An amazingly good word ^_^)
SephirothSpawn for getting me to do this tutorial.
Nitt (jimme reashu) for bug reporting.

Sources and useful links

ZenSpider - http://www.zenspider.com/Languages/Ruby/QuickRef.html#11
Ruby Central - http://www.rubycentral.com/ref/ref_c_regexp.html
Regular-Expressions.info - http://www.regular-expressions.info/

Fantasist

Zeriab,  <3 you! This is among my 'most wanted' list, now I can do lots of stuff I planned :D
Do you like ambient/electronic music? Then you should promote a talented artist! Help out here. (I'm serious. Just listen to his work at least!)


The best of freeware reviews: Gizmo's Freeware Reviews




Sally

this only works in rpgmakerxp? not the pke edition?

Nortos

gj Zeriab I always love your tuts :) gonna give you +energy for this it's real useful

Sally

zeriab is going to have the most energy XD

Nortos


Blizzard

Because nobody is giving me energy. D:

But I know why: I actually don't care about how much I have. If I did, I could cheat and add myself daily 1 energy which would be unfair. >_>
Check out Daygames and our games:

King of Booze 2      King of Booze: Never Ever
Drinking Game for Android      Never have I ever for Android
Drinking Game for iOS      Never have I ever for iOS


Quote from: winkioI do not speak to bricks, either as individuals or in wall form.

Quote from: Barney StinsonWhen I get sad, I stop being sad and be awesome instead. True story.

Nortos

I actually tried to give you some for your game cos u deserve it :P but for some reason it didn't work hey it's 2hrs later I'll do it now :)

Fantasist

There are a lot of people who Iw ant to give energy to, and I'm just too lazy to say the least XP
Do you like ambient/electronic music? Then you should promote a talented artist! Help out here. (I'm serious. Just listen to his work at least!)


The best of freeware reviews: Gizmo's Freeware Reviews




Zeriab

*hugs Fantasist*
I <3 you too.

@Susys:
Chances are that it works in pke version since the RegExp is built-in in Ruby. I have not tested so I don't know.
Otherwise you just have to test it ;)

@Nortos:
I believe I have the most energy at the time of this post.
That will probably change though. Then again. I am more lazy than Blizzard, so I won't spend as much energy as Blizzard.

Nortos

January 10, 2008, 09:00:25 pm #10 Last Edit: January 10, 2008, 09:01:07 pm by Nortos
How can you use energy or did just mean mean it j/k ly as being Lazy?
EDIT:  :o someone just gave me 1 energy  :D

j.a

im gonna try this right now








p.s.
Quote from: Susys on January 08, 2008, 04:12:38 pm
this only works in rpgmakerxp? not the pke edition?
pke edition means postality knights edition edition the e is edition so dont put two

Fantasist

January 19, 2008, 10:41:41 am #12 Last Edit: January 19, 2008, 10:45:56 am by Fantasist
No, e is 'Enhanced'. It's Postality Knights Enhanced 2.0 to be precise, I doubt anyone has the older one, it's just too buggy.

On topic, RegExp will most likely work in PKE, it depends on the RGSS dll and as far as I know, no new modules or classes were added from 100J to 102E. Only problems with 100J I know are:
- error when Win32API is messed (game crashes)
- in-built modules can't be overwritten (overwriting them causes the game to crash)
- Defaults are not set in the Font class. That's why fonts don't appear if the legal scripts are used, there's no $fontface or $fontsize, so the font just doesn't show up, where as in 102E,
QuoteFont.default_name = 'Arial'
is in-built. This is the secret of the "omg fonts don't appear!" phenomenon ^_^

EDIT: Not to mention, the conditioning bug (evaluating 'and's and 'or's) which is not corrected in any RGSS dll till date (RGSS2xx/VX dlls excluded, I don't know about them).
Do you like ambient/electronic music? Then you should promote a talented artist! Help out here. (I'm serious. Just listen to his work at least!)


The best of freeware reviews: Gizmo's Freeware Reviews




Sally

yea some scxreipts gets rid of my text so what i have to do is use the UMS to get it done. it works too :) on some of the scripts :P

j.a

Quote from: Fantasist on January 19, 2008, 10:41:41 am
No, e is 'Enhanced'. It's Postality Knights Enhanced 2.0 to be precise, I doubt anyone has the older one, it's just too buggy.

2.0 is old and buggy i use 2.0 and it works just fine?

Blizzard

1.0 is old and buggy. It's also called Dynaemu version but PKE isn't the best either. xD
Check out Daygames and our games:

King of Booze 2      King of Booze: Never Ever
Drinking Game for Android      Never have I ever for Android
Drinking Game for iOS      Never have I ever for iOS


Quote from: winkioI do not speak to bricks, either as individuals or in wall form.

Quote from: Barney StinsonWhen I get sad, I stop being sad and be awesome instead. True story.

Ryex

July 17, 2009, 09:14:41 pm #16 Last Edit: July 17, 2009, 09:37:12 pm by Ryexander
necro post! all I can say is wow I finally had a need to learn regular expressions so I took a look at this tut. before they confused the hell out of me, but the ten minuets it took to read this tut explained everything! thanks!
I no longer keep up with posts in the forum very well. If you have a question or comment, about my work, or in general I welcome PM's. if you make a post in one of my threads and I don't reply with in a day or two feel free to PM me and point it out to me.<br /><br />DropBox, the best free file syncing service there is.<br />

Aqua

Quote from: Ryexander on July 17, 2009, 09:14:41 pm
necro post! all I can say is wow I finally had a need to learn regular excretions so I took a look at this tut. before they confused the hell out of me, but the ten minuets it took to read this tut explained everything! thanks!


I'm sorry Ry... but that made me LOL!

(*gasp* Aqua made a spam post.)

Ryex

 :O_O:*edits*

What ARE you talking about Aqua?  :whistle:
I no longer keep up with posts in the forum very well. If you have a question or comment, about my work, or in general I welcome PM's. if you make a post in one of my threads and I don't reply with in a day or two feel free to PM me and point it out to me.<br /><br />DropBox, the best free file syncing service there is.<br />

Zeriab

Thanks for the nice words :shy:
I am happy that you now know about regular expressions :D

And there is nothing like a good laugh either (even if Aqua actually made a spam post!)

Just ask if you have any questions.

*hugs*