Introduction

In this blog post I will show you that the “Log Out” button does not actually log the user out.

The Log Out Button Does Not Log You Out

If you click on “Sair”, which translates to “Log Out” you’re not actually logged out. It doesn’t matter if you try to login straight away, try to login in a new tab or try to login in a new window. The only way to actually log out is to close the browser and open it again.

The video below demonstrates the issue in action.

I’ve tested this in Google Chrome on Android and the problem was present there too. This is an issue. Let’s say that the user logs on a device that’s not his. After he’s finished with his tasks he clicks on “Log Out”, thinking that his session has been terminated. This means that the attacker can simply open the website, click on “Log In” and obtain access to the victim’s account.

A little bit of exploring reveled the problem. The only think that clicking “Log Out” does is redirecting the user’s browser to https://www.autenticacao.gov.pt/c/portal/logout. This page, in turn, redirects the user back to the home page and sets some cookies to an empty string, namely COMPANY_ID, ID, LOGIN, PASSWORD and REMEMBER ME. I’m not sure what they are, since I haven’t seen them used anywhere else on the site. I assume it’s some code that was ported for somewhere, but was never adjusted to a different authentication mechanism (I just hope they weren’t storing the password in plain text there).

Figure 1: Cookies Set By The Log Out Page
Cookies Set By The Log Out Page

What the page fails to do, is clean the ASPXAUTH and ASP.Net_SessionId cookies. In ASP.NET ASPXAUTH is a cookie used to identity if the user is authenticated and the ASP.Net_SessionId is a cookie used to identify the user’s session on the server.

Those cookie’s expiration date is not set, which means that they will have the life time of a session cookie, meaning that they will get removed once the client is shut down. This explains why closing and opening the browser again actually logs the user out.

The solution to this problem is to simply clear the ASPXAUTH and ASP.Net_SessionId on the log out page.

Conclusion

In this blog post I demonstrated that the “Log Out” button does not actually log the user out, since it fails to clean the authentication cookie. This opens a window to unauthorized account access.

This blog post ends the series of blog posts related to the security weaknesses and vulnerabilities found in the Portuguese government’s authentication system autenticacao.gov.pt and Chave Móvel Digital. The issues described throughout these blog posts were not a result of a thorough analysis, but rather what I found in about 30 minutes. This leads me to believe that a more detailed study of the system’s security is likely to reveal many more problems.

I hope that the described issues will be fixed very soon, since they put millions of Portuguese citizens at risk. The agency responsible for these systems should take security more seriously and not as an afterthought.

Introduction

In this blog post I will be exploring the information leakage regarding if a phone number is registered or not. The system has a protection measure which can be easily bypassed. I will also provide proof-of-concept code.

Determining If A Phone Number Is Registered With An Account

If you enter a wrong phone number and PIN combination, you will be presented with the following message:

Figure 1: The System Ties To Hide Information
The System Ties To Hide Information

The error message reads:

O número de telemóvel ou o PIN estão errados ou registo inexistente

Which translates to:

Either the phone number or the PIN are incorrect or the account is not registered.

This might seem like a good thing: the system tries to give the attacker the least information possible. The problem is that this is useless, since it can be easily bypassed.

The issue is that if the phone number is registered, after 3 failed attempts, your account will be temporary locked. If the phone number is not registered, no lock in will occur.

Figure 2: Error Message When The Number Is Registered
Error Message When The Number Is Registered

The good news is that if you enter the wrong PIN is entered again, the temporary lockout period will be extended. Sometimes, you might be asked to solve a captcha. My personal experience, however, suggests that some simple machine learning can be used to bypass it. A better solution here would be to use a more sophisticated solution, like Google’s ReCaptcha.

Figure 3: A Simple Captcha Is Used
A Simple Captcha Is Used

Once you know that a phone number is registered in the system, you could apply social engineering attacks to obtain access to the user’s account. For example, in Portugal, you can send SMS messages with any Sender ID. Services like Amazon SNS are cheap and easy to use. When you receive your second-factor authentication code, it comes from the sender CMD (it stands for Chave Móvel Digital, or Digital Mobile Key, in English). What an attacker could do is send the user an SMS message with the sender id CMD saying that there is an issue with the user’s account and that he needs to follow the provided link to solve it. The link could perhaps exploit the XSS vulnerability described in the previous post to get the user’s PIN.

While not being a critical issue, it renders the existing protection system useless. What they could instead do is treat all of the numbers the same: lock the access to all of them, regardless of the fact if they are registered or not. This should, however, be done with some care. If you follow the naive approach of inserting every entered phone number into the database and counting the number of login attempts, the attacker might be able to flood the database with numbers. Careful system design is needed here to prevent DoS attacks.

The same trick works with e-mail and Twitter authentication. These two authentication methods are, however, not enabled by default.

Proof-Of-Concept

I wrote a simple proof-of-concept code.

A quick reverse-engineering of the authentication process showed that in order to access the login page, you need to go thought a specific sequence of steps, you can’t just go to the login page directly.

In that sequence of steps, your browser exchanges cookies with the server. Among those, there are 3 different session ids that you get through this process and a request that checks if your browser supports JavaScript.

With every POST request a humanCheck value is sent. This value is set on document load:

$(document).ready(function () {
            $('#humanCheck').val('8FBB298A-DE46-4657-88E8-95F1F1224784');
            $("#inputMobile").intlTelInput();
        });

I initially thought that that was a unique value generated on every request, acting as a nonce. Further analysis showed that this value is fixed. I’ve tried with different IP’s, different browsers and the value remained the same.

Conclusion

As demonstrated in this blog post, while the system tries to be ambiguous by not exposing if the phone number is registered with an account or not, it can be easily bypassed.

In the next and final blog post of this series I will explore another issue: the fact that the “Log Out” button does not actually log you out.

Introduction

In this blog post, I will be exploring the Reflected Cross-Site-Scripting (XSS) vulnerability present in the Portuguese government’s authentication system’s website www.autenticacao.gov.pt.

The XSS vulnerability is present in the Chave Móvel Digital (CMD) login page.

The XSS Vulnerability

At the moment of writing this post, there are 3 ways that you can authenticate yourself:

  • the second-factor code is sent to your mobile phone number
  • the second-factor code is sent to your e-mail
  • the second-factor code is sent to your as a private Twitter message
Figure 1: Chave Movel Digital Authentication Options
Chave Movel Digital Authentication Options

I’m going to choose the phone number authentication method, since this is the only one that’s required. The other two are not activated by default. The XSS vulnerability seems to only work on the mobile phone number and Twitter authentication methods. E-mail authentication seems to be doing input sanitization correctly. There are two separate places where you can login: the “Personal Area”, which only allows mobile phone number login, and the regular one, which allows the 3 authentication methods described above. The XSS vulnerability works in both places.

This is how the main login page looks like:

Figure 2: Digital Mobile Key login page
Digital Mobile Key login page

To login to the portal, you need to enter your mobile phone number and a numeric PIN. The Mobile phone number field is the one that will be used to exploit the XSS vulnerability. Throughout this section I will be using 1234 as the PIN number.

Now let’s see what happens when I try to enter a phone number that consists only of letters (Hello, World"):

Figure 3: Digital Mobile Key login page
Digital Mobile Key login page

And now let’s click on “Authenticate”:

Figure 4: User input gets copied without validation
User input gets copied without validation

Even though the page says The phone field only allows numbers, at the bottom, it still copies the input that I provided verbatim to the same field. At this point I was pretty sure that the page is vulnerable to XSS. So I looked at the source code to see how it can be exploited:

Figure 5: The input is copied to the value attribute
The input is copied to the value attribute

Okay, so the input is you provided initially is copied to the value attribute of the inputMobile element. So let’s try to show a very simple XSS that would inject a new span element into the page and would comment out some of the HTML. For this, in the Mobile phone number field, we’ll input the following text: " <span> Hello, World </span> <!--.

Figure 1: A simple XSS input
A simple XSS input

And here is the result:

Figure 6: Successful result of an XSS injection
Successful result of an XSS injection

As you can see, our custom HTMLwas successfully injected into the page. The rest of the page wasn’t commented out only because there is another closing HTML comment tag (-->) present in the page’s source.

Figure 1: Resulting HTML from the XSS attack
Resulting HTML from the XSS attack

Below is a video demonstration of how this vulnerability can be exploited to inject arbitrary JavaScript code into the page.

Whenever the text in the “Mobile phone number” field is changed, the ShowPrint() function is called. The only thing that it does is copies the value from the inputMobile element to the MainContent_hiddenMobile attribute:

function ShowPrint() {
            var mobileValue = document.getElementById("inputMobile").value;
            document.getElementById("MainContent_hiddenMobile").value = mobileValue;
            return true;
        }

Why this is done is not obvious, since both of the fields get POSTed to the the backend when you click on “Authenticate”:

Figure 1: Both of the attribute values get sent
Both of the attribute values get sent

However, if we manually change the value of the MainContent_hiddenMobile field to Hello, Reader and leave Hello, World in the inputMobile field, we can see that the value that that’s placed in the “Mobile phone number” box in the failed login page is the one from MainContent_hiddenMobile.

Figure 7: Manually changing the value of the hidden input
Manually changing the value of the hidden input
Figure 8: The hidden input value is the one that gets echoed back
The hidden input value is the one that gets echoed back

Conclusion

In this blog post I showed you the POST XSS vulnerability present in the Portuguese government’s autenticacao.gov.pt Chave Móvel Digital login page.

In the next blog post I will show you how you can find out if a mobile phone number is registered with Chave Móvel Digital or not, making the existing ambiguous “Either the mobile number or the PIN are incorrect” message useless and opening the door for various social engineering attacks.

Introduction

Back in January 2018, after about half an hour of exploring the “secure” authentication system provided by the Portuguese government I found various security issues and vulnerabilities. I will briefly explore them in a series of blog posts.

In this first blog post I will explain what is autenticacao.gov.pt and Chave Móvel Digital, as well as the sensitive information that they are supposed to be restricting the access too. I will also explore the overall weak security that the system has.

This whole series of blog posts is a result of my findings in a short period of time. It’s a compilation of the notes that I took, as a draft to be submitted to responsible agency, as part of responsible disclosure. A more thorough security audit is likely to reveal a lot more problems. If about 30 minutes is all to took to find those problems, what more would a malicious, motivated attacker would find.

I’m hoping that the issues will be fixed as soon as possible.

Responsible Disclosure

On January 21st I emailed ama@ama.pt (email of the responsible agency), describing the vulnerabilities and security issues, as well as offering a more detailed description draft and Proof-Of-Concept code, but obtained no answer.

What Is Autenticacao.gov.pt and Chave Movel Digital (CMD)?

Autenticacao.gov.pt’s homepage has the following text:

O sítio oficial dos meios de identificação eletrónica, assinatura digital e autenticação segura do Estado.

which translates to:

The official site of the means of electronic identification, digital signature and state's secure authentication.

Basically, it’s an authentication system that allows the Portuguese citizens to authenticate to administrative public services. Once authenticated, you can, among other things:

  • register a birth of a child
  • register a vehicle
  • register a company
  • request a birth certificate
  • obtain the criminal record of an individual or a company
  • submit/consult revenue of an individual or a company
  • sign documents

New functionalities are constantly added. As you can see, it is a very useful system, since it allows you to save a lot of time. It allows you to do many things from the comfort of your home and in minutes, instead of spending hours at the pubic administrative services.

Despite being very useful, it does expose some sensitive personal information and allows you to do some sensitive actions. If this gets into the wrong hands, it can lead to many problems, such as identity theft.

For this reason, it is very important to have a secure authentication system. This is what Chave Móvel Digital (CMD), which translates to Digital Mobile Key, supposedly does. The main problem with it is that its security is questionable.

Bad Security From The Ground Up

So now let’s briefly talk about the security of the Chave Móvel Digital authentication system. In December of 2017 me and my girlfriend went to a Christmas fair. Among other things, there were various company booths. One of them was VR-related, so we decided to go in and check it out. After playing around with the Microsoft’s HoloLens for a little bit, we were approached by a lady there that asked us if we have already activated the Chave Móvel Digital system. After explaining to me what that was, she told me that it soon would be mandatory to have it activated. I could either do it there now or go to the public administrative organs, wait in queue for hours and do it there.

Motivated by the interest in the security of this new system and the potential future time savings (bureaucracy is a big issue in Portugal), I decided to activate it there.

The Authentication Process

Let’s look at the authentication process. To login using the Chave Móvel Digital system you have to enter your mobile phone number and your PIN. If the entered details are correct, a code is sent to your mobile number, which you have to enter to login.

The issues here are the following:

  • PIN can only be 4-8 numeric digits
  • SMS as the only second factor authentication option

The first one is the most serious one. PIN is essentially your password. The issue here is that it only be numeric its length is limited to be between 4 and 8 characters.

Let’s look at the numbers. 8 digit numeric-only passwords means that there are 10^8 = 100000000 possible combinations. This is nothing for modern computer. For example, let’s see how long will it take to generate all of the possible combinations on my machine. Assuming that the file gen.py contains the following code:

# gen.py file contents
for i in range(0, 99999999):
    pass

Here is what time python gen.py gives us:

$ time python gen.py python gen.py 3.20s user 0.01s system 99% cpu 3.208 total

While this is not the most accurate way of measuring the process execution time, it’s good enough to transmit the idea of how quick modern computers are at generating the entire key space. This is to give you an intuition of how small the number of possible PINs is. It’s 2018 and the Portuguese government seems to think that 4 to 8 digit passwords are secure.

Let’s compare this to an 8 character long password that allows numbers (10 different: 0-9), letters (52 different: a-z and A-Z) and special characters (32 different). For every position we would have 10 + 52 + 32 = 94 possible values. This means that there would be a total of 94^8=6095689385000000 possible combinations. If we use the rule of three to estimate how long it would take my computer to generate the entire key space, we would see that it would come out to more than 6 years.

As an interesting note, when I was signing up with the system, the lady told me that from her experience people usually choose 4 digit PINs, so that they’re the same as their phone PINs. This seems like a reasonable assumption to take. In this case, the simple task is made even simpler, since the number of possible combinations is lowered to 10^4 = 10000:

# gen.py file contents
for i in range(0, 9999):
    pass

Here is what time python gen.py gives us:

$ time python gen.py python gen.py 0.01s user 0.01s system 98% cpu 0.015 total

I presented the time measurements just to give you an idea of how small the possible key space is. Of course, we can’t conclude that the PIN will be bruteforced in 3 seconds. The system limits the number of attempts that you can perform. However, if you think that this makes bruteforcing impossible, you’re wrong. I will now demonstrate why.

Let’s assume that the number of login attempts is limited to 3 (in reality, autenticacao.gov.pt allows more than 3 attempts). Let’s also assume that a lot of people will in fact use a 4 digit long PIN. This means that for each individual user there will be a 3/10000 chance of guessing his PIN. Pretty unlikely, right? Well, this number also means that for every 10000 accounts the attacker will guess the PIN of 3 of them.

To make the things worse, the numbers above are assuming a totally random distribution of PINs, which is not true. A data analysis of almost 3.4 million four digit passwords showed that 1234, 0000 and 1111 make up almost 20% of used passwords. Considering this, the probability of guessing the PIN of a single account would be about 2000/10000. Therefore, for every 10000 accounts the attacker would guess the PIN of 2000 of them.

Why doesn’t the system allow passwords that contain numbers, letters and special characters? Maybe it comes down to a legacy issue, but it is definitely possible to write middleware that would take care of that. The system is supposed to protect very sensitive data, after all.

As an example, let me present to you a very simple solution that would make the existing system more secure. It would do it so by allowing the users to use a password that consists of numbers, letters and special characters, while keeping the core legacy backend unchanged. I’m going to take the following assumptions:

  • the backend requires a 4-8 digit PIN
  • all of the PINs are stored in plain text

A very simple solution would be to do the following:

  1. When the user is registering/changing his password: hash the password provided by the user with the phone number and store it in the database. Generate an 8 digit PIN and store it. This PIN will never be revealed to the user.
  2. When the users tries to log in: hash the provided password with the phone number and compare this result to the one that you have stored in the database.
    • if they are equal, send an “OK” to the backend, which will then obtain the PIN from the database and use it as needed
    • if they are not equal, fail the authentication

I am in no way suggesting that this is the best solution, because it is not. It is, however, a very simple one, that would require very little changes to the overall system and keeping the core legacy backend unchanged. All you would have to do is add a new table to the database (this way you will keep your existing tables unchanged) for the password hash and change the authentication code in the middlware between the client and the backend system. Even if the system is different from the one in my assumptions, you can adopt the solution with relative ease.

The second point relates to only allowing SMS as the second-factor authentication. While being a lot better than not using two-factor authentication at all, it’s not the most secure option. Let me give you two examples. By using some social engineering, if an attacker can get access to your personal information, they can contact your phone company and move your phone number to a new SIM. By exploiting the vulnerabilities in the SS7 protocol the attacker can read all of your text messages.

Allowing the use of other authentication options, such as Google Authenticator would make the second-factor authentication a lot more secure. In general, I have a feeling that the engineers assumed that there was no need for strong passwords, since two-factor authentication is required and only the user has the access to his mobile phone number. This assumption is, of course, incorrect. While it is true, that if you’re using a good second-factor authentication, it’s safe to weaken your password, this does not apply to making your password a trivial 4-8 digit one.

Now, this just smelled bad security. I was certain that I could more problems. I was correct. The security issues that I’m going to be describing in this series of blog posts were found after about 30 minutes of exploring.

There is now an option to receive your second-factor code as a private message on Twitter or be sent to your e-mail address. Those are not enabled by default and you have to do it yourself. I am not sure when those options were added. In any case, when logging in into your “Personal Area”, you only have the option of using your mobile phone number.

Conclusion

In this first post of the four-part series I described why the security of the authentication system is flawed. This is also what motivated me to explore further: I was sure that it wouldn’t be hard to find more issues, since the “security” of this “secure authentication system” did seem like an afterthought.

In the next blog post, I will be describing the multiple XSS vulnerabilities present in the authentication page.

Introduction

A while back I was researching the most efficient way to check if a number is prime. This lead me to find the following piece of code:

public static boolean isPrime(int n) {
    return !new String(new char[n]).matches(".?|(..+?)\\1+");
}

I was intrigued. While this might not be the most efficient way, it’s certainly one of the less obvious ones, so my curiosity kicked in. How on Earth could a match for the .?|(..+?)\1+ regular expression tell that a number is not prime (once it’s converted to its unary representation)?

If you’re interested, read on, I’ll try to dissect this regular expression and explain what’s really going on. The explanation will be programming language agnostic, I will, however, provide Python, JavaScript and Perl versions of the Java code above and explain why they are slightly different.

I will explain how the regular expression ^.?$|^(..+?)\1+$ can filter out any prime numbers. Why this one and not .?|(..+?)\1+ (the one used in Java code example above)? Well, this has to do with the way String.matches() works, which I’ll explain later.

While there are some blog posts on this topic, I found them to not go deep enough and give just a high level overview, not explaining some of the important details well enough. Here, I’ll try to lay it out with enough detail so that anyone can follow and understand. The goal is to make it simple to understand for any one - whether you are a regular expression guru or this is the first time you’ve heard about them, anyone should be able to follow along.

1. Prime Numbers and Regular Expressions - The Theory

Let’s start at a higher level. But wait, first, let’s get every one on the same page and begin with some definitions. If you know know what a prime number is and are familiar with regular expression, feel free to skip this section. I will try to explain how every bit of the regular expression works, so that even people who are new or unfamiliar with them can follow along.

Prime Numbers

First, a prime number is any natural number greater than 1 that is only divisible by 1 and the number itself, without leaving a remainder. Here’s a list of the fist 8 prime numbers: 2, 3, 5, 7, 11, 13, 17, 19. For example, 5 is prime because you can only divide it by 1 and 5 without leaving a remainder. Sure we can divide it by 2, but that would leave a remainder of 1, since 5 = 2*2 + 1. The number 4, on the other hand, is not prime, since we can divide it by 1, 2 and 4 without leaving a remainder.

Regular Expressions

Okay, now let’s get to the regular expression (A.K.A. regex) syntax. Now, there are quite a few regex flavors, I’m not going to focus on any specific one, since that is not the point of this post. The concepts described here work in a similar manner in all of the most common flavors, so don’t worry about it. If you want to learn more about regular expressions, check out Regular-Expressions.info, it’s a great resource to learn regex and later use it as a reference.

Here’s a cheatsheet with the concepts that will be needed for the explanation that follows:

  • ^ - matches the position before the first character in the string
  • $ - matches the position right after the last character in the string
  • . - matches any character, except line break characters (for example, it does not match \n)
  • | - matches everything that’s either to the left or the right of it. You can think of it as an or operator.
  • ( and ) delimit a capturing group. By placing a part of a regular expression between parentheses, you’re grouping that part of the regular expression together. This allows you to apply quantifiers (like +) to the entire group or restrict alternation (i.e. “or”: |) to part of the regular expression. Besides that, parentheses also create a numbered capturing group, which you can refer to later with backreferences (more on that below)
  • \<number_here> - backreferences match the same text as previously matched by a capturing group. The <number_here> is the group number (remember the discussion above? The one that says that parentheses create a numbered capturing group? That’s where it comes in). I’ll give an example to clarify things in a little bit, so if you’re confused, hang on!
  • + - matches the preceding token (for example, it can be a character or a group of characters, if the preceding token is a capturing group) one or more times
  • * - matches the preceding token zero or more times
  • if ? is used after + or * quantifiers, it makes that quantifier non-greedy (more on that below)

Capturing Groups and Backreferences

As promised, let’s clarify how capturing groups and backreferences work together.

As I mentioned, parentheses create numbered capturing groups. What do I mean by that? Well, that means that when you use parentheses, you create a group that matches some characters and you can refer to those matched characters later on. The numbers are given to the groups in the order they appear in the regular expression, beginning with 1. For example, let’s say you have the following regular expression: ^aa(bb)cc(dd)$. Note, that in this case, we have 2 groups. They are numbered as follows:

Regular Expression capturing group numbers

This means that we can refer to the characters matched by them later using backreferences. If we want to refer to what is matched by (bb), we use \1 (we use 1 because we’re referring to the capturing group #1). To refer to the characters matched by (dd) we use \2. Putting that together, the the regular expression ^aa(bb)cc(dd)\1$ matches the string aabbccddbb. Note how we used \1 to refer to the last bb. \1 refers to what was matched by the group (bb), which in this case, was the sting bb.

Now note that I emphasize on what was matched. I really mean the characters that were matched and not ones that can be matched. This means, that the regular expression ^aa(.+)cc(dd)\1$ does match the sting aaHELLOccddHELLO, but does not match the sting aaHELLOccddGOODBYE, since it cannot find what was matched by the group #1 (in this case it’s the character sequence HELLO) after the character sequence dd (it finds GOODBYE there).

Greedy and Non-Greedy Quantifiers

If you remember correctly, in the cheatseheet above, I mentioned that ? can be used to make the preceding quantifier non-greedy. Well, okay, but what does that actually mean? + is greedy quantifier, this means that it will try to repeat the preceding token as many times as possible, i.e. it will try to consume as much input as it can. The same is true for the * quantifier.

For example, let’s say we have the string <p>The Documentary</p> (2005) and the regular expression <.+>. Now, you might think that it will match <p>, but that’s not true. The matched string will actually be <p>The Documentary</p>. Why is that? Well, that has to do with the fact mentioned above: the + will try to consume as much input as it can, so that means that it will not stop at the first >, but rather at the last one.

Now how do we go about making a quantifier non-greedy? Well, you might be already tired of hearing that (since I’ve already mentioned it twice), but in order to make a greedy quantifier non-greedy, you put a question mark (?) in front of it. It’s really as simple as that. In case you’re still confused, don’t worry, let’s see an example.

Suppose we have the same string: <p>The Documentary</p> (2005), but this time, we only want to match what is between the first < and >. How would we go about that? Well, all we have to do is add ? in front of the +. This will lead us to the <.+?> regex. “Uhhh, okay…”, you might wonder, “But what does that actually do?”. Well, it will make the + quantifier non-greedy. This means that it will make the quantifier consume as little input as possible. Well, in our case, the “as little as possible” is <p>, which is exactly what we want! To be precise, it will match both of the p’s: <p> and </p>, but we can easily get what we want by asking for the fist match (<p>).

A Little Note On ^ and $

Since we’re on it, I’ll take a moment to quickly explain what the ^ and $ actually do. If you remember correctly, ^ matches the position right before the first character in the string and $ matches the position right after the last character in the string. Note how in both of the regular expressions above (<.+> and <.+?>) we did not use them. What does that mean? Well, that means that the match does not have to begin at the start of the string and end at the end of the string. Taking the second, non-greedy, regex (<.+?>) and the sting The Game - <p>The Documentary</p> (2005), we would still obtain our expected matches (<p> and </p>), since we’re not forcing it to begin at the beginning of the string and end at the end of the string.

2. The Regular Expression That Tells If A Number Is Prime

Phew, so we’re finally done with the theoretical introduction and now, since we’ve already have everything we need under the belt, we’re ready to dive into the analysis of how the ^.?$|^(..+?)\1+$ regular expression can match non-prime numbers (in their unary form).

You can ignore the ? in the regular expression, it’s there for performance reasons (explained below) - it makes the + non-greedy. If it confuses you, just ignore it and consider that the regex is actually ^.?$|^(..+)\1+$, it works as well, but it’s slower (with some exceptions, like when the number is prime, where the ? makes no difference whatsoever). After explaining how this regular expression works, I’ll also explain what that ? does there, you shouldn’t have any trouble understanding it after you understand the inner workings of this regex.

All of the discussion below assumes that we have the number represented in its unary form (or base-1, if you prefer). It doesn’t actually have to be represented as a sequence of 1s, it can be a sequence of any characters that are matched by .. This means that 5 does not have to be represented as 11111, it might as well be represented as fffff or BBBBB. As long as there are five characters, we’re good to go. Please note, that the characters have to be the same, no mixtures of characters are allowed, this means that we cannot represent 5 as ffffB, since here we have a mixture of two different characters.

High Level Overview

Let’s begin with a high level overview and then dive into the details. Our ^.?$|^(..+?)\1+$ regular expression consists of two parts: ^.?$ and ^(..+?)\1+$.

As a heads-up, I just want to say that I’m lying a little in the explanation in the paragraph about the ^(..+?)\1+$ regex. The lie has to do with the order in which the regex engine checks for multiples, it actually starts with the highest number and goes to the lowest, and not how I explain it here. But feel free to ignore that distinction here, since the regular expression still matches the same thing, it just does it in more steps (so I’ll actually be explaining how ^.?$|^(..+?)\1+?$ works: notice the extra ? after the +.

I’m doing this because I believe this explanation is less verbose and easier to understand. And don’t worry, I explain how I lied and reveal the shocking truth later on, so keep on reading. Well, maybe it’s not really that shocking, but I wanna keep you engaged, so I’ll stick to that naming.

The regex engine will first try to match ^.?$, then, if it fails, it will try to match ^(..+?)\1+$. Note that the number of characters matched corresponds to the matched number, i.e. if 3 characters are matched, that means that number 3 was matched, if 26 characters are matched, that means that the number 26 was matched.

^.?$ matches strings with zero or one characters (corresponds to the numbers 0 and 1, respectively).

^(..+?)\1+$ first tries to match 2 characters (corresponds to the number 2), then 4 characters (corresponds to the number 4), then 6 characters, then 8 characters and so on. Basically it will try to match multiples of 2. If that fails, it will try to first match 3 characters (corresponds to the number 3), then 6 characters (corresponds to the number 6), then 9 characters, then 12 characters and so on. This means that it will try to match multiples of 3. If that fails, it proceeds to try match multiples of 4, then if that fails it will try to match multiples of 5 and so on, until the number whose multiple it tries to match is the length of the string (failure case) or there is a successful match (success case).

Diving Deeper

Note, that both of parts of the regular expression begin with a ^ symbol and end with a $ symbol, this forces to what’s in between those symbols (.? in the first case and (..+)\1+ in the second case) to start at the beginning of the string and end at the end of the string. In our case that string is the unary representation of the number. Both of the parts are separated separated by an alternation operator, this means that either only one of them will be matched or neither will. If the number is prime, a match will not occur. If the number is not prime a match will occur. To summarize, we concluded that:

  • either ^.?$ or ^(..+?)\1+$ will be matched
  • the match has to be on the whole string, i.e. start at the beginning of the string and end at the end of the string

Okay, but what does each one those parts matches? Keep in mind that if a match occurs, it means that the number is not prime.

How The ^.?$ Regular Expression Works

^.?$ will match 0 or 1 characters. This match will be successful if:

  • the string contains only 1 character - this means that we’re dealing with number 1 and, by definition, 1 is not prime.
  • the string contains 0 characters - this means that we’re dealing with number 0, and 0 is certainly not prime, since we can divide 0 by anything we want, except for 0 itself, of course.

If we’re given the sting 1, ^.?$ will match it, since we have only one character in our string (1). The match will also occur if we provide an empty string, since, as explained before, ^.?$ will match either an empty string (0 characters) or a string with only 1 character.

Okay, so far so so good, we certainly want our regex to recognize 0 and 1 as non-primes. But that’s not enough, since there are numbers other than 0 and 1 that are not prime. This is where the second part of the regular expression comes in.

How The ^(..+?)\1+$ Regular Expression Works

^(..+?)\1+$ will first try to match multiples of 2, then multiples of 3, then multiples of 4, then multiples 5, then multiples of 6 and so on, until the multiple of the number it tries to match is the length of the string or there is a successful match.

But how does it actually work? Well, let’s dissect it!

Let’s focus on the parentheses now, here we have (..+?) (remember, ? just makes this expression non-greedy). Notice that we have a + here, which means “one or more of the preceding token”. This regex will first try to match (..) (2 characters), then (...) (3 characters), then (....) (4 characters), and so on, until the length of the string we’re matching against is reached or there is a successful match.

After matching for some number of characters (let’s call that number x, the regular expression will try to see if the string’s length is multiple of x. How does it do that? Well, there’s a backreference. This takes us to the second part of the regex: \1+. Now, as explained before this will try to repeat the match in capturing group #1 one or more times (actually it’s more “more or one times, I’m lying a little bit”) This means that first, it will try to match x * 2 characters in the string, then x * 3, then x * 4, and so on. If it succeeds in any of those matches, it returns it (and this means that the number is not prime). If it fails (it will fail when x * <number> exceeds the length of the string we’re matching against), it will try the same thing, but with x+1 characters, i.e, first (x+1) * 2, then (x+1) * 3, then (x+1) * 4 and so on (because now the \1+ backreference refers to x+1 characters). If the number of characters matched by (..+?) reaches the length of the string we’re matching against, the regex matching process will stop and return a failure. If there is a successful match, it will be returned.

Example Time

Now, I’ll sketch some examples to make sure you got everything. I will provide one example where a regular expression succeeds to match and one where it fails to match. Again, I’m lying in the order of sub-steps (the nested ones, i.e the ones that have a ., like 2.1, 3.2, etc), just a little.

As an example of where a match succeeds, let’s consider the string 111111. The length of the string we’re matching against is 6. Now, 6 is not a prime number, so we expect the regex to succeed with the match. Let’s see a sketch of how it will work:

1. It will try to match ^.?$. No luck. The left side of | returns a failure 2. It try to match ^(..+?)\1+$ (the right side of |). It begins with (..+?) matching 11:

  • 2.1 The backreference \1+ will try to match 11 twice (i.e 1111). No luck.
  • 2.2 The backreference \1+ will try to match 11 trice (i.e 111111). Success!. Right side of | returns success

Woah, that was fast! Since the right side of | succeeded, our regular expression succeeds with the match, which means our number is not prime.

As an example of where a match fails, let’s consider the string 11111. The length of the string we’re matching against is 5. Now, 5 is a prime number, so we expect the regex to fail to match anything. Let’s see a sketch of how it will work:

1. It will try to match ^.?$. No luck. The left side of | returns a failure 2. It try to match ^(..+?)\1+$ (the right side of |). It begins with (..+?) matching 11:

  • 2.1 The backreference \1+ will try to match 11 twice (i.e 1111). No luck.
  • 2.2 The backreference \1+ will try to match 11 trice (i.e 111111). No luck. Length of string exceeded (6 > 5). Backreference returns a failure.

3. (..+?) now matches 111:

  • 3.1 The backreference \1+ will try to match 111 twice (i.e 111111). No luck. Length of string exceeded (6 > 5). Backreference returns a failure.

4. (..+?) now matches 1111:

  • 4.1 The backreference \1+ will try to match 1111 twice (i.e 11111111). No luck. Length of string exceeded (8 > 5). Backreference returns a failure.

5. (..+?) now matches 11111:

  • 5.1 The backreference \1+ will try to match 11111 twice (i.e 1111111111). No luck. Length of string exceeded (10 > 5). Backreference returns a failure.

5. (..+?) will try to match 1111111. No luck. Length of string exceeded (6 > 5). (..+?) returns a failure. The right side of | returns a failure

Now since both sides of | failed to match anything, the regular expression fails to match anything, which means our number is prime.

What About The ?

Well, I mentioned that you can ignore the ? symbol in the regular expression, since it’s there only for performance reasons, and that’s true, but there is no need to keep its purpose a mystery, so I’ll explain what it actually does there.

As mentioned before, ? makes the preceding + non-greedy. What does it mean in practice? Let’s say our string is 111111111111111 (corresponds to the number 15). Let’s call L the length of the string. In our case, L=15.

With the ? present there, + will try to match its preceding token (in this case .) as few times as possible. This means that first (..+?) will try to match .., then ..., then .... and then ....., after which our whole regex (^.?$|^(..+?)\1+$) would succeed. So first, we’ll be testing the divisibility by 2, then by 3, then by 4 and then by 5, after which we would have a match. Notice that the number of steps in (..+?) was 4 (first it matches 2, then 3, then 4 and then 5).

If we omitted the ?, i.e if we had (..+), then it would go the other way around: first it would try to match ............... (the number 15, which is our L), then .............. (the number 14, i.e L-1), and so on until ....., after which the whole regex would succeed. Notice that even though the result was the same as in (..+?), in (..+) the number of steps was 11 instead of 4. By definition, any divisor of L must be no greater than L/2, so that means that means that 8 steps were absolutely wasted computation, since first we tested the divisibility by 15, then 14, then 13, and so on until 5 (we could only hope for a match from number 7 and downwards, since L/2 = 15/2 = 7.5 and the first integer smaller than 7.5 is 7).

The Shocking Lie

As I mentioned before, I actually lied in the explanation of how the multiples of a number are matched. Let’s say we have the string 111111111111111 (number 15).

The way I explained it before was that the regular expression would begin to test for divisibility by 2. It would do so by first trying to match 2*2 characters, then 2*3, then 2*4, then 2*5, then 2*6, then 2*7, after which it would fail to match 2*8, so it would try its luck with testing for divisibility by 3, by first trying to match for 3*2 characters, then for 3*3 characters, then for 3*4 and then for 3*5, where it would succeed. This is actually what would happen if the regular expression was ^.?$|^(..+?)\1+?$ (notice the ? at the end), i.e., if the +following the backreference was non-greedy.

What actually happens is the opposite. It would still try to test for the divisibility by 2, first, but instead of trying to match for 2*2 characters, it would begin with trying to match for 2*7, then for 2*6, then for 2*5, then for 2*4, then for 2*3 and then for 2*2, after which it would fail and, once again, try its luck with divisibility by 3, by first trying to match for 3*5 characters, where it would succeed right away.

Notice, that in the second case, which is what happens in reality, less steps are required: 11 in the first case vs 7 in the second (in reality, both of the cases would require more steps than presented here, the goal of this explanation is not count them all, but to transmit the idea of what’s happening in both cases, it’s just a sketch of what’s going on under the hood). While both versions are equivalent, the one explained in this blog post, is more efficient.

3. The Java Case

Here’s the piece of Java code that started all of this:

public static boolean isPrime(int n) {
    return !new String(new char[n]).matches(".?|(..+?)\\1+");
}

If you remember correctly, I said that due to the peculiarities of the way String.matches works in Java, the regular expression that matches non-prime numbers is not the one in the code example above (.?|(..+?)\1+), but it’s actually ^.?$|^(..+?)\1+$. Why? Well, turns out String.matches() matches on the whole string, not on any substring of the string. Basically, it “automatically inserts” all of the ^ and $ present in the regex I explained in this post.

If you’re looking for a way not to force the match on the whole string in Java, you can use Pattern, Matcher and Matcher.find() method.

Other than that, it’s pretty much self explanatory: if the match succeeds, then the number is not prime. In case of a successful match, String.matches() returns true (number is not prime), otherwise, it return false (number is prime), so to obtain the desired functionality we negate what the method returns.

new String(new char[n]) returns a Stringof n null characters (the . in our regex matches them).

4. Code Examples

Now, as promised, it’s time for some code examples!

Java

Although I already presented this code example twice in this post, I’ll do it here again, just to keep it organized.

public static boolean isPrime(int n) {
    return !new String(new char[n]).matches(".?|(..+?)\\1+");
}

Python

I’ve expressed my sympathy for Python before, so of course I have to include this one here.

def is_prime(n):
    return not re.match(r'^.?$|^(..+?)\1+$', '1'*n)

JavaScript

Java is to JavaScript as car is to carpet.

That’s a joke I like. I didn’t come up with it and I don’t really know its first source, so I don’t know whom to credit. Anyways, I’m actually going to give you two versions here, one which works in ES6 and one that works in previous versions.

First, the ECMAScript 6 version:

function isPrime(n) {
    var re = /^.?$|^(..+?)\1+$/;
    return !re.test('1'.repeat(n));
}

The feature that’s only available in ECMAScript 6 is the String.prototype.repeat() method.

If you gotta use previous versions of ES, you can always fall back to Array.prototype.join(). Note, however, that we’re passing n+1 to join(), since it actually places those characters in between array elements. So if we have, let’s say, 10 array elements, there are only 9 “in-betweens”. Here’s the version that will work in versions prior to ECMAScript 6:

function isPrime(n) {
    var re = /^.?$|^(..+?)\1+$/;
    return !re.test(Array(n+1).join('1'));
}

Perl

Last, but not least, it’s time for Perl. I’m including this here because the regular expression we’ve been exploring in this blog post has been popularized by Perl. I’m talking about the one-liner perl -wle 'print "Prime" if (1 x shift) !~ /^1?$|^(11+?)\1+$/' <number> (replace <number> with an actual number).

Also, since I haven’t played around with Perl before, this seemed like a good opportunity to do so. So here we go:

sub is_prime {
    return !((1x$_[0]) =~ /^.?$|^(..+?)\1+$/);
}

Since Perl isn’t the most popular language right now, it might happen that you’re not familiar with its syntax. Now, I’ve had about 15 mins with it, so I’m pretty much an expert, so I’ll take the liberty to briefly explain the syntax above:

  • sub - defines a new subroutine (function)
  • $_[0] - we’re accessing the first parameter passed in to our subroutine
  • 1x<number> - here we’re using the repetition operator x, this will basically repeat the number 1 <number> of times and return the result as a string. This is similar to what '1'*<number> would do in Python or '1'.repeat(<number>) in JavaScript.
  • =~ is the match test operator, it will return true if the regular expression (its right-hand side) has a match on the string (its left-hand side).
  • ! is the negation operator

I included this brief explanation, because, I myself, don’t like being left in mystery about what a certain passage of code does and the explanation didn’t take up much space anyways.

Conclusion

That’s all folks! Hopefully, you’re now demystified about how a regular expression can check if a number is prime. Keep in mind, that this is far from efficient, there are a lot more efficient algorithms for this task, but it is, nonetheless, a fun and interesting thing.

I encourage you to go to a website like regex101 and play around, specially if you’re still not 100% clear about how everything explained here works. One of the cool things about this website is that it includes an explanation of the regular expression (column on the right), as well as the number of steps the regex engine had to make (rectangle right above the modifiers box) - it’s a good way to see the performance differences (through the number of steps taken) in the greedy and non-greedy cases.

If you have any questions or suggestions, feel free to post them in the comment section below or get in touch with me via a different medium.

EDIT:

  • Thanks to joshuamy for pointing out a typo in Perl code
  • Thanks to Keen for pointing out a typo in the post
  • Thanks to Russel for submitting a Swift 2 code example
  • I didn’t want to get into the topic of regular/non-regular languages and related, since it’s theory that isn’t crucial for the topic of this post, but as lanzaa pointed out, there is a difference between “regex” and “regular expression”. What was covered in this blog post wasn’t a regular expression, but rather a regex. In the “real world”, however (outside of academia), those terms are used interchangeably