Finding Patterns in π

π is an interesting number. It’s not only irrational, it’s transcendental. Its decimal representation “looks random” (it isn’t random, it’s precisely defined, but seems to pass tests for randomness). It seems likely that any pattern of digits you want can be found somewhere in that representation. It may take a while to find them, but given enough digits of π you can expect to find any given number. For example, Jenny’s number, 8675309, is found at the 9,202,591st position (counting the initial 3 in the number as being at position 0).

I found that out using a cloud application I created to search for any string of decimal digits in π. The approach I took to building it demonstrates some useful approaches to cloud application development and architecture. That’s what this post is about: not a comprehensive cloud architecture framework, just an example of how to think about architecting applications for the cloud. This post doesn’t have a detailed description of the application, but there is example code available at https://github.com/engelke/where-in-pi-is/. With a bit of work you could use that code to build and deploy your own π digit searcher. Or you can get inspiration on how to build applications to solve other problems you encounter.

The Data

If I’m going to find strings in the digits of π, I’ll have to have those digits available somehow. In theory, a program could calculate more and more digits as it searched for a string, stopping when the string is found. But that would be ridiculous; calculating those digits is an intensive, slow process.

Lucky for me, someone already calculated quite a few digits of π. On March 14, 2019, Google announced that Emma Haruka Iwao had calculated more than 31.4 trillion digits of π using Google Cloud Platform tools. That was a new world record when announced. And Google set up a web API for returning selected decimal places at pi.delivery. To get the seven digits starting at the 9,202,591st decimal place, make a web request to https://api.pi.delivery/v1/pi?start=9202591&numberOfDigits=7. You should see:

{"content":"8675309"}

So one way to search for a seven digit number in π would be to just request the seven digits starting at position 0, then at position 1, and so on, until the number being looked for is returned.

Don’t try that. Really, don’t.

The service is pretty fast, taking about 75 ms per request of that size when I tried it out. At that rate, you’d find this number in about four years, unless Google blocked you first for abusing the API. I want faster answers.

So let’s figure out a faster way by building a cloud application to find digit strings in π. When we design for the cloud, we don’t think of a computer as the application platform. No, the entire collection of available web services is our platform. And we commonly will use several of those services, communicating in various ways with each other, to build an application. For the π searcher, we start by looking at each different action that needs to be done, and then perform each action as an independent piece. Then we’ll use cloud technology to get those pieces to work together.

First Piece: Searching

Something has to store the digits of π and search through them. There are links on the pi.delivery site to download the digits to your own storage for processing. The simplest fast way to stream through data is to put it in a file connected to a virtual machine, so that’s what I did: launched a virtual machine Google Compute Engine and put the digits into a text file named pi.txt. I wrote a Python program that memory-mapped that file to a Python byte string, and used the built-in find method for the search. I’d look for other ways if this was too slow.

A search for the seven digit string 8675309 took only 20ms. That’s much better than four years for the repeated calls to the API site. In fact, it’s good enough for me, for now. For every extra digit in length, we can expect to take about ten times as long to search. So looking for an 8 digit number (like a date) should take about a fifth of a second, ten digits (like a phone number with area code) about 20 seconds, and so on. And though this is fairly fast, a lot of searches take too long for the requester to wait for a response, as they would with a web page. Users will have to ask for a search and then later somehow get the result.

That’s the first piece of the application: a search box that can find a requested string of digits reasonably fast. See the where_is function in https://github.com/engelke/where-in-pi-is/blob/master/computeengine/find-in-pi.py for a minimal solution. Now let’s look at what else is needed.

Second Piece: Requesting a Search

The next problem to be solved is how someone can request a search. I completely control the search box virtual machine, so I can just SSH into it and run Python code. That’s not going to work for anybody else. There needs to be some way people can ask for a search to be run. I can think of a few ways:

Fill out a form on a web page
Send an email to a special address
Post a tweet, mentioning a particular Twitter account
Send a text via SMS
Make a voice phone call and say a number (or use a keypad)

We will have to build one or more of those ways, but before getting into that, let’s figure out how that front end that a user interacts with will get the request to the search box. We want the method used to be asynchronous (so the requesting program doesn’t have to wait and possibly get overloaded) and authenticated (so malicious attackers can’t flood the searcher with fake requests). Cloud platforms offer services exactly tailored to that need: message passing.

There are different flavors of message passing available; Google offers PubSub. An application can create one or more topics (for example, one called search_requests) and components can publish messages to the topic and other components can subscribe to the topic, hence the name. Google PubSub guarantees published messages will be delivered at least once, and only rarely delivers a message more than once. That’s okay for our use, since the worst case is that we (very rarely) search for the same requested string more than once.

The front end getting a user request for a search is going to publish a message to the search_requests topic, and the search box will subscribe to that same topic. A PubSub subscription can be configured as push messaging or pull messaging. Push messaging forces each message on the subscriber as soon as it is available, and keeps pushing each message until the subscriber acknowledges it or a retry or timeout limit is exceeded. Pull messaging waits for the subscriber to ask for available messages before delivering them.

Pull messaging is a good fit for this use, since the search box can have a loop that asks for messages, performs the searches being requested one at a time, and then asks for more messages. The search box can’t be overloaded this way, since it’s only running one search at a time. Of course, the backlog of messages will grow if the search box doesn’t consume them fast enough, so that will have to be watched. If the box has the capacity to perform more than one search at a time, we can just have multiple processes, each fetching and processing messages on their own. If we need to scale way up, we can deploy multiple search boxes.

Third Piece: User Request Front End

I listed five possible ways for a user to request a search earlier. We need to implement at least one of them, or there’s no point to having a search box. We will just do one of them for this example, and the easiest seems to be the first: give the user a web page with a form to fill out, requesting the search. If we needed to we could always implement more front ends later, just by having them each publish to the search_requests PubSub topic.

There are lots of ways to provide a web page with a form and then accept the filled in form in return. If you are used to managing your own computing resources, you’ll likely think of setting up a virtual machine with web server software for that. That will work just fine, but requires doing our own system administration and maintenance. Serverless computing products let you write your application code and leave everything else, including scaling, monitoring, and logging, up to the platform.

We need a component that responds to web requests with a page containing a form for a GET request, and by publishing a message to the search_requests topic for a POST request. The simplest way that I know to do that is with a Cloud Function. Providing this web user interface just requires writing a Python function that the platform will route web requests to. The code needs to see if the request is a GET, in which case it returns a web page with a form, or a POST, in which case it reads the values submitted with the form, and asks for a search by publishing a message to the search_requests PubSub topic. You can see how easy it is to implement this as a Cloud Function at https://github.com/engelke/where-in-pi-is/blob/master/function/main.py.

The overall application now looks like:

Fourth Piece: Pushing a Result

One thing front end does not do is return the result of the search to the user. Since a search might take a while, or need to wait for other earlier searches to finish before it can start, the front end function is just responsible for triggering a search. The user is going to have to get the answer somewhere else. Possible “somewhere else” choices would include a web page (either requiring user login or having a unique request given to the user for each request) or an email message (to an address provided to the user).

For the moment, we’re going to postpone figuring out how to deliver a result to a user. Instead, we will set up another PubSub topic: send_result. When the search box gets an answer, it will publish a message to that topic and trust that there’s a subscriber to that topic that will do the job of delivery. The message is going to have include information on how to deliver that result, which is going to vary depending on how the request came in. For many requests, the natural way to send a result will be via email. For those situations, the front end will have to collect an email address, including it in the message to the search_requests topic, and the search box will have to then include it in the message to the send_result topic. Other forms of requests might need responses via other methods, needing a Twitter username or a phone number to send a text to. The messages can simply include the type of response and appropriate address. Until we get to the next step, nothing needs to much care what those are.

Final Piece: Sending an Email Result

The final piece of the application is a component that subscribes to the send_result topic and does the delivery. If we have multiple delivery methods, we could either have a single delivery component that handles them all, or separate components each subscribing to the topic and skipping any messages that they can’t handle. We will keep it simple here and handle every message by sending an email.

How do you send email using Google Cloud Platform? Well, there’s no easy way. There’s no email sending API, as some cloud providers have. If you set up a virtual machine with email software, it will probably be blocked to prevent possible spam. There are ways around this, and third-party services you can use, but I remember that Google App Engine used to have an email sending API. The latest version of App Engine no longer provides that API, but the previous App Engine version is still available, so that’s what we will use.

That’s right: we will set up an App Engine instance just to send emails based on PubSub messages sent to it.

Code for the App Engine for Python 2.7 application is at https://github.com/engelke/where-in-pi-is/tree/master/appengine. It uses a PubSub push subscription to send HTTP requests to it whenever a message is published to the send_result topic. Those incoming messages include the email address to send the result to and the result itself, and include a shared secret token that is configured in the push subscription so that App Engine can be sure those incoming requests aren’t forged.

Wrapping It All Up

Our complete app looks like this:

The user interacts with a web form provided by a Cloud Function, which publishes a message to the search_requests topic. The searching is done by a Compute Engine virtual machine that pulls search requests, finds the answer, and publishes the answer to the send_result topic. Finally, an App Engine instance gets those requests to send results via a push subscription that topic, and emails the address originally provided by the user.

The code for the various pieces is available at https://github.com/engelke/where-in-pi-is. For the time being, there’s a live version of this application available at https://us-central1-engelke-pi-blog.cloudfunctions.net/where-in-py. I don’t promise to keep it running, but while it is live you can try it out. It will only search the first couple billion digits of π because storing the full 31.4 trillion digits that Emma calculated would be pretty expensive for a demo like this. Still, that will find most 7 or 8 digit numbers you ask for.

Give this a try, or build your own serverless cloud application using some of these ideas!

Building and Publishing the Site

This post is fifth (and for now, last) in a series, beginning with Automating Web Site Updates.

We’ve been narrowing the “miracle” step in our solution. This post fills in the remaining gap.

We have a Cloud Function ready to kick off the missing step that will build and deploy the updated site. At a high level, this step will need to:

Fetch a repository with static web content plus source directories that each need to be converted to web pages
Convert each source directory to web pages in desired format
Build new static website structure
Deploy the pages to a Firebase hosting project

We need a place to run our code that uses some high level tools: git, firebase CLI, and anything that converts source to web pages. And we need a file system to build the new static site in. Plus, this process may take longer than a cloud function is allowed to run (or at least longer than the GitHub webhook is willing to wait for a response). Those requirements are why we couldn’t just do these steps in the cloud function that responds to the GitHub webhook. We need something more general purpose than that.

Cloud Run looks like a possible solution. The managed version is a lot like Cloud Functions in that it takes your code, runs it in response to HTTP requests, and only charges for resources while the code is running. But instead of just providing source code in a supported language, you provide Cloud Run with a container. That container could run any supporting software you need, not just the supported language environment of a cloud function. Cloud Run will even build the container for you, from your specifications.

Any negatives to using Cloud Run? There are several for this use case, though they can possibly be worked around:

Cloud Run is still in beta, so it is subject to change before becoming final.
Containers require maintenance. If a security update is needed for any software, the container needs to be rebuilt with the new versions.
The container runs only so long as it is serving a web request, so if the requesting program is only willing to wait a short time (for example, 10 seconds for a GitHub webhook or 10 minutes for a Google Cloud Pub/Sub push subscription) we have to be able to build and deploy our site in that amount of time.
In any case, the each run is limited to no more than 15 minutes at this time.
The file system size is limited by the memory allocated to the service (no more than 2GB).
Invocations of the service can be concurrent, so if you are building a site in the file system you have to be sure concurrent invocations don’t step on each other, and don’t use up all the memory.

Despite these negatives, I find trying to use Cloud Run for the problem to be an intriguing approach. I’m not going to use it here, but I’ll keep thinking out how it can solve problems like this one.

So what is the solution for the current problem? I’m going to go old school, and use a virtual machine for this. In Google terms, I’m going to use a Compute Engine instance. At first look this may seem to go against my goal to use “services that require little or no customization or coding on our part”. And I also said I don’t want to maintain any servers. But the way I’m going to use Compute Engine will not require any coding other than that specifically aimed at our business logic, and won’t need any server maintenance, either.

We will launch a virtual machine with a startup script that will:

Install the standard tools we need (git, Firebase CLI, etc.)
Fetch the source from a GitHub repository
Build the web pages using existing tools specific to our needs
Deploy the site to Firebase hosting
Destroy itself when done

The first and last steps are key: this virtual machine installs what it needs when run and then deletes itself when it has finished the task. That way we aren’t paying for an idle machine standing by waiting for work to do; we’re only paying for what really use. Further, by creating a new machine for each task and then throwing it away, we don’t need to worry about updates – we always launch and then install the latest versions of the tools we’re using.

For more background on this technique of creating, using, then deleting Compute Engine instances, see this tutorial by Laurie White and me.

The virtual machine’s actions all need to be scripted in advance, so they can run without human intervention. Once the script is in place, we can enhance the cloud function from the last blog post to create a new Compute Engine instance that will run that script. The script needs to end by deleting the instance it runs on.

We aren’t going to build the whole solution here, just give the outline. Here’s what the script will look like:

#!/bin/sh
apt update; apt install -y git
#
# Install other tools, get code from GitHub, run business
# logic to build web site pages, deploy to Firebase hosting
# -- Not included in this post
#
# Instance deletes itself below (see tutorial for details)
export METADATA=metadata.google.internal/computeMetadata/v1/instance
export NAME=$(curl -X GET http://$METADATA/name -H 'Metadata-Flavor: Google')
export ZONE=$(curl -X GET http://$METADATA/zone -H 'Metadata-Flavor: Google')
gcloud --quiet compute instances delete $NAME --zone=$ZONE

If we launch a machine with the startup script above (filled in with all the business logic specific details) it will pull our source content from GitHub, build website pages from that, then deploy it to our Firebase hosting site. Which leaves us with one more question: how do we launch such a machine from our Cloud Function (that unfinished update_the_site() function from the last post)? We use the google-api-python-client library. It’s pretty low-level, but there’s good sample code available you can adapt to do this.

So that’s the pipeline now:

I’m going to put this topic to rest for a while, but there are tips and trips regarding secrets and permissions I’ll probably talk about soon.

Responding to GitHub Updates

This post is fourth in a series, beginning with Automating Web Site Updates.

Most of the picture of the process we need is filled in now. We have to deal with what happens between a GitHub PR being merged and an updated website being deployed on Firebase Hosting. This post is going to just deal with responding to a GitHub PR merge.

We need to know when a PR is merged so we can kick off the rest of the update process. Lucky for us, GitHub has a feature that will tell us that: webhooks. At its core it’s a really simple idea: when an event you care about happens, GitHub will make a web request to a URL of your choosing with information about the event in its body. You just need to provide a web request handler to receive it. So before we set up the webhook, let’s figure out what we will use to receive those requests. We need:

to run our own custom code
when triggered by an HTTP (actually HTTPS) request
containing information about a merged PR
without costing much when nothing is happening (which in this case, is probably 99% or more of the time)

That sounds tailor-made for a Cloud Function. We can write code in Go, Node, or Python and say we want it triggered by an HTTPS request. Cloud Functions gives us a URL and runs our code whenever a request is sent to that URL. We don’t pay for anything except time and memory while our code is running, not while it is idle waiting for a notification. The only problem is that functions are limited in what they can do. They can’t run long jobs, they have only a small file system available, and they only provide a few language options. We can’t install other software in them, either. But none of that is a problem because we are not trying to handle the website update in the cloud function, we just need to kick that off when it’s appropriate (a subject for the next blog post).

So we will use the Google Cloud Platform console to create a new HTTP-triggered Python Cloud Function. For now, we’ll leave the default sample code in it; we just want to know the URL for the next step: setting up a GitHub webhook.

Authorized GitHub repository users can set up a GitHub webhook in the repository’s Settings section. There’s a section just for Webhooks, and a button to add a new webhook. After you click that, there are some choices to be made:

The Payload URL is the address that GitHub will send the request to. That’s our cloud function’s URL from the step above.
The Content type specifies the format of the body of the request GitHub will send. The default is application/w-www-form-urlencoded, which is what a web page might send when a user submits a form. Since we want to get a possibly complicated data structure from GitHub, the second option, application/json, is a better choice for us.
The Secret is a string (shared between GitHub and your receiving application) that GitHub will use to create a signature for each web request. This is a non-standard way to check that a request really comes from the GitHub webhook you created, and not somewhere else. I created a long random password with a password manager for this.
We finally reach the question “Which events would you like to trigger this webhook?” We can choose “just the push event”, but we don’t much care about pushes, we want to know about merges. The second option, “send me everything,” would certainly include the merge events, but we don’t want to be bothered about the vast majority of events we’d be told about then. So we can say “Let me select individual events” and just hear about Pull Requests. That choice still includes a lot of events we don’t care about (creating PRs, closing them unmerged, labeling them, and so on) but it seems to be the narrowest choice that includes PR merges.
And we’re going to want to make this webhook Active.

When we click the Add Webhook button, GitHub will send a test request to the URL to see if there’s something at that address accepting incoming data. That should pass, since we already created a cloud function there, but if not, that’s okay for now. We need it to work once the Cloud Function is finished.

Now that we have a webhook, every action on a PR on this repository will cause GitHub to send a JSON object to our cloud function. We need to verify that this request describes a PR merge, since we will get notification of all sorts of other PR events, too. And we need to make sure that this information really comes from our GitHub webhook, and not somebody trying to fool us into thinking a merge happened. Here’s an outline of what we need to do:

Load the JSON data in the request body into a Python object
notification = request.get_json()
Check to see that this is a PR that is closed by a merge; if not, just exit, nothing to do here
if notification.get('action') != 'closed':
return 'OK', 200
else:
if notification.get('pull_request') is None:
return 'OK', 200
else:
if not notification['pull_request'].get('merged'):
return 'OK', 200
Check that the signature is valid; if not, just return a Forbidden response and exit, we aren’t going to deal with fake requests
import hashlib, hmac, os
secret = os.environ.get('SECRET', 'Missing!').encode()
signature = request.headers.get('X-Hub-Signature')
body = request.get_data()
calc_sig = hmac.new(secret, body, hashlib.sha1)
if signature != 'sha1={}'.format(calc_sig).hexdigest():
return 'Forbidden', 403
Kick off the next step that will actually publish an updated website based on the contents of the repository
update_the_site()

Notice that the secret for the signature, which was provided to GitHub when creating the webhook, is fetched from an environment variable. That returns a string, and it needs to be converted to bytes in order to send to the hash function. You can set up the necessary environment variable when creating or redeploying a cloud function. This is a better option than keeping the secret in the source code itself, which might be available to others in a source repository at some point.

Which leaves us with one big piece left to build, update_the_site(). That will be covered in the next post. Spoiler alert: the cloud function won’t be doing the update, it will just kick off some other tool to handle that.

So, our update process picture is nearly complete:

Jumping to the Other End

This post is third in a series, beginning with Automating Web Site Updates.

Readers are going to use web browsers to look at content, so we need some kind of web server to deliver the final formatted web pages. It’s static content, so there are lots of choices:

A virtual machine running web server software like Apache or NGINX
A web hosting service
Cloud storage set for public access
Serverless platforms

Option 1 is right out. I do not want to configure, manage, patch, and monitor a server. The second option might be okay, but most of them aren’t amenable to full automation. The third choice could be okay, but the cloud storage I’d prefer to use (Google Cloud Storage) doesn’t offer the ability to use HTTPS on a custom domain. That leaves a serverless solution.

Which serverless solution? I’m going to stick with Google Cloud Platform, but other cloud providers offer many similar services. GCP’s serverless offerings include Cloud Functions, Cloud Run, App Engine, and Firebase. I’d have to write code to respond to web requests for Cloud Functions or Cloud Run, so they’re out. Both App Engine and Firebase can serve static web pages without my writing any code, so they’re still looking good.

We want static web page hosting, via HTTPS, on a custom domain, and we want it cheap. We’d rather have it free (hey, is there an option to have them pay us?). Well, both App Engine and Firebase Hosting have free tiers available. So how to choose? I’ve used them both and they’d both work for this. We have to pick one, and I found Firebase Hosting to be easy, scalable, and affordable.

The solution will use Firebase Hosting for the last step.

We will need to build a static copy of the desired website, and use Firebase tools to deploy it to the service. Other than that, we don’t need to do anything to have the pages served in a scalable and reliable manner.

The picture is beginning to be filled in:

Contributors to GitHub to ? to Firebase Hosting to Readers

The unknown center is shrinking. Next time we will jump back to the other side of that unknown and expand on what we need GitHub to do.

Working from the Outside In

This post is second in a series, beginning with Automating Web Site Updates.

Let’s start searching for a solution by starting at the edges, then filling in the middle. That is, “where does this process begin, and where does it end up?”

Well, it begins by contributors uploading their content in markdown format, and ends up with a web pages delivered to readers. Here’s a vague picture:

That picture looks kind of familiar:

"Then a miracle occurs" Cartoon by S. Harris. Copyright ScienceCartoonsPlus.com, used with permission. — Copyright ScienceCartoonsPlus.com, reprinted with permission.

We need to fill in the middle of this picture, ideally with services that require little or no customization or coding on our part.

Let’s start with the contributors. They produce markdown files with their content, and need to send them somewhere where they can be reviewed, commented on, possibly changed (by them or others) before they can be used. That sounds like a Git repository. We could set up our own git repo on a server somewhere, but remember: we want to use existing services whenever we can. And GitHub is just such a service (there are others, too, like GitLab, that would work fine, too).

We will use GitHub for this first step. The contributors are all at least somewhat technical so they should have little trouble forking the main repository, adding content, and creating a pull request (PR) from their fork. If they need help there’s a lot of documentation at the site, in books, and on StackOverflow. And the editors can use regular GitHub tools and processes to manage the contributions, eventually resulting in a merged PR.

So, we’ve started to fill in the picture:

Next time we will continue to work from the outside in by jumping to the other side and deciding on how to deliver web content to the readers. Then we’ll jump back to see where we go from GitHub.

Automating Web Site Updates: a Case Study

I was presented with a problem recently: automate the process of updating a web site when new contributions come in. The contributions are articles, in markdown format, and they need to be translated to web pages, inserted into the site’s content, and parts of the site (such as index pages) need to be updated accordingly. The contributors don’t have full publishing authority on their own – each submitted article is reviewed and perhaps sent for editing before it is accepted. The translation from markdown to web format can be done by an automated tool, but is usually run by hand by an editor.

The goal: automate this process as fully as possible. My goal: build a solution with cloud-native, preferably serverless, technology.

I’m going to take the next several blog posts to go over my approach to the problem and the eventual solution. I hope to get a new post up every few days. Stay tuned.