Category: Google Cloud

  • Measuring container size and startup latency for serverless apps written in C#, Node.js, Go, and Java

    Measuring container size and startup latency for serverless apps written in C#, Node.js, Go, and Java

    Do you like using function-as-a-service (FaaS) platforms to quickly build scalable systems? Me too. There are constraints around what you can do with FaaS, which is why I also like this new crop of container-based serverless compute services. These products—the terrific Google Cloud Run is the most complete example and has a generous free tier—let you deploy more full-fledged “apps” versus the glue code that works best in FaaS. Could be a little Go app, full-blown Spring Boot REST API, or a Redis database. Sounds good, but what if you don’t want to mess with containers as you build and deploy software? Or are concerned about the “cold start” penalty of a denser workload?

    Google Cloud has embraced Cloud Buildpacks as a way to generate a container image from source code. Using our continuous integration service or any number of compute services directly, you never have to write a Dockerfile again, unless you want to. Hopefully, at least. Regarding the cold start topic, we just shipped a new cloud metric, “container startup latency” to measure the time it takes for a serverless instance to fire up. That seems like a helpful tool to figure out what needs to be optimized. Based on these two things, I got curious and decided to build the same REST API in four different programming languages to see how big the generated container image was, and how fast the containers started up in Cloud Run.

    Since Cloud Run accepts most any container, you have almost limitless choices in programming language. For this example, I chose to use C#, Go, Java (Spring Boot), and JavaScript (Node.js). I built an identical REST API with each. It’s entirely possible, frankly likely, that you could tune these apps much more than I did. But this should give us a decent sense of how each language performs.

    Let’s go language-by-language and review the app, generate the container image, deploy to Cloud Run, and measure the container startup latency.

    Go

    I’m almost exclusively coding in Go right now as I try to become more competent with it. Go has an elegant simplicity to it that I really enjoy. And it’s an ideal language for serverless environments given its small footprint, blazing speed, and easy concurrency.

    For the REST API, which basically just returns a pair of “employee” records, I used the Echo web framework and Go 1.18.

    My data model (struct) has four properties.

    package model
    
    type Employee struct {
    	Id       string `json:"id"`
    	FullName string `json:"fullname"`
    	Location string `json:"location"`
    	JobTitle string `json:"jobtitle"`
    }
    

    My web handler offers a single operation that returns two employee items.

    package web
    
    import (
    	"net/http"
    
    	"github.com/labstack/echo/v4"
    	"seroter.com/restapi/model"
    )
    
    func GetAllEmployees(c echo.Context) error {
    
    	emps := [2]model.Employee{{Id: "100", FullName: "Jack Donaghy", Location: "NYC", JobTitle: "Executive"}, {Id: "101", FullName: "Liz Lemon", Location: "NYC", JobTitle: "Writer"}}
    	return c.JSON(http.StatusOK, emps)
    }
    

    And finally, the main Go class spins up the web server.

    package main
    
    import (
    	"fmt"
    
    	"github.com/labstack/echo/v4"
    	"github.com/labstack/echo/v4/middleware"
    	"seroter.com/restapi/web"
    )
    
    func main() {
    	fmt.Println("server started ...")
    
    	e := echo.New()
    	e.Use(middleware.Logger())
    
    	e.GET("/employees", web.GetAllEmployees)
    
    	e.Start(":8080")
    }
    

    Next, I used Google Cloud Build along with Cloud Buildpacks to generate a container image from this Go app. The buildpack executes a build, brings in a known good base image, and creates an image that we add to Google Cloud Artifact Registry. It’s embarrassingly easy to do this. Here’s the single command with our gcloud CLI:

    gcloud builds submit --pack image=gcr.io/seroter-project-base/go-restapi 
    

    The result? A 51.7 MB image in my Docker repository in Artifact Registry.

    The last step was to deploy to Cloud Run. We could use the CLI of course, but let’s use the Console experience because it’s delightful.

    After pointing at my generated container image, I could just click “create” and accept all the default instance properties. As you can see below, I’ve got easy control over instance count (minimum of zero, but you can keep a warm instance running if you want).

    Let’s tweak a couple of things. First off, I don’t need the default amount of RAM. I can easily operate with just 256MiB, or even less. Also, you see here that we default to 80 concurrent requests per container. That’s pretty cool, as most FaaS platforms do a single concurrent request. I’ll stick with 80.

    It seriously took four seconds from the time I clicked “create” until the instance was up and running and able to take traffic. Bonkers. I didn’t send any initial requests in, as I want to hit it cold with a burst of data. I’m using the excellent hey tool to generate a bunch of load on my service. This single command sends 200 total requests, with 10 concurrent workers.

    hey -n 200 -c 10 https://go-restapi-ofanvtevaa-uc.a.run.app/employees
    

    Here’s the result. All the requests were done in 2.6 seconds, and you can see that that the first ones (as the container warmed up) took 1.2 seconds, and the vast majority took 0.177 seconds. That’s fast.

    Summary:
      Total:        2.6123 secs
      Slowest:      1.2203 secs
      Fastest:      0.0609 secs
      Average:      0.1078 secs
      Requests/sec: 76.5608
      
      Total data:   30800 bytes
      Size/request: 154 bytes
    
    Response time histogram:
      0.061 [1]     |
      0.177 [189]   |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
      0.293 [0]     |
      0.409 [0]     |
      0.525 [1]     |
      0.641 [6]     |■
      0.757 [0]     |
      0.873 [0]     |
      0.988 [0]     |
      1.104 [0]     |
      1.220 [3]     |■
    
    
    Latency distribution:
      10% in 0.0664 secs
      25% in 0.0692 secs
      50% in 0.0721 secs
      75% in 0.0777 secs
      90% in 0.0865 secs
      95% in 0.5074 secs
      99% in 1.2057 secs

    How about the service metrics? I saw that Cloud Run spun up 10 containers to handle the incoming load, and my containers topped out at 5% memory utilization. It also barely touched the CPU.

    How about that new startup latency metric? I jumped into Cloud Monitoring directly to see that. There are lots of ways to aggregate this data (mean, standard deviation, percentile) and I chose the 95th percentile. My container startup time is pretty darn fast (at 95th percentile, it’s 106.87 ms), and then stays up to handle the load, so I don’t incur a startup cost for the chain of requests.

    Finally, with some warm instances running, I ran the load test again. You can see how speedy things are, with virtually no “slow” responses. Go is an excellent choice for your FaaS or container-based workloads if speed matters.

    Summary:
      Total:        2.1548 secs
      Slowest:      0.5008 secs
      Fastest:      0.0631 secs
      Average:      0.0900 secs
      Requests/sec: 92.8148
      
      Total data:   30800 bytes
      Size/request: 154 bytes
    
    Response time histogram:
      0.063 [1]     |
      0.107 [185]   |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
      0.151 [2]     |
      0.194 [10]    |■■
      0.238 [0]     |
      0.282 [0]     |
      0.326 [0]     |
      0.369 [0]     |
      0.413 [0]     |
      0.457 [1]     |
      0.501 [1]     |
    
    
    Latency distribution:
      10% in 0.0717 secs
      25% in 0.0758 secs
      50% in 0.0814 secs
      75% in 0.0889 secs
      90% in 0.1024 secs
      95% in 0.1593 secs
      99% in 0.4374 secs

    C# (.NET)

    Ah, .NET. I started using it with the early preview release in 2000, and considered myself a (poor) .NET dev for most of my career. Now, I dabble. .NET 6 looks good, so I built my REST API with that.

    Update: I got some good feedback from folks that I could have tried this .NET app using the new minimal API structure. I wasn’t sure it’d make a difference, but tried it anyway. Resulted in the same container size, and roughly the same response time (4.2088 seconds for all 200 requests) and startup latency (2.23s at 95th percentile). Close, but actually a tad slower! On the second pass of 200 requests, the total response time was almost equally (1.6915 seconds) fast as the way I originally wrote it.

    My Employee object definition is straightforward.

    namespace dotnet_restapi;
    
    public class Employee {
    
        public Employee(string id, string fullname, string location, string jobtitle) {
            this.Id = id;
            this.FullName = fullname;
            this.Location = location;
            this.JobTitle = jobtitle;
        }
    
        public string Id {get; set;}
        public string FullName {get; set;}
        public string Location {get; set;}
        public string JobTitle {get; set;}
    }
    

    The Controller has a single operation and returns a List of employee objects.

    using Microsoft.AspNetCore.Mvc;
    
    namespace dotnet_restapi.Controllers;
    
    [ApiController]
    [Route("[controller]")]
    public class EmployeesController : ControllerBase
    {
    
        private readonly ILogger<EmployeesController> _logger;
    
        public EmployeesController(ILogger<EmployeesController> logger)
        {
            _logger = logger;
        }
    
        [HttpGet(Name = "GetEmployees")]
        public IEnumerable<Employee> Get()
        {
            List<Employee> emps = new List<Employee>();
            emps.Add(new Employee("100", "Bob Belcher", "SAN", "Head Chef"));
            emps.Add(new Employee("101", "Philip Frond", "SAN", "Counselor"));
    
            return emps;
        }
    }
    

    The program itself simply looks for an environment variable related to the HTTP port, and starts up the server. Much like above, to build this app and produce a container image, it only takes this one command:

    gcloud builds submit --pack image=gcr.io/seroter-project-base/dotnet-restapi 
    

    The result is a fairly svelte 90.6 MB image in the Artifact Registry.

    When deploying this instance to Cloud Run, I kept the same values as with the Go service, as my .NET app doesn’t need more than 256MiB of memory.

    In just a few seconds, I had the app up and running.

    Let’s load test this bad boy and see what happens. I sent in the same type of request as before, with 200 total requests, 10 concurrent.

    hey -n 200 -c 10 https://dotnet-restapi-ofanvtevaa-uc.a.run.app/employees
    

    The results were solid. You can see a total execution time of about 3.6 seconds, with a few instances taking 2 seconds, and the rest coming back super fast.

    Summary:
      Total:        3.6139 secs
      Slowest:      2.1923 secs
      Fastest:      0.0649 secs
      Average:      0.1757 secs
      Requests/sec: 55.3421
      
    
    Response time histogram:
      0.065 [1]     |
      0.278 [189]   |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
      0.490 [0]     |
      0.703 [0]     |
      0.916 [0]     |
      1.129 [0]     |
      1.341 [0]     |
      1.554 [0]     |
      1.767 [0]     |
      1.980 [0]     |
      2.192 [10]    |■■
    
    
    Latency distribution:
      10% in 0.0695 secs
      25% in 0.0718 secs
      50% in 0.0747 secs
      75% in 0.0800 secs
      90% in 0.0846 secs
      95% in 2.0365 secs
      99% in 2.1286 secs

    I checked the Cloud Run metrics, and see that request latency was high on a few requests, but the majority were fast. Memory was around 30% utilization. Very little CPU consumption.

    For container startup latency, the number was 1.492s at the 95th percentile. Still not bad.

    Oh, and sending in another 200 requests with my .NET containers warmed up resulted in some smokin’ fast responses.

    Summary:
      Total:        1.6851 secs
      Slowest:      0.1661 secs
      Fastest:      0.0644 secs
      Average:      0.0817 secs
      Requests/sec: 118.6905
      
    
    Response time histogram:
      0.064 [1]     |
      0.075 [64]    |■■■■■■■■■■■■■■■■■■■■■■■■■
      0.085 [104]   |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
      0.095 [18]    |■■■■■■■
      0.105 [2]     |■
      0.115 [1]     |
      0.125 [0]     |
      0.136 [0]     |
      0.146 [0]     |
      0.156 [0]     |
      0.166 [10]    |■■■■
    
    
    Latency distribution:
      10% in 0.0711 secs
      25% in 0.0735 secs
      50% in 0.0768 secs
      75% in 0.0811 secs
      90% in 0.0878 secs
      95% in 0.1600 secs
      99% in 0.1660 secs

    Java (Spring Boot)

    Now let’s try it with a Spring Boot application. I learned Spring when I joined Pivotal, and taught a couple Pluralsight courses on the topic. Spring Boot is a powerful framework, and you can build some terrific apps with it. For my REST API, I began at start.spring.io to generate my reactive web app.

    The “employee” definition should look familiar at this point.

    package com.seroter.springrestapi;
    
    public class Employee {
    
        private String Id;
        private String FullName;
        private String Location;
        private String JobTitle;
        
        public Employee(String id, String fullName, String location, String jobTitle) {
            Id = id;
            FullName = fullName;
            Location = location;
            JobTitle = jobTitle;
        }
        public String getId() {
            return Id;
        }
        public String getJobTitle() {
            return JobTitle;
        }
        public void setJobTitle(String jobTitle) {
            this.JobTitle = jobTitle;
        }
        public String getLocation() {
            return Location;
        }
        public void setLocation(String location) {
            this.Location = location;
        }
        public String getFullName() {
            return FullName;
        }
        public void setFullName(String fullName) {
            this.FullName = fullName;
        }
        public void setId(String id) {
            this.Id = id;
        }
    }
    

    Then, my Controller + main class exposes a single REST endpoint and returns a Flux of employees.

    package com.seroter.springrestapi;
    
    import java.util.ArrayList;
    import java.util.List;
    
    import org.springframework.boot.SpringApplication;
    import org.springframework.boot.autoconfigure.SpringBootApplication;
    import org.springframework.web.bind.annotation.GetMapping;
    import org.springframework.web.bind.annotation.RestController;
    
    import reactor.core.publisher.Flux;
    
    @RestController
    @SpringBootApplication
    public class SpringRestapiApplication {
    
    	public static void main(String[] args) {
    		SpringApplication.run(SpringRestapiApplication.class, args);
    	}
    
    	List<Employee> employees;
    
    	public SpringRestapiApplication() {
    		employees = new ArrayList<Employee>();
    		employees.add(new Employee("300", "Walt Longmire", "WYG", "Sheriff"));
    		employees.add(new Employee("301", "Vic Moretti", "WYG", "Deputy"));
    
    	}
    
    	@GetMapping("/employees")
    	public Flux<Employee> getAllEmployees() {
    		return Flux.fromIterable(employees);
    	}
    }
    

    I could have done some more advanced configuration to create a slimmer JAR file, but I wanted to try this with the default experience. Once again, I used a single Cloud Build command to generate a container from this app. I do appreciate how convenient this is!

    gcloud builds submit --pack image=gcr.io/seroter-project-base/spring-restapi 
    

    Not surpassingly, a Java container image is a bit hefty. This one clocks in at 249.7 MB in size. The container image size doesn’t matter a TON to Cloud Run, as we do image streaming from Artifact Registry which means only files loaded by your app need to be pulled. But, size still does matter a bit here.

    When deploying this image to Cloud Run, I did keep the default 512 MiB of memory in place as a Java app can tend to consume more resources. The service still deployed in less than 10 seconds, which is awesome. Let’s flood it with traffic.

    hey -n 200 -c 10 https://spring-restapi-ofanvtevaa-uc.a.run.app/employees
    

    200 requests to my Spring Boot endpoint did ok. Clearly there’s a big startup time on the first one(s), and as a developer, that’d be where I dedicate extra time to optimizing.

    Summary:
      Total:        13.8860 secs
      Slowest:      12.3335 secs
      Fastest:      0.0640 secs
      Average:      0.6776 secs
      Requests/sec: 14.4030
      
    
    Response time histogram:
      0.064 [1]     |
      1.291 [189]   |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
      2.518 [0]     |
      3.745 [0]     |
      4.972 [0]     |
      6.199 [0]     |
      7.426 [0]     |
      8.653 [0]     |
      9.880 [0]     |
      11.107 [0]    |
      12.333 [10]   |■■
    
    
    Latency distribution:
      10% in 0.0723 secs
      25% in 0.0748 secs
      50% in 0.0785 secs
      75% in 0.0816 secs
      90% in 0.0914 secs
      95% in 11.4977 secs
      99% in 12.3182 secs

    The initial Cloud Run metrics show fast request latency (routing to the service), 10 containers to handle the load, and a somewhat-high CPU and memory load.

    Back in Cloud Monitoring, I saw that the 95th percentile for container startup latency was 11.48s.

    If you’re doing Spring Boot with serverless runtimes, you’re going to want to pay special attention to the app startup latency, as that’s where you’ll get the most bang for the buck. And consider doing a “minimum” of at least 1 always-running instance. See that when I sent in another 200 requests with warm containers running, things look good.

    Summary:
      Total:        1.8128 secs
      Slowest:      0.2451 secs
      Fastest:      0.0691 secs
      Average:      0.0890 secs
      Requests/sec: 110.3246
      
    
    Response time histogram:
      0.069 [1]     |
      0.087 [159]   |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
      0.104 [27]    |■■■■■■■
      0.122 [3]     |■
      0.140 [0]     |
      0.157 [0]     |
      0.175 [0]     |
      0.192 [0]     |
      0.210 [0]     |
      0.227 [0]     |
      0.245 [10]    |■■■
    
    
    Latency distribution:
      10% in 0.0745 secs
      25% in 0.0767 secs
      50% in 0.0802 secs
      75% in 0.0852 secs
      90% in 0.0894 secs
      95% in 0.2365 secs
      99% in 0.2450 secs

    JavaScript (Node.js)

    Finally, let’s look at JavaScript. This is what I first learned to really program in back in 1998-ish and then in my first job out of college. It continues to be everywhere, and widely supported in public clouds. For this Node.js REST API, I chose to use the Express framework. I built a simple router that returns a couple of “employee” records as JSON.

    var express = require('express');
    var router = express.Router();
    
    /* GET employees */
    router.get('/', function(req, res, next) {
      res.json(
        [{
            id: "400",
            fullname: "Beverly Goldberg",
            location: "JKN",
            jobtitle: "Mom"
        },
        {
            id: "401",
            fullname: "Dave Kim",
            location: "JKN",
            jobtitle: "Student"
        }]
      );
    });
    
    module.exports = router;
    

    My app.js file calls out the routes and hooks it up to the /employees endpoint.

    var express = require('express');
    var path = require('path');
    var cookieParser = require('cookie-parser');
    var logger = require('morgan');
    
    var indexRouter = require('./routes/index');
    var employeesRouter = require('./routes/employees');
    
    var app = express();
    
    app.use(logger('dev'));
    app.use(express.json());
    app.use(express.urlencoded({ extended: false }));
    app.use(cookieParser());
    app.use(express.static(path.join(__dirname, 'public')));
    
    app.use('/', indexRouter);
    app.use('/employees', employeesRouter);
    
    module.exports = app;
    

    At this point, you know what it looks like to build a container image. But, don’t take it for granted. Enjoy how easy it is to do this even if you know nothing about Docker.

    gcloud builds submit --pack image=gcr.io/seroter-project-base/node-restapi 
    

    Our resulting image is a trim 82 MB in size. Nice!

    For my Node.js app, I chose the default options for Cloud Run, but shrunk the memory demands to only 256 MiB. Should be plenty. The service deployed in a few seconds. Let’s flood it with requests!

    hey -n 200 -c 10 https://node-restapi-ofanvtevaa-uc.a.run.app/employees
    

    How did our cold Node.js app do? Well! All requests were processed in about 6 seconds, and the vast majority returned a response in around 0.3 seconds.

    Summary:
      Total:        6.0293 secs
      Slowest:      2.8199 secs
      Fastest:      0.0650 secs
      Average:      0.2309 secs
      Requests/sec: 33.1711
      
      Total data:   30200 bytes
      Size/request: 151 bytes
    
    Response time histogram:
      0.065 [1]     |
      0.340 [186]   |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
      0.616 [0]     |
      0.891 [0]     |
      1.167 [0]     |
      1.442 [1]     |
      1.718 [1]     |
      1.993 [1]     |
      2.269 [0]     |
      2.544 [4]     |■
      2.820 [6]     |■
    
    
    Latency distribution:
      10% in 0.0737 secs
      25% in 0.0765 secs
      50% in 0.0805 secs
      75% in 0.0855 secs
      90% in 0.0974 secs
      95% in 2.4700 secs
      99% in 2.8070 secs

    A peek at the default Cloud Run metrics show that we ended up with 10 containers handling traffic, some CPU and memory spikes, a low request latency.

    The specific metrics around container startup latency shows a very quick initial startup time of 2.02s.

    A final load against our Node.js app shows some screaming performance against the warm containers.

    Summary:
      Total:        1.8458 secs
      Slowest:      0.1794 secs
      Fastest:      0.0669 secs
      Average:      0.0901 secs
      Requests/sec: 108.3553
      
      Total data:   30200 bytes
      Size/request: 151 bytes
    
    Response time histogram:
      0.067 [1]     |
      0.078 [29]    |■■■■■■■■■■
      0.089 [114]   |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
      0.101 [34]    |■■■■■■■■■■■■
      0.112 [6]     |■■
      0.123 [6]     |■■
      0.134 [0]     |
      0.146 [0]     |
      0.157 [0]     |
      0.168 [7]     |■■
      0.179 [3]     |■
    
    
    Latency distribution:
      10% in 0.0761 secs
      25% in 0.0807 secs
      50% in 0.0860 secs
      75% in 0.0906 secs
      90% in 0.1024 secs
      95% in 0.1608 secs
      99% in 0.1765 secs

    Wrap up

    I’m not a performance engineer by any stretch, but doing this sort of testing with out-of-the-box settings seemed educational. My final container startup latency numbers at the 95th percentile were:

    There are many ways to change these numbers. If you have a more complex app with more dependencies, it’ll likely be a bigger container image and possibly a slower startup. If you tune the app to do lazy loading or ruthlessly strip out unnecessary activation steps, your startup latency goes down. It still feels safe to say that if performance is a top concern, look at Go. C# and JavaScript apps are going to be terrific here as well. Be more cautious with Java if you’re truly scaling to zero, as you may not love the startup times.

    The point of this exercise was to explore how apps written in each language get packaged and started up in a serverless compute environment. Something I missed or got wrong? Let me know in the comments!

  • How easily can you process events in AWS Lambda, Azure Functions, and Google Cloud Functions? Let’s try it out.

    How easily can you process events in AWS Lambda, Azure Functions, and Google Cloud Functions? Let’s try it out.

    A simple use case came to mind yesterday. How would I quickly find out if someone put a too-big file into a repository? In ancient times (let’s say, 2008), here’s what I would have done to solve that. First I’d have to find a file share or FTP location to work with. Then I’d write some custom code with a file system listener that reacted to new documents hitting that file location. After that, I’d look at the size and somehow trigger an alert if the file exceeded some pre-defined threshold. Of course, I’d have to find a server to host this little app on, and figure out how to deploy it. So, solving this might take a month or more. Today? Serverless, baby! I can address this use case in minutes.

    I’m learning to program in Go, so ideally, I want a lightweight serverless function written in Go that reacts whenever a new file hits an object store. Is that easy to do in each major public cloud entirely with the console UIs? I just went on a journey to find out, without preparing ahead of time, and am sharing my findings in real time.

    Disclaimer: I work at Google Cloud but I am a fairly regular user of other clouds, and was a 12-time Microsoft MVP, mostly focused on Azure. Any mistakes below can be attributed to my well-documented ignorance, and not about me trying to create FUD!

    Google Cloud

    First up, the folks paying my salary. How easily could I add a Cloud Function that responds to things getting uploaded to Cloud Storage?

    First, I created a new bucket. This takes a few seconds to do.

    Hey, what’s this? From the bucket browser, I can actually choose to “process with Cloud Functions.” Let’s see what this does.

    Whoa. I get an inline “create function” experience with my bucket-name pre-populated, and the ability to actually author the function code RIGHT HERE.

    The Go code template was already populated with a “storage” object as input, and I extended it to include the “size” attribute. Then I added a quick type conversion, and check to see if the detected file was over 1MB.

    // Package p contains a Google Cloud Storage Cloud Function.
    package p
    
    import (
    	"context"
    	"log"
    	"strconv"
    )
    
    // GCSEvent is the payload of a GCS event. Please refer to the docs for
    // additional information regarding GCS events.
    type GCSEvent struct {
    	Bucket string `json:"bucket"`
    	Name   string `json:"name"`
    	Size   string `json:"size"`
    }
    
    // HelloGCS prints a message when a file is changed in a Cloud Storage bucket.
    func HelloGCS(ctx context.Context, e GCSEvent) error {
    	log.Printf("Processing file: %s", e.Name)
    	
    	intSize, _ := strconv.Atoi(e.Size)
    
    	if intSize > 1000000 {
    		log.Printf("Big file detected, do something!")
    	} else {
    		log.Printf("Normal size file detected")
    	}
    
    	return nil
    }
    

    After deploying it, I want to test it. To do so, I just dropped two files—one that was 54 bytes and another that was over 1MB.

    Now I’m heading over to the Cloud Functions dashboard and looking at the inline “Logs” tab. This shows me the system logs, as well as anything my function itself emitted. After just a moment, I see the logs my function wrote out, including the “normal size file” and “big file detected” messages.

    Goodness that was easy. The same sort of in-experience trigger exists for Pub/Sub, making it easy to generate functions that respond to messaging events.

    The other UI-driven way to do this. From the Cloud Functions experience, I chose to add a new function. You see here that I have a choice of “trigger.”

    I chose “Cloud Storage” and then picked from a list of possible event types. Let’s also choose the right bucket to listen in on. Note that from this creation wizard, I can also do things like set the memory allocation and timeout period, define the minimum and maximum instance count, add environment variables, reference secrets, and define ingress and egress permissions.

    Next, I have to add some source code. I can upload a zip file, reference a zip file in Cloud Storage, point to a source code repository, or add code inline. Let’s do that. What I love is that the code template recognizes my trigger type, and takes in the object representing the storage event. For each language. That’s a big time-saver, and helps new folks understand what the input object should look like. See here:

    Here, I picked Go again, used the same code as before, and deployed my function. Once again, it cleanly processes any event related to new files getting added to Cloud Storage. Cloud Functions is underrated, and super easy to work with.

    End to end, this solution should take 2-5 minutes tops to complete and deploy. That’s awesome. Past Richard would be crying for joy right now.

    AWS

    The grandaddy of serverless should be pretty good at this scenario too! From humble beginnings, AWS Lambda has seemingly becomes the preferred app platform in that ecosystem. Let’s use the AWS console experience to build a Lambda function that responds to new files landing in an S3 bucket.

    First, I need an S3 bucket. Easy enough, and accepting all the default settings.

    My bucket is now there, and I’m looking around, but don’t see any option to create a Lambda function from within this S3 interface. Maybe I’m missing it, but doesn’t seem so.

    No problem. Off to the Lambda dashboard. I click the very obvious “create function” button and am presented with a screen that asks for my function name and runtime, and the source of code.

    Let’s see what “from scratch” means, as I’d probably want some help via a template if it’s too bare bones. I click “create function” to move forward.

    Ok, rats, I don’t get an inline code editor if I want to write code in Go. Would have been useful to know beforehand. I’ll delete this function and start over, this time, looking for a blueprint that might provide a Go template for reading from S3.

    Doesn’t look like there’s anything for Go. If I want a blueprint, I’m choosing between Python and Node. Ok, I’ll drop by Go requirement, and crank out this Lambda function in JavaScript. I picked that s3-get-object template, and then provide a function name and a role that can access S3. I’m asked for details about my S3 trigger (bucket name, event type) and shown the (uneditable) blueprint code. I’d like to make changes, but I guess I wait until later, so I create the function.

    Shoot, I did something wrong. Got an error that, on the plus side, is completely opaque and unreadable.

    Not be stopped, I’ll try clicking “add trigger” here, which lets me choose among a variety of sources, including S3, and this configuration seems to work fine.

    I want to update the source code of my function, so that it logs alerts for big files. I updated the Lambda code (after looking up the structure of the inbound event object) and clicked “deploy” to apply this new code.

    Not too bad. Ok, let’s test this. In S3, I just dropped a handful of files into the bucket. Back in the Lambda console, I jump to the “Monitor” tab to see what’s up.

    I’ve got the invocations listed here. I can’t see the logs directly, but looks like I need to click the LogStream links to view the invocation logs. Doing that takes me to a new window where I’m now in CloudWatch. I now see the logs for this particular set of invocations.

    Solid experience. A few hiccups, but we’ll chalk some of that up to my incompetence, and the remainder to the fact that AWS UIs aren’t always the most intuitive.

    Microsoft Azure

    Azure, my old friend. Let’s see how I can use the Azure Portal to trigger an Azure Function whenever I add something to a storage bucket. Here we go.

    Like with the walkthroughs above, I also need to setup some storage. From the home page, I click “create resource” and navigate on the left-hand side to “Storage.” And … don’t see Azure Storage. *Expletive*.

    I can’t find what category it’s in, but just noticed it in the “Get started” section. It’s weird, but whatever. I pick an Azure subscription and resource group, try to set a name (and remember that it doesn’t accept anything but letters and numbers, no dashes), and proceed. It validates something (not sure I’ve ever seen this NOT pass) and then I can click “create.”

    After thirty seconds, I have my storage account. Azure loves “things contained within things” so this storage account itself doesn’t hold objects. I create a “container” to hold my actual documents.

    Like with Lambda, I don’t see a way from this service to create an event-driven function. [Updated 2-13-22: A reader pointed out that there is an “events” experience in Storage that lets you somewhat create a function (but not the Function App itself). While convenient, the wizard doesn’t recognize where you are, and asks what sort of Function (storage!) you want to build. But it’s definitely something.]

    So, let’s go to the Azure Functions experience. I’m asked to create a “Function App.” There’s no option to choose Go as a managed language, so I’ll once again pick Node. YOU WIN AGAIN JAVASCRIPT.

    I move on to the next pane of the wizard where I’m asked about hosting stack. Since this is 2022, I chose Linux, even though Windows is somehow the recommended stack for Node functions. After a few moments, I have my Function app.

    As with the storage scenario, this Function app isn’t actually the function. I need to add a function to the app. Ok, no problem. Wait, apparently you can’t use the inline editor for Linux-based functions because of reasons.

    Sigh. I’ll create a new Function App, this time choosing Windows as the host. Now when I choose to add a function to this Function App, I see the option for “develop in portal”, and can choose a trigger. That’s good. I’ll choose the Storage Blob trigger, but I’m not clear on the parameter values I’m supposed to provide. Hmm, the “learn more” goes to a broken page. Found it by Googling directly. Looks like the “path” is the name of the container in the account, and {name} is a standard token.

    The creation succeeded, and now I have a function. Sweet. Let’s throw some code in here. The “Code + Test” window looks like an inline editor. I updated the code to do a quick check of file size, and hope it works.

    After saving it (I don’t see a concept of versioning), I can test it out. Like I did for Google Cloud and AWS, I dragged a couple of files onto the browser window pointing at the Storage Blob. Looks like the Azure Portal doesn’t support drag-and-drop. I’ll use the “upload files” wizard like an animal. After uploading, I switch back to the Azure Functions view which offers a “Monitor” view.

    I don’t love that “results may be delayed for up to 5 minutes” as I’m really into instant gratification. The Function dashboard shows two executions right away, but the logs are still delayed for minutes after that. Eventually I see the invocations show up, and it shows execution history (not app logs).

    I can’t seem to find the application logs, as the “logs” tab here seems to show a stream, but nothing appears here for me. Application Insights doesn’t seem to show the logs either. They could be lost to the universe, or more likely, I’m too bad at this to find them.

    Regardless, it works! My Azure Function runs when objects land in my Storage account.

    Wrap Up

    As to the options considered here, it seemed obvious to me that Google Cloud has the best dev experience. The process of creating a function is simple (and even embedded in related services), the inline editor easily works for all languages, and the integrated log monitoring made my build-deploy-test loop faster. The AWS experience was fine overall, although inconsistent depending on your programming language. And the Azure experience, honestly, felt super clunky and the Windows-centricity feels dated. I’m sure they’ll catch up soon.

    Overall, this was pretty fun. Managed services and serverless computing makes these quick solutions so simple to address. It’s such an improvement for how we had to do this before!

  • Want to externalize app configuration with Spring Cloud Config and Google Cloud Secret Manager? Now you can.

    Want to externalize app configuration with Spring Cloud Config and Google Cloud Secret Manager? Now you can.

    You’re familiar with twelve-factor apps? This relates to a set of principles shared by Heroku over a decade ago. The thinking goes, if your app adheres to these principles, it’s more likely to be scalable, resilient, and portable. While twelve-factor apps were introduced before Docker, serverless, or mainstream cloud adoption were a thing, I think these principles remain relevant in 2022. One of those principles relates to externalizing your configuration so that environment-related settings aren’t in code. Spring Cloud Config is a fun project that externalizes configurations for your (Java) app. It operates as a web server that serves up configurations sourced from a variety of places including git repos, databases, Vault, and more. A month ago, I saw a single-line mention in the Spring Cloud release notes that said Spring Cloud Config now integrates with Google Cloud Secret Manager. No documentation or explanation of how to use this feature? CHALLENGE ACCEPTED.

    To be sure, a Spring Boot developer can easily talk to Google Cloud Secret Manager directly. We already have a nice integration here. Why add the Config Server as an intermediary? One key reason is to keep apps from caring where the configs come from. A (Spring Boot) app just needs to make an HTTP request or use the Config Client to pulls configs, even if they came from GitHub, a PostgreSQL database, Redis instance, or Google Cloud Secret Manager. Or any combination of those. Let’s see what you think once we’re through.

    Setting up our config sources

    Let’s pull configs from two different places. Maybe the general purpose configuration settings are stored in git, and the most sensitive values are stored in Secret Manager.

    My GitHub repo has a flat set of configuration files. The Spring Cloud Config Server reads all sorts of text formats. In this case, I used YAML. My “app1” has different configs for the “dev” and “qa” environments, as determined by their file names.

    Secret Manager configs work a bit differently than git-based ones. The Spring Cloud Config Server uses the file name in a git repo to determine the app name and profile (e.g. “app1-qa.yml”) and makes each key/value pair in that file available to Spring for binding to variables. So from the image above, those three properties are available to any instance of “app1” where the Spring profile is set to “qa.” Secret Manager itself is really a key/value store. So the secret name+value is what is available to Spring. The “app” and “profile” come from the labels attached to the secret. Since you can’t have two secrets with the same name, if you want one secret for “dev” and one for “qa”, you need to name them differently. So, using the Cloud Code extension for VS Code, I created three secrets.

    Two of the secrets (connstring-dev, connstring-qa) hold connection strings for their respective environments, and the other secret (serviceaccountcert) only applies to QA, and has the corresponding label values.

    Ok, so we have all our source configs. Now to create the server that swallows these up and flattens the results for clients.

    Creating and testing our Spring Cloud Config Server

    Creating a Spring Cloud Config Server is very easy. I started at the Spring Intializr site to bootstrap my application. In fact, you can click this link and get the same package I did. My dependencies are on the Actuator and Config Server.

    The Google Cloud Secret Manager integration was added to the core Config Server project, so there’s config-specific dependency to add. It does appear you need to add a reference to the Secret Manager package to enable connectivity and such. I added this to my POM file.

    <dependency>
    		<groupId>com.google.cloud</groupId>
    		<artifactId>google-cloud-secretmanager</artifactId>
    		<version>1.0.1</version>
    </dependency>
    

    There’s no new code required to get a Spring Cloud Config Server up and running. Seriously. You just add an annotation (@EnableConfigServer) to the primary class.

    @EnableConfigServer
    @SpringBootApplication
    public class BootConfigServerGcpApplication {
    
    	public static void main(String[] args) {
    		SpringApplication.run(BootConfigServerGcpApplication.class, args);
    	}
    }
    

    The final step is to add some settings. I created an application.yaml file that looks like this:

    server:
      port: ${PORT:8080}
    spring:
      application:
        name: config-server
      profiles:
        active:
          secret-manager, git
      cloud:
        config:
          server:
            gcp-secret-manager:
              #application-label: application
              #profile-label: profile
              token-mandatory: false
              order: 1
            git:
              uri: https://github.com/rseroter/spring-cloud-config-gcp
              order: 2
    

    Let’s unpack this. First I set the port to whatever the environment provides, or 8080. I’m setting two active profiles here, so that I activate the Secret Manager and git environments. For the “gcp-secret-manager” block, you see I have the option to set the label values to designate the application and profile. If I wanted to have my secret with a label “appname:app1” then I’d set the application-label property here to “appname.” Make sense? I fumbled around with this for a while until I understood it. And notice that I’m pointing at the GitHub repo as well.

    One big thing to be aware of on this Secret Manager integration with Config Server. Google Cloud has the concept of “projects.” It’s a key part of an account hierarchy. You need to provide the project ID when interacting with the Google Cloud API. Instead of accepting this as a setting, the creators of the Secret Manager integration look up the value using a metadata service that only works when the app is running in Google Cloud. It’s a curious design choice, and maybe I’ll submit an issue or pull request to make that optional. In the meantime, it means you can’t test locally; you need to deploy the app to Google Cloud.

    Fortunately, Google Cloud Run, Secret Manager, and Artifact Registry (for container storage) are all part of our free tier. If you’re logged into the gcloud CLI, all you have to do is type gcloud run deploy and we take your source code, containerize it using buildpacks, add it to Artifact Registry, and deploy a Cloud Run instance. Pretty awesome.

    After a few moments, I have a serverless container running Spring middleware. I can scale to zero, scale to 1, handle concurrent requests, and maybe pay zero dollars for it all.

    Let’s test this out. We can query a Config Server via HTTP and see what a Spring Boot client app would get back. The URL contains the address of the server and path entries for the app name and profile. Here’s the query for app1 and the dev profile.

    See that our config server found two property sources that matched a dev profile and app1. This gives a total of three properties for our app to use.

    Let’s swap “dev” for “qa” in the path and get the configurations for the QA environment.

    The config server used different sources, and returns a total of five properties that our app can use. Nice!

    Creating and testing our config client

    Consuming these configurations from a Spring Boot app is simple as well. I returned to the Spring Initializr site and created a new web application that depends on the Actuator, Web, and Config Client packages. You can download this starter project here.

    My demo-quality code is basic. I annotated the main class as a @RestController, exposed a single endpoint at the root, and returned a couple of configuration values. Since the “dev” and “qa” connection strings have different configuration names—remember, I can’t have two Secrets with the same name—I do some clunky work to choose the right one.

    @RestController
    @SpringBootApplication
    public class BootConfigClientGcpApplication {
    
    	public static void main(String[] args) {
    		SpringApplication.run(BootConfigClientGcpApplication.class, args);
    	}
    
    	@Value("${appversion}")
    	String appVersion;
    
    	@Value("${connstring-dev:#{null}}")
    	String devConnString;
    
    	@Value("${connstring-qa:#{null}}")
    	String qaConnString;
    
    	@GetMapping("/")
    	public String getData() {
    		String secret;
    		secret = (devConnString != null) ? devConnString : qaConnString;
    		return String.format("version is %s and secret is %s",appVersion, secret);
    	}
    }
    

    The application.yaml file for this application has a few key properties. First, I set the spring.application.name, which tells the Config Client which configuration properties to retrieve. It’ll query for those assigned to “app1”. Also note that I set the profile to “dev”, which also impacts the query. And, I’m exposing the “env” endpoint of the actuator, which lets me peek at all the environment variables available to my application.

    server:
      port: 8080
    management:
      endpoints:
        web:
          exposure:
            include: env
    spring:
      application:
        name: app1
      profiles:
        active: dev
      config:
        import: configserver:https://boot-config-server-gcp-ofanvtevaa-uw.a.run.app
    

    Ok, let’s run this. I can do it locally, since there’s nothing that requires this app to be running in any particular location.

    Cool, so it returned the values associated with the “dev” profile. If I stop the app, switch the spring.profiles.active to “qa” and restart, I get different property values.

    So the Config Client in my application is retrieving configuration properties from the Config Server, and my app gets whatever values make sense for a given environment with zero code changes. Nice!

    If we want, we can also check out ALL the environment variables visible to the client app. Just send a request to the /actuator/env endpoint and observe.

    Summary

    I like Spring Cloud Config. It’s a useful project that helps devs incorporate the good practice of externalizing configuration. If you want a bigger deep-dive into the project, check out my new Pluralsight course that covers it.

    Also, take a look at Google Cloud Run as a legit host for your Spring middleware and apps. Instead of over-provisioning VMs, container clusters, or specialized Spring runtimes, use a cloud service that scales automatically, offers concurrency, supports private traffic, and is pay-for-what-you-use.

  • I noticed these three themes in the (free) Google Cloud Next program starting October 12th

    Are you tired on online events yet? No? You might be the only one. There are a few events popping up in person, but looks like we’re all stuck with “amazing digital experiences” for the next while. But at least the organizers are learning about what works and improving the events! Last year’s Google Cloud Next lasted for nine weeks, which was about eight weeks too long. Sorry about that. This year, our flagship cloud conference is a brisk three days, from October 12-14. And it’s free, which is cool.

    Cloud Next matters because a lot of what Google Cloud shares becomes widely adopted by others later on. Might as well get it here first!

    I flipped through the agenda to find the talks that interested me the most. Obviously my keynote/demo thing will be the most glorious session, but let’s put that aside. As I browsed the catalog, I identified a handful of themes. Here are fifteen talks I’m excited about, spread across my three made-up themes: familiar but better, migration ready, and optimized for scale.

    Familiar but better

    This is the story of Google Cloud. Things that resemble cloud services or products you’ve used before, but more full-featured, easier to use, and more reliable. Talks that stood out:

    Migration ready

    It’s fun to build and modernize, but many folks are looking for a clean path to migrate to the cloud faster, while working with what they already have. There are a few talks about this:

    Optimized for scale

    Many are past the first cloud wave of using stuff in small pockets. Now it’s about running things effectively at scale from a cost, security, manageability perspective. Talks I like:

    There’s lots of other terrific talks covering security, analytics, infrastructure, and more, so do check out the whole catalog. I hope to see you at Cloud Next, and drop into my presentation to heckle me or provide moral support.

  • Using the new Google Cloud Config Controller to provision and manage cloud services via the Kubernetes Resource Model

    Using the new Google Cloud Config Controller to provision and manage cloud services via the Kubernetes Resource Model

    When it comes to building and managing cloud resources—VMs, clusters, user roles, databases—most people seem to use a combination of tools. The recent JetBrains developer ecosystem survey highlighted that Terraform is popular for infrastructure provisioning, and Ansible is popular for keeping infrastructure in a desired state. Both are great tools, full stop. Recently, I’ve seen folks look at the Kubernetes API as a single option for both activities. Kubernetes is purpose-built to take a declared state of a resource, implement that state, and continuously reconcile to ensure the resource stay in that state. While we apply this Kubernetes Resource Model to containers today, it’s conceptually valid for most anything.

    18 months ago, Google Cloud shipped a Config Connector that offered custom resource definitions (CRDs) for Google Cloud services, and controllers to provision and manage those services. Install this into a Kubernetes cluster, send resource definitions to that cluster, and watch your services hydrate. Stand up and manage 60-ish Google Cloud services as if they were Kubernetes resources. It’s super cool and useful. But maybe you don’t want 3rd party CRDs and controllers running in a shared cluster, and don’t want to manage a dedicated cluster just to host them. Reasonable. So we created a new managed service: Config Controller. In this post, I’ll look at manually configuring a GKE cluster, and then show you how to use the new Config Controller to provision and configure services via automation. And, if you’re a serverless fan or someone who doesn’t care at ALL about Kubernetes, you’ll see that you can still use this declarative model to build and manage cloud services you depend on.

    But first off, let’s look at configuring clusters and extending the Kubernetes API to provision services. To start with, it’s easy to stand up a GKE cluster in Google Cloud. It can be one-click or fifty, depending on what you want. You can use the CLI, API, Google Cloud Console, Terraform modules, and more.

    Building and managing one of something isn’t THAT hard. Dealing with fleets of things is harder. That’s why Google Anthos exists. It’s got a subsystem called Anthos Config Management (ACM). In addition to embedding the above-mentioned Config Connector, this system includes an ability to synchronize configurations across clusters (Config Sync), and apply policies to clusters based on Open Policy Agent Gatekeeper (Policy Controller). All these declarative configs and policies are stored in a git repo. We recently made it possible to use ACM as a standalone service for GKE clusters. So you might build up a cluster that looks like this:

    What this looks like in real life is that there’s a “Config Management” tab on the GKE view in the Console. When you choose that, you register a cluster with a fleet. A fleet shares a configuration source, so all the registered clusters are identically configured.

    Once I registered my GKE cluster, I chose a GitHub repo that held my default configurations and policies.

    Finally, I configured Policy Controller on this GKE cluster. This comes with a few dozen Google-provided constraint templates you can use to apply cluster constraints. Or bring your own. My repo above includes a constraint that limits how much CPU and memory a pod can have in a specific namespace.

    At this point, I have a single cluster with policy guardrails and applied configurations. I also have the option of adding the Config Connector to a cluster directly. In that scenario, a cluster might look like this:

    In that diagram, the GKE cluster not only has the GKE Config Management capabilities turned on (Config Sync and Policy Controller), but we’ve also added the Config Connector. You can add that feature during cluster provisioning, or after the fact, as I show below.

    Once you create an identity for the Config Connector to use, and annotate a Kubernetes namespace that holds the created resources, you’re good to go. I see all the cloud services we can create and manage by logging into my cluster and issuing this command:

    kubectl get crds --selector cnrm.cloud.google.com/managed-by-kcc=true

    Now, I can create instances of all sorts of Google Cloud managed services—BigQuery jobs, VMs, networks, Dataflow jobs, IAM policies, Memorystore Redis instances, Spanner databases, and more. Whether your app uses containers or functions, this capability is super useful. To create the resource I want, I write a bit of YAML. I could export an existing cloud service instance to get its representative YAML, write it from scratch, or generate it from the Cloud Code tooling. I did the latter, and produced this YAML for a managed Redis instance via Memorystore:

    apiVersion: redis.cnrm.cloud.google.com/v1beta1
    kind: RedisInstance
    metadata:
      labels:
        label: "seroter-demo-instance"
      name: redisinstance-managed
    spec:
      displayName: Redis Instance Managed
      region: us-central1
      tier: BASIC
      memorySizeGb: 16
    

    With a single command, I apply this resource definition to my cluster.

    kubectl apply -f redis-test.yaml -n config-connector

    When I query Kubernetes for “redisinstances” it knows what that means, and when I look to see if I really have one, I see it show up in the Google Cloud Console.

    You could stop here. We have a fully-loaded cluster that synchronizes configurations and policies, and can create/manage Google Cloud services. But the last thing is different from the first two. Configs and policies create a secure and consistent cluster. The Config Connector is a feature that uses the Kubernetes control plane for other purposes. In reality, what you want is something like this:

    Here, we have a dedicated KRM server thanks to the managed Config Controller. With this, I can spin up and manage cloud services, including GKE clusters themselves, without running a dedicated cluster or stashing extra bits inside an existing cluster. It takes just a single command to spin up this service (which creates a managed GKE instance):

    gcloud alpha anthos config controller create seroter-cc-instance \
    --location=us-central1

    A few minutes later, I see a cluster in the GKE console, and can query for any Config Controller instances using:

    gcloud alpha anthos config controller list --location=us-central1

    Now if I log into that service instance, and send in the following YAML, Config Controller provisions (and manages) a Pub/Sub topic for me.

    apiVersion: pubsub.cnrm.cloud.google.com/v1beta1
    kind: PubSubTopic
    metadata:
      labels:
        label: "seroter-demo"
      name: cc-topic-1
    

    Super cool. But wait, there’s more. This declarative model shouldn’t FORCE you to know about Kubernetes. What if I want to GitOps-ify my services so that anyone could create cloud services by checking a configuration into a git repo versus kubectl apply commands? This is what makes this interesting to any developer, whether they use Kubernetes or not. Let’s try it.

    I have a GitHub repo with a flattened structure. The Config Sync component within the Config Controller service will read from this repo, and and work with the Config Connector to instantiate and manage any service instances I declare. To set this up, all I do is activate Config Sync and tell it about my repo. This is the file that I send to the Config Controller to do that:

    # config-management.yaml
    
    apiVersion: configmanagement.gke.io/v1
    kind: ConfigManagement
    metadata:
      name: config-management
    spec:
      #you can find your server name in the GKE console
      clusterName: krmapihost-seroter-cc-instance
      #not using an ACM structure, but just a flat one
      sourceFormat: unstructured
      git:
        policyDir: /
        syncBranch: main
        #no service account needed since there's no read permissions required
        secretType: none
        syncRepo: https://github.com/rseroter/anthos-seroter-config-repo-cc
    

    Note: this demo would have been easier if I had used Google Cloud’s Source Repositories instead of GitHub. But I figured most people would use GitHub, so I should too. The Config Controller is a private GKE cluster, which is safe and secure, but also doesn’t have outbound access. It can reach our Source Repos, but I had to add an outbound VPC firewall rule for 443, and then provision a NAT gateway so that the traffic could flow.

    With all this in place, as soon as I check in a configuration, the Config Controller reads it and acts upon it. Devs just need to know YAML and git. They don’t have to know ANY Kubernetes to provision managed cloud services!

    Here’s the definition for a custom IAM role.

    apiVersion: iam.cnrm.cloud.google.com/v1beta1
    kind: IAMCustomRole
    metadata:
      name: iamcustomstoragerole
      namespace: config-control
    spec:
      title: Storage Custom Role
      description: This role only contains two permissions - read and update
      permissions:
        - storage.buckets.list
        - storage.buckets.get
      stage: GA
    

    When I add that to my repo, I almost immediately see a new role show up in my account. And if I mess with that role directly by removing or adding permissions, I see Config Controller detect that configuration drift and return the IAM role back to the desired state.

    This concept gets even more powerful when you look at the blueprints we’re creating. Stamp out projects, landing zones, and GKE clusters with best practices applied. Imagine using the Config Controller to provision all your GKE clusters and prevent drift. If someone went into your cluster and removed Config Sync or turned off Workload Identity, you’d be confident knowing that Config Controller would reset those properties in short order. Useful!

    In this brave new world, you can can keep Kubernetes clusters in sync and secured by storing configurations and policies in a git repo. And you can leverage that same git repo to store declarative definitions of cloud services, and ask the KRM-powered Config Controller to instantiate and manage those services. To me, this makes managing an at-scale cloud environment look much more straightforward.

  • Schema-on-write and schema-on-read doesn’t just apply to databases. It applies to message queues, too.

    Schema-on-write and schema-on-read doesn’t just apply to databases. It applies to message queues, too.

    When does your app enforce its data structure? If you’re using a relational database, you comply with a pre-defined data structure when you write data to its tables. The schema—made up of field names, data types, and foreign key constraints, for example—is enforced up front. Your app won’t successfully write data if it violates the schema. Many of us have been working with schema-on-write relational databases for a long time, and they make sense when you have relatively static data structures.

    If you’d prefer to be more flexible with what data you store, and want data consumers to be responsible for enforcing structure, you’ll prefer a NoSQL database. Whether you’ve got a document-style database like Firestore or MongoDB, or key-value stores like Redis, you’re mostly leaving it up the client to retrieve the data and deserialize it into a structure it expects. These clients apply a schema when they read the data.

    Both of these approaches are fine. It’s all about what you need for a given scenario. While this has been a choice for database folks for a while, today’s message queue services often apply a schema-on-read approach. Publish whatever, and subscribers retrieve the data and deserialize it into the object they expects. To be sure, there are some queues with concepts of message structure—ActiveMQ has something, and traditional ESB products like TIBCO EMS and BizTalk Server offer schemas—but modern cloud-based queue services are typically data-structure-neutral.

    Amazon SQS is one of the oldest cloud services. It doesn’t look at any of the messages that pass through, and there’s no concept of a message schema. Same goes for Azure Service Bus, another robust queuing service that asks the consumer to apply a schema when a message is read. To be clear, there’s nothing is wrong with that. It’s a good pattern. Heck, it’s one that Google Cloud applies too with Pub/Sub. However, we’ve recently added schema support, and I figured we should take a look at this unique feature.

    I wrote about Pub/Sub last year. It’s a fairly distinct cloud service. You can do traditional message queuing, of course. But it also supports things like message replay—which feels Kafka-esque—and push notifications. Instead of using 3+ cloud messaging services, maybe just use one?

    The schema functionality in Pub/Sub is fairly straightforward. A schema defines a message structure, you apply it to one or many new Topics, and only messages that comply with that schema may be published to those Topics. You can continue using Topics without schemas and accept any input, while attaching schemas to Topics that require upfront validation.

    Creating schemas

    Schemas work with schemas encoded as JSON or in a binary format. And the schema itself is structured using either Apache Avro or the protocol buffer language. Both support basic primitive types, and complex structures (e.g. nested types, arrays, enumerations).

    With Google Cloud Pub/Sub, you can create schemas independently and then attach to Topics, or you can create them at the same time as creating a Topic. Let’s do the former.

    You can create schemas programmatically, as you’d expect, but let’s use the Google Cloud Console to do it here. I’M A VISUAL LEARNER.

    On the schemas view of the Console, I see options to view, create, and delete schemas.

    I chose to create a brand new schema. In this view, I’m asked to give the schema a name, and then choose if I’m using Avro or Protocol Buffers to define the structure.

    In that “schema definition” box, I get a nice little editor with type-ahead support. Here, I sketched out a basic schema for an “employee” message type.

    No matter how basic, I’m still capable of typing things wrong. So, it’s handy that there’s a “validate schema” button at the bottom that shockingly confirmed that I got my structure correct.

    You’ll also notice a “test message” button. This is great. From here, I can validate input, and see what happens (below) if I skip a required field, or put the the wrong value into the enumeration.

    Also note that the CLi lets you do this too. There are simple commands to test a message against a new schema, or one that already exists. For example:

    gcloud pubsub schemas validate-message \
            --message-encoding=JSON \
            --message="{\"name\":\"Jeff Reed\",\"role\":\"VP\",\"timeinroleyears\":0.5,\"location\":\"SUNNYVALE\"}" \
            --schema-name=employee-schema
    

    Once I’m content with the structure, I save the schema. Then it shows up in my list of available schemas. Note that I cannot change a published schema. If my structure changes over time, that’s a new schema. This is a fairly light UX, so I assume you should maintain versions in a source code repo elsewhere.

    [March 20, 2023 update: Schemas can now be updated.]

    Apply schemas to Topics

    In that screenshot above, you see a button that says “create topic.” I can create a Topic from here, or, use the standard way of creating Topics and select a schema then. Let’s do that. When I go to the general “create Topic” view, you see I get a choice to use a schema and pick a message encoding. Be aware that you can ONLY attach schemas to new Topics, and once you attach a schema, you can’t remove it from that Topic. Make good choices.

    [March 20, 2023 update: Schemas can now be added and removed from topics.]

    How do I know that a Topic has schema attached? You have a few options.

    First, the Google Cloud Console shows you! When I view the details of a given Topic, I notice that the encoding and schema get called out.

    It’s not all about the portal UX, however. CLI fans need love too. Everything I did above, you can do in code or via CLi. That includes getting details about a given schema. Notice below that I can list all the schemas for my project, and get the details for any given one.

    And also see that when I view my Topic, it shows that I have a schema applied.

    Publishing messages

    After ensuring that my Topic has a subscription or two—messages going to a Topic without a subscription are lost—I tried publishing some messages.

    First, I did this from a C# application. It serializes a .NET object into a JSON object and sends it to my schema-enforcing Pub/Sub topic.

    using System;
    using Google.Cloud.PubSub.V1;
    using Google.Protobuf;
    using System.Text.Json;
    
    namespace core_pubsub_schema
    {
        class Program
        {
            static void Main(string[] args)
            {
                Console.WriteLine("Pub/Sub app started");
    
                PublishMessage();
    
                Console.WriteLine("App done!");
            }
    
            static void PublishMessage() {
                
                //define an employee object
                var employee = new Employee {
                    name = "Jeff Reed",
                    role = "VP",
                    timeinroleyears = 0.5f,
                    location = "SUNNYVALE"
                };
                //convert the .NET object to a JSON string
                string jsonString = JsonSerializer.Serialize(employee);
    
                //name of our topic
                string topicName ="projects/rseroter-misc/topics/new-employees";
                PublisherServiceApiClient publisher = PublisherServiceApiClient.Create();
    
                //create the message
                PubsubMessage message = new PubsubMessage
                {
                    Data = ByteString.CopyFromUtf8(jsonString)
                };
    
                try {
                    publisher.Publish(topicName, new[] { message });
                    Console.WriteLine("Message published!");
                }
                catch (Exception ex) {
                    Console.WriteLine(ex.ToString());
                }
            }
        }
    
        public class Employee {
            public string name {get; set; }
            public string role {get; set; }
            public float timeinroleyears {get; set;}
            public string location {get; set;}
        }
    }
    

    After running this app, I see that I successfully published a message to the Topic, and my lone subscription holds a copy for me to read.

    For fun, I can also publish messages directly from the Google Cloud Console. I like that we’ve offered the ability to publish up to a hundred messages on an interval, which is great for testing purposes.

    Below, I entered some JSON, and removed a required field (“role”) before publishing. You can see that I got an error before the message hit the Topic.

    Dealing with schema changes

    My first impression upon using this schema capability in Pub/Sub was that it’s cool, but I wish I could change schemas more easily, and detach schemas from Topics. But the more I thought about it, the more I understood the design decision.

    If I’m attaching a schema to a Topic, then I’m serious about the data structure. And downstream consumers are expecting that specific data structure. Changing the schema means creating a new Topic, and establishing new subscribers.

    What if your app can absorb schema changes, and you want to access new Subscriptions without redeploying your whole app? You might retrieve the subscription name from an external configuration (e.g. ConfigMap in Kubernetes) versus hard-coding it. Or use a proxy service/function/whatever in between publishers and Topics, or consumers and subscriptions. Changing that proxy might be simpler than changing your primary system. Regardless, once you sign up to use schemas, you’ll want to think through your strategy for handling changes.

    [March 20, 2023 update: Schemas can now be updated.]

    Wrap up

    I like this (optional) functionality in Google Cloud Pub/Sub. You can do the familiar schema-on-read approach, or now do a schema-on-write when needed. If you want to try this yourself, take advantage of our free tier for Pub/Sub (10GB of messages per month) and let me know if you come up with any cool use cases, or schema upgrade strategies!

  • So, what the heck is *outbound* product management, and should you have this function too?

    So, what the heck is *outbound* product management, and should you have this function too?

    When the executive recruiter pinged me about joining Google Cloud to lead an outbound product management team, my first question was: “um, what’s an outbound product management team?” After a year in the job, I now know. Sort of. I’ve saved a list of questions people have asked me, and figured I’d answer them here.

    As an aside, smarter people that me have done a good job explaining the fundamentals of product management itself. Read anything by Melissa Perri—her book is great, and see this post how a product owner is different than product manager. Also dig into the tremendous archives of Marty Cagan to learn what product managers should be skilled in, and what a good job description looks like. And John Cutler is a must-read for regular, insightful perspectives on product thinking.

    Here’s what I get asked fairly often about outbound product management.

    Q: How are outbound product managers different than “regular” product managers?

    Both types of PMs have foundational product management skills, technical expertise, and a focus on product success. Outbound product managers are product managers who are primarily focused on go-to-market activities and customer interactions. We don’t maintain a product backlog, partner with engineers to plot out a given release, or directly own the product strategy.

    Instead of working with one product or subsystem, outbound PMs often work across a portfolio of products. We champion the portfolio and products to internal and external audiences, and spend a significant amount of time talking to customers, partners, industry analysts, and field personnel. Outbound PMs take what we learn and feed it back into our overall portfolio and product strategy.

    Q: What’s the purpose of the team?

    I describe the mission of our team as “increasing the adoption and fit of our products.” The “adoption” part is outbound focused and includes customer briefings, analyst updates, field training, partner enablement, content creation, and more. The “fit” part is how we take this broad set of things we learn about, and ensure we have a relevant, customer-focused strategy for every product in the portfolio. That means doing things like advocating for new products and features, owning and updating roadmaps, and helping construct cross-product strategies.

    Q: Where did outbound product management come from?

    I don’t know exactly. I think that our CEO Thomas Kurian had this at Oracle, and brought it to Google Cloud because it worked well. I’m told that other software companies had this practice, or a variation of it at different times in their history.

    I’ve seen more and more people put outbound PM on their resumes, so I’m not sure if that’s because they had that actual title, or if they did related work and want to align more closely to my job description 🙂

    Q: Isn’t this just a marketing team or an “office of the CTO” type team?

    We do a bit of a lot of things. From what I’ve observed, the biggest difference between outbound product management and its sister teams (product marketing, office of the CTO, developer relations) is the direct alignment with product management. I sit in engineering reviews, talk daily to product managers, contribute to our product roadmap artifacts, and help create our product strategy. The related teams do wonderful work broadcasting information and engaging outbound with customers. We do parts of that, but also the inward-facing work that takes what we learn from being outbound, and make the products better.

    Q: Where do you sit in the org chart?

    Outbound product management sits in the product management org, and outbound PMs are just “PMs” in our job ladder. At times, I feel like I answer to a dozen different people, given that OPM sits at the center of so many things.

    The person actually stuck managing me is a VP of Product Management. When performance review time arrives, we’re evaluated alongside the rest of the product managers. That said, we do different things that inbound PMs, and I’m working to ensure that it’s accounted for during perf and promo cycles!

    Q: How is the team arranged?

    I think I was the first outside hire for outbound product management at Google. Now, there are 40+ folks, spread across 8+ product areas (e.g. Storage, Networking, Compute, Analytics, AI, App Modernization). Teams range in size from 2-10 individuals.

    Currently, my team is arranged by product area. We drive our developer tools (Cloud SDK, Cloud Code), serverless (Cloud Build, Cloud Run, Cloud Functions, Cloud Workflows, App Engine), and container services (GKE, and Anthos). Individual outbound PMs focus on a given product (e.g. GKE) or product area (e.g. CI/CD).

    Everyone pitches in on cross-cutting efforts like analyst briefings, events, and joint messaging. Subsets of the team have expertise in different areas, and pair up to tackle those projects.

    Q: What skills do you hire for?

    It’s taken me a year to build out the team, so I’m not sure I’m a good person to ask. But basically, the best outbound PMs are continuous learners who are good communicators, customer-focused, and technically savvy. Ideally, candidates have meaningful product management experience, worked in the enterprise software space before, know the industry landscape, and have outstanding soft skills. That said, I’ve hired people who were new to cloud, people who hadn’t officially been a product manager before, and people who were in different roles within Google. We have a fun, diverse team of people with complementary skills who think strategically while learning constantly.

    Q: What does your team do, day to day?

    Let’s look at last week, shall we? Our team worked on the following things:

    • Delivered 20-ish talks to individual customers.
    • Helped a handful of customers onboard into private previews of new product features.
    • Updated product roadmap artifacts based on changes in priority for a couple of planned features.
    • Wrote announcement blog posts for multiple upcoming launches.
    • Hosted analyst inquiries with Gartner, Forrester, and Redmonk to learn about topics like software composition analysis, container management, and developer needs.
    • Filled out multiple analyst questionnaires and prepared for an upcoming hour-long presentation to Forrester for an upcoming “Wave” evaluation.
    • Created competitive overviews to summarize the container platform landscape. One is given to the field, the other is for an offsite with GKE product leadership.
    • Published video assets created for the Google social media team.
    • Did product roadmap reviews and Q&A sessions with our field teams and Certified Fellows community.
    • Drafted keynotes for upcoming public events.
    • Conducted multiple interviews of product management candidates.

    Q: How do you measure success? What are your OKRs?

    An inbound PM often has objectives and key results related to launching a given product and landing it in the market. For our outbound PM team, the four objectives we agreed on for 2021 include:

    1. [Enablement] Educate and enable the field, partners, and customers for success on product portfolio
    2. [Revenue] Engage with customers to meaningfully improve the adoption of product portfolio
    3. [Positioning and Strategy] Unify the positioning and strategy of the product portfolio across internal teams, partners, customers and developer communities
    4. [Team] Build an outbound PM organization that leads by example with expert people, repeatable systems, and inclusive culture.

    Each of these has some very specific key results to track progress towards those objectives. And it’s all a work in progress, so who knows how this will evolve.

    Q: Should every software company have an outbound product management team?

    No? When I ran product management at a previous company, our product managers did both inbound and outbound activities. There wasn’t a separate function. This specific discipline might make sense for your org if you have VERY technical products and want help landing those in the market and ensuring fit. Or, add outbound PMs if you have a very broad portfolio of products and need to unify the market positioning while better integrating the internal strategy. Outbound PM might also make sense if you have PMs that need to be very focused on day-to-day prioritization and delivery, and want help getting constant customer feedback and educating internal teams (e.g. field staff, marketing) about product capabilities.

    For some, outbound product management may be a temporary team, for others, it’ll be a durable part of how they plan and deliver products. It’ll be interesting to watch!

    Q: Are you having fun as an outbound product manager?

    I am. Honestly, I’ve enjoyed every job I’ve ever had. They each scratched a different itch. This one is unique, however. It requires the combination of every professional skill I’ve developed to this point, while still teaching me new ones. It’s a privilege to work here and lead this team, and I look forward to starting each day.

    We’re still hiring for a few OPM teams, so if this sounds interesting, throw your hat in the ring.

  • What’s the most configurable Kubernetes service in the cloud? Does it matter?

    What’s the most configurable Kubernetes service in the cloud? Does it matter?

    Configurability matters. Whether it’s in our code editors, database engine, or compute runtimes, we want the option—even if we don’t regularly use it—to shape software to our needs. When it comes to using that software as a service, we also look for configurations related to quality attributes—think availability, resilience, security, and manageability.

    For something like Kubernetes—a hyper-configurable platform on its own—you want a cloud service that makes this powerful software more resilient and cheaper to operate. This blog post focuses on configurability of each major Kubernetes service in the public cloud. I’ll make that judgement based on the provisioning options offered by each cloud.

    Disclaimer: I work for Google Cloud, so obviously I’ll have some biases. That said, I’ve used AWS for over a decade, was an Azure MVP for years, and can be mostly fair when comparing products and services. Please call out any mistakes I make!

    Google Kubernetes Engine (GKE)

    GKE was the first Kubernetes service available in the public cloud. It’s got a lot of features to explore. Let’s check it out.

    When creating a cluster, we’re immediately presented with two choices: standard cluster, or Autopilot cluster. The difference? A standard cluster gives the user full control of cluster configuration, and ownership of day-2 responsibilities like upgrades. An Autopilot cluster—which is still a GKE cluster—has a default configuration based on Google best practices, and all day-2 activities are managed by Google Cloud. This is ideal for developers who want the Kubernetes API but none of the management. For this evaluation, let’s consider the standard cluster type.

    If the thought of all these configurations feels intimidating, you’ll like that GKE offers a “my first cluster” button which spins up a small instance with a default configuration. Also, this first “create cluster” tab has a “create” button at the bottom that provisions a regular (3-node) cluster without requiring you to enter or change any configuration values. Basically, you can get started with GKE in three clicks.

    With that said, let’s look at the full set of provisioning configurations. On the left side of the “create a Kubernetes cluster” experience, you see the list of configuration categories.

    How about we look at the specific configurations. On the cluster basics tab, we have seven configuration decisions to make (or keep, if you just want to accept default values). These configurations include:

    1. Name. Naming is hard. These are 40 characters long, and permanent.

    2. Location type. Where do you want your control plane and nodes? Zonal clusters only live in a chosen zone, while Regional clusters spread the control plane and workers across zones in a region.

    3. Zone/Region. For zonal clusters, you pick a zone, for regional clusters, you pick a region.

    4. Specify default node locations. Choose which zone(s) to deploy to.

    5. Control plane version. GKE provisions and offers management of control plane AND worker nodes. Here, you choose whether you want to pick a static Kubernetes version and handle upgrades yourself, or a “release channel” where Google Cloud manages the upgrade cadence.

    6. Release channel. If you chose release channel vs static, you get a configuration choice of which channel. Options include “rapid” (get Kubernetes versions right away), “regular” (get Kubernetes versions after a period of qualification), and “stable” (longer validation period).

    7. Version. Whether choosing “static” or “release channel”, you configure which version you want to start with.

    You see in the picture that I can click “Create” here and be done. But I want to explore all the possible configurations at my disposal with GKE.

    My next (optional) set of configurations relates to node pools. A GKE cluster must have at least one node pool, which consists of an identical group of nodes. A cluster can have many node pools. You might want a separate pool for Windows nodes, or a bigger machine type, or faster storage.

    In this batch of configurations, we have:

    8. Add node pool. Here you have a choice on whether to stick with a single default node pool, or add others. You can add and remove node pools after cluster creation.

    9. Name. More naming.

    10. Number of nodes. By default there are three. Any fewer than three and you can have downtime during upgrades. Max of 1000 allowed here. Note that you get this number of nodes deployed PER location. 3 nodes x 3 locations = 9 nodes total.

    11. Enable autoscaling. Cluster autoscaling is cool. It works on a per-node-pool basis.

    12. Specify node locations. Where do you want the nodes? If you have a regional cluster, this is where you choose which AZs you want.

    13. Enable auto-upgrade. It’s grayed-out below because this is automatically selected for any “release channel” clusters. GKE upgrades worker nodes automatically in that case. If you chose a static version, then you have the option of selecting auto-upgrades.

    14. Enable auto-repair. If a worker node isn’t healthy, auto-repair kicks in to fix or replace the node. Like the previous configuration, this one is automatically applied for “release channel’ clusters.

    15. Max surge. Surge updates is about letting you control how many nodes GKE can upgrade at a given time, and how disruptive an upgrade may be. The “max surge” configuration determines how many additional nodes GKE adds to the node pool during upgrades.

    16. Max unavailable. This configuration refers to how many nodes can be simultaneously unavailable during an upgrade.

    Once again, you could stop here, and build your cluster. I WANT MORE CONFIGURATION. Let’s keep going. What if I want to configure the nodes themselves? That’s the next available tab.

    For node configurations, you can configure:

    17. Image type. This refers to the base OS which includes Google’s container-optimized OS, Ubuntu, and Windows Server.

    18. Machine family. GKE runs on virtual machines. Here is where you choose which type of underlying VM you want, including general purpose, compute-optimized, memory-optimized or GPU-based.

    19. Series. Some machine families have sub-options for specific VMs.

    20. Machine type. Here are the specific VM sizes you want, with combinations of CPU and memory.

    21. Boot disk type. This is where you choose a standard or SSD persistent disk.

    22. Boot disk size. Choose how big of a boot disk you want. Max size is 65,536 GB.

    23. Enable customer-managed encryption for boot disk. You can encrypt the boot disk with your own key.

    24. Local SSD disks. How many attached disks do you want? Enter here. Max of 24.

    25. Enable preemptible nodes. Choose to use cheaper compute instances that only live for up to 24 hours.

    26. Maximum pods per node. Limit how many pods you want on a given node, which has networking implications.

    27. Network tags. This represents firewall rules applied to nodes.

    Security. Let’s talk about it. You have a handful of possible configurations to secure your GKE node pools.

    Node pool security configurations include:

    28. Service account. By default, containers running on this VM call Google Cloud APIs using this account. You may want a unique service account, and/or least-privilege one.

    29. Access scopes. Control the type of level of API access to grant the underlying VM.

    30. Enable sandbox with gVisor. This isn’t enabled for the default node pool, but for others, you can choose the extra level of isolation for pods on the node.

    31. Enable integrity monitoring. Part of the “Shielded node” functionality, this configuration lets you monitor and verify boot integrity.

    32. Enable secure boot. Use this configuration setting for additional protection from boot-level and kernel-level malware.

    Our last set of options for each node pool relates to metadata. Specifically:

    33. Kubernetes labels. These get applied to every node in the pool and can be used with selectors to place pods.

    34. Node taints. These also apply to every node in the pool and help control what gets scheduled.

    35. GCE instance metadata. This attaches info to the GCE instances

    That’s the end of the node pool configurations. Now we have the option of cluster-wide configurations. First up are settings based on automation.

    These cluster automation configurations include:

    36. Enable Maintenance Window. If you want maintenance activities to happen during certain times or days, you can set up a schedule.

    37. Maintenance exclusions. Define up to three windows where updates won’t happen.

    38. Enable Notifications. GKE can publish upgrade notifications to a Google Cloud Pub/Sub topic.

    39. Enable Vertical Pod Autoscaling. With this configured, your cluster will rightsize CPU and memory based on usage.

    40. Enable node auto-provisioning. GKE can create/manage entire node pools on your behalf versus just nodes within a pool.

    41. Autoscaling profile. Choose when to remove underutilized nodes.

    The next set of cluster-level options refer to Networking. Those configurations include:

    42. Network. Choose the network the GKE cluster is a member of.

    43. Node subnet. Apply a subnet.

    44. Public cluster / Private cluster. If you want only private IPs for your cluster, choose a private cluster.

    45. Enable VPC-native traffic routing. Applies alias IP for more secure integration with Google Cloud services.

    46. Automatically create secondary ranges. Disabled here because my chosen subnet doesn’t have available user-managed secondary ranges. If it did, I’d have a choice of letting GKE manage those ranges.

    47. Port address range. Pods in the clusters are assigned IPs from this range.

    48. Maximum pods per node. Has network implications.

    49. Service address range. Any cluster services will be assigned an IP address from this range.

    50. Enable intranode visibility. Pod-to-pod traffic because visible to the GCP networking fabric so that you could do flow logging, and more.

    51. Enable NodeLocal DNSCache. Improve perf by running a DNS caching agent on nodes.

    52. Enable HTTP load balancing. This installs a controller that applies configs to the Google Cloud Load Balancer.

    53. Enable subsetting for L4 internal load balancers. Internal LBs use a subset of nodes as backends to improve perf.

    54. Enable control plane authorized networks. Block untrusted, non-GCP sources from accessing the Kubernetes master.

    55. Enable Kubernetes Network Policy. This API lets you define which pods can access each other.

    GKE also offers a lot of (optional) cluster-level security options.

    The cluster security configurations include:

    56. Enable Binary Authorization. If you want a secure software supply chain, you might want to apply this configuration and ensure that only trusted images get deployed to GKE.

    57. Enable Shielded GKE Nodes. This provides cryptographic identity for nodes joining a cluster.

    58. Enable Confidential GKE Nodes. Encrypt the memory of your running nodes.

    59. Enable Application-level Secrets Encryption. Protect secrets in etcd using a key stored in Cloud KMS.

    60. Enable Workload Identity. Map Kubernetes service accounts to IAM accounts so that your workload doesn’t need to store creds. I wrote about it recently.

    61. Enable Google Groups for RBAC. Grant roles to members of a Workspace group.

    62. Enable legacy authorization. Prevents full Kubernetes RBAC from being used in cluster.

    63. Enable basic authentication. This is a deprecated way to authenticate to a cluster. Don’t use it.

    64. Issue a client certificate. Skip this too. This creates a specific cert for cluster access, and doesn’t automatically rotate.

    It’s useful to have cluster metadata so that you can tag clusters by environment, and more.

    The couple of metadata configurations are:

    65. Description. Free text box to describe your cluster.

    66. Labels. Add individual labels that can help you categorize.

    We made it to the end! The last set of GKE configurations relate to features that you want to add to the cluster.

    These feature-based configurations include:

    67. Enable Cloud Run for Anthos. Throw Knative into your GKE cluster.

    68. Enable Cloud Operations for GKE. A no-brainer. Send logs and metrics to the Cloud Ops service in Google Cloud.

    69. Select logging and monitoring type. If you select #68, you can choose the level of logging (e.g. workload logging, system logging).

    70. Enable Cloud TPU. Great for ML use cases within the cluster.

    71. Enable Kubernetes alpha features in this cluster. Enabled if you are NOT using release channels. These are short lived clusters with everything new lit up.

    72. Enable GKE usage metering. See usage broken down by namespace and label. Good for chargebacks.

    73. Enable Istio. Throw Istio into your cluster. Lots of folks do it!

    74. Enable Application Manager. Helps you do some GitOps style deployments.

    75. Enable Compute Engine Persistent Disk CSI Driver. This is now the standard way to get volume claims for persistent storage.

    76. Enable Config Connector. If you have Workload Identity enabled, you can set this configuration. It adds custom resources and controllers to your cluster that let you create and manage 60+ Google Cloud services as if they were Kubernetes resources.

    FINAL TALLY. Getting started: 3 clicks. Total configurations available: 76.

    Azure Kubernetes Service (AKS)

    Let’s turn our attention to Microsoft Azure. They’ve had a Kubernetes service for quite a while.

    When creating an AKS cluster, I’m presented with an initial set of cluster properties. Two of them (resource group, and cluster name) are required before I can “review and create” and then create the cluster. Still, it’s a simple way to get started with just five clicks.

    The first tab of the provisioning experience focuses on “basic” configurations.

    These configurations include:

    1. Subscription. Set which of your Azure subscriptions to use for this cluster.

    2. Resource group. Decide which existing (or create a new) resource group to associate with this cluster.

    3. Kubernetes cluster name. Give your cluster a name.

    4. Region. Choose where in the world you want you cluster.

    5. Availability zones. For regions with availability zones, you can choose how to stripe the cluster across those.

    6. Kubernetes version. Pick a specific version of Kubernetes for the AKS cluster.

    7. Node size. Here you choose the VM family and instance type for your cluster.

    8. Node count. Pick how many nodes make up the primary node pool.

    Now let’s explore the options for a given node pool. AKS offers a handful of settings, including ones that fly out into another tab. These include:

    9. Add node pool. You can stick with the default node pool, or add more.

    10. Node pool name. Give each node pool a unique name.

    11. Mode. A “system” node pool is meant for running system pods. This is what the default node pool will always be set to. User node pools make sense for your workloads.

    12. OS type. Choose Linux or Windows, although system node pools must be Linux.

    13. Availability zones. Select the AZs for this particular node pool. You can change from the default set on the “basic” tab.

    14. Node size. Keep or change the default VM type for the cluster.

    15. Node count. Choose how many nodes to have in this pool.

    16. Max pods per node. Impacts network setup (e.g. how many IP addresses are needed for each pool).

    17. Enable virtual nodes. For bursty scenarios, this AKS features deploys containers to nodes backed by their “serverless” Azure Container Instances platform.

    18. Enable virtual machine scale sets. Chosen by default if you use multiple AZs for a cluster. Plays a part in how AKS autoscales.

    The next set of cluster-wide configurations for AKS relate to security.

    These configurations include:

    19. Authentication method. This determines how an AKS cluster interacts with other Azure sources like load balancers and container registries. The user has two choices here.

    20. Role-based access control. This enables RBAC in the cluster.

    21. AKS-managed Azure Active Directory. This configures Kubernetes RBAC using Azure AD group membership.

    22. Encryption type. Cluster disks are encrypted at rest by default with Microsoft-managed keys. You can keep that setting, or change to a customer-managed key.

    Now, we’ll take a gander at the network-related configurations offered by Azure. These configurations include:

    23. Network configuration. The default option here is a virtual network and subnet created for you. You can also use CNI to get a new or existing virtual network/subnet with user-defined address ranges.

    24. DNS name prefix. This is the prefix used with the hosted API server’s FQDDN.

    25. Enable HTTP application routing. The previous “Load balancer” configuration is fixed for every cluster created in the Azure Portal. This setting is about creating publicly accessible DNS names for app endpoints.

    26. Enable private cluster. This ensures that network traffic between the API server and node pools remains on a private network.

    27. Set authorized IP ranges. Choose the IP ranges that can access the API server.

    28. Network policy. Define rules for ingress and egress traffic between pods in a cluster. You can choose none, Calico, or Azure’s network policies.

    The final major configuration category is “integrations.” This offers a few options to connect AKS clusters to other Azure services.

    These “integration” configurations include:

    29. Container registry. Point to, or create, an Azure Container Registry instance.

    30. Container monitoring. Decide whether you want workload metrics fed to Azure’s analytics suite.

    31. Log Analytics workspace. Create a new one, or point to an existing one, to store monitoring data.

    32. Azure Policy. Choose to apply an admission controller (via Gatekeeper) to enforce policies in the cluster.

    The last tab for AKS configuration relates to tagging. This can be useful for grouping and categorizing resources for chargebacks.

    FINAL TALLY. Getting started: 5 clicks. Total configurations available: 33.

    Amazon Elastic Kubernetes Service (EKS)

    AWS is a go-to for many folks running Kubernetes, and they shipped a managed service for Kubernetes a few years back. EKS looks different from GKE or AKS. The provisioning experience is fairly simplistic, and doesn’t provision the worker nodes. That’s something you do yourself later, and then you see a series of configurations for node pools after you provision them. It also offers post-provisioning options for installing things like autoscalers, versus making that part of the provisioning.

    Getting started with EKS means entering some basic info about your Kubernetes cluster.

    These configurations include:

    1. Name. Provide a unique name for your cluster.

    2. Kubernetes version. Pick a specific version of Kubernetes for your cluster.

    3. Cluster Service Role. This is the AWS IAM role that lets the Kubernetes control plan manage related resources (e.g. load balancers).

    4. Secrets encryption. This gives you a way to encrypt the secrets in the cluster.

    5. Tags. Add up to 50 tags for the cluster.

    After these basic settings, we click through some networking settings for the cluster. Note that EKS doesn’t provision the node pools (workers) themselves, so all these settings are cluster related.

    The networking configurations include:

    6. Select VPC. Choose which VPC to use for the cluster. This is not optional.

    7. Select subnets. Choose the VPC subnet for your cluster. Also, not optional.

    8. Security groups. Choose one or more security groups that apply to worker node subnets.

    9. Configure Kubernetes Service IP address range. Set the range that cluster services use for IPv4 addresses.

    10. Cluster endpoint access. Decide if you want a public cluster endpoint accessible outside the VPC (including worker access), a mix of public and private, or private only.

    11. Advanced settings. Here’s where you set source IPs for the public access endpoint.

    12. Amazon VPC CNI version. Choose which version of the add-on you want for CNI.

    The last major configuration view for provisioning a cluster relates to logging.

    The logging configurations include:

    13. API server. Log info for API requests.

    14. Audit. Grab logs about cluster access.

    15. Authenticator. Get lots for authentication requests.

    16. Controller manager. Store logs for cluster controllers.

    17. Scheduler. Get logs for scheduling decisions.

    We have 17 configurations available in the provisioning experience. I really wanted to stop here (versus being forced to create and pay for a cluster to access the other configuration settings), but to be fair, let’s look at post-provisioning configurations of EKS, too.

    After creating an EKS cluster, we see that new configurations become available. Specifically, configurations for a given node pool.

    The node group configurations include:

    18. Name. This is the name for the node group.

    19. Node IAM role. This is the role used by the nodes to access AWS services. If you don’t have a valid role, you need to create one here.

    20. Use launch template. If you want a specific launch template, you can choose that here.

    21. Kubernetes labels. Apply labels to the node group.

    22. Tags. Add AWS tags to the node group.

    Next we set up compute and scaling configs. These configs include:

    23. AMI type. Pick the machine image you want for your nodes.

    24. Capacity type. Choose on-demand or spot instances.

    25. Instance type. Choose among dozens of VM instance types to host the nodes.

    26. Disk size. Pick the size of attached EBS volumes.

    27. Minimum size. Set the smallest size a cluster can be.

    28. Maximum size. Set the largest size a cluster can be.

    29. Desired size. Set the desired number of nodes to start with.

    Our final set of node group settings relate to networking. The configurations you have access to here include:

    30. Subnets. Choose which subnets for your nodes.

    31. Allow remote access to nodes. This ensures you can access nodes after creation.

    32. SSH keypair. Choose (or create) a key pair for remote access to nodes.

    33. Allow remote access from. This lets you restrict access to source IP ranges.

    FINAL TALLY. Getting started: 7 clicks (just cluster control plane, not nodes). Total configurations available: 33.

    Wrap Up

    GKE does indeed stand out here. GKE has fewest steps required to get a cluster up and running. If I want a full suite of configuration options, GKE has the most. If I want a fully managed cluster without any day-2 activities, GKE is the only one that has that, via GKE Autopilot.

    Does it matter that GKE is the most configurable Kubernetes service in the public cloud? I think it does. Both AKS and EKS have a fine set of configurations. But comparing AKS or EKS to GKE, it’s clear how much more control GKE offers for cluster sizing, scaling, security, and automation. While I might not set most of these configurations on a regular basis, I can shape the platform to a wide variety of workloads and use cases when I need to. That ensures that Kubernetes can run a wide variety of things, and I’m not stuck using specialized platforms for each workload.

    As you look to bring your Kubernetes platform to the cloud, keep an eye on the quality attributes you need, and who can satisfy them the best!

  • Exploring a fast inner dev loop for Spring Boot apps targeting Google Cloud Run

    Exploring a fast inner dev loop for Spring Boot apps targeting Google Cloud Run

    It’s a gift to the world that no one pays me to write software any longer. You’re welcome. But I still enjoy coding and trying out a wide variety of things. Given that I rarely have hours upon hours to focus on writing software, I seek things that make me more productive with the time I have. My inner development loop matters. You know, the iterative steps we perform to write, build, test, and commit code.

    So let’s say I want to build a REST API in Java. This REST API stores and returns the names of television characters. What’s the bare minimum that I need to get going?

    • An IDE or code editor
    • A database to store records
    • A web server to host the app
    • A route to reach the app

    What are things I personally don’t want to deal with, especially if I’m experimenting and learning quickly?

    • Provisioning lots of infrastructure. Either locally to emulate the target platform, or elsewhere to actually run my app. It takes time, and I don’t know what I need.
    • Creating database stubs or mocks, or even configuring Docker containers to stand-in for my database. I want the real thing, if possible.
    • Finding a container registry to use. All this stuff just needs to be there.
    • Writing Dockerfiles to package an app. I usually get them wrong.
    • Configuring API gateways or network routing rules. Just give me an endpoint.

    Based on this, one of the quickest inner loop I know of involves Spring Boot, the Google Cloud SDK, Cloud Firestore, and Google Cloud Run. Spring Boot makes it easy to spin up API projects and it’s ORM capabilities make it simple to interact with a database. Speaking of databases, Cloud Firestore is powerful and doesn’t force me into a schema. That’s great when I don’t know the final state of my data structure. And Cloud Run seems like the single best way to run custom-built apps in the cloud. How about we run through this together?

    On my local machine, I’ve installed Visual Studio Code—the FASTEST possible inner loop might have involved using the Google Cloud Shell and skipping any local work, but I still like doing local dev—along with the latest version of Java, and the Google Cloud SDK. The SDK comes with lots of CLI tools and emulators, including one for Firestore and Datastore (an alternate API).

    Time to get to work. I visited start.spring.io to generate a project. I could choose a few dependencies from the curated list, including a default one for Google Cloud services, and another for exposing my data repository as a series of REST endpoints.

    I generated the project, and opened it in Visual Studio Code. Then, I opened the pom.xml file and added one more dependency. While I’m using the Firestore database, I’m using it in “Datastore mode” which works better with Spring Data REST. Here’s my finished pom file.

    <?xml version="1.0" encoding="UTF-8"?>
    <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
    	<modelVersion>4.0.0</modelVersion>
    	<parent>
    		<groupId>org.springframework.boot</groupId>
    		<artifactId>spring-boot-starter-parent</artifactId>
    		<version>2.4.4</version>
    		<relativePath/> <!-- lookup parent from repository -->
    	</parent>
    	<groupId>com.seroter</groupId>
    	<artifactId>boot-gcp-run-firestore</artifactId>
    	<version>0.0.1-SNAPSHOT</version>
    	<name>boot-gcp-run-firestore</name>
    	<description>Demo project for Google Cloud and Spring Boot</description>
    	<properties>
    		<java.version>11</java.version>
    		<spring-cloud-gcp.version>2.0.0</spring-cloud-gcp.version>
    		<spring-cloud.version>2020.0.2</spring-cloud.version>
    	</properties>
    	<dependencies>
    		<dependency>
    			<groupId>org.springframework.boot</groupId>
    			<artifactId>spring-boot-starter-data-rest</artifactId>
    		</dependency>
    		<dependency>
    			<groupId>com.google.cloud</groupId>
    			<artifactId>spring-cloud-gcp-starter</artifactId>
    		</dependency>
    		<dependency>
    			<groupId>com.google.cloud</groupId>
    			<artifactId>spring-cloud-gcp-starter-data-datastore</artifactId>
    			<version>2.0.2</version>
    		</dependency>
    		<dependency>
    			<groupId>org.springframework.boot</groupId>
    			<artifactId>spring-boot-starter-test</artifactId>
    			<scope>test</scope>
    		</dependency>
    	</dependencies>
    	<dependencyManagement>
    		<dependencies>
    			<dependency>
    				<groupId>org.springframework.cloud</groupId>
    				<artifactId>spring-cloud-dependencies</artifactId>
    				<version>${spring-cloud.version}</version>
    				<type>pom</type>
    				<scope>import</scope>
    			</dependency>
    			<dependency>
    				<groupId>com.google.cloud</groupId>
    				<artifactId>spring-cloud-gcp-dependencies</artifactId>
    				<version>${spring-cloud-gcp.version}</version>
    				<type>pom</type>
    				<scope>import</scope>
    			</dependency>
    		</dependencies>
    	</dependencyManagement>
    
    	<build>
    		<plugins>
    			<plugin>
    				<groupId>org.springframework.boot</groupId>
    				<artifactId>spring-boot-maven-plugin</artifactId>
    			</plugin>
    		</plugins>
    	</build>
    
    </project>
    

    Let’s sling a little code, shall we? Spring Boot almost makes this too easy. First, I created a class to describe a “character.” I started with just a couple of characteristics—full name, and role.

    package com.seroter.bootgcprunfirestore;
    
    import com.google.cloud.spring.data.datastore.core.mapping.Entity;
    import org.springframework.data.annotation.Id;
    
    @Entity
    class Character {
    
        @Id
        private Long id;
        private String FullName;
        private String Role;
        
        public String getFullName() {
            return FullName;
        }
        public String getRole() {
            return Role;
        }
        public void setRole(String role) {
            this.Role = role;
        }
        public void setFullName(String fullName) {
            this.FullName = fullName;
        }
    }
    

    All that’s left is to create a repository resource and Spring Data handles the rest. Literally!

    package com.seroter.bootgcprunfirestore;
    
    import com.google.cloud.spring.data.datastore.repository.DatastoreRepository;
    import org.springframework.data.rest.core.annotation.RepositoryRestResource;
    
    @RepositoryRestResource
    interface CharacterRepository extends DatastoreRepository<Character, Long> {
        
    }
    

    That’s kinda it. No other code is needed. Now I want to test it out and see if it works. The first option is to spin up an instance of the Datastore emulator—not Firestore since I’m using the Datastore API—when my app starts. That’s handy. It’s one line in my app.properties file.

    spring.cloud.gcp.datastore.emulator.enabled=true
    

    When I execute ./mvnw spring-boot:run I see the app compile, and get a notice that the Datastore emulator was started up. I went to Postman to call the API. First I added a record.

    Then I called the endpoint to retrieve the store data. It worked. It’s great that Spring Data REST wires up all these endpoints automatically.

    Now, I really like that I can start up the emulator as part of the build. But, that instance is ephemeral. When I stop running the app locally, my instance goes away. What if my inner loop involves constantly stopping the app to make changes, recompile, and start up again? Don’t worry. It’s also easy to stand up the emulator by itself, and attach my app to it. First, I ran gcloud beta emulators datastore start to get the local instance running in about 2 seconds.

    Then I updated my app.properties file by commenting out the statement that enables local emulation, and replacing with this statement that points to the emulator:

    spring.cloud.gcp.datastore.host=localhost:8081
    

    Now I can start and stop the app as much as I want, and the data persists. Both options are great, depending on how you’re doing local development.

    Let’s deploy. I wanted to see this really running, and iterate further after I’m confident in how it behaves in a production-like environment. The easiest option for any Spring Boot developer is Cloud Run. It’s quick, it’s serverless, and we support buildpacks, so you never need to see a container.

    I issued a single CLI command—gcloud beta run deploy boot-app --memory=1024 --source .— to package up my app and get it to Cloud Run.

    After a few moments, I had a container in the registry, and an instance of Cloud Run. I don’t have to do any other funny business to reach the endpoint. No gateways, proxies, or whatever. And everything is instantly wired up to Cloud Logging and Cloud Monitoring for any troubleshooting. And I can provision up to 8GB of RAM and 4 CPUs, while setting up to 250 concurrent connections per container, and 1000 maximum instances. There’s a lot you can run with that horsepower.

    I pinged the public endpoint, and sure enough, it was easy to publish and retrieve data from my REST API …

    … and see the data sitting in the database!

    When I saw the results, I realized I wanted more data fields in here. No problem. I went back to my Spring Boot app, and added a new field, isHuman. There are lots of animals on my favorite shows.

    This time when I deployed, I chose the “no traffic” flag—cloud beta run deploy boot-app --memory=1024 --source . --no-traffic—so that I could control who saw the new field. Once it deployed, I saw two “revisions” and had the ability to choose the amount of traffic to send to each.

    I switched 50% of the traffic to the new revision, liked what I saw, and then flipped it to 100%.

    So there you go. It’s possible to fly through this inner loop in minutes. Because I’m leaning on managed serverless technologies for things like application runtime and database, I’m not wasting any time building or managing infrastructure. The local dev tooling from Google Cloud is terrific, so I have easy use of IDE integrations, emulators and build tools. This stack makes it simple for me to iterate quickly, cheaply, and with tech that feels like the future, versus wrestling with old stuff that’s been retrofitted for today’s needs.

  • Want secure access to (cloud) services from your Kubernetes-based app? GKE Workload Identity is the answer.

    Want secure access to (cloud) services from your Kubernetes-based app? GKE Workload Identity is the answer.

    My name is Richard, and I like to run as admin. There, I said it. You should rarely listen to me for good security advice since I’m now (always?) a pretend developer who does things that are easy, not necessarily right. But identity management is something I wanted to learn more about in 2021, so now I’m actually trying. Specifically, I’m exploring the best ways for my applications to securely access cloud services. In this post, I’ll introduce you to GKE Workload Identity, and why it seems like a terrific way to do the right thing.

    First, let’s review some of your options for providing access to distributed components—think databases, storage, message queues, and the like—from your application.

    • Store credentials in application variables. This is terrible. Which means I’ve done it before myself. Never do this, for roughly 500 different reasons.
    • Store credentials in property files. This is also kinda awful. First, you tend to leak your secrets often because of this. Second, it might as well be in the code itself, as you still have to change, check in, do a build, and do a deploy to make the config change.
    • Store credentials in environment variables. Not great. Yes, it’s out of your code and config, so that’s better. But I see at least three problems. First, it’s likely not encrypted. Second, you’re still exporting creds from somewhere and storing them here. Third, there’s no version history or easy management (although clouds offer some help here). Pass.
    • Store credentials in a secret store. Better. At least this is out of your code, and in a purpose-built structure for securely storing sensitive data. This might be something robust like Vault, or something more basic like Kubernetes Secrets. The downside is still that you are replicating credentials outside the Identity Management system.
    • Use identity federation. Here we go. How about my app runs under an account that has the access it needs to a given service? This way, we’re not extracting and stashing credentials. Seems like the ideal choice.

    So, if identity federation is a great option, what’s the hard part? Well, if my app is running in Kubernetes, how do I run my workload with the right identity? Maybe through … Workload Identity? Basically, Workload Identity lets you map a Kubernetes service account to a given Google Cloud service account (there are similar types of things for EKS in AWS, and AKS in Azure). At no point does my app need to store or even reference any credentials. To experiment, I created a basic Spring Boot web app that uses Spring Cloud GCP to talk to Cloud Storage and retrieve all the files in a given bucket.

    package com.seroter.gcpbucketreader;
    
    import java.util.ArrayList;
    import java.util.Iterator;
    import java.util.List;
    
    import com.google.api.gax.paging.Page;
    import com.google.cloud.storage.Blob;
    import com.google.cloud.storage.Storage;
    
    import org.springframework.beans.factory.annotation.Autowired;
    import org.springframework.boot.SpringApplication;
    import org.springframework.boot.autoconfigure.SpringBootApplication;
    import org.springframework.stereotype.Controller;
    import org.springframework.ui.Model;
    import org.springframework.web.bind.annotation.GetMapping;
    import org.springframework.web.bind.annotation.RequestParam;
    
    @Controller
    @SpringBootApplication
    public class GcpBucketReaderApplication {
    
    	public static void main(String[] args) {
    		SpringApplication.run(GcpBucketReaderApplication.class, args);
    	}
    
    	//initiate auto-configuration magic that pulls in the right credentials at runtime
    	@Autowired(required=false)
    	private Storage storage;
    
    	@GetMapping("/")
    	public String bucketList(@RequestParam(name="bucketname", required=false, defaultValue="seroter-bucket-logs") String bucketname, Model model) {
    
    		List<String> blobNames = new ArrayList<String>();
    
    		try {
    
    			//get the objects in the bucket
    			Page<Blob> blobs = storage.list(bucketname);
    			Iterator<Blob> blobIterator = blobs.iterateAll().iterator();
    
    			//stash bucket names in an array
    			while(blobIterator.hasNext()) {
    				Blob b = blobIterator.next();
    				blobNames.add(b.getName());
    			}
    		}
    		//if anything goes wrong, catch the generic error and add to view model
    		catch (Exception e) {
    			model.addAttribute("errorMessage", e.toString());
    		}
    
    		//throw other values into the view model
    		model.addAttribute("bucketname", bucketname);
    		model.addAttribute("bucketitems", blobNames);
    
    		return "bucketviewer";
    	}
    }
    

    I built and containerized this app using Cloud Build and Cloud Buildpacks. It only takes a few lines of YAML and one command (gcloud builds submit --config cloudbuild.yaml .) to initiate the magic.

    steps:
    # use Buildpacks to create a container image
    - name: 'gcr.io/k8s-skaffold/pack'
      entrypoint: 'pack'
      args: ['build', '--builder=gcr.io/buildpacks/builder', '--publish', 'us-west1-docker.pkg.dev/seroter-anthos/seroter-images/boot-bucketreader:$COMMIT_SHA']
    

    In a few moments, I had a container image in Artifact Registry to use for testing.

    Then I loaded up a Cloud Storage bucket with a couple of nonsense files.

    Let’s play through a few scenarios to get a better sense of what Workload Identity is all about.

    Scenario #1 – Cluster runs as the default service account

    Without Workload Identity, a pod in GKE assumes the identity of the service account associated with the cluster’s node pool.

    When creating a GKE cluster, you choose a service account for a given node pool. All the nodes runs as this account.

    I built a cluster using the default service account, which can basically do everything in my Google Cloud account. That’s fun for me, but rarely something you should ever do.

    From within the GKE console, I went ahead and deployed an instance of our container to this cluster. Later, I’ll use Kubernetes YAML files to deploy pods and expose services, but the GUI is fun to use for basic scenarios.

    Then, I created a service to route traffic to my pods.

    Once I had a public endpoint to ping, I sent a request to the page and provided the bucket name as a querystring parameter.

    That worked, as expected. Since the pod runs as a super-user, it had full permission to Cloud Storage, and every bucket inside. While that’s a fun party trick, there aren’t many cases where the workloads in a cluster should have access to EVERYTHING.

    Scenario #2 – Cluster runs as a least privilege service account

    Let’s do the opposite and see what happens. This time, I started by creating a new Google Cloud service account that only had “read” permissions to the Artifact Registry (so that it could pull container images) and Kubernetes cluster administration rights.

    Then, I built another GKE cluster, but this time, chose this limited account as the node pool’s service account.

    After building the cluster, I went ahead and deployed the same container image to the new cluster. Then I added a service to make these pods accessible, and called up the web page.

    As expected, the attempt to read my Storage bucket failed, This least privilege account didn’t have rights to Cloud Storage.

    This is a more secure setup, but now I need a way for this app to securely call the Cloud Storage service. Enter Workload Identity.

    Scenario #3 – Cluster has Workload Identity configured with a mapped service account

    I created yet another cluster. This time, I chose the least privilege account, and also chose to install Workload Identity. How does this work? When my app ran before, it used (via the Spring Cloud libraries) the Compute Engine metadata server to get a token to authenticate with Cloud Storage. When I configure Workload Identity, those requests to the metadata server get routed to the GKE metadata server. This server runs on each cluster node, mimics the Compute Engine metadata server, and gives me a token for whatever service account the pod has access to.

    If I deploy my app now, it still won’t work. Why? I haven’t actually mapped a service account to the namespace my pod gets deployed into!

    I created the namespace, created a Kubernetes service account, created a Google Cloud storage account, mapped the two together, and annotated our service account. Let’s go step by step.

    First, I created the namespace to hold my app.

    kubectl create namespace blog-demos

    Next, I created a Kubernetes service account (“sa-storageapp”) that’s local to the cluster, and namespace.

    kubectl create serviceaccount --namespace blog-demos sa-storageapp

    After that, I created a new Google Cloud service account named gke-storagereader.

    gcloud iam service-accounts create gke-storagereader

    Now we’re ready for some account mapping. First, I made the Kubernetes service account a member of my Google Cloud storage account.

    gcloud iam service-accounts add-iam-policy-binding \
      --role roles/iam.workloadIdentityUser \
      --member "serviceAccount:seroter-anthos.svc.id.goog[blog-demos/sa-storageapp]" \
      gke-storagereader@seroter-anthos.iam.gserviceaccount.com
    

    Now, to give the Google Cloud service account the permission it needs to talk to Cloud Storage.

    gcloud projects add-iam-policy-binding seroter-anthos \
        --member="serviceAccount:gke-storagereader@seroter-anthos.iam.gserviceaccount.com" \
        --role="roles/storage.objectViewer"
    

    The final step? I had to add an annotation to the Kubernetes service account that links to the Google Cloud service account.

    kubectl annotate serviceaccount \
      --namespace blog-demos \
      sa-storageapp \
      iam.gke.io/gcp-service-account=gke-storagereader@seroter-anthos.iam.gserviceaccount.com
    

    Done! All that’s left is to deploy my Spring Boot application.

    First I set my local Kubernetes context to the target namespace in the cluster.

    kubectl config set-context --current --namespace=blog-demos

    In my Kubernetes deployment YAML, I pointed to my container image, and provided a service account name to associate with the deployment.

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: boot-bucketreader
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: boot-bucketreader
      template:
        metadata:
          labels:
            app: boot-bucketreader
        spec:
          serviceAccountName: sa-storageapp
          containers:
          - name: server
            image: us-west1-docker.pkg.dev/seroter-anthos/seroter-images/boot-bucketreader:latest
            ports:
            - containerPort: 8080
    

    I then deployed a YAML file to create a routable service, and pinged my application. Sure enough, I now had access to Cloud Storage.

    Wrap

    Thanks to Workload Identity for GKE, I created a cluster that had restricted permissions, and selectively gave permission to specific workloads. I could get even more fine-grained by tightening up the permissions on the GCP service account to only access a specific bucket (or database, or whatever). Or have different workloads with different permissions, all in the same cluster.

    To me, this is the cleanest, most dev-friendly way to do access management in a Kubernetes cluster. And we’re bringing this functionality to GKE clusters that run anywhere, via Anthos.

    What about you? Any other ways you really like doing access management for Kubernetes-based applications?