A cloud strategy

I’m going to rehash some learnings that I have made over the last decade and a bit of doing cloud in one way or another. I have recently read thought leaders report similar things – only better written and backed by more experience of course – which made me think “oh, I’m not totally crazy then” and to set about writing this down.

Basically, it is my medium tempered take on the whole cloud thing, in terms of getting on it.

Why cloud is cool

In the bad old days, if you had an idea for some software, you had to start by buying a server. I mean it was never on the scale of “I have an idea for a consumer product, let’s build a factory”, but still it was definitely a barrier to getting started.

The impetus for building out cloud was that Amazon needed compute for their little bookshop website (remember!?) and thought it was prohibitively expensive to buy high-end servers, so they decided to buy a metric faecal ton of low-end computers instead, and use software to provision this aggregated computing horsepower, and basically letting people choose between a bunch of virtual machine sized depending on the oompgh a certain department needed for their application.

This was of course extremely complicated, the software bit, but once they were done, they had accidentally created cloud compute and could make more money selling cloud compute than selling books. The ability of using relatively simple APIs to create and provision VMs allowed startups to acquire really pathetic servers for nearly nothing, which was amazing for trialling ideas and fed the software boom we have seen over the last decades. Other services were built on top, and competitors came around.

Why cloud is cool is the tight APIs that lets you create and destroy infrastructure in an automated fashion. Yes, you can autoscale, but also the rapid prototyping potential is really beneficial and arguably even more significant.

What to worry about

Security

Yes, of course. Public cloud, you can tell from the name. It is not the same as having your servers at home, at least psychologically. On the other hand, assuming you are secure because your servers are inside your own building is false as well, you are still on the internet. There is no getting around this, you already have network people and security people, and with all cloud providers there are ways of securing your network that they will be familiar with how to operate sensibly. I e your network and security people will know what to do.

Cost

Yes, it’s a big one. Not all cloud things are free or near free. Basically, running a VM 24/7 and block storage (as in a scalable pretend hard drive mapped to a VM) are usually the most expensive things you can do in a public cloud. Sadly, a VM tucked away in a virtual private network with elastic storage mapped to it seem like such an easy migration path if you are currently running your apps in VM You will need to migrate your apps over to cloudier solutions such as AWS Fargate/Lambda or Azure App Services to reduce cost eventually. For your in-house LOB apps you can in most cases (but not all) trivially replace file system storage with cloud native blob storage such as AWS S3 or Azure BlobStorage for your files storage needs, but it does require code changes, even if they are small. As the cloud bill start to come in, it seems a good way to spend developer resources as the returns in terms of cost savings can be quite significant. Be wary of giving developers the ability to create resources at will, as the odd developer accidentally leaving a VM running will quickly accumulate. There are ways of dealing with this kind of stuff, but do consider it.

What not to worry about

Multi cloud

There are many tools that provide abstractions over cloud APIs, and many tools that promise that they can offer you independence and warn of vendor lock-in. That is for most people just a waste.

You will need to choose one provider for your app stack. You can still have Google Apps for email and use AWS for cloud, or use DNSimple for DNS, Office 365 for email and AWS for apps, those are mostly orthogonal concerns. You will suffer outages. You will not – without incurring unfathomable cost – be able to load balance across cloud providers. If you really are that uptime sensitive, it would be cheaper for you to have georedundant datacentres and give the cloud thing a miss.

The problem with attempting to stay cloud agnostic is that you can only use the lowest common denominator of the tools you have available rather than throwing yourself in feet first into all the opportunities that exist with a given cloud provider.
Worst case, if the CEO gets angry enough at something and wants to switch just to make a point, it still will not be completely impossible to rewrite code in the seams. For instance, if you change your code to use AWS S3, it would be relatively trivial to change the code that calls S3 to use Azure BlobStorage in a pinch. No need to go choose a platform for it. Just like with ORMs and database providers (“with NHibernate it’s so easy, you can switch to Oracle much more easily”) people very rarely switch cloud providers. There would have to be a very compelling economic argument anyway.

Why go through with it?

Rapid prototyping

You should test in production anyway, but if you insist on creating test environments, being able to copy/paste your prod environment exactly and test your changes is only possible in an environment where you aren’t poking at real metal. It would be ludicrous to buy overbuilt on-prem hardware “because sometimes I like to spawn up a few extra copies of prod”. The powers that be would be livid at the massive capital expense that would go underutilised most of the time. With cloud however you spawn, test and destroy in minutes. Merely a blip on the radar in terms of cost. To mitigate the risk of developers leaving stray instances around you can just use governance like you do anywhere else in a workplace, but ideally the concept of ephemeral instances should lend itself to clean up nicely.

Modern software development

Bringing the organisation to a place where it has autonomous engineering teams that can bring feature from idea to production without hand-offs is the key driver for organisational performance. Moving to the cloud is going to make that happen. You could achieve this with on prem as well, but it would probably mean buying more hardware than you will really use more than in short bursts. If that trade off is worthwhile to you, then who am I to deny you your wish, but for most people cloud is the way to go.

What to do?

You probably need to get some help with this. Everybody in your organisation is already doing things. Taking on a cloud migration is going to be a massive effort for everybody, and you are still probably going to need an external experienced consultancy to help you. There are many out there that offer to architect a cloud migration for you. Not everyone is a charlatan, but, given that the selection process for these types of gigs is “who did the CxO meet at a conference/play golf with/…” I think the most important take-away is that the individual firm probably doesn’t really matter. It also probably doesn’t really matter which cloud provider you go with. Let a bunch of ops and devs benchmark the tools and APIs of various providers and come back with their feedback. There are probably going to be some budgets that can be negotiated between your consultancy (who most likely also has a VAR agreement with some cloud providers) , the provider and yourself that will determine some kind of benefits for one over another, but that’s still only speculation at this point.

Eventually one provider will be declared the winner and work will start. It doesn’t matter, really, which one was chosen. Even If the engineers say there really is a show-stopper, do investigate, but most problems can probably be avoided through some development. If you are running some weird VM somewhere that needs specific hypervisor features or some curious networking then of course you will have a challenge. Not necessarily an impossible one, though. This is not going to be done in an afternoon anyway. There is time to make changes to code, and there will be a need to do so in order to fully leverage the public cloud as mentioned above. Obviously non-blockers can be deferred, but unlike traditional tech debt there will be direct cost implications of course.

Enterprise IT, a tragedy?

In the beginning of time, Grace Hopper invented business software and Cobol at IBM. This meant that in addition to calculating the trajectory of an intercontinental ballistic missile, large computers could now be installed, literally built into the very lower basement of international enterprises and be used to process payroll and do forecasting, gradually obsoleting armies of calculators and typists.

Earliest business applications

At this point, nobody knew how to actually or maintain write large software systems. I mean, granted, current paradigms like object oriented programming and functional programming were formalised around that time, but still, people had not written as much software back then as you now have to write to make a car set fire to petrol at an opportune time given crank angle, engine speed and temperature (I exaggerate, but not really).

From terminals to personal computers

Eventually Microsoft began it’s journey to global domination in the office by piggybacking onto IBM’s good name and getting installed into offices by default since nobody got fired for buying IBM. Unfortunately, IBM’s own operating system guys that wrote time sharing operating systems for mainframes had no influence over the people that created the CP/M knockoff that was to be PC-DOS (and Microsoft’s MS DOS), so it had no multi-user or networking security features built it as, well you were supposed to have it on your desk, and buy another computer for somebody else’s desk. No need for passwords, amirite? I mean UNIX existed and was fairly wide-spread in the corporate environment at the time, and it had decent security built-in, at least a fundamental understanding that even as a power user, you don’t want to have that power all the time.

The beginnings of Enterprise IT was therefore to maintain and write software for mainframe computers, and that was so alien of work to most organisations that it quickly became outsourced to suppliers that could lower maintenance costs, but overcharged for development work, thus lowering OPEX but moving software to CAPEX. This had one of the first negative incentives driving the business to make fewer changes, making enterprise IT infamously slow to react and become a department of No.

As technology changed, the IT department stayed in the cellars and basements where the business rarely ventured. As mainframes were phased out and replaced with banks of personal computers, the white coats disappeared. Gradually, typing stuff into computers at the IT department stopped being work exclusively for women, and men started taking over; salaries increased (not saying there is a causal relationship, just saying), but still no natural daylight.

Forces conspire to make organisations write and maintain code

There is a syndrome that happens to computer people called “Not Invented Here”, which is a form of exceptionalism that means that no outsider could possibly understand our Very Special Requirements, so we need to write our own X, where X can be anything. While certain pieces of software became standardised, like payroll and accounting for small businesses, there is still big money in helping people shoot themselves in the foot by developing and maintaining their own adaptations of commercial ERP systems like SAP and on a lesser scale Dynamics NAV or GP. Microsoft Excel is the most successful way of combining both, as in letting people buy a commercial off the shelf application but then making excel sheets with bespoke maths that can both be the lifeblood of new business and be its cause of death when the bespoke template turns out to have a bug and nobody knows how to fix it anymore.

An accounting quirk (see the bit about CAPEX above) where you count code written as value created rather than a pure expense, and book development costs as having added capital value to your code base despite, objectively, few if any of your competitors would buy your bespoke mess of a back-office system if you offered it on the open market. Also they most often don’t count the depreciation that comes naturally unless you make the code maintainable and easy to refactor if requirements change in future.

At this point we have an alienated IT department that has only two key metrics, to cut cost and to add features to internal software products, but no real way to directly discuss requirements with those that use the software.

Negative incentives

In the beginning you only have developers. As in, they develop, profess to test and deploy their code without any supervision. Then you have an outage that embarrasses somebody in management but no singular scapegoat could be found and all of a sudden you get ops guys that are there to protect the business from the devs. After another outage that embarrasses the management further, you may get a QA department. You can imagine how IT security comes to exist as a function within an IT department?

You now have one team that’s there to make changes that implement business requirements they have captured some way or another, one side that’s there to make sure those changes are valid and one side that’s there to make sure there are no outages. The incentives become to make few releases so that the QA guys want to be able to make a full regression before they green-light the release, and the QA guys are the ones getting a bollocking when bugs get out in the wild. The ops guys have enough to contend with without f^!£$g developers making changes ON PURPOSE, how are you ever going to maintain a stable system if you keep poking it with new software all the time?! So yea at best one release per month, anything quicker would be irresponsible and you would be working your QA department to the bone.

From a business perspective this means you are never getting your change in. As more things go wrong, process/red tape is added, lead times longer, change freezes are introduced periodically. Longer release cycles and bigger releases cause bigger problems.

Technically speaking, after mainframes people bought servers. At first they were just computers, beige like all the rest of them, and a server room was just a cupboard where they were shoved. You bought a physical computer from a supplier where you got a decent price, and you set it up, installed your OS and then you installed your software one it. Or you bought the computer and shipped it to your software vendor and let them install their software on it. 19″ racks became a thing, and servers became distinct from PCs in that they became loud and flat. You were going to stick them in a room away from humans anyway, so noise was no longer a concern. You did want many computers per rack, and you wanted simple but effective cooling, so you would fit powerful high-RPM fans.

As things progressed, people realised that it is hard to find office space that allows you to fit redundant power, automatic fire suppression and redundant network connections, so instead of trying to fit that into your basement, they would go to a third party that offered co-located datacentres. There you could mount your rack servers and they would give you ways to remotely manage them, so you wouldn’t have to physically interact with the servers to run them. All the patching and other maintenance could be done over the internet, and the data centre would make sure no villains could get at your hardware.

After a while people realised that you could just buy a couple of massively overpowered servers and then divvy up the computer horsepower onto virtual machines, pretend servers that would behave like separate physical machines. Carving out a bit of virtual compute and create a new “server” was a lot faster than buying a physical server and having your colo provider plug it into your rack. You would still have to install the OS and configure the networking, but there would be templates and automation. Heck, you could even write command line scripts to new up new servers.

Point is, in Enterprise IT there is no time to write scripts, and far be it from the mind of any ops person to collaborate with developers to explore things like version control and automated testing, I mean the developers are the enemy, the cause of all our problems, why would we collaborate? VMs are therefore largely artisan creations with very bespoke installs, apart from possibly sharing a raw template with some antivirus or monitoring, not even instance 1 and 2 of a load balanced pair have the same software on them.

As the blows keep coming with outages, bugs in productions and near misses that cause leadership to go on long-term sick leave, decrees can go out to create test environments that are “the same as production”.

Servers are bought (VM hosts, or course) and VMs are configured. Obviously, it is prohibitively expensive to make it exactly the same as production given that the load requirements will be different, so some corners are cut, but, it should be close enough. As the old meme would go – narrator: It was not close enough. Also, since the server operating systems are still hand crafted, there are multiple potential places where differences between production and test can creep in.

Great expectations

In various countries, leaps were made at various times that caused people to adopt electronics at vastly greater rate than before. In Sweden there was a push in the late nineties, early noughties where you would get a quite substantial tax rebate if you bought a modern computer at a certain cost, causing a lot of people to all of a sudden possess a modern computer. With various Covid stimulus cheques it seems a lot of Americans put money straight into a gaming computer (thus worsening the current silicon supply chain constraint situation). There have been similar schemes all over the world at one time or another to encourage adoption of technology to promote familiarity with new technology, I just can’t be bothered to google more examples. Basically, people have used the internet and are able to see what commercial software development can do. Another cause of people becoming aware of the wider world is how various Apple products have had a marked impact on Enterprise IT, and BYOD basically becomes a thing in businesses because the CFO one days comes back to the office with a MacBook Air and stares the IT department down until they “make it work” with their existing corporate software that only runs on a specific type of dell laptop they have imaged two years ago, requiring IE6.

The exposure to what computers can really do and the daily torture of using enterprise software to do their jobs, dissatisfaction among the people on the business side is rife. Eventually, some middle manager just takes their corporate expense account and hires some consultant off the books and builds some quick win software to solve a specific problem. If we are unlucky, this is an all out win. It works, it generates business and was executed in a timely fashion on, or just over, budget. A dangerous precedent is set, another wedge is driven in between the business and the IT department. Shadow IT is born.

Recap

To bring us to the final bit of the story, let’s assess where we are.
Enterprise IT is disgraced, underfunded and distinct from the core business, literally moved away from the rest of the organisation. A troll under a bridge or an ogre in a swamp. Sure, the CTO may report to the Board of Directors, but there is no real correlation between desired business outcomes and the metrics that the IT department measures its services against, and no work is done to ensure that IT services directly benefit the key outcomes that the business as a whole needs to achieve. Control of IT spend is instead largely project based, as in, the business has an idea, it needs some IT support, a project is created with a budget and a final ship date before anybody with technical know-how has even assessed it, and then work commences. Probably contractors are brought in if there is sizeable chunk of development, but normally the department is kept lean. Service desk and “maintenance development” presumably outsourced to a country in a different timezone.

What Enterprise IT wants is to be a force for good within the organisation, but since the beginnings of time IT has often been deemed a non-core support function, and that increased conceptual distance has made it more difficult to effectively be of use to the organisation. Cultural differences between IT and other parts of the organisation, and perhaps ineffective communication between stakeholders and the developer organisation as caused upper management to silo the organisations further apart rather than agree on an effective way of working closer together. In cases where IT is not involved in other aspects of the business, obviously there will be no “osmotic” assimilation of domain knowledge, so the business may be shocked of how little the IT department actually knows about the bread-and-butter business that keeps the lights on.

Roadmap

How do we turn things around? What are the most important things moving forward? I will leave out the Practices bit, because I have rarely seen bad practitioners in organisations I have worked with. People will write automated tests and implement CI and CD if they are given circumstances not directly hostile to professional software development. The bigger lego pieces are usually where the problems lie, and why management buy-in is crucial.

Increase automation

The VMs are not family or even pets, you should delete them all and start over. Not in one go, or to cause another outage, but replace at pace the VMs you are currently running in production with new ones created through automation. When you can automatically deploy your core business application from nothing to a running instance without any manual intervention you are done. Thinking you could replace your running VMs with automation but not actually having proven it has no value. Ideally make the deployment process be some version of “stand up new VMs with an app on them, run tests to make sure it’s working, route traffic to the new instances, destroy the old instances”, so that you can deploy without causing any loss of availability. This is crucial in building trust with the rest of the organisation. When I write VM, I’m not making a dig against containers or containerisation, just saying you can achieve this without moving to kubernetes and a service mesh, you can do it with bog standard VMs and a bit of scripting.

Decouple systems

In order to be able to fearlessly deploy, you will have to decouple systems so that you can deploy small changes often, with small blast radius and short time to recover.

Organise your IT department after what you are supporting. Yes, after business functions but also after the systems that you maintain. If you have a monolith that support all business functions, then you need to split it up. The output of different teams needs to be independently deployable. This is not easy, but it is the only way to enable teams to make predictable progress. Consider it an investment in predictable outcomes.

Maintain products, don’t run projects

When you have finally managed to put together teams that match your customers and your work, then don’t disband them again after you have delivered a certain milestone and the “project is done”. Budget for products rather than projects, so see your own software as a driver to increase profits and reduce cost. Set targets for what you want to achieve and measure outcomes. Let developers prototype things and show the business what’s possible. If you have built the right platform for people, it will be possible to securely bring new ideas to life and test them with real clients and get true feedback. This way you can delight customers faster, and there is lesser risk you get saddled with maintaining some Shadow-IT piece of software that suddenly became core IT after it was successful and the consultant that wrote it moved on to greener pastures.

Goal

By introducing a wider interface between the business and IT, and making teams that have the autonomy and platform support to independently iterate of features quickly, you will both delight the business, delight the customers, make more money and have happier teams. There will be times when you have to organise and coordinate, but the lion’s share of work can be done independently.

SQL Server and cloud

Back in the day

You would have one database connection string, and that would be it. “The Database” would live in the darkest corner of the office and you would try not to interact with the physical machine that hosted the relational database management server. The Windows Forms app would open a connection and keep it alive forever with nothing to really threaten it. It would enable Multiple Active Result Sets because there was no grown-up way to query the database, queries were being shoved down the shared connection seemingly randomly depending on what the user was doing and what events forced data to be loaded. You would use the Windows Credentials of the end-user and specifically allow a the user to access the server, which fit with the licensing model where the user paid for accessing SQL Server when buying Office products, whilst pure server deployments needed per CPU core licensing which was hysterically expensive.

Cloud

The cloud revolution came and went, whilst we in the Microsoft / Windows / .NET bubble never really heard about it, until Microsoft decided to try their hand at hosting cloud servers with the Windows Azure project. I genuinely think this was extremely useful for Windows, as the prospect of hosting cloud services was so far beyond the capabilities of the bare OS, and forced some stark introspection and engineering effort to overcome some of the worst designs in the kernel.

For .NET Framework, a lot was too late already, and many misfeatures weren’t fixed until in the .NET Core reboot effort.

One of the things that did change for the average programmer was that database connections became ephemeral, and pooled. You could no longer hog one connection for all of your needs for all eternity – you created a new connection when you needed something and the framework would pool the connections for you, but you had to allow for connections taking longer to get established and also that they can be evicted at any point. Relatively quickly database providers would build in resilience so that you didn’t have to know or care, but in the early days even the happy path Microsoft marketing slides that usually never have error handling or security in them had to feature retry mechanisms – or else people simply couldn’t successfully replicate the early samples.

Security

As people started using the public cloud, eventually people figured out that security was a thing. Maybe we should not have overly powerful credentials lying around in config files on the webserver ready to be exploited by any nefarious visitors? People eventually started using things like Azure KeyVault or AWS Secrets Manager. Sure it’s bad if people break in and nick your stuff, but it’s worse if they also steal your car to carry away the loot.

Once people had all their secrets backed by hardened credentials services, features like autorotating credentials started becoming available. Since you are provided most of your authorisation by the general hosting environment anyway, and only really need credentials for systems that are incompatible with this approach, why don’t you just automatically periodically update the credentials?

Also – back in the day when the server lived back in the office, if somebody got a bit enthusiastic with the UPDATE statements and forgot to add a WHERE, you could send everybody out for coffee (read tea, dear UK readers) whilst the techiest person in the office would Restore to Point in Time and feel like a hero when the database stopped being in recovery mode and everything was back to the way it was before the Incident.

Today a “restore” means you get to create a new database that contains the world as you knew it, before the Incident. You then need to serially rename databases to achieve the effect of having restored the database. Not a big deal, just not what we used to call restore back in my day.

Back to retry mechanisms

For SQL Server running in Azure, you can now tell the Microsoft.Data.SqlClient to connect to the database using the Managed Identity for the app, meaning you are completely free of faff and the KeyVault is managing your access automatically.

With RDS on AWS you need to use legacy usernames and passwords unless you configure an Active Directory domain up there, but … no, but these credentials can be auto-rotated in AWS Secrets Manager. Because of the cost of talking to the Secrets Manager, it’s not something you want to do every request, as that piles up over a month.

One of those retry mechanisms from early Azure days start to make sense again, and is easily implemented as a factory method you can call instead of the await using var cn = new SqlConnection(...) you have littered all throughout the code. I mean you’ll still await using that factory method, but it can do the setting up of the connection and validating the credentials, and only spend the dosh fetching the latest from the vault if you get the error code for invalid credentials. This means your bespoke factory method replaces both new SqlConnection and OpenAsync().

Naïve retry mechanism, featuring goto:

// https://docs.microsoft.com/en-us/sql/relational-databases/errors-events/mssqlserver-18456-database-engine-error?view=sql-server-ver15
// Error code that indicates invalid credentials, meaning we need to bite the bullet and fetch latest settings from Secrets Manager
private const int INVALID_CREDS_ERROR = 18456;
private const int INVALID_DB_NAME = 4060;

public async Task EnsureValidCredentialsAsync(SqlConnection conn, CancellationToken cancellationToken)
{
    var rdsCredential = GetRdsCredential(await _secretsCache.GetCachedSecretAsync(_siteConfiguration.DefaultDatabaseCredentialKey));
    var dbCatalog = await _secretsCache.GetCachedSecretAsync(_siteConfiguration.DefaultCatalogKey);    int reconnectCount = 0;
reconnect:
    var connectionString = GetConnectionString(rdsCredential, dbCatalog);
    conn.ConnectionString = connectionString;
    try
    {
        await conn.OpenAsync(cancellationToken);
        conn.Close(); //  restore to known state
        return;
    }
    catch (SqlException sqlEx) when (sqlEx.Number == INVALID_CREDS_ERROR)
    {
        // Credentials are incorrect, double check with secrets manager to see what's what - potentially creds have been rotated
        rdsCredential = await _secretsCache.GetSecretAsync<AwsSecretsManagerCredential>( _siteConfiguration.DefaultDatabaseCredentialKey );
    }
    catch (SqlException sqlEx) when (sqlEx.Number == INVALID_DB_NAME)
    {
        // Database identifier is not valid, double check with secrets manager to see what's what (potentially restored db, deprecating old db name)
        dbCatalog =
            await _secretsCache.GetSecretAsync(_siteConfiguration.DefaultCatalogKey);
    }
    catch (SqlException sqlEx)
    {
        Log.Error(sqlEx, "Could not open default DB connection. Reattempting");
    }
    if (reconnectCount++ < 3) goto reconnect;
    // surrounding code expects an exception if the open fails 
    throw new InvalidOperationException("Could not open database connection");
}

What about Entity Framework?

Looking at the way you officially configure Entity Framework – it seems you can’t get away from having a known connection string up-front, which again isn’t a problem in Azure as discussed earlier, but for me, I want to only hand credentials to the app that I know have been tested.

In my past life inviting speakers to Oredev I once invited Julie Lerman to speak about Entity Framework and DDD, so I pretend that I know her and that I can call upon her for help in matters of programming, so I sent out a tweet linking to a Hail Mary Stack Overflow question I had created where I asked how I would be able to dynamically handle connection retries in a similar way or at least be able to call out and ask for credentials.

Surprisingly she had time to reply and pointed me to a video where she had addressed this subject and which taught me about something called DbConnectionInterceptors that had all the things I wanted, in addition this also introduced me to the super elegant solution natively supported by Microsoft.Data.SqlClient for handling this situation in Azure that I mentioned earlier.

Basically I therefore created a class that inherits from DbConnectionInterceptor and overrides two methods, one sync and one async version and call something like the above function to test the connection EF Core is about to open.

public override async ValueTask<InterceptionResult> ConnectionOpeningAsync(DbConnection connection, ConnectionEventData eventData, InterceptionResult result,
    CancellationToken cancellationToken = new CancellationToken())
{
    var sqlConnection = (SqlConnection) connection;
    // try opening the connection, if it doesn't work - update its params - close it before returning to achieve same state
    await _dbConnectionFactory.EnsureValidCredentialsAsync(sqlConnection, cancellationToken);
    return await base.ConnectionOpeningAsync(connection, eventData, result, cancellationToken);
}     

Of course – registering this interceptor is easy as well, after I had googled some more. There is a (synchronous) override that allows you access to an instantiated IServiceProvider as follows:

services.AddSingleton<ResilientRdsConnectionInterceptor>();
services.AddDbContext<ADbContext>((provider, options) =>
            {
                var dbConnectionFactory = provider.GetRequiredService<DbConnectionFactory>();
                var connectionString = dbConnectionFactory.GetDefaultConnectionString().GetAwaiter().GetResult(); // So sorry, but there is no async override
                options.AddInterceptors( provider.GetRequiredService<ResilientRdsConnectionInterceptor>());
                options.UseSqlServer(connectionString);
            });

An aside on async/sync. It seems Microsoft will have you rewrite every other line of your codebase to properly support async or your app will fall over. But when you then want to reuse your async code and not have to duplicate everything into similar-but-not-reusable sync and async versions, the tiny changes needed for MS to support async everywhere are all of a sudden “not needed”, line in configuration or DI. It’s laziness, I tell you.

Anyway I have integration tests in place that verify that the mechanism for calling out for updated credentials actually works. Provided the credentials in the secrets manager/key vault actually work, this works like a charm.

A fitting song. Also a banger.

Why is everybody waterfalling “Agile”?

Just like that rebrand cycle years ago when RUP consultants transitioned over to scrum masters through staged re-titling on LinkedIn and liberal use of search / replace in their CV, scaled agile frameworks and certified project managers attempt to apply the agile manifesto to large organisations by bringing in processes and tools to manage the individuals and interactions, comprehensive documentation of the working software, to negotiate contracts to manage customer collaboration and make plans for how to respond to changes. You start seeing concepts like the Agile Release Train, which are – well – absurd.

Why? Do they not see what they are doing? Are they Evil?

No – I think it’s simple – and really really hard, at the same time.

You cannot respond to change quickly if you have delays in the system. These delays could be things like manual regression testing due to lack of automated test coverage, insufficient or badly configured tooling around the release or having a test stack that is an inverted pyramid, where you rely on a big stack of UI tests to cover the entire feature set of the application because faster, lower level tests aren’t covering all the features and you have undeniable experience of users finding bugs for you.

Obviously, if these tests are all you have, you need to run them before releasing or you would alienate customers. If your software stack is highly coupled, it would be irresponsible to not coordinate carefully when making changes on shared components with less-than-stellar contract test coverage. You are stuck with this, and it is easy to just give up. The actual solution is to first shorten the time it takes from deciding you have all the features you want to release until the software is actually live. This means automate everything that isn’t automated (the release itself, the necessary tests, et c) which could very well be a “let’s just stop developing features and focus all our attention on this until this is in place” type investment, the gains are so great. After this initial push you need to make an investment into decoupling software components into bits that can be released independently. This can be done incrementally whilst normal business is progressing.

Once you have reached the minimum bar of being able to release whatever you want at any time you want and be confident that each change is small enough that you can roll them back in the unlikely event that the automated tests missed something, then you are in a position to talk about an agile process, because now teams are empowered and independent enough that you only need to coordinate in very special cases, where you can bring in ad hoc product and technical leadership, but in the day to day, product and engineering together will make very quick progress in independent teams without holding each other up.

When you can release small changes, you can all of a sudden see the value in delivering features in smaller chunks with feature flags, because you can understand the value in making 20 small changes in trunk (main for you zoomers) rather than a massive feature branch, as releases go live several times a day, and the benefit of your colleagues seeing your feature flagged changes start appearing from beginning to end, they can work with your massive refactor rather than be surprised when you open a 100 file PR at 16:45 on a Friday.

Auto-login after signup

If you have a website that uses Open ID Connect for login, you may want to allow the user to be logged in directly after having validated their e-mail address and having created their password.

If you are using IdentityServer 4 you may be confused by the hits you get on the interwebs. I was, so I shall – mostly for my own sake – write down what is what, should I stumble upon this again.

OIDC login flow primer

There are several Open ID authentication flows depending on if you are protecting an API, a mobile native app or a browser-based web app. Most flows basically work in such a way that you navigate to the site that you need to be logged in to access. It discovers that you aren’t logged in (most often – you don’t have the cookie set) and redirects you to its STS, IdentityServer4 in this case, and with this request it tells identityserver4 what site it is (client_id), the scopes it wants and how it wants to receive the tokens. IdentityServer4 will either just return the token (the user was already logged in elsewhere) or get the information it needs from the end user (username, password, biometrics, whatever you want to support) and eventually if this authentication is successful, the IdentityServer will return some tokens and the original website will happily set an authentication token and let you in.

The point is – you have to first go where you want, you can’t just navigate to the login screen, you need the context of having been redirected from the app you want to use for the login flow to work. As a sidenote, this means your end users can wreak havoc unto themselves with favourites/ bookmarks capturing login context that has long expired.

Registration

You want to give users a simple on-boarding procedure, a few textboxes where they can type in email and password, or maybe invite people via e-mail and let them set up their password and then become logged in. How do we make that work with the above flows?

The canonical blog post on this topic seems to be this one: https://benfoster.io/blog/identity-server-post-registration-sign-in/. Although brilliant, it is only partially helpful as it covers IdentityServer3, and the newer one is a lot different. Based on ASP.NET Core, for instance.

  1. The core idea is sound – generate a cryptographically random one-time access code and map against the user after the user has been created in the registration page. (In IdentityServer4)
  2. Create an anonymous endpoint in a controller in one of the apps the user will be allowed to use, in it, ascertain that you have been sent one of those codes, then Challenge the OIDC authentication flow, adding this code as an AcrValue as the request goes back to the IdentityServer4
  3. Extend the authentication system to allow these temporary codes to log you in.

To address the IdentityServer3-ness, people have tried all over the internet, here is somebody who get’s it sorted: https://stackoverflow.com/questions/51457213/identity-server-4-auto-login-after-registration-not-working

Concretely you need a few things – the function that creates OTACs, which you can lift from Ben Foster’s blog post. A sidenote, do remember that if you use a cooler password hashing algorithm you have to use special validators rather than rely on applying the hash onto the the same plaintext to validate. I e, you need to fetch the hash from whatever storage you use and use the specific methods the library offers to validate that the hashes are equivalent.

After the OTAC is created, you need to redirect to a controller action in one of the protected websites, passing the OTAC along.

The next job is therefore to create the action.

        [AllowAnonymous]
        public async Task LogIn(string otac)
        {
            if (otac is null) Response.Redirect("/Home/Index");
            var properties = new AuthenticationProperties
            {
                Items = { new KeyValuePair<string, string>("otac", otac) },
                RedirectUri = Url.Action("Index", "Home", null, Request.Scheme)
            };

            await Request.HttpContext.ChallengeAsync(ClassLibrary.Middleware.AuthenticationScheme.Oidc, properties);
        } 

After storing the OTAC in the HttpContext, it’s time to actually send the code over the wire, and to do that you need to intercept the calls when the authentication middleware is about to send the request over to IdentityServer. This is done where the call to AddOpenIdConnect happens (maybe yours is in Startup.cs?), where you get to configure options, among which are some event handlers.

OnRedirectToIdentityProvider = async n =>{
    n.ProtocolMessage.RedirectUri = redirectUri;
    if ((n.ProtocolMessage.RequestType == OpenIdConnectRequestType.Authentication) && n.Properties.Items.ContainsKey("otac"))
    {
        // Trying to autologin after registration
        n.ProtocolMessage.AcrValues = n.Properties.Items["otac"];
    }
    await Task.FromResult(0);
}

After this – you need to override the AuthorizeInteractionResponseGenerator, get the AcrValues from the request, and – if successful – log the user in, and respond accordingly. Register this class using services.AddAuthorizeInteractionResponseGenerator(); in Startup.cs

Unfortunately, I was still mystified as to how to log things in, in IdentityServer4 as I could not find a SignIn manager used widely in the source code, but then I found this blog post:
https://stackoverflow.com/questions/56216001/login-after-signup-in-identity-server4, and it became clear that using an IHttpContextAccessor was “acceptable”.

    public override async Task<InteractionResponse> ProcessInteractionAsync(ValidatedAuthorizeRequest request, ConsentResponse consent = null)
    {
        var acrValues = request.GetAcrValues().ToList();
        var otac = acrValues.SingleOrDefault();

        if (otac != null && request.ClientId == "client")
        {
            var user = await _userStore.FindByOtac(otac, CancellationToken.None);

            if (user is object)
            {
                await _userStore.ClearOtac(user.Guid);
                var svr = new IdentityServerUser(user.SubjectId)
                {
                    AuthenticationTime = _clock.UtcNow.DateTime

                };
                var claimsPrincipal = svr.CreatePrincipal();
                request.Subject = claimsPrincipal;

                request.RemovePrompt();

                await _httpContextAccessor.HttpContext.SignInAsync(claimsPrincipal);

                return new InteractionResponse
                {
                    IsLogin = false,
                    IsConsent = false,
                };
            }
        }

        return await base.ProcessInteractionAsync(request, consent);
    }

Anyway, after ironing out the kinks the perceived inconvenience of the flow was greatly reduced. Happy coding!

WSL 2 in anger

I have previously written about the Windows Subsystem for Linux. As a recap, it comes in two flavours- one built on the concept of pico processes, marshalling the Linux ABI into Win32 API calls (WSL1) and an actual Linux kernel hosted in a lightweight Hyper-V installation (WSL2). Both types have file system integration and fairly transparent command line interface to run Linux commands from Windows and Windows executables from the Linux command line. But, beyond the headline stuff, how does it work in real life?

Of course with WSL1, there are compatibility issues, but the biggest problem is horrifyingly slow Linux file system performance because of it being Windows NTFS pretending to be EXT4. Since NTFS is slow on small files, you can imagine an operating system whose main feature is being a immense collection of small files working together would run slowly on top of it a filesystem with those characteristics.

With WSL2, obviously kernel compatibility is 100%, as, well it’s a Linux kernel, and the Linux file system stuff Just Works, as the file system is managed natively (although over hypervisor), but ironically, the /mnt filesystem with the Windows drives mounted are prohibitively slow. It has been said to be a bug that has been allegedly fixed, but given that we are – at the end of the day – talking about accessing local PCIe gen 4 NVME storage, managing to make file I/O this slow, betrays plenty of room for improvement. To summarise – if you want to do Linuxy in Linux under Windows, use WSL2 , if you want to do Windowsy things in Linux under Windows, use WSL1. Do with that what you will. WSL2 being based on a proper VM means despite huge efforts, the networking story is not super smooth, no proper mechanism exists to make things easier for you and no hits on Google will actually address the fundamental problem.

That is to say, I can run a website I have built in docker in WSL2, but l need to do a lot of digging to figure out what IP the site got, and do a lot of firewall stuff to be able to reach it. Also, running X Window with the excellent X410 server requires a lot of bespoke scripting because there is no way of setting up the networking to just work on start-up. You would seriously think that a sensible bridging default could have been brought in to make things a lot more palatable? After all, all I want to do is road test my .NET Core APIs and apps in docker before pushing them. Doesn’t seem too extreme of a use case.

To clarify – running or debugging a .NET Core Linux website from Visual Studio Code (with the WSL2 backend) works seamlessly, absolutely seamlessly. My only gripe is that because of the networking issue, I cannot really actually verify docker things in WSL2 which I surmised was the point of WSL2 vs WSL1.

Put your Swagger UI behind a login screen

I have tried to put a piece of API documentation behind interactive authentication. I have various methods that are available to users of different roles. In an attempt to be helpful I wanted to hide the API methods that you can’t access anyway. Of course when the user wants to call the methods for the purpose of trying them out, I use the well documented ways of hooking up Bearer token authentication in the Swashbuckle UI.

I thought this was a simple idea, but it seems to be a radical concept that was only used back in Framework days. After reading a bunch of almost relevant google hits, I finally went ahead and did a couple of things.

  1. Organise the pipeline so that Authentication happens before the UseSwaggerUI call in the pipeline.
  2. Hook up an operation filter to tag the operations that are valid for the current user by checking the Roles in the User ClaimsPrincipal.
  3. Hook up a document filter to filter out the non-tagged operations, and also clean up the tags or you’ll get duplicates – although further experimentation here too can yield results.
  4. Set up the API auth as if you are doing an interactive website so you have Open ID Connect middleware set up as a default Authentication Scheme, set up Cookie as Default Scheme and add Bearer as an additional scheme.
  5. Add the Bearer scheme to all APi controllers (or some other policy, point is, you need to specify that the API controllers only accept Bearer auth.

AWS CLI and SDK local setup

Jeff Bezos may have tripled his fortunes in the last couple of months by price gauging hand sanitiser and bog roll during lockdown with 1000% markup, but some aspects of the empire are less well-functioning. Getting an install of aws cli and AWSSDK.NET working on a new machine is one of those less stellar areas. As there is no acceptable documentation I shall write the process down now so that I can at least aide my memory for the next time.

Start at the beginning

  1. Install the V2 CLI on your local computer. Google it, the links will surely change.
  2. Go to https://console.aws.amazon.com/iam and create new credentials for a user that has the least privilege you can get away with.
  3. Add the credentials in a file called credentials, like so: ~/.aws/credentials.
  4. Add config in a file called ~/.aws/config and specify your favourite output format and region per profile
  5. In your friendly neighbourhood Powershell window, type SETX AWS_PROFILE my-awesome-profile in order to assign a default profile.

For people without imagination I’ll show examples of what the files should look like. Let’s hope I have remembered to recycle these credentials.

[default]
aws_access_key_id = AWHERESAFAK3K3YNAME
aws_secret_access_key = FKJqKj23kfj23kjl23l4j2f3l4jl2Kkl

[local-excellent-profile]
aws_access_key_id = AN0THERFAK3K3YNAME
aws_secret_access_key = FKJ/e34fegf4ER24Efj23kjl23l4j2f3l4jl2Kkl

Here is an example of the config file:

[default]
output = json
region = eu-west-2

[profile local-excellent-profile]
output = text
region = eu-west-2

Those are the steps. Feel free to validate your credentials by issuing aws cli commands, perhaps specifying –profile in order to select a different profile than your default one. Unless you failed to copy the access key correctly you should find that things are ticking along nicely and that you can access precisely the resources you’re supposed to be able to access.

Tangents, everybody loves a good tangent

So – at some point it came to pass that the way I used to fetch IdS signing certificates from AWS became a thought crime, so I had to change the way I do it, by adding an additional parameter, essentially making the overload I’m calling to load up a PFX into a X509Certificate2 class actually take a PFX and load it up without trying to shove it into a system store of any kind. It would give an exception “keyset does not exist” because that makes total sense. Anyway, the fix is to supply the parameter X509KeyStorageFlags.EphemeralKeySet along with the byte array and super secret password when constructing the X509Certificate2 object.

That’s it for today. Don’t forget to like and subscribe for more.

Windows Terminal

The old Windows command window, the “DOS prompt” has been around since the beginnings pf Windows NT, and is used when running batch scripts. It uses a model to describe its world that was probably fit for purpose in 1992 but has quickly become insufficient with the advent of Unicode and modern graphics. No modern graphics processing is used in the console host.

The process uses the Windows Console API, which is basically a Windows API that accepts text input and produces text output. It was seen as an improvement to the old school pseudo terminal model used in the Linux world. The upshot has been that the third party applications (most famously ConEmu) have had to aggregate built-in command prompts off screen and send individual characters to them and then do their own rendering, actually providing the terminal UI.

After a few decades, Microsoft realised that this was unsustainable, they needed more than 16 colours in the terminal, they wanted unicode and they needed to improve performance. Batch scripts run at different speed depending on if the window is minimised or not due to the actual rendering of characters being slow. It was not going to be possible to entice Linux users back to Windows with such an atrocious command line interface.

The modernisation has taken two forms, first they created a ConTTY interface, meaning windows will provide a pseudo terminal interface to processes, so they just read from the standard input and write to standard the output like in DOS, Linux, UNIX, well the rest of the world basically.

The second improvement track has been creating a new terminal. They have forked the old console host software and added support for hardware acceleration, DirectX rendering, unicode fonts, all kinds of colours and selectable opacity. The terminal itself is highly customisable and allows you to set up a multitude of named profiles, it allows you to split panes and configure what to launch in various panes when you open a new tab. A proper terminal in other words.

Now there are loads of tweets and YouTube clips about this terminal, but I wanted to add my 2p here and emphasise that the important thing is not the transparency, the blurred background or the reaction GIF backgrounds, the cool thing is the performance and that fact that if you install this and use it you do not need an other terminal. You may prefer another one because you don’t want to reconfigure what is already working, but I mean don’t need another terminal. It works. It is fast and fluid. The very first preview was glitchy and artefacty but now it looks good and is fast. It still needs to be configured via json file, but I am glad they brought it to market this way, so the important bits are working.

Database Integration Testing

Testing your SQL queries is as important as any other piece of logic. Unless you only do reads and writes, presumably some type of logic will be implemented at least in the form of a query, and you would like to validate that logic same as any other.

Overview

For this you need database integration tests, There are multiple strategies for this (in-memory databases, additional abstractions and mocks, or creating a temporary but real database, just to name a few) but I will in this post discuss running a linux SQL Server docker image, applying all migrations to it from scratch and the running tests on top of it.

Technology choice is beyond the scope of this text. I use .NET Core 3.1, XUnit and legacy C# because I know it already and because my F# is not idiomatic enough for me not to go on tangents and end up writing a monad tutorial instead. I have used MySQL / MariaDB before and I will never use it for anything I care about. I have tried Postgres, and I like it , it is a proper database system, but again, not familiar enough for my purposes this time. To reiterate, this post is based on using C# on .NET Core 3.1 over MSSQL Server and the tests will be run on push using Github Actions.

My development machine is really trying, OK, so let us cut it some slack. Anyway, I have Windows 10 insider something, with WSL2 and Docker Desktop for WSL2 on it. I run Ubuntu 20.04 in WSL2, dist-upgraded from 18.04. I develop the code in VS2019 Community on Windows, obviously.

Problem

This is simple, when a commit is made to the part of a repository that contains DbUp SQL Scripts, related production code or these tests, I want to trigger tests that verify that my SQL Migrations are valid, and when SQL queries change, I want those changes verified against a real database server.

I do not like docker, especially docker-compose. It seems to me it has been designed by people that don’t know what they are on about. Statistically that cannot be the case, since there are tens of thousands of docker-compose users that do magical things, but I have wasted enough time, so like Seymour Skinner I proclaim, “no, it is the children that are wrong!”, and I thus need to find another way of running an ad hoc SQL Server.

All CI stuff and production hosting of this system is Linux based, but Visual Studio is stuck in Windows, so I need a way to be able to trigger these tests in a cross platform way.

Clues

I found an article by Jeremy D Miller that describes how to use a .NET client of the Docker API to automatically run a MSSQL database server. I made some hacky mods:

internal class SqlServerContainer : IDockerServer
{
    public SqlServerContainer() : base("microsoft/mssql-server-linux:latest", "dev-mssql")
    {
        // My production code uses some custom types that Dapper needs
        // handlers for. Registering them here seems to work
        SqlMapper.AddTypeHandler(typeof(CustomType), CustomTypeHandler.Default);
    }

    public static readonly string ConnectionString = "Data Source=127.0.0.1,1436;User Id=sa;Password=AJ!JA!J1aj!JA!J;Timeout=5";

    // Gotta wait until the database is really available
    // or you'll get oddball test failures;)
    protected override async Task<bool> isReady()
    {
        try
        {
            using (var conn =
                new SqlConnection(ConnectionString))
            {
                await conn.OpenAsync();

                return true;
            }
        }
        catch (Exception)
        {
            return false;
        }
    }

    // Watch the port mapping here to avoid port
    // contention w/ other Sql Server installations
    public override HostConfig ToHostConfig()
    {
        return new HostConfig
        {
            PortBindings = new Dictionary<string, IList<PortBinding>>
            {
                {
                    "1433/tcp",
                    new List<PortBinding>
                    {
                        new PortBinding
                        {
                            HostPort = $"1436",
                            HostIP = "127.0.0.1"
                        }

                    }
                }
            },

        };
    }

    public override Config ToConfig()
    {
        return new Config
        {
            Env = new List<string> { "ACCEPT_EULA=Y", "SA_PASSWORD=AJ!JA!J1aj!JA!J", "MSSQL_PID=Developer" }
        };
    }

    public async static Task RebuildSchema(IDatabaseSchemaEnforcer enforcer, string databaseName)
    {
        using (var conn = new SqlConnection($"{ConnectionString};Initial Catalog=master"))
        {
            await conn.ExecuteAsync($@"
                IF DB_ID('{databaseName}') IS NOT NULL
                BEGIN
                    DROP DATABASE {databaseName}
                END
            ");
        }
        await enforcer.EnsureSchema($"{ConnectionString};Initial Catalog={databaseName}");
    }
}

I then cheated by reading the documentation for DbUp and combined the SQL Server schema creation with the docker image starting code to produce the witch’s brew below.

internal class APISchemaEnforcer : IDatabaseSchemaEnforcer
{
    private readonly IMessageSink _diagnosticMessageSink;

    public APISchemaEnforcer(IMessageSink diagnosticMessageSink)
    {
        _diagnosticMessageSink = diagnosticMessageSink;
    }

    public Task EnsureSchema(string connectionString)
    {
        EnsureDatabase.For.SqlDatabase(connectionString);
        var upgrader =
            DeployChanges.To
                .SqlDatabase(connectionString)
                .WithScriptsEmbeddedInAssembly(Assembly.GetAssembly(typeof(API.DbUp.Program)))
                .JournalTo(new NullJournal())
                .LogTo(new DiagnosticSinkLogger(_diagnosticMessageSink))
                .Build();
        var result = upgrader.PerformUpgrade();
        return Task.CompletedTask;
    }
}

When DbUp runs it will output all scripts run to the console, so we need to make sure this type of information will actually end up being logged, despite it being diagnostic. There are two problems there, we need to use a IMessageSink to write diagnostic logs from DbUp for XUnit to become aware of the information and secondly we must add a configuration file to the integration test project for xunit to choose to print the messages to the console.

Our message sink diagnostic logger is plumbed into DbUp as you can see in the previous example, and here is the implementation:

internal class DiagnosticSinkLogger : IUpgradeLog
{
    private IMessageSink _diagnosticMessageSink;

    public DiagnosticSinkLogger(IMessageSink diagnosticMessageSink)
    {
        _diagnosticMessageSink = diagnosticMessageSink;
    }

    public void WriteError(string format, params object[] args)
    {
        var message = new DiagnosticMessage(format, args);
        _diagnosticMessageSink.OnMessage(message);
    }

    public void WriteInformation(string format, params object[] args)
    {
        var message = new DiagnosticMessage(format, args);
        _diagnosticMessageSink.OnMessage(message);
    }

    public void WriteWarning(string format, params object[] args)
    {
        var message = new DiagnosticMessage(format, args);
        _diagnosticMessageSink.OnMessage(message);
    }
}

Telling XUnit to print diagnostic information is done through a file in the root of the integration test project called xunit.runner.json, and it needs to look like this:

{
  "$schema": "https://xunit.net/schema/current/xunit.runner.schema.json",
  "diagnosticMessages": true
}

If you started out with Jeremy’s example and have followed along , applying my tiny changes you may or may not be up and running by now. I had an additional problem – developing on Windows while running CI on Linux. I solved this with another well judged hack:

public abstract class IntegrationFixture : IAsyncLifetime
{
    private readonly IDockerClient _client;
    private readonly SqlServerContainer _container;

    public IntegrationFixture()
    {
        _client = new DockerClientConfiguration(GetEndpoint()).CreateClient();
        _container = new SqlServerContainer();
    }

    private Uri GetEndpoint()
    {
        return RuntimeInformation.IsOSPlatform(OSPlatform.Windows)
            ? new Uri("tcp://localhost:2375")
            : new Uri("unix:///var/run/docker.sock");
    }

    public async Task DisposeAsync()
    {
        await _container.Stop(_client);
    }

    protected string GetConnectionString() => $"{SqlServerContainer.ConnectionString};Initial Catalog={DatabaseName}";
        
    protected abstract IDatabaseSchemaEnforcer SchemaEnforcer { get; }
    protected abstract string DatabaseName { get; }

    public async Task InitializeAsync()
    {
        await _container.Start(_client);
        await SqlServerContainer.RebuildSchema(SchemaEnforcer, DatabaseName);
    }

    public SqlConnection GetConnection() => new SqlConnection(GetConnectionString());
}

The point is basically, if you are executing on Linux, find the unix socket but if you are stuck on Windows – try TCP.

Github Action

After having a single test – to my surprise – actually pass locally after having created the entire database – I thought it was time to think about the CI portion of this adventure. I had no idea if the Github Action thing would allow me to just pull down docker images, but I thought “probably not”. Still created the yaml, because nobody likes a coward:

# This is a basic workflow to help you get started with Actions

name: API Database tests

# Controls when the action will run. Triggers the workflow on push or pull request
# events but only for the master branch
on:
  push:
    branches: [ master ]
    paths: 
      - '.github/workflows/thisaction.yml'
      - 'test/API.DbUp.Tests/*'
      - 'src/API.DbUp/*'
      - 'src/API/*'
  pull_request:
    branches: [ master ]
    paths: 
      - '.github/workflows/thisaction.yml'
      - 'test/API.DbUp.Tests/*'
      - 'src/API.DbUp/*'
      - 'src/API/*'

# A workflow run is made up of one or more jobs that can run sequentially or in parallel
jobs:
  # This workflow contains a single job called "test"
  test:
    # The type of runner that the job will run on
    runs-on: ubuntu-latest

    # Steps represent a sequence of tasks that will be executed as part of the job
    steps:
    # Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it
    - uses: actions/checkout@v2

    # Runs a single command using the runners shell
    - name: Run .NET Core CLI tests
      run: |
        echo Run tests based on docker. Bet u twenty quid this will fail
        dotnet test test/API.DbUp.Tests/API.DbUp.Tests.csproj

You can determine, based on the highlighted line above the level of surprise and elation I felt when after I committed and pushed, github chugged through, downloaded the mssql docker image, recreated my schema, ran the test and returned a success message. I am still in shock.

So what now?

Like Jeremy discusses in his post, the problem with database integration tests is that you want to get a lot of assertions out of each time you created your database due to how expensive it is. In order to do so, and to procrastinate a little, I created a nifty little piece of code to keep track of test data I create in each function, so that I can run tests independent of each other and clean up almost automatically using Stack<T>.

I created little helper functions that would create domain objects when setting up tests. Each test would at the beginning create a Stack<RevertAction> and pass it into each helper function while setting up the tests, and each helper function would push a new RevertAction($"DELETE FROM ThingA WHERE id = {IDofThingAIJustCreated}") onto that stack. At the end of each test, I would invoke the Revert extension method on the stack and pass it some context so that it can access the test database and output test logging if necessary.

public class RevertAction
{
    string _sqlCommandText;

    public RevertAction(string sqlCommandText)
    {
        _sqlCommandText = sqlCommandText;
    }

    public async Task Execute(IntegrationFixture fixture, ITestOutputHelper output)
    {
        using var conn = fixture.GetConnection();
        try
        {
            await conn.ExecuteAsync(_sqlCommandText);
        }
        catch(Exception ex)
        {
            output.WriteLine($"Revert action failed: {_sqlCommandText}");
            output.WriteLine($"Exception: {ex.Message}");
            output.WriteLine($"{ex.ToString()}");
            throw;
        }

    }
}

The revert method is extremely simple:

public static class StackExtensions
{
    public static async Task Revert(this Stack<RevertAction> actions, IntegrationFixture fixture, ITestOutputHelper output)
    {
        while (actions.Any())
        {
            var action = actions.Pop();
            await action.Execute(fixture, output);
        }
    }
}

So – that was it. The things I put in this blog post were the hardest for me to figure out, the rest is just a question of maintaining database integration tests, and that is very implementation specific, so I leave that up to you.