Rion.IO

A Distributed Tracing Adventure in Apache Beam

Rion Williams — Sat, 04 Jul 2020 17:00:00 GMT

Distributed systems are hard, and things can often get much more difficult when problems arise. This is only exacerbated by the fact that many of these systems can be notoriously difficult to dig into when they are actually out in the wild and not just running "on your machine".

They say that a picture is worth a thousand words, but in the world of distributed systems, a picture can easily be worth a thousand hours. While I can't promise you that this post will in any way save you a thousand hours, I hope that you find value in the thought process that I explored when introducing tracing and visibility into an Apache Beam pipeline.

What is Your Quest? What Do You Seek?

Before embarking on this journey, it was important to establish a few reasons why tracing is important, both in a development and production sense, as well as what the overarching goal of introducing it would be.

These are the few that I came up with:

Improve Development Story / Observability - Developing and local testing can be challenging when working with distributed systems like Apache Beam and in particular streaming. Because of this, it can be difficult to debug and examine the data that is coming through your streaming pipeline and determine if it matches the expected outputs at various points. Additionally, the ability to trace an error to a given section of code can be invaluable.
Provide Production Value - While you wouldn't want to actually trace every single request through your pipeline in a production environment, you could enable sampling to ensure that your production workflows are working as intended and in states where you find inconsistent results, a trace can be a valuable tool to help investigate.
Ubiquitous Tracing - While the story itself may focus on tracing within a distributed streaming infrastructure, when done properly, it can extend outside of a streaming pipeline and provide an end-to-end tracing story from when a given element was introduced to the system and all of the actions that were performed relative to the element via the OpenTelemetry standards.

Choose Your Own Adventure

Approaching the problem given my previous experience with Kafka and still (at the time) being relatively new to working with Beam, the following three approaches came to mind:

The "Kafka" Approach - In Kafka, all messages that flow through the system contain a series of headers similar to those in HTTP Requests. In a tracing scenario, you would have an opportunity to inject a correlation id within the headers to persist the trace throughout the course of the pipeline. Even after a message lands within another topic, the trace would still be persisted and could be picked up further down the pipeline by simply extracting the trace from the header at any point.
The "Wrapper" Approach - Apache Beam has no notion of headers similar to how Kafka handles storing the tracing identifier, which can make persisting the trace challenging. As a result, one approach can be to create a "wrapper" for each of the elements within your pipeline such as a TracingElement which will just wrap an existing element and contain the key-value pairs for the record as well as the tracing id. The downside of this approach is that it requires an adjustment to all of the entities and transforms throughout your system to look within the wrapper each time.
The "Data" Approach - As mentioned in the previous point, since Apache Beam has no semblance of a ubiquitous external storage at the record level, another option is simply to add an additional property to all entities / elements within the pipeline that denotes the tracing identifier. Storing this data on the record itself will also easily allow the trace to be persisted into other technologies and will require no changes to the overall pipeline itself (as the records will be unchanged save for the property related to tracing).

After some exploration, we found that the approach with the least overhead was simply adjusting the records themselves such that each record could be responsible for persisting its own trace (aka the "Data" Approach).

The Wrapper had significant overhead with regards to coding issues after transforms and added another layer of complexity when trying to retrieve the elements to operate on. The Kafka record lent itself too heavily to Kafka and made transforming difficult, not to mention it was inefficient since it persisted information specific to Kafka throughout the process (e.g. topic names, partitions, etc.)

Take These, You’ll Need Them!

With the hopes of following open-standards like those defined by OpenTracing, I figured it best to explain a bit about what goes into a trace. These terms come up frequently when discussing tracing and frameworks that handle it, so it could hurt before we embark into the code.

Span - A single building block representing an operation or some unit of work that you want to capture. They are capable of standing on their own, referencing (or following) from other spans, storing metadata, tags, etc.
Trace - A trace is a visualization of the life of a request (or series of other operations) that is made up of one or more spans. This collection of spans work together to paint a picture of the overall request, allowing you to easily reconstruct what happened within each span, etc.
SpanContext - This is a wrapper of key values pairs that associates a trace to one or more spans and is the key ingredient when carrying traces across data-boundaries (different transforms, systems, etc.). This is the primary component that we store and work with in the context of a distributed system.

Follow the Map

As mentioned earlier, a picture can say a thousand words, so it’s probably worth providing a very rudimentary example of what these pieces look like composed together, or what a timeline between actual actions and operations in the system parallel with a series of traces:

If we look at the diagram above, we can see how a given series of operations within an Apache Beam pipeline can parallel with building a trace to allow visibility into the pipeline. From the first encounter, a context will be created for the trace and spans will be associated with the context as it travels through the pipeline. It'll provide opportunities to tag searchable properties, output exceptions and logs, and much, much more.

The Adventure Begins (Using and Building a Trace)

There are four components necessary to initialize or create a trace/span within Beam, which this section will cover:

Context - You need some type of context, which is typically just a HashMap of Key-Value pairs that is used to store the tracing information and information about the span context. This can be done in a variety of ways, but the simplest can be just to add a property for it on one of your objects.
Tracing Configuration - If you are planning on pushing the traces to be consumed through a service such as Jaeger, you'll need to have the appropriate configuration added to your pipeline to resolve the tracer and send the traces off.
Resolving the Tracer - Once you have the tracer configured, the next step is to resolve it within the individual element-wise transformations that are part of your pipeline. You'll need a reference (a static one) to the tracer in order to properly send off traces.
Building a Trace - After resolving the tracer, you can easily initialize and build a trace to send off to Jaeger within your function and add the appropriate tags, logs, etc.

Defining a Context

As mentioned in the previous section, a span context can come in a variety of forms (such as a byte[], map, HTTP Headers, etc.). If you want to perform tracing at the element level, you'll want to ensure that your specific class or element has something defined to store it:

public class TraceablePerson {  
    // Other properties omitted for brevity

    // Define a publicly accessible tracing context

    public val tracingContext = mutableMapOf()
}

Likewise, if you were defining an Avro schema, this context might be defined as follows:

{
  "name": "tracing_context",
  "type": {
      "type": "map",
      "values": "string"
  },
  "default": {}
}

Configuring a Tracer

Configuring the tracer is a requirement if you want to start sending your traces to Jaeger or another service that handles distributed tracing via the OpenTracing standard. Thankfully, it's quite easy to configure via a custom TracingOptions class that your overall Apache Beam pipeline can inherit from:

interface TracingOptions: PipelineOptions {  
    @get:Description("The tracing application name")
    @get:Default.String("your_application_name")
    var tracingApplicationName: String

    @get:Description("The tracing host name")
    @get:Default.String("localhost")
    var tracingHost: String

    @get:Description("The tracing host name")
    @get:Default.Integer(6831)
    var tracingPort: Int
}

This will allow this configuration to be driven from command line arguments, files, or environmental variables. Next, you'll want to make sure that your overall pipeline inherits from these so they can be accessible via the pipelineOptions property within your transforms:

// Define a pipeline configuration that is traceable and interacts with Kafka
// (this is just an example, your mileage may vary)
public interface YourPipelineOptions : TracingOptions, KafkaOptions {  
    // Other pipeline specific configurations here
}

Using a Tracer

After your individual elements and tracing has been configured, you are ready to build your first trace. Since tracing is done at the element-level, you'll only be able to interact with the tracer at the DoFn level within Apache Beam. As such, there are two ways to handle this, you can either explicitly initialize this during the @StartBundle operation of a given transform as seen below:

class SomeTraceableFunction() : DoFn, KV<...>() {  
        private lateinit var tracer: Tracer;

        @StartBundle
        fun initializeTracing(context: StartBundleContext){
            // Resolve the tracer if configured from the pipeline options
            val tracingOptions = context.pipelineOptions.`as`(TracingOptions::class.java)

            if(tracingOptions != null) {
                tracer = TracingConfiguration.getTracer(
                        tracingOptions.tracingApplicationName,
                        tracingOptions.tracingHost,
                        tracingOptions.tracingPort
                )
            }
            else {
                // If no tracing configuration was found, use an in-memory one
                tracer = NoopTracerFactory.create()
            }
        }

        @ProcessElement
        fun processElement(@Element element: KV<...>) {
             // Omitted for brevity
        }
    }
}

We can take a look deeper into the usage of TracingConfiguration, which is simply a wrapper class that will create our tracer using a specified configuration, which you can tailor to suit your needs:

open class TracingConfiguration {  
    companion object {
        fun getTracer(application: String, host: String, port: Int): Tracer {
            return io.jaegertracing.Configuration
                .fromEnv(application)
                .withSampler(
                    io.jaegertracing.Configuration.SamplerConfiguration
                        .fromEnv()
                        .withType(ConstSampler.TYPE)
                        .withParam(1)
                )
                .withReporter(
                    io.jaegertracing.Configuration.ReporterConfiguration
                        .fromEnv()
                        .withLogSpans(true)
                        .withFlushInterval(1000)
                        .withMaxQueueSize(10000)
                        .withSender(
                            io.jaegertracing.Configuration.SenderConfiguration
                                .fromEnv()
                                .withAgentHost(host)
                                .withAgentPort(port)
                        )
                )
                .tracer
        }
    }
}

This static reference will allow you access to the tracer which will be used to build your traces and send them to Jaeger (or not do anything if you haven't configured it). If you plan on doing any decent amount of tracing, you'll likely find it beneficial to constructor your own TraceableDoFn to handle this:

open class TraceableDoFn : DoFn() {  
    public lateinit var tracer: Tracer;

    @StartBundle
    fun initializeTracing(context: StartBundleContext) {
        // Resolve the appropriate tracer if configured
        val tracingOptions = context.pipelineOptions.`as`(TracingOptions::class.java)
        if (tracingOptions != null) {
            tracer = TracingConfiguration.getTracer(
                tracingOptions.tracingApplicationName,
                tracingOptions.tracingHost,
                tracingOptions.tracingPort
            )
        } else {
            tracer = NoopTracerFactory.create()
        }
    }
}

This will allow you a publicly accessible tracer instance within any usages of the TraceableDoFn, which will cover using in the next section.

Constructing a Trace

As we discussed earlier in this post, we would be adopting an element-wise tracing context that could follow each individual message as it flowed through the pipeline:

public val tracingContext = mutableMapOf()

Now, sometimes creating a trace can be somewhat involved, but might typically look like this:

fun trace(context: MutableMap, name: String, tracer: Tracer){  
    // Create a builder for this span
    val spanBuilder = tracer.buildSpan(name)

    // If we have some type of previous context, we need this to associate them
    if (context.isNotEmpty()) {
        // If so, indicate this is a continuation from the previous context
        val existingSpan = tracer.extract(TEXT_MAP, TracingContextExtractor(context))
        spanBuilder.addReference(References.FOLLOWS_FROM, existingSpan)
    }

    // Start the context
    val span = spanBuilder.start()
    try {
        // Activate this span and update the context
        tracer.scopeManager().activate(span)
        tracer.inject(span.context(), TEXT_MAP, TracingContextInjector(context))

        // Add tracing information here
        span
            .setTag("some-tag", "some-value")
            .log("log some message")

    } catch (ex: Exception) {
        Tags.ERROR.set(span, true)
        span.log("$ex");
    } finally {
        span.finish()
    }
}

As you might imagine, that can be a lot, so we can create some extension methods to handle simplifying this into two functions: one to initialize a trace and another to diverge an existing trace:

// Initializes a new trace/span
fun Tracer.trace(context: MutableMap, name: String, traceFunction: (span: Span) -> Unit) {  
    // Create a builder for this span
    val spanBuilder = this.buildSpan(name)

    // If we have some type of previous context, we need this to associate them
    if (context.isNotEmpty()) {
        // If so, indicate this is a continuation from the previous context
        val existingSpan = this.extract(TEXT_MAP, TracingContextExtractor(context))
        spanBuilder.addReference(References.FOLLOWS_FROM, existingSpan)
    }

    // Start the context
    val span = spanBuilder.start()
    try {
        // Activate this span and update the context
        this@trace.scopeManager().activate(span)
        this@trace.inject(span.context(), TEXT_MAP, TracingContextInjector(context))

        // Apply any internal tracing
        traceFunction(span)
    } catch (ex: Exception) {
        Tags.ERROR.set(span, true)
        span.log("$ex");
    } finally {
        span.finish()
    }
}

// Creates a new span that follows from an existing one
fun Tracer.follows(  
    context: MutableMap,
    name: String,
    traceFunction: (span: Span) -> Unit
): MutableMap {
    // Create a copy of the context if one exists
    val contextualCopy = HashMap(context)

    // Create a builder for this span
    val spanBuilder = this.buildSpan(name)

    // If we have some type of previous context, we need this to associate them
    if (context.isNotEmpty()) {
        // If so, indicate this is a continuation from the previous context
        val existingSpan = this.extract(TEXT_MAP, TracingContextExtractor(context))
        spanBuilder.addReference(References.FOLLOWS_FROM, existingSpan)
    }

    // Start the context
    val span = spanBuilder.start()
    try {
        // Activate this span and update the context
        this@follows.scopeManager().activate(span)
        this@follows.inject(span.context(), TEXT_MAP, TracingContextInjector(contextualCopy))

        // Apply any internal tracing
        traceFunction(span)
    } catch (ex: Exception) {
        Tags.ERROR.set(span, true)
        span.log("$ex");
    } finally {
        span.finish()
    }

    // If we are not explicitly overwriting, we want to be 
    // able to capture the underlying context
    return contextualCopy
}

After you've established your tracer within your appropriate element-wise transform, you can use the trace() method to build and start your trace as seen below leveraging those extension methods:

fun processElement(@Element element: KV<...>) {  
     // Omitted for brevity

     // Create a span (which will create a new trace behind the scenes)
     // that applies contextually to this specific element (via tracingContext)
     tracer.trace(element.tracingContext, "name_of_span") { span ->
         // In here you can perform any operations that you might care about
         // and use the span reference to add tagging, logging, etc. as seen
         // below
         span
             .setTag("some_property", element.someProperty)
             .log("Log some message about the element here")
     }
}

Behind the scenes here, the following is happening:

The element context is examined to determine if any previous spans exist.
If a span did exist, a FOLLOWS_FROM attribute is added in order to relate this operation to the chain of other potential spans for this element.
If a span did not exist, a new span is generated and injected into the context.
The trace() call itself is finalized at the close of the closing bracket, which disposes of the span and it is finalized/committed to the appropriate tracer.
Any errors within the body of the trace() function will be properly decorated as errors and the log within the span/trace will contain the complete stack trace for the error.
Upon the finalization of a trace, it is committed to Jaeger (or your preferred/configured tracing system) and it should appear within the UI for those tools. This is performed within the trace() call automatically, so you don't need to worry about it yourself.

Support for Divergent Traces

Pipelines are seldom linear. Complex ones frequently branch, diverge and split off onto multiple paths, therefore tracing needs to support such operations and thankfully we can.

Let's consider you had a single event that was coming into your system and it had some notion of being traced. As we saw in the earlier example, we could easily do this via the trace() we showed off in the previous step:

// Start a trace for your event
tracer.trace(event.tracingContext, "name_of_span") { ... }

This will establish your trace and expose it up to Jaeger, Google Operations (formerly StackDriver), or your preferred OpenTracing consumer. However, if during your pipeline, you wanted to trace other entities that branch off from your event (e.g. an event contains multiple user instances that we care about, so we want to initialize traces from those that follow from our event).

To accomplish this, you can use the follows() API to create a new trace that follows from an existing one. What this means is that you can have an element traced independently downstream, but ultimately it can still be linked back to the originating record that introduced it into the system:

// Initialize the trace for a new traceable instance from an existing context
user.tracingContext = tracer.follows(event.tracingContext, "found_user") { ... }

After introducing this into your pipeline and running some data through, you can view the trace (covered in the next section) to visualize this branching within the trace:

What you can see in this chart is the following steps:

An event was introduced into the system and its trace was initialized.
Within the identification Apache Beam pipeline, two separate users were identified from this event with their own independent tracing contexts.
Downstream each of these users were sent to Kafka topics with appropriate tracing during that process (this step having no notion of the existence of an event itself)

As you might imagine, an entirely separate Apache Beam pipeline could pick up one of these users, and apply an additional trace, which will ultimately appear on the overall graph for the originating event.

Accessing Your Trace

NOTE: This assumes that either have a production instance of Jaeger running or a local instance (perhaps inside of a Docker container) where you can send your traces.

Once you've updated your applications to configure a tracer, your elements to contain and support contextual tracing, and actually run them, you are ready to leave your application and take a look at the traces themselves in Jaeger.

After running your application, you should be able to visit your Jaeger instance, which provide a UI to visualize the traces themselves:

From the Jaeger UI, you can do quite a bit in terms of exploration. You get an overview of all of the applications that are currently performing traces and filter down by a specific application. You can also search all of the known spans for a given tag that was defined within your application (e.g. a search for error=true would display every span that contained an error, so you could easily find errors within your pipeline).

Additionally, you can drill into any given trace to see more information about it such as timings, individual tags, logging information and more:

While this example is a very simple use case, you can imagine the value in more complex systems, especially during the development process, for logging aggregations in real-time, ensuring that transformations are being performed correctly, etc. The UI also provides a graphical representation of a trace as well:

You can also take two traces and compare them against each other to see where one might diverge (e.g. if one contained erroneous data it might be offloaded to a Kafka topic for manual review and another would continue on to its expected destination). As you might imagine, adding more traces, spans, and applications, this image would begin to expand into a complete graph of your entire system.

The End?

Obviously tracing and logging are their own massive topics that entire books have been dedicated to, so this is really just the tip of the iceberg. This post just covered the most simple of use cases for getting started with a tracing framework in a distributed system. There's a wide range of different frameworks, implementations, and strategies but this was just one of the options that was chosen.

Avoiding Kotlin Minefields in Apache Beam

Rion Williams — Wed, 17 Jun 2020 17:00:00 GMT

Without a doubt, the Java SDK is the most popular and full featured of the languages supported by Apache Beam and if you bring the power of Java's modern, open-source cousin Kotlin into the fold, you'll find yourself with a wonderful developer experience. As with most great relationships, not everything is perfect, and the Beam-Kotlin one isn't totally exempt.

This post will cover some of the unique interactions between the two technologies and help you avoid some of the potential landmine gotchas that could arise when you are getting started, so you can focus on the great experience between Kotlin and Beam.

Declaring Anonymous ParDos / DoFns

When scouring the web looking for examples, it’s fairly common to see something like the following that creates an anonymous DoFn to be used within a ParDo:

lines.apply("Extract Words", ParDo.of(new DoFn() { ... }));

A simple conversion to Kotlin would yield the following:

lines.apply("Extract Words", ParDo.of(DoFn() { ... }))

However, you’ll find that this causes type erasure to occur and Beam will complain about it. Instead in order to implement an anonymous function you must indicate that an object is inheriting from the DoFn explicitly:

lines.apply("Extract Words", ParDo.of(object : DoFn() { ... }))

Defining TupleTags

TupleTags can be invaluable and necessary if you are dealing with transforms or operations that deal with multiple types, however you may find that issues bubble up related to the declarations of these that cause you to either explicitly require a Coder to be defined via the setCoder() function after retrieving a specific tag.

A dead giveaway would be the following error:

Exception in thread "main" java.lang.IllegalStateException: Unable to return a default Coder for Transform.out1 [PCollection]. Correct one of the following root causes: No Coder has been manually specified; you may do so using .setCoder(). Inferring a Coder from the CoderRegistry failed: Unable to provide a Coder for V. Building a Coder using a registered CoderProvider failed. See suppressed exceptions for detailed failures. Using the default output Coder from the producing PTransform failed: Unable to provide a Coder for V. Building a Coder using a registered CoderProvider failed.

If you encounter this, it’s very likely that you defined your TupleTags as follows:

val userTag = TupleTag>()

Unfortunately as with most issues in Kotlin, type erasure can be a problem. To avoid this issue, you need to be explicit as possible when defining a TupleTag and use the object implements pattern as seen below:

val usersTag = object: TupleTag>() {}

The use of the object and trailing open-close curly braces allow the specific types to not be lost when attempting to read from the tag.

IntelliJ Generated Overrides

One of the most appealing features of IntelliJ is the ability to allow the IDE to generate any missing overrides for you when implementing or inheriting from another class / interface. Due to Kotlin’s typechecking system, this can be a challenge since Kotlin explicitly uses a ? character to denote nullability, but Beam will want you to ensure that the types match exactly.

Consider the following function:

class ExampleTransform: PTransform>, PCollectionTuple>() {  
    // Omitted for brevity
}

You know that you need to perform some type of operation here, so you take advantage of your IDE and allow it to generate the appropriate overrides:

When you do this, you’ll see that nullable instances of all of your types will be added, explicitly the parameters:

// Notice the trailing ? after the type definition the input
override fun expand(input: PCollection>?): PCollectionTuple {  
    TODO("Not yet implemented")
}

Beam is extremely explicit with regards to typing and nullability, so you’ll want to ensure that the PCollection in this case is not decorated with the nullable operator:

override fun expand(input: PCollection>): PCollectionTuple {  
    TODO("Not yet implemented")
}

Iterables, But Which Ones?

Both Java and Kotlin have notions of an Iterable interface for working with collections of items, however when leveraging them via a grouping/batching operation such as the GroupIntoBatchs transform, a Kotlin-Java JVM disconnect can occur between the types.

pipeline  
    .apply("Batch Items", GroupIntoBatches.ofSize(100))
    .apply("Apply Batching Transform", ParDo.of(SomeTransform.transform()))

You may encounter an error that looks like the following:

ProcessContext argument must have type DoFn, Result>.ProcessContext

You can resolve this by adding a @JvmWildcard annotation to the type of the iterable (and not the iterable itself on the DoFn) by changing this:

class SomeTransform: DoFn>, KV>(){  
  // Omitted for brevity 
}

to this:

class SomeTransform: DoFn>, KV>(){  
  // Omitted for brevity
}

This hint to the JVM should allow it to determine the correct version of the interface to use and be serialized/deserialized by the Beam programming model.

Writing Pipeline Tests

Testing, particularly unit testing, is extremely important when writing Beam applications (and obviously always), however there are two major gotchas in the testing department that you should be aware of when working with Kotlin, namely:

Defining Your Pipeline
Apply PAsserts
Running Pipeline Tests

Defining Your Pipeline

Since the native PAsserts that are applied when writing unit tests against Beam pipelines rely on native Java code, they will require a bit of annotations when being used in Beam. You can use the following as an example for how to construct one:

@get:Rule
@Transitive
val testPipeline: TestPipeline = TestPipeline.create()

All of your individual unit tests can share this pipeline, but you should consider writing it exactly as above since both the @get:Rule and @Transitive annotations are required, as is the explicit type declaration (e.g. : TestPipeline).

Applying PAsserts

The PAssert library comes bundled with Beam and allows you to write tests explicitly targeting PCollection objects (e.g. you can write assertions against the contents of them, verify their contents, etc.). These generally will just “work” as expected, however one particular caveat comes when using the .satisfies() function:

PAssert.that(numbers).satisfies { elements ->  
    assertTrue(elements.contains(42))
}

You’ll find that this will not work since the satisfies() function explicitly expects a Java Void to be returned. Since this doesn’t exist within Kotlin, you’ll be required to explicitly place a null at the end of the body of the function:

PAssert.that(numbers).satisfies { elements ->  
    assertTrue(elements.contains(42))
    null // Required
}

Running Pipeline Tests

After potential gotcha can come when attempting to actually execute or run the tests themselves. You have to ensure that the pipeline itself is explicitly run, to completion, after the PAssert is defined:

PAssert.that(numbers).containsInAnyOrder(42)  
testPipeline.run().waitUntilFinish()

Since the PAssert is constructed as part of the dynamic acyclic graph that executes the pipeline, it must be declared prior to running the tests. You’ll also find that you won’t be able to debug any of the ParDo level operations if you are missing the run() declaration.

Missing a Gotcha?

If you've been working with Apache Beam and Kotlin, I'd love to hear of any specific gotchas or use-cases that you ran into and how you overcame them!

CodeProject

Introducing Apache Beam Katas for Kotlin

Rion Williams — Mon, 01 Jun 2020 17:00:00 GMT

Everyone learns differently, especially when it comes to technology.

Some folks enjoy reading a book, others like immediately diving in and getting their hands dirty. Being inspired by the folks in the latter half of that statement, I decided to follow the lead of a few other folks that had been developing courses to learn Apache Beam and combine that with my new language du-jour: Kotlin.

Introducing Beam Katas for Kotlin

The folks working on Apache Beam have done an excellent job at providing examples, documentation, and tutorials on all of the major languages that are covered under the Beam umbrella: Java, Python, and Go. One of the shining stars of all of these resources are is series of available Beam Katas.

Beam Katas are interactive coding exercises (i.e. code katas) for learning more about writing Apache Beam applications, working with its various APIs and programming model hands-on, all from the comfort of your favorite IDEs. As of today, you can now work through all of the progressive exercises to learn about the fundamentals of Beam in Kotlin.

Each series of Katas provides a structured experience for learners to understand about Apache Beam and its various SDKs through exercises of gradually increasing complexity. You'll start from the most basic of "Hello Beam" projects to eventually building an entire data pipeline on your own. The Katas themselves are built upon JetBrains Educational Tools, which allow you to work through this and the hundreds of other available courses all from the comfort of your IDE (assuming you are using IntelliJ or PyCharm) through their EduTools plugin.

Beam + Kotlin = ❤️

Kotlin is a modern, open-source, statically typed language that targets the JVM. It is most commonly used by Android developers, however it has recently risen in popularity due to its extensive feature set that enables more concise and cleaner code than Java, without sacrificing performance or type safety. It recently was ranked as one of the "Most Loved" programming languages annual Stack Overflow Developer Survey, so don't take just my word for it:

The relationship between Apache Beam and Kotlin isn't a new one. You can find examples scattered across the web of engineering teams embracing the two technologies including a series of samples announced on the Apache Beam blog. If you are new to Beam or are an experienced veteran looking for a change of pace, we'd encourage you to give Kotlin a try.

You can find the Kotlin and the other excellent Beam Katas below (or by just searching for "Beam Katas" within IntelliJ or PyCharm through the EduTools plugin):

I'd like to extend a very special thanks to Henry Suryawirawan for his creation of the original series of Katas and his support during the review process and making this effort a reality.

CodeProject

An Education in Streaming

Rion Williams — Sat, 09 May 2020 17:00:00 GMT

I’m probably one of the few folks out there that love a good technical book. Sure, well written blog posts, tutorials, and exploratory projects are amazing, but if you really want to dig deep, books can be a great resource.

Since I’ve fallen in love with streaming data systems, I thought it would be fitting to focus on books in that arena. Streaming itself seems to be very hit or miss in terms of understanding. It’s not always intuitive and can often go against what your previous experience has taught you to believe.

I’ve seen countless developers, incredibly smart folks, struggle with some of the most basic concepts behind streaming systems, and with this post I’m hoping to provide some “recommended readings” to help prevent that from happening to you.

So without further fanfare, let’s take a few looks into a few books (in no particular order) that might help you on your journey.

Kafka: The Definitive Guide

A perfect primer for Kafka that covers its origins, internals, uses, and very much a recommended read before diving into it.

Kafka is probably the most ubiquitous technology when you think of streaming, so it’s a good one to understand. So who better to cover this topic than the folks at Confluent, a team composed of tons of Kafka aficionados and many members of the original team that built it.

it’s important to note, we are talking about Apache Kafka proper (not to be confused with Kafka Streams which is an abstraction and framework built atop Kafka).

Kafka: The Definitive Guide is an excellent (and free) entry point into the world of Kafka. It’s an incredibly easy read, and provides a high-level overview of Kafka itself, the value it provides, and how it works in a nutshell. You'll learn about producers, consumers, brokers, clusters, partitions and all of the other vast building blocks that make Kafka great - and all within the first chapter.

After the overview, you'll get a nice overview of the installation process and a primer on all of the companion applications and processes that help Kafka do its job effectively (things like Zookeeper), configuration, hardware needs, and more. Once you get your feet wet there, you'll journey into the world of writing producers to generate messages as well as consumers to read them and the pieces will start to come together.

The book then dives into the nitty-gritty details: the internals. You'll revisit many of the concepts introduced earlier in the book and go into deep-dives on each of them (e.g. instead of knowing that partitions are just the building blocks of Kafka, you'll learn about how they are structured, segments, indexes, etc.). These details segue into actually constructing a pipeline and all of the considerations that go into making it perform well, scale, and handle failures (which happen all the time). You'll get a variety of other production-oriented topics surrounding administration, monitoring, and you should be able to leave the book with some confidence in working with Kafka.

One of the most common reviews of the book itself states that “I wish that I had read this book before working with Kafka” and that couldn’t be more true (and it's a great resource if you are interested in becoming certified with Kafka).

Kafka Streams in Action

An excellent introduction to Kafka and its internals, when to use it, all framed around the Kafka Streams framework for building streaming applications.

Keeping with the topic of Kafka, the next book “Kafka Streams in Action” by Bill Bejeck provides a much more code-heavy approach. It almost reads like a revision of the Definitive Guide mentioned earlier with a much more practical focus specifically for Kafka Streams.

The book goes over many of the same concepts covered as the earlier book and the first few chapters closely mirror each other, discussing high level use-cases, when and when not to consider Kafka for solving problems, etc.

This overview quickly takes a turn from around Chapter 3 and beyond into very informative, targeted chapters which focus on things like join semantics, windowing, stateful processing, etc. Each chapter works through and builds upon earlier problems and provides code examples for how to go about leveraging Kafka Streams to solve these.

You'll learn about the various supported APIs that Kafka Streams exposes from the high-level DSL that provides a very SQL-esque syntax for working with streams to the low-level Processor API that'll allow you to work at the most granular level. It's chapters regarding state really shine and demonstrate some of the true potential behind the Kafka Streams framework.

As with the previously mentioned Kafka: The Definitive Guide, you'll explore many common design patterns and use cases for Kafka along with accompanying Kafka Streams code to implement them. You'll cover concepts like testing, monitoring, and all of the other goodness you'd expect before creating a production-ready Kafka Streams pipeline.

Ultimately, you can’t go wrong with either of these if you are working with Kafka, however if you are using Kafka Streams, I’d probably err towards this one, otherwise, stick with the Definitive Guide.

Designing Data-Intensive Applications

Probably the most well known book on this list, and with good reason. It covers all facets of building data-oriented applications and while it doesn’t focus solely on streaming, it’s covered quite a bit.

In this fantastic work on all things data, Martin Kleppmann guides you through the foundations of data in an easily digestible and comprehensive journey. Regardless of your experience, there’s something for you in Designing Data-Intensive Applications.

Kleppmann begins the journey at the foundational level, but don’t assume that it consists of just a one-two sentence high level overview. All of the popular flavors of data stores are covered (relational, document, graph, etc.) in depth with regards to scalability, maintainability, and reliability. You’ll also learn about the underlying implementations (e.g B-trees, LSM trees, etc.), formatting techniques, and protocols that make each excel as databases or warehouses respectively.

After this foundational level, the book dives into the distributed world, which is primarily why it’s on this list. Streaming technologies such as Kafka are covered along with all of the concepts that make them great (partitioning, concurrency, replication, etc.). You’ll learn in-depth about concepts (and potential assumptions) like isolation, locking, serialization, and clocks (they can be unreliable). There’s lots of information to grok here, but it’ll affect how you think about streaming and distributed applications.

The final part of the book focuses on both batch processing and streaming, and many of the patterns that align the two. You’ll revisit Kafka in depth here (hopefully with the knowledge gleaned earlier in the book) as well as other pub-sub technologies and message queues. Lots of good nuggets about failures (spoiler alert: bad things happen a lot), idempotence, microbatching, windowing, and much, much more.

If you had to read one book from this list - Designing Data Intensive Applications wouldn’t be a bad one to pick.

Streaming Systems

A fantastic book on streaming in a holistic sense that focuses on building systems around it, the trade offs between batching and streaming, the Apache Beam model, and all sorts of other goodness.

Arguably one of the best books I’ve come across regarding streaming in a holistic view. Streaming Systems was written by a team of three engineers at Google, who worked on their DataFlow team and were members of the steering committee for Apache Beam.

The book itself talks holistically about streaming with a focus on the Beam Model (i.e.Apache Beam) and does so because Beam itself is designed to be an interface or abstraction for various streaming technologies (e.g. Spark, Flink, Dataflow, Samza, and countless others). As such, all of the code examples are written in it so that they could easily be identified in these other technologies, but more so they allow you to understand what is happening.

The book is extremely well written and is an easy read, despite the content occasionally teetering on academic (i.e. very detailed, low-level). It’s chock full of knowledge and tidbits, which a dash of humor in every chapter. It’s what most development books should aspire to be.

Streaming Systems also introduces a common “What” , “How”, “When”, “Where” motif that will stick with you when approaching steaming problems.

Another plus for Streaming Systems is that despite the name, it doesn’t shy away from saying the B-word. Batching comes up frequently, primarily since the Beam Model supports it, but because it’s also the manner in which most companies perform data processing. The book details how the two interact, parallel, and how to strike a balance between them.

CodeProject

No, You Aren't Alone.

Rion Williams — Thu, 09 Apr 2020 19:29:19 GMT

This situation sucks.

My heart goes out to all of those affected by this terrible pandemic across the globe. With our lives flipped upside-down for the foreseeable future, I thought I'd shed some insight into what's going on in my neck of the woods, how I'm managing, and hopefully provide some solace to others out there in saying:

You absolutely aren't alone in what you are dealing with.

Normal Remote Work This is Not

Many folks reading this may be experiencing working remotely for the first time in their careers. It can be a huge adjustment. It can be scary. You might feel isolated, especially if you are extroverted or a social butterfly. You may struggle to feel like you are getting anything done - and that's okay.

As someone that has been working fully remote for several years I can easily say this is not what normal, fully remote work is like.

It's truly a struggle to balance everything at a time like this. The constant stream of bad news flowing in, numbers that are out of a horror movie that continue to skyrocket, and no real end in sight. It can be hard to focus on solving work-related problems when personal life problems and legitimate emergencies are all around you.

I want you to know that you aren't alone - this is hard, unknown territory that we are all wandering through together.

What's Life Like?

My spouse and I are both still working full-time, with me as a software engineer and her being a university professor. Both typical jobs that offer a ton of flexibility, and you'd think that this whole pandemic wouldn't affect us too severely... but I also have two children.

With both my kids being home, the difficulty level gets cranked up to near unbearable levels. Neither of the kids are old enough to be totally independent with one being two and the other being four. The four-year old still has school-work to do, homework to do (and turn in), and between both of them - they need things to do throughout the day.

Each day is an insane grind of work, family time, being a teacher, being a parent, and just trying to feel some semblance of productivity in any of those areas during the day.

A few things that I've found that have helped me get by include:

Strive for Normalcy - Structure is so important, especially if you have kids. Try to get them on a schedule that mimics life before the pandemic. Wake up, eat breakfast, do some activities with them, play outside, etc. It can be hard, they'll ask why they can't "play with their friends" or why they aren't going to school. It's tough.
Tag Team - If you have a partner (again most of my challenges revolve around having kids), you two have to lean on each other. You may have important work obligations as may your partner. Communicate those to one another and prioritize as much as you can.
Find Time Where You Can - Having a set schedule can be important, but it's not always realistic. Kids fall, chaos ensues, Wi-Fi goes out, you need food, etc. So find time where and when you can. I've found myself working odd hours, nights, during naps to try to feel as productive as I can with the situation around me.
Set Realistic Goals - It's hard to try and expect that you'll be just as productive as your usual non-global crisis self - you likely won't. Try and set smaller, realistic goals for yourself that you can reach within a day or week to help check that "I Feel Productive" box mentally.

This isn't easy, but you'll get through it.

Not Holding It Together

You might feel like you are going to break or might not be able to handle it. As I sit here typing this, know that I've felt the same way. My wife has felt the same way. My friends, co-workers, and countless others feel that way - you aren't alone.

If you are struggling to keep your head above water - talk to someone. If you aren't feeling productive at work, reach out to your manager, director, HR, or someone.

Work together to figure out a different schedule during these crazy times and don't feel guilty about it - it's very likely the person you are talking to is going through the same situation.

If you are working remotely, as most folks are now, take some time each day to hit those social goals with your co-workers. Talk about your crazy days, exchange the horror stories, and hopefully find comfort to know that you aren't alone in all this - because you aren't.

CodeProject

When Containers Become Trashcans

Rion Williams — Sun, 01 Mar 2020 18:00:00 GMT

Containers are so awesome.

Prior to containers, if you wanted to experiment with some new technology, you had to go through the ringer to configure and install all of the appropriate dependencies, set up the proper infrastructure, and clutter your machine with tons of trash that you might not ever care to use in the future. Thankfully, Docker and other containerized technologies came along and empowered developers to just throw a one-liner into the command line and like magic, the entire world appeared in a magical little box.

Several days ago, I had an idea in my head so I scrounged around through Docker Hub looking for the perfect container that had everything that I wanted in it. I'd pull down one, tinker a bit, and throw it away. Pull another down, try something else, and then eventually throw it away. Pull yet another one down and ... crap!

docker: write /var/lib/docker/tmp/GetImageBlob785934359: no space left on device.

No space left on my device? That's odd, maybe it's the hours upon hours of videos that I've downloaded for my kids to watch on road trips, or all those computer games that I've installed but never can find the time to play (some day)? Let me check out the hard-drive and see how that looks:

Okay, so either this image that I'm pulling down is really big, or something else is going on. After a bit of mulling it over, I realized that like all good humans that don't want the machines taking over, delegated a specific amount of space and resources that Docker could take advantage and sure enough, I was right at the threshold.

It was time to take out the trash, however since docker dump trash isn't a legitimate command (at least out of the box), I just had to do a quick prune:

docker system prune

The prune command in a nutshell does the following:

Removes all non-running / stopped containers
Removes all volumes that aren't being used (by at least one active container)
Removes all networks that aren't being used (by at least one active container)
Removes all dangling images within the system.

Since I mentioned above that I really enjoy playing with containers, I figured that we'd just do it live, so here's what that looked like on my local machine:

With three little words, I managed to free up nearly 126GB of unused space across 22 containers, 5 networks, and 45 images). That's quite impressive, but more importantly, I was then able to immediately install the image that I was trying to in the first place.

So the moral of the story is this - containers are really awesome, powerful, and useful to the modern developer, but if you aren't mindful, those unused ones can quickly turn from containers to trashcans that can make your development environment a stinky place to work in.

CodeProject

Putting the Fun in C# Local Functions

Rion Williams — Fri, 28 Feb 2020 04:00:00 GMT

Many popular languages support the use of local functions and in C# 7, support for them was announced with relatively little fanfare. As someone that would consider themselves a C# power-user, I seldom took advantage of the feature until I realized just how much it could help with making code more readable, specifically in the context as a replacement for comments/hacks, unit tests, and in general just to clean things up.

What are local functions exactly?

Local functions are private methods of a type that are nested in another member. They can only be called from their containing member. Local functions can be declared in and called from:

Methods, especially iterator methods and async methods
Constructors
Property accessors
Event accessors
Anonymous methods
Lambda expressions
Finalizers

As with most things, sometimes, it's easier to just show you what a local function looks like:

public static IEnumerable SanitizeAddresses(List addresses)  
{
      foreach(Address address in addresses) 
      {
           yield return Anonymize(address);
      }

      // This is a local function
      Address Anonymize(Address address) { ... }
}

Cleaning Up Comments with Local Functions

One of the first use cases that comes to mind that functions can help alleviate is any pesky sanitation or business logic rules, particularly those around string manipulation, etc. If you've worked in enough business applications, you've undoubtably seen something terrible with some massive comment as to why it's being done:

public static User ProcessUser(User user)  
{
      // All names must conform to some crazy Dr. Seuss-eqsue rhyming scheme
      // along with every other character being placed by it's closest numerically
      // shaped equivalent
      var seussifyExpression = new Regex("...");
      user.Name = seussifyExpression.Replace(user.Name, "...");
      user.Name = user.Name
                      .Replace(..., ...)
                      .Replace(..., ...)
                      .Replace(..., ...);

      // Other processes omitted for brevity

      return user;
}

As you can see here, we have a series of chained replacements, some relying on strings, and others relying on regular expressions, which can make a method pretty clunky, especially if there's multiple operations to perform. Now, this is where you can define a local function to encapsulate all this business logic to replace your crazy comment:

public static User ProcessUser(User user)  
{
      SanitizeName(user)

      // Other processes omitted for brevity

      return user;

      void SanitizeName(User user)
      {
          var seussifyExpression = new Regex("...");
          user.Name = seussifyExpression.Replace(user.Name, "...");
          user.Name = user.Name
                          .Replace(..., ...)
                          .Replace(..., ...)
                          .Replace(..., ...);

          return user;
      }
}

You could easily name your local function whatever you like, even ApplyBusinessLogicNamingRules() and include any necessary comments for reasoning that you'd like within there (if you absolutely need to answer why you are doing something), but this should help the rest of your code tell you what it's doing without a comment explicitly writing it all out.

Going All Reading Rainbow with Local Functions

If readability isn't the single most important thing about code, then it's damn close to the top.

LINQ is another popular area that local functions can assist with, especially if you have to do any type of crazy filtering logic over a series of records. You can define a series of local functions that can cover each step of your filtering process (or any process really), and more easily reason about your code from a readability perspective:

public List FindPrimesStartingWithASpecificLetter(List numbers, int startingDigit)  
{
    return numbers.Where(n => n > 1 && Enumerable.Range(1, n).Where(x => n % x == 0).SequenceEqual(new [] {1, n }))
                  .Where(n => $"{n}".StartsWith($"{startingDigit}"));
}

While succinct, it doesn't exactly read well. Let's take a gander at what it looks like after rubbing some local functions on it:

public List FindPrimesStartingWithASpecificLetter(List numbers, int startingDigit)  
{
    return numbers.Where(x => IsPrime(x) && StartsWithDigit(x, startingDigit));

    bool IsPrime(int n) => return n > 1 && Enumerable.Range(1, n).Where(x -> n % n == 0).SequenceEqual(new [] { 1, n }));
    bool StartsWithDigit(int n, int startingDigit) => return $"{n}".StartsWith($"{startingDigit}");
}

As you can see, local functions are assisting with wrapping up all the ugly/nasty logic within their own tiny little functions. This is a really trivial case, but as you might imagine if you have lines-upon-lines of code that isn't touching anything outside of one method, it's likely a solid candidate for a local function.

Testing, Testing, Lo-ca-lly!

If you've spent any amount of time writing tests, either unit or integration, you are probably familiar with the fabled 'Arrange-Act-Assert` pattern, which is used to separate each piece of functionality when testing a given piece of code as follows:

Arrange all necessary preconditions and inputs.
Act on the object or method under test.
Assert that the expected results have occurred.

As you might imagine, this could lend itself to the pattern quite well for complex test cases:

public void IsThisAnArrangeActAssertLocalFunction()  
{
     Arrange();
     Act();
     Assert();

     void Arrange() { ... }
     void Act() { ... }
     void Assert() { ... }
}

Is it practical? Does it fit all use cases? Is it something that you'd ever find yourself using? The answers to all of these might be an overwhelming no, but it does seem like a scenario where local functions could play a role.

Choose Your Own Adventure

Local functions present a few interesting options that fit some scenarios better than others. As a replacement for large comments or very messy business logic - absolutely. In unit tests or little one liners - probably not. With most new features, especially those that are sugary, it's really up to you and your team to see if they work for you. While they may seem appealing in some situations, they also seem ripe for abuse, potentially cluttered methods, and other issues that would completely defeat the purpose of using them in the first place.

So, if you choose to go down this road of local functions, proceed with care.

CodeProject

The Other Kafka's Metamorphosis

Rion Williams — Mon, 17 Feb 2020 14:00:00 GMT

As someone that has been crafting software for the better part of half of my life, it's be a long time since I've been as excited as I have been over the last few months. Since making the jump late last year into the world of stream processing, I just can't get enough. I've read every book on it that I can get my hands on and continuously scour the web for interesting content on it, engineering blogs with novel approaches for interesting use-cases, and at this point, I'm even dreaming about it.

"When Gregor Samsa woke up one morning from unsettling dreams, he found himself changed in his bed into a monstrous vermin."

Franz Kafka, The Metamorphosis

Alright, so I'm not turning into some monstrosity akin to our friend Gregor in Kafka's seminal work, or even close to that. But even with just a few months of exposure to Kafka (the technology), there's been a transformation on numerous fronts. It's affected the way that I want to work, the approach and solutions to the problems I encounter, and it's opened a wide range of doors that may have previously been closed without a paradigm like streaming.

Transforming Novelty into Enthusiasm

The first part of this transformation was just how different working in this new streaming world was from what I had been accustomed. As I wrote in an earlier post describing the change:

In a nutshell, I was taking everything I had done previously in my career and voluntarily throwing it out the window to shift to an entirely different stack, programming paradigm, language, etc. I suppose it could have been terrifying, but of all things it was... exciting.

It was an entirely different paradigm for solving problems, with its own unique challenges and nuances that were interesting puzzles to solve. Architectural changes, dealing with replication, partitioning, figuring out which freaking JDK to use, streams, tables, and countless other already forgotten issues were weekly, if not daily, discussions.

At first, the transition was fun, most of which I attributed to the sheer novelty of it. Since everything was new, it felt more like a vacation from what I had been accustomed to working with. Sure, there were challenges, misunderstandings, and quite a few things that I just totally got wrong, but that's expected. My colleagues and I found ourselves vacillating between optimism and nearly wanting to scrap the experiment entirely until...it worked.

It was the most basic of scenarios, just simple enrichment, but it was mind-blowing. Messages were flowing into the stream, being enriched from known sources, and sent off to their final resting places. It really was just like magic except that I'd reckon even the best of magicians would struggle with the type of throughput we were handling.

I was hooked.

In even just a matter of months, I've become professionally obsessed with it. I've read a handful of fantastic books dedicated to the topic, scoured the web for interesting uses of it (shout-out to LinkedIn, Uber, Netflix, and all the other transparent tech-giants leveraging it), and since they weren't selling t-shirts anywhere, I even went out and got a piece of paper:

(If anyone at Confluent or anyone else with some Kafka swag is reading this, hit me up, I'm still looking for a t-shirt or two.)

Transforming Batches into Streams

I'd say the first glaring thing about Kafka is that it's provided another approach for solving problems, specifically, the ability to do so in real-time. This alone is an incredibly compelling story if you've been living your developmental life in a batch-processing world. You'd now have the ability to action items or handle events as they occurred instead of waiting until some given interval before even being aware they existed at all.

Streams don’t just solve every problem however, and I’ll let this quote from Bill Bejeck help you decide when it might be appropriate:

Here are the key points to remember: If you need to report on or take action immediately as data arrives, stream processing is a good approach. If you need to perform in-depth analysis or are compiling a large repository of data for later analysis, a stream-processing approach may not be a good fit.

Bill Bejeck, Kafka Streams in Action

I’ve always considered myself a pragmatist, and I strongly believe that you should use the best tool for the job, but always be cognizant of your biases. Your most comfortable hammer is likely not going to help you tighten a loose bolt, and if it does, well some irreparable damage might be done.

Kafka is easy enough to integrate with using tools like Connect to sync the data in Kafka down to various data sources (e.g nearly every flavor of relation and non-relation databases, Elasticsearch, Big Query, etc.) in real-time. Likewise, data can be sourced into Kafka the same way as well.

You don’t have to go all in on one approach or the other; Kafka can supplement your existing ecosystem. If you have long-running data analysis to do, then use a tool that’s best for that. If you need that data in Kafka to do enrichment or make decisions with it, then sync it to Kafka when it makes sense and use it there as soon as it’s available.

You have options, use them.

Transforming Your Applications and Business

While the learning curve was steep, it paled in comparison to the analysis paralysis that followed. Despite the earlier quote from Bill on when streams were appropriate, it’s very easy for people to start feeling the instant gratification of what real-time processing looks like and get carried away.

Seeing a process that prior to Kafka may have only run once in the middle of the night to now seeing processing within seconds is awesome, especially to those outside of engineering. This is where you must be careful and resist the urge to stream everything (and push back when it doesn’t make sense to do so).

Given that today we are bursting at the seams with data. Companies are gathering more data than ever and are seeking to leverage all this information to make decisions on just about everything. It goes without saying that since companies are overflowing with data, they are receiving it constantly.

Data is constantly flowing into systems and time is money. Kafka enables these systems to make decisions, change courses, take action, or at least be notified immediately when something of value occurs instead of waiting until the report trickles into their inbox.

Kafka is very fast, resilient, and battle-tested. It can easily be tailored to fit your specific scenarios from debugging, anomaly/fraud detection, training of machine learning models, and much more. It's become a centerpiece among data frameworks, architectures, and processing systems.

It's a technology that can absolutely transform the way you build your applications, and even the capabilities of your business. Just make sure that you transform responsibly.

CodeProject

How I Haven't Become an Amorphous Blob While Working Remotely

Rion Williams — Tue, 11 Feb 2020 04:47:08 GMT

With it being February, I’m sure quite a few folks out there are still trying to keep their New Year's resolutions intact. Since I’d wager quite a few of those are in the area of fitness or personal wellness, this post falls in line with a resolution that I made years ago.

I’ve been working remotely full-time for the last three years and did a quite bit of it part-time for my previous employer. Remote work can be challenging for several reasons. It requires a great deal of self-discipline, both personally and professionally, and it can be tough to find that balance between focusing on your work, your family, and yourself.

This post is going to revolve on the last one of those items: focusing on yourself, specifically with regards to fitness and what has worked for me in my career as a developer, specifically a remote one (where a kitchen filled with all sorts of terrible things to eat is never more than a few steps away).

The Ball

It was circa 2008 and Jeff Atwood's well-written article 'Investing in a Quality Programming Chair' was making the rounds on all the good developer related circuits. After reading this article, I did as most might, and began searching for a chair that would qualify as quality and surely it was improve my life... unfortunately my wallet didn't seem to agree with me.

Enter the Target clearance aisle and a Swiss Ball. This ball has been my daily chair for the last seven years. It’s never popped, I’ve never had to replace it, and it cost all of $6.99.

It greatly helped my posture, balance, core strength, and has been worth its weight in gold. While you might not fully replace your chair with it, it’s a great supplement throughout the day, especially if you don't have the cheddar lying around to go pick up a nice Herman Miller.

Beware Those Who Enter

Another great purchase that can be worthwhile is a simple pull up bar. Most are inexpensive and can mount in nearly any doorframe. Since there isn’t always a ton of opportunities to work out your upper body (typing doesn’t exactly get you looking ripped), you can make it an easy option with one of these.

I think that most would agree that pull-ups and chin-ups aren't the easiest of exercises, especially if you aren't someone that works out regularly. Therefore, it's so important to adjust and focus on what you can do. Every exercise has variants and approaches to accommodate different skill levels and pull-up/chin-ups are no exception. If you consider yourself in that category, explore doing "negatives" or employ the aid of a nearby chair to help you bridge the gap.

It may seem daunting at first, but if you can build a habit of knocking out a few reps each time you enter/leave the room and it’ll pay dividends over time.

The Deskcycle

While all of the previous items on this list could be considered general purpose, this next item would probably fall into the more specialized for office-use category. The Deskcycle is from the folks at 3D Innovations and I can honestly say that I rode one of these until the imaginary wheels fell off. At its peak, I probably averaged several hours a day in a serious sweat while working and I'd estimate that I clocked several thousand miles on it.

It’s worth noting, if you find yourself churning along at 15-20+ miles an hour. You are going to sweat, so dress, deodorize, and plan accordingly (especially if you are in a shared office)

I never ran into any issues, besides the complaints from my co-workers at the time because I'd be drenched in sweat going into a meeting (small, poorly insulated office, so the devastating Louisiana summer heat is partially to blame). As with most workouts, there was a bit of an adjustment period (really, really sore legs), but after some time, I left every workday feeling like I had just left the gym.

It has all sort of neat tracking features regarding calories burned, distance, and all the usual jazz that you’d expect from something like this. It’ll easily slide under just about any desk, although if you are tall, you may want to do some measuring before getting one.

Iron

Probably the most classical and ubiquitous item on the list: iron. Free-weights, dumbbells, barbells, kettlebells, whatever suits your fancy. The options are really timeless and can adjust to your needs (or whatever specifically you want to work on). They come in all shapes, sizes, and varieties to work just about every muscle group in your body and most are small enough to hide away in any office.

On a long conference call? Maybe knock out a few reps/sets. Waiting for that long running build process that you’ve been needing to fix? Fix it later and do a few Turkish get-ups.

While it can be handy to have a complete barbell set with plates at your home, which I recommend if you are interested in doing the major Olympic lifts, it's not required by any stretch. Some of the hardest workouts that I've done have taken place in my office over a matter of minutes with just a single kettlebell. In addition, the benefits of doing HIIT (high-intensity interval training) periodically throughout the day has been found to yield benefits for hours after you've stopped working out, so a few sets every few hours might be all you need.

Weights can be pretty easy to ignore, but if you can find a quick program or series of exercises that you make a few minutes for throughout the day, you can sometimes walk away feeling mentally and physically productive.

Treadmill Desk

The real reason behind this entire post is this last section, which might not be the most practical, but it’s done wonders for me: the treadmill desk. Now it’s worth mentioning, these things are not cheap, and they are not by any means just some ordinary gym treadmill. These are made for specifically for lower speeds and extended use (several hours). Don’t go expecting to run sprints and train for the 2024 Olympics on one of these.

As soon as I took on a fully remote position, I began scouring sites like Facebook Marketplace, eBay, and Craigslist to try and find a deal on one. Given that they can run in excess of $1000 for a quality one, that I'm horribly frugal, and the nearest major city is about two hours from where I live, it look months for me to find a deal on one, but when I did, I jumped on it.

Four hours later, it was all cleaned off and sitting my office, ready to change my remote life forever.

At this point you might be thinking "A four-hour drive for a deal on a treadmill desk? That's crazy" and you might be right. But all things considered, if this thing burst into flames tomorrow, I'd be ordering one for full price that same night, because that's how much of a difference it's made.

I spent the majority of my days on it while actively developing. There was quite a bit of a learning curve to learn how to walk, solve problems, and write code all in conjunction, but after a while the walking isn't even given a thought. Usually for day-to-day writing code, I'll probably average between 2.0-2.5mph, with that going a bit higher if I'm just in a meeting, conference call, etc. I’d say I would easily average between 10-15 miles a day, which is something I’ve adapted to (don’t kill yourself trying this if you get one).

To be honest, it was a game-changer for me working remotely, especially living in a place with really, really good (read as fattening) food. It’s yielded numerous healthy benefits and improvements, but above all else, it’s allowed me to finish every work day feeling satisfied both physically and mentally, and allowed me the additional time to spend with my family instead of finding time for me to work out.

CodeProject

Glyphfriend 2019 Released!

Rion Williams — Thu, 23 Jan 2020 15:00:00 GMT

With a new year just beginning, it's always nice to start things out on the right foot and open-source is no exception.

Several years ago, I developed a popular extension for Visual Studio called Glyphfriend. It's a handy tool for developers and designers alike that enjoy using the wide range of icon families out there within their applications:

With two little ones running around and growing up so fast, free time can be harder and harder to come by. But after a year of folks asking for it, a bit of lost sleep, and some bruises from banging my head against some new APIs, Glyphfriend is finally here for Visual Studio 2019!.

(You can download it here)

What is Glyphfriend?

Glyphfriend is an open-source Visual Studio extension to enhance the existing IntelliSense to display preview glyphs for many of the common glyph-based font libraries like Font Awesome, Material Design, Ionic, and more. If you read my blog enough, you know I love bullet-ed lists, so why not one for this:

Multiple Supported Glyph Libraries - Access near 8000 glyphs from your favorites including Font Awesome, Material Design, Ionic, Foundation, and many, many more
Library Toggling - Avoid being bogged down by the thousands of glyphs when you start typing by selecting only the libraries that you commonly use. Glyphfriend will remember your favorites every time that you open Visual Studio, and accessing another library is just a click away.
Only When You Need It - The extension will only decide to chime in when you open a valid HTML-flavored document and start using it within a class attribute. So you don't have to worry about writing some C#, F#, or (god forbid) Visual Basic and having glyphs invade your screen.
Open Source - The extension is completely open source and accepts contributors. The community has created really awesome things in the past (such as separate plug-in for the Resharper extension) and much more.

What's New in It?

While there's still quite a lot to do in the backlog, the primary feature for this new release is simply the support for Visual Studio 2019 and a bit of housecleaning that I'll detail below:

Visual Studio 2019 Support - Glyphfriend now has an extension that supports Visual Studio 2019. There were a ton of major API changes between the 2017 and 2019 release that basically required writing an entire new extension, so I did.
Performance Improvements - Obviously with an entirely rewritten extension, there were tons of changes, most of which were for the better. The newer APIs relied more heavily on asynchronous calls and more deliberate patterns surrounding threading (e.g. switching from the UI thread and vice-versa).
Putting 2015 and 2017 Out To Pasture (Not Really) - Due to the major API changes, the project structure shifted quite a bit. All of the code that the earlier version of the extension relied upon has now been moved from the Glyphfriend.Core shared project to one titled Glyphfriend.Core.Legacy, which should serve to distinguish between the major API changes.

In terms of what this extension focuses on, the migration between the older APIs and the newer ones was rife with challenges and "gotchas". I'd like to offer a huge thank you and shout-out to several members of the Visual Studio team: Mads Kristensen, Taylor Mullen, and Christian Gunderman that made this possible.

Check It Out!

Glyphfriend has a separate extension for both Visual Studio 2015, Visual Studio 2017, Visual Studio 2019, so to get started, you'll simply need to download the version(s) that apply to you from the Visual Studio Marketplace:

You can also just search for it within the Tools > Extensions and Updates or Extensions > Manage Extensions area of Visual Studio to get started and installed.

Contributions Welcome!

Pull Requests are openly accepted and encouraged.

The libraries that were chosen were just some of the more common ones that I had come across, but I am sure that I left quite a few out. If you find that one of your favorites is missing, you can either report it within the Issues area or just clone the repository, make a fork and add your changes (and then just make a pull request when you are all done).

CodeProject

When Random Isn't the Right Random

Rion Williams — Sat, 11 Jan 2020 18:00:00 GMT

Nearly any engineer worth his or her salt will likely agree that consistency is important.

If they don’t, they probably haven’t ever worked on a large legacy application or with a team of any decent size. Everyone being roughly (sans tabs vs. spaces religious views) on the same page can go a long way in terms of productivity. Things look the same. Things feel the same. It’s great. Another benefit of consistency is that it makes inconsistencies stick out like sore thumbs. Something out of line or that just looks wrong and can swiftly be identified, corrected, etc.

Inconsistencies related to style and naming typically don't matter in the grand scheme of things. Compilers will usually just eat them up and they'll vanish into the abyss. But when those inconsistencies extended into the actual code itself and implementation details, that's when things can get dangerous. This post is a tale of one such inconsistency, which seemed innocuous at first glance, but eventually festered into something nasty to track down.

Setting the Stage

To really appreciate just how annoying this issue was, it's worth setting the stage a little. It involves two major components:

.NET Producer - This is a basic .NET console application that reads data from a source and produces messages to send up to Kafka, which does all sorts of magic downstream.
Kafka Streams Consumer - This is just an application that handles receiving messages from the producer to perform some enrichment processes (i.e. join the messages with another data source) downstream.

Without getting too much into the weeds, you just need to know that when messages are produced, they have a key associated with them. These keys are used to uniquely identify each message and they are used by Kafka when its determining which partition in a distributed environment that a given key should live on. Partitions are important to the story as well, since Kafka is distributed by nature, so a given key should only exist on a single partition in the entire environment.

This use case in the Kafka world is a pretty common one. There was no magic going on. Everything was a very vanilla set-up using out of the box / recommended settings. And shortly after running it, it seemed to be working as expected. Thousands upon thousands of messages flowing through per second, data flowing into the final, enriched landing ground.

The process ran overnight, but when I awoke to check the data, it was clear something was very wrong. All of the data was making its way from the producer to the consumer, logs indicated that the appropriate keys were present where they needed to be, but it appeared that the joins were failing.

That's no good.

Investigating the Data

Let's consider an analogy that might make this more familiar to folks with database (and not streaming) backgrounds:

You have an imaginary database with two identical tables.

You attempt to join these two tables on their keys, which are the exact same in each.

The join succeeds and returns ... nothing ... well ... sometimes.

Knowing that the joins were failing, I was a bit baffled. Some records were flowing through the pipeline past the join operations, but it didn't make any sense. The keys were there, I was sure of it. So, I decided to take a subset of the data and look at it a bit more carefully to make sure I wasn't going crazy:

Source A (Producer)	Source B (Consumer)
mawjuG0B9k3AiALz0_2S	0q0juG0B9k3AiALz8ApP
xEEcv20B9k3AiALzEN0m	m60juG0B9k3AiALz5gU5
ua0juG0B9k3AiALz7wqa	ua0juG0B9k3AiALz7wqa
m60juG0B9k3AiALz5gU5	xEEcv20B9k3AiALzEN0m
0q0juG0B9k3AiALz8ApP	mawjuG0B9k3AiALz0_2S
...	...

With this very small subset, which was reflective of the overall data, it was verified that out of over a million pairs of records, each pair of keys was present in the two sources being joined. Next, I resorted to trying an experiment with a very, very small subset of 25 records to see just how many made it through the pipeline and successfully joined: 5.

Now why would such a small fraction of the records make it through the entire processing pipeline and others not? It didn't make sense. It's almost as if it was random.

It was.

Distributed Stuff is Hard

After banging my head for hours upon hours and burning the late-night oil wondering just what might be wrong, a colleague mentioned just how random the issue seemed and it hit me:

It was random, but just not the kind of random I was looking for.

One of the challenges of working with Kafka is that it's intended to be used in distributed environments. The ability to divvy up messages across multiple nodes allows incredible performance, resiliency, and the ability to easily scale to suit your needs without missing a beat. But just how does Kafka manage to scale so well? The answer: partitioning.

Kafka by default handles divvying up work across multiple partitions and/or nodes by using an algorithm that peeks at the key for a given record and delegates it to a partition:

// How Kafka handles delegating messages across partitions
return DefaultPartitioner.toPositive(Utils.murmur2(keyBytes)) % numPartitions;

As you can see, it takes your message key, performs some operation on it, and takes the sum modulo the number of partitions you have and magically you have a partition for your record. Since this process is deterministic and dependent on the key, it will ensure that a given key is always assigned to the same partition. So, we had to investigate a bit further into this and instead of looking at the joins that were failing, and instead focus on those that were succeeding.

Bingo! After analyzing all the data in the previous subset, I found that all five of the successful joins had the same key present on the same partition:

Key	Partition A	Partition B
mawjuG0B9k3AiALz0_2S	8	8
xEEcv20B9k3AiALzEN0m	8	8
ua0juG0B9k3AiALz7wqa	6	6
m60juG0B9k3AiALz5gU5	1	1
0q0juG0B9k3AiALz8ApP	3	3

So why were some of the keys present on the same partitions and others weren't? There didn't appear to be any rhyme or reason behind which partition a given record landed on.

It was random and that was the problem.

Inconsistency

After rounds and rounds of analyzing the data, we had the following:

All the data was emitted as expected from the producer application (with the appropriate keys)
All the data was making it into the streams / Kafka ecosystem.
Some of the join operations were failing, seemingly at random, despite the keys being present on both sides of the join.

Random keeps coming up throughout this post, and that's important because it's the crux of this entire issue. After stepping away from the data itself and focusing on the partitioning, a breakthrough emerged. Digging into the source code itself, which detailed that the default partitioning strategy used by Kafka was the murmur2_random hashing algorithm. However, after looking at the .NET Producer defaults, it uses the consistent_random algorithm!

Both technologies, designed to interact with one another, had an inconsistency with how each of them partitioned specific keys. Since Kafka depends on a given key being on one and only one specific partition, the previously failing joins would never succeed since the keys, while the same, were not present on the same partitions.

A quick adjustment to the .NET producer application resolved the issue:

// Sets the .NET Producer to use the same partitioning strategy to be consistent with downstream Kafka partitioning
producerConfiguration.Partitioner = Partitioner.Murmur2Random;

After setting that single property and reprocessing all my data: an immediate world of difference. Every join was succeeding, the entire pipeline was up and running just as intended. Life was good again. It's easy to look back and smile on the solution to the problem being so simple that even the folks at XKCD had figured out a partitioning strategy that would have worked better:

At least that would have ensured all the keys ended up in their same respective partitions.

But in the real world, at some point there was a disconnect. Some silly miscommunication or issue that resulted in this inconsistency that lead me down a rabbit-hole of heartache, confusion, and doubt. These weren't explicit configuration settings - these were defaults.

This is why consistency is important.

CodeProject

Why is My SQL Server Query Slow?

Rion Williams — Thu, 02 Jan 2020 19:15:00 GMT

As applications and their associated databases grow, things change. Rows get modified, schema get updated, and often, things can slow down. These performance hits may come up suddenly, or maybe intermittent, but it's important to know how to distinguish what is going on so that you can go about fixing it.

There are so many factors that can contribute to performance issues in the SQL world, it can be downright baffling where to even begin. The post is going to delve into some of the steps that you can take when your database is acting silly, so that you can diagnose exactly what is going wrong and focus on speeding things up.

Is it the Database? Or the Server?

One of the first things to check when a query is running sluggishly isn't actually the database itself, but rather possible issues with the server that it's running on.

If you have access to the server, you might consider using a tool like System Monitor or Perfmon to look at the processes that are running on the server itself. High memory usage, high CPU usage, and slow network traffic are all red flags that the issue may not be the database itself.

There are plenty of other system diagnostics tools out there that can help with this process, but one of these should at least point you in the right direction before you spend countless hours staring at tables, indexes, and more.

Profiling and Reporting to the Rescue!

SQL Server has a great profiling tool that can be great for both troubleshooting existing issues and isolating slow running queries. It's also a fantastic way to get some additional data about your database for performance tuning purposes (i.e. see which queries are being executed most frequently, look at things like memory and CPU usage, identify potential blocks, etc.

A few things that you should look for here:

Long-running queries - Look for any queries that might jump out as outliers from the rest of your calls could be indicators that something may not be working as intended.
CPU intensive queries - Look for any queries that may be pegging the CPU, which could be issues on the machine itself, poorly written queries, or a combination of things.
Check for Deadlocks - Look for any mentions of the term "Lock", which might indicate that two or more transactions are causing a deadlock. When this occurs, you can dig further into it by finding those that mention "Deadlock Chain", which indicates they are events that lead to the deadlock.

The profiler is infinitely useful and a great first step into diagnosing poor performance (or bugs in general), and it's useful in both development and production environments (since it can remotely target an existing database).

Additionally, if you know something like CPU is your bottleneck, you can take advantage of pre-built performance queries to attempt to isolate a specific query or queries that might be causing your issues:

These various performance related queries can be worth reviewing periodically to find potential issues that may creep up over time.

Found the culprit? Now figure out why it's slow.

Once you've identified a particular query that is running slowly within the profiler, you can view the execution plan for it to see exactly what SQL Server is doing behind the scenes. You can easily access this from a given query via the Ctrl+M shortcut, the "Include Actual Execution Plan" button on the toolbar, or from the Query menu:

Execution plans can easily reveal potential issues related to indexing and any non-SARGable queries that might be taking place, which are causing your entire table to be scanned as opposed to seeking exactly what it needs. At first, they may seem incredibly complex (and they can be), but once you have worked with them, you'll learn to identify patterns and what they are associated with (e.g. X operation indicates a missing index on a table, etc.)

A few things to look for here:

Look for Warnings - Warnings like "No Join Predicate" should be very apparent red-flags that you'll likely need to address. Most warnings should warrant further investigation at the very least.
Order of Operations (Costs) - Consider ordering the most costly operations and determine if those make sense. Is a simple join using up 90% of the compute for the entire call? If so, something might be wrong.
Scans vs. Seeks - Neither of these are necessarily bad, but if one is taking much longer than expected (either one), it's probably worth determining if you are missing an index (i.e. SQL Server is scanning the entire table instead of just grabbing a well-defined lookup value).

Understanding execution plans comes with experience, and hopefully, you don't have to delve into them too frequently. But if you do, know they can be a valuable ally in the fight against bad performance.

Investigate Potential Bad/Missing Indices

SQL indexes are the bread and butter of performance tuning within your database. You can think of them much like the indexes you might find within a book, as they'll allow SQL to "know" where to go looking for a particular piece of data instead of arbitrarily thumbing through page after page.

It's worth noting that indexes aren't "free" and as with most things, if you use them incorrectly, they can do more harm than good (i.e. imagine going on a scavenger hunt with extraordinary vague or crappy hints).

One great thing is that you don't always have to do all of this grunt work yourself. Folks like Brent Ozar have built scripts like spBlitzIndex that you can run against your database to recommend indexes based on table usage, etc. A few other tools like the following can be useful as well:

In general, Brent's blog is a treasure-trove of knowledge on SQL Server performance, troubleshooting, and much, much more. I highly recommend bookmarking it if you live in the SQL world (or even dabble in it).

Lies, Damn Lies, and Statistics.

As much as it can be valuable to know how to optimize SQL queries, it's worth noting that SQL Server does some of this on its own for better or worse. It accomplishes this through the use of statistics by monitoring calls that are being made, caching execution plans, and making judgements on how a given call could/should be best executed.

Notice I mentioned "for better or worse" and that's intentional. While statistics are incredibly valuable, they can also create problems if they aren't being gathered correctly or they are stale. In most scenarios, you probably won't have to dig into these with any kind of regularity, but it's important to know that they exist.

I highly recommend visiting the Microsoft documentation on SQL Server statistics to learn a bit more about them, how to update them, and more.

Use Your Eyes (or Someone Else's)

If the ill-performing query in question is something that has been untouched for a long period of time, you might want to run a "fresh" set of eyes on it. Look for any noticeably slow operations (e.g. wildcard searches, large aggregation calls, etc.) that could turn those seeks into scans.

Additionally, you might consider if you are using a more recent version of SQL Server than the query was originally written to target to see if any new features might be more efficient to perform your operations.

Other Common Scenarios

A few other things that you might look into that might be considered "edge cases" would be things like:

Sharding/Partitioning - For extremely large databases, you might consider an approach like sharding or partitioning your databases.
Leveraging Views - In some cases, the use of a SQL View can help make your queries much more readable and allow you to identify potential issues in a smaller wall of text.
Distributed Transactions - Distributed programming is hard, and if you are doing anything crazy such as making connections to remote databases via Linked Servers, this could present challenges on both sides of the equation.

As with just about every other branch of technologies, it’s very unlikely that just one of these items will cure all of your performance ailments. Consider all of these to be valuable tools within your troubleshooting arsenal and use them in conjunction when trouble arises.

CodeProject

Happy New ... C# 9 Features!

Rion Williams — Wed, 01 Jan 2020 20:33:06 GMT

As one of my languages du jour, I've always had a fondness for C#. It was one of the first high-level languages that I learned in college and it's been part of my daily professional life for the better part of ten years. One of the things that's always been enjoyable about the language is the concept of constant innovation.

Features are constantly being adapted from other languages (looking at you F#, Kotlin, Typescript, and more) to help add features that would be greatly desired, extend existing functionality, or just apply an ample sprinkling of syntax sugar. I thought with the new year just beginning, let's take a look at some of the proposed features that are slated to find their way into the flagship Microsoft language in the near future!

It's worth noting that this is by no means a complete list and everything in this list, as with all things in active development, is subject to change / never happen.

So, let's get into it!

Your Results May Covary

Have you ever been writing a class and found yourself just inventing new methods so that you can return some type from a base class since C# has always restricted you to only return the explicit class itself? This proposal outlines what would be support for covariant return types, which would allow you to override an existing method using a more specific implementation than the original method.

class Animal  
{
    virtual Animal GenerateFromEnvironment(EnvironmentConfiguration config);
}

Now previously, if you wanted to extend your Animal class to support something like a Dinosaur, you'd have to either write some hacky bridge method or throw some logic in there to specifically check for the existence of a Dinosaur. Covariant returns will allow you to still override the same base class / methods, but return the more specific implementations you are looking for your class:

class Dinosaur : Animal  
{
    // Notice the more specific return type overriding the virtual method
    override Dinosaur GenerateFromEnvironment(EnvironmentConfiguration config);
}

A "New" New

Often the var keyword can be great if you don't want to go to the effort of explicitly defining a type for a variable. You already know what it's going to, so it's convenient to just keep things short.This proposal follows that same rough premise by introducing a new usage for the word 'new', specifically type targeted new expressions, which allows you to completely forego the type specification for constructors when the type is already know.

Let's say you had some complex collection like the following that you wanted to initialize upon declaration:

Dictionary> words = new Dictionary>() {  
    { "foo", new List(){ "foo", "bar", "buzz" } }
};

This proposal simplifies the use of new expressions since it already knows the types, so instead of using new Dictionary>(), you can just use new():

Dictionary> words = new() {  
    { "foo", new() { "foo", "bar", "buzz" } }
};

Null Checks Kotlinified No More

No one really likes writing countless null checks. Lines upon lines of if statements with nested exceptions being thrown, it's gross. In a vein similar to the previous feature, this proposal aims to eliminate the need for explicit null checking by allowing an operator ! to be added to a parameter to indicate to the compiler that a given value will not be null.

It would transform a snippet like the following:

int CountWords(string sentence)  
{
    if (sentence is null) 
    {
        throw new ArgumentNullException(nameof(sentence));
    }

    // Omitted for brevity
}

Into a much terser, uncluttered form:

// Notice the trailing '!' after the parameter name, which indicates
// it will not be null
int CountWords(string sentence!)  
{
    // Omitted for brevity
}

Let the Constructor Do the Work

Another proposal that has been or in some phase of discussion since C# 6 has been the idea of primary constructors, a feature that's been found in languages like Typescript, Kotlin, and other languages. The basic idea behind primary constructors is that they would simplify writing class constructors and boilerplate code in general by implicitly creating private fields from the arguments passed in by the constructor itself.

Let's look at an example of a class with a few private, readonly properties:

public class Widget  
{
    private readonly int _foo;
    private readonly WidgetConfiguration _config;

    public Widget(int foo, WidgetConfiguration config)
    {
         _foo = foo;
         _config = config;
    }
}

The proposal would remove the need for the boilerplate field declarations as the constructor itself would handle creating the arguments passed in:

public class Widget  
{
     public Widget(int _foo, WidgetConfiguration _config)
     {
          // If you wanted one of these properties to be publicly accessible, you could define
          // and set one of those here, otherwise the arguments will be privately accessible
          // as fields.
     }
}

If That Wasn't Simple Enough... How About Records?

If you enjoyed the last proposed feature, then you may find yourself doing a double-take with this one. Another proposal that has been toyed with for a while is the concept of records. Records are a simplified form for declaring classes and structs that supports some new features to simplify working with them (such as caller-receiver parameters and with expressions).

public class Widget  
{
    // Properties (business as usual)
    public readonly string Foo;
    public readonly string Bar;

    // Definition of a With expression, which will allow you to 
    // easily create instances of a widget from an existing instance
    public Widget With(string foo = this.Foo, string bar = this.Bar) => new Widget(foo, bar);
}

We can see this demonstrated below:

var existingWidget = new Widget("foo", "bar");

// Now create a brand new instance with one of the properties changed
var clonedWidget = existingWidget.With(bar: "buzz");

// At this point clonedWidget looks like this: { Foo = "foo", Bar = "buzz" }

Records will also support the use of positional pattern matching along with the use of destruction in tandem to do some things like this:

var widget = new Widget("foo", "bar");

// If the widget has a its Foo property set to "foo", then this condition 
// will be met
if (widget is Widget("foo", var bar)) {  
    // Perform some operation here on the widget, the use of var above will
    // function as destruction so you can access it within this scope
    Console.WriteLine(bar); 
}

Records are one of the more involved features being proposed in C# 9, so if you are curious or want to learn more about them, I'd highly recommend checking out the full proposal on them here, which contains a wide range of examples, use cases, and more.

Switching It Up ... Visual Basic Style?

One of the very few things that you'll see come up from a developer that has recently switched from Visual Basic to C# is the switch statement. Despite C# undergoing nearly nine entire revisions, support for comparison operations within switch statement was never supported (although pattern matching comes close), this proposal aims to remedy that.

var internalTemperature = GetInternalTemp(yourDinner);  
switch(internalTemperature)  
{
    case <= 110:
        // Very rare
    case <= 125:
        // Rare
    case 125 to 135:
        // Perfect
    case <= 145:
        // Medium
    case <= 155:
        // What are you doing?
    case > 155:
        // What have you done...
}

But Wait - There's More!

The features that were covered in this post are some of the more practical ones that might see everyday use, but these were by no means all of the currently proposals. I'd encourage you, if you are interested in learning more about what else might be on the menu at the related milestone for C# 9 on GitHub.

There you'll find tons of additional features that weren't mentioned here such as:

Finally - the C# language is open-source and is always looking for contributors, opinions, and any folks that love the language to share their thoughts on what could make it better.

CodeProject

Just Jump into the Stream

Rion Williams — Sun, 29 Dec 2019 19:17:30 GMT

I've spent the better part of my ten-year career as a developer in a relatively safe bubble, technology-wise.

Falling in love with a programming language or technology is a bit like the software version of having your first real girlfriend/boyfriend. They might change the ways that you think, the related technology choices that you make, and it might cause you to potentially not realize what they are good (and not good at) until to try something else.

I was incredibly fortunate to be introduced to C# in my early computer science courses in college and fell in love with it (at least compared to some of the other languages being taught at the time). The language stuck with me throughout college and I used in whenever the opportunity presented itself. Plenty of languages were sprinkled in during my time there: C, C++, Python, Java, Lisp, Visual Basic, x86 Assembly, and countless others, but C# always just felt the best.

At any rate, eventually it landed me my first intern position at a petrochemical company where it was the language du jour and I fit right in. It was here that I was introduced to the rest of the Microsoft Stack of:

C#
ASP.NET
SQL Server
IIS

These technologies have been a huge part of my career at every position I've had as a professional and any major side-projects or consulting work I've been a part of. There have been plenty of other technologies and languages along the way, but these few have been at the forefront for ... at least until recently.

A project came along that required the need for something different. It was something that I had never worked with previously, using technologies that I hadn't touched since exercises in college (or ever), and they would be sure to burst any safe bubble that I had been living in, in a good way.

Bubble Burst

In a nutshell, the project required building a real-time streaming infrastructure to replace long-running batch processing jobs for a legacy application. Languages would change, technologies would change, databases and storage would change, and finally the entire programming paradigm itself would change.

The new stack would look something like this:

Kotlin
Apache Kafka
Postgres
IntelliJ

As you can see, not a ton of overlap with the previous stack, and by not a ton, I mean none. I had never written a line of Kotlin, downloaded IntelliJ, touched Postgres, or been in the same building as Kafka. Being an experienced engineer, I didn't have much concern for a few of them. Languages come and go and once you learn one, it's pretty darn easy to pick up another. IDEs, meh, basically the same way. A new database technology? Just a bit of new syntax. All these changes are totally manageable and basically negligible.

Except streaming. Streaming was a whole new paradigm to wrap my head around. Calls and operations weren't as procedural like most applications, you had to think about when code was being executed, and any "race-conditions" that you had encountered earlier in your career immediately become totally irrelevant as everything is much, much crazier in the streams world.

In a nutshell, I was taking everything I had done previously in my career and voluntarily throwing it out the window to shift to an entirely different stack, programming paradigm, language, etc. I suppose it could have been terrifying, but of all things it was... exciting.

Pivoting over to a Java-oriented ecosystem was different. A different build system (Gradle/Maven) had a bit of a learning curve, but after dealing with countless Javascript frameworks, it was a walk in the park. They key takeaway is that you'd think with all of these changes, this would be a scary proposition, but if you enjoy learning, and enjoy becoming a more well-rounded developer, you'd probably be just as excited as I was to get started.

Don't Dip Your Toe In, Dive!

As I mentioned earlier, my entire career has been spent in various versions of the same silos. I knew C# and was comfortable with it, I could write it all day, in my sleep, in a box, with a fox, etc. The Microsoft stack in general was what I had lived professionally (and in my free time) to write everything in, and while I had tinkered with some languages and technologies on the side to play with, it wasn't anything like actually writing production applications and code for projects.

I'd encourage any developer, if presented with the opportunity, to step out of your comfort zone with open arms. You know what you know, but you absolutely don't know what you don't. Working with different technologies, languages, tooling, will all make you better. Software engineering is cumulative.

The project that I reference in this post is still very much ongoing. It's still exciting (and at times frustrating), but I'm so glad that I made the decision to take the reins on it. Even after just a few months it still feels novel, probably because I spent 10+ years basically doing everything in a similar way, but most of all it's fun. Fun can be rare in engineering, especially if you've been doing it for a long time but having fun while learning and solving complex business problems, can be extremely rewarding, both personally and professionally.

The opportunity to jump into real-time streaming applications might not present itself, but that's beside the point. It's about taking advantage of any opportunity to learn something new, or more specifically, something different. It's not just a matter of doing it but taking the time to learn it. Learn why it works the way it does, what makes it so different that what you've done in the past, and carry that with you into every project that you work on, regardless of the technology or stack, in the future.

CodeProject

The Bug That Got Away

Rion Williams — Sat, 21 Dec 2019 20:30:55 GMT

One thing that I've always loved hearing about from fellow engineers or reading about on technical blogs are bugs. Nasty ones. Ones that keep you up at night and those that will wake you from a dead sleep. These are the ones that great stories are built upon, because like many great stories, they have all of the pieces:

Exposition - Ah crap! There's a bug in here somewhere.
Rising Action - Let's dig into this and see how widespread it is and how we'll mitigate it.
Climax - The "Eureka!" moment when you've narrowed down the exact cause of the bug.
Falling Action - Implementing a fix, verifying it fixes the issue.
Resolution - Merging the fix into source control, knowing the bug will be gone (forever)!

There's an extreme satisfaction to be found in a good bug. The exploration, the thrill of the chase, and finally catching that bug red-handed and putting an end to it with extreme prejudice.

Unfortunately, not all tales have happy endings; Sometimes the bug gets away.

The Exposition

This particular tale begins as most bug stories do - with a legacy software system. There isn't really anything special here, an older, cobbled together front-end, an enterprise-grade database, etc. If you've seen one, you've seen them all.

At any rate, just prior to an upcoming major release - I get a ping from a colleague to look at something. One of the records in the database is corrupted with some really bizarre encoding patterns. There doesn't appear to be any rhyme or reason behind them, it's just screwy and inconsistent with just about every other area of this application:

Record A: Look everything is nice & shiny!.  
Record B: Look everything is nice &amp; shiny!  
Record C: Look everything is nice & shiny!

So, upon seeing this - I did said what any good developer would: "Oh, this should be a pretty simple fix.".

The Rising Action

Software engineering is full of bugs.

There are countless systems, big and small, that are just riddled with the things. As an engineer I know this very well, as I've contributed to my fair share of them. I've been a software engineer over ten years or so and I've always considered myself to be thorough, especially when it comes to tracking down a bug: the research, the deep diving, and finally: the fix.

As with any bug - one of the first steps to fix it, is being able to reproduce it. I spoke with our QA team and they weren't immediately able to reproduce it, but mentioned they would look into it further. Hours pass and I receive another message something to the effect of:

QA Person: Rion, I just spun up a fresh new environment and I can reproduce the issue!

At this point, I'm excited. I had been fighting with this for over a day and I'm about to dive down the bug fixing rabbit hole on the way to take care of this guy. I log into the new environment, and sure enough, QA was right! I can reproduce it! I should have this thing knocked out in a matter of minutes and my day is saved!

Or so I thought. Roughly two hours to the minute of being able to reproduce the issue, it stops occurring. I was literally in the middle of demonstrating the issue to a colleague and minutes later, it's completely vanished. How could this be? Nothing in the environment changed, no machine or web server restarts, no configuration changes, nothing. The bug, just after a matter of hours, seems to have resolved itself.

Skipping to the Last Page

Normally as part of a rising action in a story, things built and build until they reach a point. At this point in my story, I should have figured out the root cause by now. The bug apparently was reproducible for a short while, but not long enough to determine the exact cause (lots of moving parts in this machine). So, I start adventuring to try to find a path to climb up that much higher on debugging mountain. I was pulling everything out of my bag of tricks including:

Examining IIS Logs - In multiple environments, I checked through IIS logs in production environments where the issue had occurred, in the short-term reproducible QA environment, my local environment.
Examining Event Viewer Logs - Maybe there was some type of exception that was causing the web server to restart and that magically fixed the issue. Surely, there would be something there.
Profiling Environments - In times when the issue was reproducible, I took advantage of the SQL Server Profiler and had logs of the exact calls that were being executed against the database.
Decompiling Production Code - With a Hail Mary I attempted to decompile code from the production environment to ensure that no code changes were different and that no calls outside expectation were being made.

Nothing helped. Every single new avenue I'd venture down would only further my confusion and leave me wondering what the heck could be causing the issue. After putting all of the pieces together, you could basically describe the issue as follows:

How could making two sets of calls, all traveling through the same endpoints, passing along the same data, executing the same queries against the same exact stored procedures result in different data (one being corrupted and the other not).

For the first time in years, I felt defeated by a bug. I started grasping at straws, looking for race-conditions, outside forces that might affecting the code, network throttling issues, nothing.

The Bug Won

Many days and nights had passed. This bug was waking me up at night, I was dreaming about potential causes only to run to my computer and try them out and eventually realize they didn't work. Like every good engineer, I had a workaround in mind for this issue just minutes after encountering it, but I was determined to not have to end up there.

I had seen the issue locally, even for a fleeting moment, in several QA environments (again fleeting), and within several production environments. I had tried everything that I could think of, consulting countless peers to brainstorm the cause, but all that resulted in was spreading the bewilderment throughout the team.

This seemingly trivial bug had eluded every form of capture/resolution that I could think of. It left in its wake nothing but bewilderment, not only to myself, but seemingly everyone that I tried demonstrating the issue to. Eventually, much like a doctor, I had to call it.

After over a week of my life, days and nights, being spent pursing this bug: it won. There wouldn't be a climax, there wouldn't be a happy ending, there wouldn't be a nice warm, fuzzy feeling of accomplishment; there'd be a few lines of hacky code to fix it.

I felt just like our friend Charlie Brown, and this bug had ripped the football away just before I'd ever get a chance to kick it.

It Happens

The reason that I wrote this, or that it's worth writing about really has nothing to do with the bug itself. It has to do with me, and maybe even you. I've always considered myself great at solving problems, and thorough. I'll dig deep, keep digging, exploring, and won't stop until I can crack the problem, until in this case: I couldn't.

Being an engineer is typically about solving problems, but more importantly, it's about being practical. I could have easily spent several more days (and nights) trying to solve this problem and figure out exactly why it was happening, but honestly, the fix for it took no longer than 5 minutes to implement. This was about being able to admit defeat. Much like there's nothing wrong with admitting "I don't know", there's nothing wrong with knowing when to suck up your pride and move on.

If you ask me today, I still don't know what caused this issue. I'll probably never know, and that's alright. I'll let this one get away and tell its friends about me. I know I'll certainly make sure to tell mine about it.

CodeProject