Continuously improve your code via data
There are ample phrases coining the "Data Driven" mantra, but I believe development truly deserves this connection. As software leads the way for data collection, it only makes since to start there. So, data driven development focuses on using data to improve development outcomes, improve algorithm performance, and design software to capture the necessary data to make decisions and analyze performance. This form of development focuses on using data in every aspect in the software development life cycle. We collect data on coding standards, like unit test code coverage, incurred technical debt, and adherence to coding best practices. We also monitor the devops pipeline for speed of delivery, bottlenecks and pain points. Considerable focus is also given to a data first approach where the key performance metrics for a given project are a focal point of the initial software architecture and development. We then move into algorithm performance modeling. Data is collected in production and used in advanced self learning or directed learning models to improve an algorithms performance in a positive feedback loop.
Each of these components is well documented individually and can each be studied extensively on their own, but for this book we will focus on the conjoined benefits of operating the entire development life cycle through data analytics. W will go into considerable depth on positive feedback loop algorithm modeling as it is relatively new and the source of considerable competitive advantage when done correctly.
Data driven development is very similar in many respects to a Machine Learning algorithm. The algorithm is created, trained, and let loose to see how it handles a control set. When the algorithm has completed the task, it analyzes the results and modifies the algorithm to increase accuracy. Once modified, it runs the control set again and compares the results to the previous iteration.
So what makes data driven development different than an advanced machine learning algorithm? The developer guides the algorithms changes. Machine learning algorithms have many shortcomings tied to the advanced mathematics that drive the iterative changes. Machine learning is very good at defining a function that fits a curve, but not all problems can be fixed by a linear (or multidimensional) function. Machine Learning algorithms may behave unexpectedly in uncertain environments where explicit functions may server for a better solution.
One example of this is changes in incoming data. Machine learning is limited by the quality of data provided. It cannot tell if new data is changing in fundamental ways. It is up to the developer to analyze, scrub and re-scrub data that makes since for the algorithm at hand. For example, an algorithm is written to understand and analyze four pieces of information. Now a fifth is added. A developer can quickly understand the impact of this fifth variable and adjust the algorithm accordingly. The machine learning algorithm may continue to function the same, function less effectively, or even stop working in the face of this new information.
Another example, is changes to the environment. A developer can analyze data and ultimately conclude there is something wrong upstream with the current process. I had an address matching algorithm that was incredibly accurate at scrubbing and comparing user input to data in a database. After a few iterations of development however, I noticed the incoming user input consistently had the same flaws. Users were including their state and zip in the street address line. This led me to conclude a change to the user experience was needed. I worked with our front end team and quickly turned out a very small change in the interface. We simply added a dedicated state and zip field apart from the street address field. This change resulted in a twenty percent increase in accuracy because of more consistent user input.
Lastly, machine learning algorithms still lack basic functions past linear algebra and "and/or" gates. When analyzing an algorithm, a developer has a much larger tool box to their disposal. They can call upon many advanced coding features that machine learning has yet to crack. Some simple examples would be "if/else" statements, switches or exception handling that contain business logic. Not to say some future state learning algorithms will surpass human capabilities, but as of now the developer is still king as they understand the end goal. Many of these algorithms may also employ some type of machine learning, coupling human knowledge and insight with the speed and accuracy of a computer. These can lead to some very sophisticated and power algorithms that far exceed the capabilities of humans or machines alone.
The advanced state of these algorithms is why we must be so diligent on coding practices. Without a firm grasp of the basics, we will find ourselves in maintenance hell for the remainder of the program's functional life. Adhering to the basic tenets discussed in the next chapter will allow a smooth transition of knowledge in the future or allow for simpler changes during each iteration.
To pull on a recent example of what not to do, I had a team member who solely worked on a key component of our infrastructure (bad practice: don't let one person be the sole owner of a project). Nothing was documented nor human readable. He was running iterative development cycles with a data capture suite, monitoring memory consumption and response times. The end goal was to decrease the memory footprint of the internal cache while increasing the response time. After multiple rounds of changes spanning months, the end result was a spaghetti code nightmare with multiple layers of caching, randomly named variables, quick notes to himself and memory leaks galore. The project was never worked on after he left because no one could figure out what it was doing (in a reasonable amount of time). Instead, we had to put the project on it's own server with a weekly restart to prevent out of memory errors from the memory leaks and slowly sunset it. So please master the basics; then move onto data-driven development.
As a recap, data driven development is an iterative process of development where the developer uses training sets (and real world data in most cases) to improve upon an algorithm. As you may have guessed, this is the point of data science: to add value to business decisions using data. The address algorithm above directly increased sales with every improvement by allowing more potential customers inside our footprint to order services. Improvements to a market segmentation algorithm can increase accuracy of well-defined market segments, or discover a whole new segment. One of the key aspects of this approach, and this is where many businesses fall short, is the continuous periodic reviews and analysis of these algorithms and the underlying data.
From a managerial standpoint, this is also a shift in paradigm. Traditionally, the project owner would set the goals for the software's success as well as the timelines and developer engagement. For an ecommerce site, this might be the sales department requesting changes on button layout or font style. For customer retention, it might be trying to solve what features would be best to provide online to reduce call volume. Data driven development more tightly couples the development teams with the business teams to achieve faster and better results for the business goals. In many cases, it makes sense to even have the developer's performance measured in the same KPI's as the teams they work with, like sales or marketing or HR. Aligning the development teams objectives with that of the business not only gives them a better understanding of how they fit into the organization and contribute, but also increases productivity and performance based on those shared business interests. This almost seems like common sense, that IT teams that better understand the business' needs and work closely together have better results, but many companies like to place a hard and fast line between these groups.
The end result of data driven development then is to provide data on performance, use that data to make fast, iterative changes and to push the decision making down as much as possible while increasing collaboration between the stakeholders and the developers. Easy enough right?