"Each one of us can make a difference. Together we make change."
— Barabara Mikulski
Imagine an entire day dedicated towards getting to be a part of a global community, using your skills to contribute to projects designed to solve real world problems, interacting with and learning from remarkable women around the globe, and making a difference in the world!
Cue the drum roll 🥁: That's Open Source Day for ya!
Open Source Day (OSD) 2022 was an all-day hackathon that took place as a pre-event to the Grace Hopper Celebration (GHC) 2022. The event took place virtually on September 16, 2022 from 8am to 3pm US Pacific Time. We, the Moja Global Team, had the immense pleasure of taking part as a Project Partner in the event with our Full Lands INtegration Tool (FLINT) Project: FLINT UI and FLINT Cloud in order to help participants:
understand the importance of open source technology in revolutionizing the fight against climate change.
kick-start their own open-source journeys as contributors and mentors.
It's always a delight to participate in these events and create an amazing day for contributors to experience Open Source in a safe, supportive and fun environment!
In August 2022, Mentor-Project assignments were finalized based on the reviews by the Grace Hopper Celebration Open Source Day Committee, and three mentors were assigned to FLINT: Meenakshi Dhanani, Lahari Boni, and Abhilipsa Sahoo, who, under the guidance of the project representatives: Harsh Mishra, Padmaja Bhol, and Shloka Gupta, familiarized themselves with the repositories to prepare for the event.
Prior to the day of the event, representatives, mentors and OSD Committee Member/Moderator actively discussed their roles and responsibilities through the main communication channel and welcomed new contributors. The mentors and representatives also created new issues and triaged existing ones that would be suitable for the new contributors to the project. Care was taken that the descriptions of these issues had lots of details and helpful pointers for getting started. These issues were the labelled as beginner friendly and OSD to make them distinguishable from other issues for the beginners. Detailed guidelines were already put up in order for participants to refer to and set up their projects in their local machines.
The day began with a powerful opening ceremony that featured speakers Anne Neuberger (Deputy Assistant to the President & Deputy National Security Advisor for Cyber & Emerging Technologies at the White House), Nithya Ruff (Head of the Open Source Program Office at Amazon), and Mishi Choudhary (SVP and General Counsel at Virtru). It was amazing to hear from these exceptional women and watch them inspire a new generation to join open source software projects!
After the opening ceremony, it was time for mentors and team members to introduce themselves and give an overview of the Moja Global and FLINT. Then they covered some general information on how to make contributions to the project, and curated a list of important websites such as Moja Global's community website, Moja Global Handbook, FLINT website, in order for participants to understand how and why to use an open-source software tech such as FLINT to assess land-use emissions, FLINT's role in climate action, its functionalities as well as commercial applications, and how one can contribute to the mission of Moja Global. After that, it was time to officially start the hackathon!
During the hackathon, the Moja Global team members and mentors provided one-on-one uninterrupted guidance to the participants in setting up their development environment (in the local machine as well as on GitPod), answered questions about Git and the process of submitting and updating pull requests, helped participants pick and work on their issues, and provided feedback on their pull requests. We ended the day with merging successful pull requests from several first-timers!
Full Lands Integration Tool (FLINT) is an open-source software technology designed for measurement-reporting-verifying greenhouse gas emissions and removals from forestry, agriculture and other land uses. FLINT is not an MRV system but provides a framework to progressively build MRV systems for specific cases.
The Generic Carbon Budget Model (GCBM) is a tool developed to assess and report the cumulative effects of anthropogenic and natural disturbances on forests. The GCBM is a set of modules developed by the Canadian Forest Service (CFS) to run on top of FLINT. This set of modules describes several forest carbon pools. GCBM simulations work at an annual time step interval. GCBM handles as inputs a combination of spatially explicit datasets concerning the forest (with information like tree species or the location of forest types), age, climate, and disturbances, along with non-spatial parameters such as volume to biomass conversion coefficients or yield curves. A python tool is used to prepare the spatial inputs from raster to vector format and an SQLite Database handles the non-spatial data [1].
*carbon pool: represents a reservoir of carbon that can be stored or released , **disturbances: unplanned events (e.g. wildfire) that affect carbon pools
We applied MLOps techniques and tools to a complex scientific workflow s part of a community led, distributed carbon modelling platform. These techniques ensure reproducibility, which is the cornerstone of good science.
In my proposal, the main concept is to automate complicated and technically demanding tasks, as well as the corresponding reports, that a moja-global contributor might face by using CI/CD tools for Machine Learning/Data Science projects. I propose a DVC repository (i.e.remote storage) to cache significant logs and outputs of simulations which is achieved with the DVC pipeline for GCBM.Belize and Colombia repositories. Furthermore, I build a cloud storage repository on which we can use DVC's and CML's features to track, cache, compare, store and make flint-ready datasets there, thus making the datasets reproducible and interactive with the git repository that processes them because spatial datasets have a lot of variability and a standardisation process is needed to achieve reproducibility. I believe I achieved that with the work I've done on the Land Sector repository. Last but not least, I created a CML Action that generates a small summary/report from the execution of a simulation for the FLINTcloud platform as a numerical integration test, preventing breaking changes while under active development, and providing a useful template for new users to deploy their own FLINT services.
** flint-ready: convert spatial datasets in into to a common format and coordinate system
I owe huge thanks to my mentors Andrew O'Reilly-Nugent and Simple Shell for supporting me throughout the whole program and providing the appropriate feedback to me so I can improve my contributions.
Also huge thanks to community members Namya, Padmaja, Harsh, and all the other contributors that I worked with for not only providing support on technical issues but welcoming me to the community.
What do I believe I achieved in a general aspect?
In my proposal, my ideas and deliverable items were described generically and indefinitely because I hadn't realized 100% how several aspects of the code base worked. After I conducted enough research and learned how the code works I believed I reformed my deliverables more realistically.
GCBM.Belize was developed as a case study for the GCBM in Belize. The repository can be seen as a paradigm on how the GCBM works in nation-scale, but also provides a technical paradigm for new-coming contributors on how the GCBM works, particularly:
How the datasets are prepared
How the simulation runs
How to interact with the results (analysis and reports)
Some existing issues:
The whole workflow was configured to run in Windows Batch scripts and the steps were not connected between them, nor do they follow any structure, making their execution and the analysis of the results more complicated.
The Belize repository is a good case for experimenting with the GCBM by modifying specific parameters of the simulation and comparing the different outputs they provide. There was no such procedure to do that.
Data Version Control (DVC) is a dataset versioning tool that takes advantage of already existing engineering tools like Git. So we can't say that DVC is git for handling large-sized datasets that can be managed by git due to size limits.
Integrate a system-agnostic pipeline to execute the complete workflow (preprocessing, simulation and postprocessing) that tracks and stores the outputs in remote storage.
The cases of GCBM.Belize and GCBM.Colombia are developed to only be executed in Windows systems but using DVC's functionalities I set up the pipeline to be system-agnostic.
Before the DVC pipeline was established, the phases of the workflow had to be executed manually but using DVC anyone can execute the whole workflow with only one command (dvc repro)
In the cases of the GCBM module, the outputs it generates are for the most part .tif files and in the postprocessing step, it generates some metrics and plots. All these outputs are listed and tracked by DVC using md5 hashes. I used DVC's functionality to store files in remote storage and set up a Google Drive repository that stores them.
Furthermore, I used DVC's features for metrics and plot files to track these kinds of outputs from the workflow. This way when someone created another version of the GCBM, the dvc diff command could be used to compare the metrics from the standard GCBM version to the new one (and use the output in a potential report as well).
In the scenario that someone has created and executed different versions of the established pipeline the dvc exp tool can be used to list, compare and display metrics, plots, and the output tifs (and use these outputs in a potential report as well).
Pipelines in DVC are developed in YAML syntax and are divided into stages. For each stage to be appropriately executed we need to define:
the working directory
the command to be executed
the stage's dependencies (i.e. the files that are used/affected by the stage)
the stage's outputs(optional)
The GCBM pipeline consists of 12 stages :
Tiler → Processes and defines the spatial layers to be used in the sim
Recliner2GCBM → generates the gcbm_input database
modify_<type>_parameters scripts that apply preprocessing to the input_database to fit better into specific GCBM cases
UpdateGCBMConfiguration → updates the simulation configuration based on the contents of the Standalone_GCBM/template directory
run_gcbm
create_tiffs → generates the compiled spatial output
compile_results → generates the output database, a more user-friendly format to show the output results
post_processing → which creates the 3 figures that showcase the distribution of the four unique indicators throughout the simulation and metrics files that calculate the mean values of 3 different metrics of carbon stock divided between 3 periods (1 period = 50 years) for every type of indicator and every type of LifeZone.
Another important fact is that DVC by default does not assign any specific order to the execution of the pipeline. Although someone can define a specific order by denoting the outputs of each i-th stage as dependencies in i+1-th stage.
The pipeline offers the ability to professionals outside the programming world to effortlessly interact with simulations and modules without having to dig down on coding or other difficult technical tasks so they can focus more on analyzing the results.
DVC not only provides a blueprint to organize the workflow but also offers the ability to determine the dependencies and outputs of each step of the workflow. Because in most of the cases that are studied and use either FLINT.core or other modules the stages of the workflow are somewhat familiar(create a simulation and its configs, run it, compile the results and run some postprocessing scripts) this pipeline can be used as a blueprint/guide (with some modifications) to integrate the same functionalities in other FLINT or GCBM cases.
The Land Sector Datasets repository consists of datasets and their metadata for Land Sector management use in the FLINT. In more detail, this repository includes Jupyter Notebooks that contain the license, metadata along with other information, and the processing code to get the datasets (which are in raster format) into vector format (.tif or geoJSON) to be flint-ready.
** flint-ready: prepare the dataset in vector format
There is also a Google Drive remote storage maintained by moja-global's contributors, where the processed (vector format) datasets are stored. Although the Land Sector git repository and the Google Drive storage are not directly connected, meaning there's no specific procedure that ensures the reproducibility of those datasets, contributors can only use one version of each one.
The goal here is to establish a framework that guarantees reproducibility while curating the datasets and also decrease the required dependencies to process the datasets, so in general, make the Google Drive storage directly connected and interactive with the GitHub repository.
My first thoughts were that this case is very well suited for utilizing DVC's features to track datasets through Google Drive but that didn't seem enough. After discussion with the community and analyzing the situation I pinned down the issue into these finite tasks:
Implementing a general processor capable of processing the datasets into flint-ready formats
Track the processing steps (using DVC) and store the flint-ready datasets in the Google Drive storage
Provide a health check on the processed datasets when someone makes changes to the processing steps
Since the repository consists of many different datasets my mentors suggested I use the Harmonized World Soil Database dataset as a use case. My first task was to implement processing code for the dataset from raster into vector (.tif) format. I achieved that by writing a script that utilized the gdal python library to convert the raster dataset into vector format and restructure it to be flint-ready. Then, I created a DVC pipeline that runs the processing script and lists as output the vector dataset. At this point, I have to note that the pipeline is essential for future reference where we potentially need to implement processing for multiple datasets and push them in remote storage only with one command. The third and final step was to create a GitHub action that executes the pipeline and pushes the output datasets to Google Drive storage. This action is going to be triggered only when someone’s commit changed the dataset processing code thus providing a health-check (i.e. that someone's changes in the processing code generated a reproducible and flint-ready dataset). For a better understanding of how these deliverables will work here is a workflow example of a potential contributor:
Let's say I want to use the Harmonized World Soil Database dataset, which is stored in GeoTiff format in moja-global's Google Drive storage, but I want to use it in GeoJSON format, so I make some changes to the processing script and commit them.
The health-check action is triggered and executes the pipeline
After the action is completed the processed dataset is pushed to the Google Drive storage and an auto-generated commend (using CML) is published (example here) that informs me whether my script generated a flint-ready dataset or not.
Most of the datasets in the repository are processed using arcpy which is a python package for geographical analysis but can only be used in Windows systems. As suggested by my mentors I also worked on refactoring the processing code of some Datasets that use arcpy with system-agnostic libraries/modules such as geopandas. I applied this idea in the Global Ecological Zones dataset where I processed the dataset with geopandas instead of arcpy, so now the datasets can be processed in non-Windows systems too.
I believe that if we follow an OOP approach we can implement a general processor that would be able to handle multiple Datasets from the Land Sector repository so it can be used in other moja-global projects in the future.
As this deliverable is not merged yet I used a personal Google Drive storage for testing my code but the same principles can be applied to moja-global's Google Drive storage.
The core goal in FLINT.Cloud is to build a continuous deployment pipeline to offer FLINT on cloud resources. It consists of 2 unique APIs that run a different kinds of simulations. The APIs in FLINT.Cloud are configured inside docker containers where the required dependencies such as the FLINT source code, other required software packages, etc. are included. The repository is going to co-operate with other repositories because the community is also working on FLINT.UI which is going to be a FLINT frontend client for configuring simulations using the FLINT.Cloud APIs. Regarding the FLINT.Cloud repository, every step of the workflow (creating simulation configs, running and analyzing results) is executed through manual requests to the APIs. So, after discussion with the community and mentors, the creation of a CI script that would run a simulation and auto-generate a report was suggested. This CI script would provide a blueprint on how the APIs work to new-coming contributors and also stand as a procedure that automates a part of the workflow, thus releasing the developer/researcher of some technically demanding tasks.
CML is an open-source CI/CD tool for Machine Learning or general Data Science projects. It can be used to track and provide auto-generated reports on development workflows and also can be configured to provide CI pipelines on cloud-hosted runners.
The original goal was to establish a GitHub Action that executes a benchmark simulation in FLINT.Cloud and provides an automated report on the sim results using CML
First of all, the action is triggered on pull requests with the simulation label. The action uses a benchmark sim configuration and runs it using the rest_gcbm_api which is wrapped inside a docker container that includes all the required dependencies for the simulation along with the GCBM rest-API. Using the provided benchmark configuration we execute it. After the simulation ends we use the CompileResults repo to prepare the compiled_gcbm_output database where SQL queries are run to provide information on the simulation's output. These inferences and plots can be utilized by CML to be published in the form of a comment on the PR.
The FLINT.Cloud repository is still in the development phase. Since the GCBM rest-API is still in the development phase and daily continuous changes in the input configuration code are made, we stumbled upon some errors that couldn't be resolved so it was decided to temporarily modify the CML Action to run the simulation straight from the FLINT CLI interface (which is included in the container) until the APIs get their final form.
After the CML Action PR was merged mentors suggested we could enrich the auto-generated report by also displaying (in Jupyter Notebook form) the code that generated any potential plots. They suggested the use of jupytext, which is a python package for versioning and managing Jupyter Notebooks. It offers a variety of commands that map python scripts, markdown texts, and notebooks, so you can instantly access code in any of these formats. To sum up, I used jupytext to convert the script that generates the report from py:percent format into markdown so it can be attached to the report. I also raised a new PR for this addition.
My experience throughout the mentorship has been wonderful. I believe I learned a variety of things, from how to implement CI principles in non-traditional web projects to how to write cooperative code. I feel proud for being a part of a community with such team spirit, and I want to continue collaborating and making real contributions to moja-global.
[1] Shaw, C. H., et al. "Cumulative effects of natural and anthropogenic disturbances on the forest carbon balance in the oil sands region of Alberta, Canada; a pilot study (1985–2012)." Carbon Balance and Management 16.1 (2021): 1-18.