Advanced Computing in the Age of AI | Wednesday, December 6, 2023

Communities of Communities: The Next Era of Open Source Software 

We are now about 20 years into the open source software era. You might think that open source simply means publishing the source code for something useful. While this is correct by definition, the most important component of any open source project is its community and how it works together.

Open source projects are not isolated islands. In fact, it’s common for them to depend on each other. As new projects are created, it is also common that members come from related projects to work on something new. Apache Arrow is an example of a new project that worked across many related projects, creating a new community that from the beginning knew it needed to build a community of communities.

An open source project is only successful when it builds trust in a large enough community.

If nobody uses an open source project, then it does not really matter that its source is public.

There’s a spectrum of reasons why people are part of the community, and any given community has many types of members. For example, one group could be users that trust the project enough to build their own project on top. They can report bugs, provide feedback, give context to feature requests and advocate for the project.

Another group could be contributors, providing fixes along with their bug reports or volunteering to implement a feature that is important to them. They can be committers who – recognized through their sustained contributions – now help others have their own contributions merged into the main source tree. Committers have the privilege of curating what eventually ends up in the source and the responsibility of helping new contributors to reach the standards that will get them here.

In the Apache world, the PMC (Project Management Committee) is in charge of recognizing new committers and PMC members that will help the project and the community grow. In an open source project, everything has to be thoroughly vetted to ensure the quality of the project. Contributions have to be focused on the project goals and be up to the quality standards that inspired trust in the project in the first place. The PMC itself evolves over time. One of the responsibilities of the PMC is to recognize people who help define those goals and include them as peers in the decision making process by inviting them to join the PMC.

Every one of these community members is important. They usually don’t conform to a single role. Over time they can move from being a user, to reporting a bug, to fixing it, to contributing and sometimes making a career of it.

Hadoop is an open source project under development for 10 years now, and it’s actually a collection of communities working together. In this ecosystem we reached a point where ensuring interoperability introduces many dependencies. Open source projects rely on each other and make a larger community of communities. Fortunately, very often those communities overlap, which helps to align goals and priorities.

Naturally as people work on integrating two projects, they tend to contribute to both and become part of both communities. This cross-pollination smoothes coordination since the very people needing a release or a bug fix can contribute to unblock themselves by fixing a bug and stewarding the release.

Today, Apache Parquet is the standard for columnar data representation. It started as a grassroots effort when I was at Twitter, in collaboration with the Impala team at Cloudera. There was a need for a standard, language-agnostic columnar format. On the basis of that early group we kept building the community by being inclusive and by working to build a consensus with a wider group.

As I contributed to Apache Thrift and Apache Pig integration, which were a focus for Twitter at the time, Tom White from Cloudera implemented the Apache Avro integration, and engineers from Criteo made it work with Apache Hive. Netflix started using it and worked on Presto support. The Apache Drill project reached out to include some of their requirements in the format and made Parquet their default representation. Apache Spark also picked Parquet as their default representation for Spark SQL. Then critical mass was reached and Parquet became an integral part of the Hadoop stack.

Apache Arrow is building on the success of Parquet. As various projects worked together creating and improving Parquet and making it their common on disk format, they also realized they needed a standard in-memory representation with different optimizations. Arrow benefitted from an already developed community of open source projects with a common goal. The initial PMC was formed from members of open source projects with this common need. In the Query engines/data manipulation category you’ll find: Calcite, Drill, Impala, Pandas, Phoenix, Storm, and Spark. In the Storage layer: Cassandra, HBase, Kudu, and Parquet.

We started by building a consensus around the purpose and the scope of the project. A key to success is to stay focused on the main goal for the project and avoid distractions and reduce the number of topics on which we might disagree. The interface must be clear enough that each dependent project can move independently.

Discussions are held publicly using GitHub pull requests, mailing lists, and JIRA, allowing contributions from anyone interested. We hold regular online syncs using Google Hangouts, and everyone is welcome. This helps us reach a consensus faster as a complement to asynchronous communication, which can be slow to converge. Notes are always posted on the mailing list and items followed up in pull requests and JIRA.

Responsiveness is key. Reaching consensus requires someone to push the discussion along to its conclusion. Finding agreement on what should be done can take longer than getting it done, a source of frustration to those used to working on private codebases. But consensus is what drives adoption of a project like Arrow. Making open source successful is about enabling people who really need something to do it themselves.

The Arrow project now has implementations in Java, C++, and Python, with good compatibility testing coverage. Several Arrow integrations are ongoing, for example speeding up Spark/Python integration. This phase is inherently parallel as each project can work independently without requiring coordination. The next step for Arrow is to keep integrating in the ecosystem and providing higher level interfaces, such as for data storage and user defined functions. The project has made good progress since its inception.

Julien Le Dem is principal architect at Dremio and is the co-author of Apache Parquet and the PMC chair of the project.