Search
15 MAY 2015

Q&A with Nikhil, Leading Contributor to GS Collections 6.0 Open Source Project

A third-year engineer in Private Wealth Management (PWM) Technology, Nikhil was the leading contributor to GS Collections 6.0, the latest release of our open source Java library.

Q: When did you start writing open source code?
Actually, I’d never written a line of Java code until July 2012, when I joined Goldman Sachs as a NAPA (in-house term for first-year engineers). In fact, my academic background had no connection with computer science at all—my bachelor’s and master’s degrees are in mechanical engineering. My training included the GS Collections Kata, after which I joined a PWM team in Salt Lake City that uses GS Collections extensively. If anybody had told me two years later I’d be writing code that would be open sourced as part of GS Collections 6.0 I would have said, “In your dreams!”

Q: You say your team uses GS Collections extensively. How so?
We use specialized calculators that take reference data (account, position and product) and apply different algorithms to compute investment management fees for the firm’s clients. GS Collections methods like forEach, collect, select, reject and partition have become second nature to us because they are much simpler, more efficient and readable. The iteration methods are exposed on the GS Collections data structures (MutableList, MutableSet, MutableMap and MutableBag), making them much easier to use than standard JDK Collections (List, Set and Map). When you can replace the standard Java approach with a single line of code, your code is clearly going to be shorter and more readable.

For example, a JDK Set contains all unique values, regardless of how many times they occur, but there is no easy way of finding a set of values that occur a specific number of times. Without GS Collections, here is how I would find the set of elements that occur once in a List:

public static <K> Set<K> getNonRepeatingObjects(List<K> inputList)
{
    Map<K, Integer> objectToOccurrencesMap = new HashMap<>();

    for(K each : inputList)
    {
        Integer count = objectToOccurrencesMap.get(each);
        if(count == null)
        {
            count = 0;
        }
        objectToOccurrencesMap.put(each, count + 1);
    }
    Set<K> nonRepeatingObjectSet = new HashSet<K>();

    for(Map.Entry<K, Integer> entry : objectToOccurrencesMap.entrySet())
    {
        if(entry.getValue() == 1)
        {
            nonRepeatingObjectSet.add(entry.getKey());
        }
    }

    return nonRepeatingObjectSet;
}

With GS Collections, the method changes to just one line:
public static <K> Set<K> getNonRepeatingObjects(List<K> inputList)
{
    return ListAdapter.adapt(inputList).toBag().selectByOccurrences(IntPredicates.equal(1)).toSet();
}

Q: When did you start making contributions to GS Collections—and why?
In late 2013, when I started using Multimaps on a project, I realized that we needed a constructor that accepted an iterable of pairs. I wrote a factory method in our codebase, and started using SetMultimaps. Over the weekend, I sent out code samples to a colleague, asking if my method could also be added to GS Collections, and he sent me all the details I needed to start contributing. During our code reviews, they explained the concepts and reasons behind their suggested changes and listened to my ideas for improvements to the codebase. It was an amazing learning experience, and that became one of my main motivations to keep contributing.

Q: What was the specific problem you were trying to solve?
Multimaps are maps where the values are a collection. GS Collections provides ListMultimap, SetMultimap and BagMultimap for different use cases. My use case required a Set, so that the same account would never be processed more than once. I could have used a JDK Map with the values as a Set, but Multimaps are more intelligent. They do not store a key where the collection is empty or null, so they’re safer, and more efficient and readable than just Map<Key, Set<Value>>.

When I started working with Multimaps, they already had methods to iterate over each key (forEachKey), each value (forEachValue) and over each key value pair (forEachKeyValue). In my use case, I realized I always had to call the get() method and then operate over the collection. My code would be cleaner if I had a way to iterate over keys and the value collection. So, I contributed Multimap.forEachKeyMultiValues(). After that I had momentum and wanted to add more iteration patterns like select, reject and collect. After brainstorming with Craig, we decided to add two forms of each pattern, selectKeysValues(), rejectKeysValues(), selectKeysMultiValues(), rejectKeysMultiValues(), collectKeysValues() and collectValues().

The two collect patterns presented a new challenge. What should their return types be? Craig explained that the return types should preserve as much information as possible and not imply anything new. Take SortedSetMultimap.collectValues(), for example. After transforming the values, the comparator no longer applies, so they cannot be sorted. But they still have an order based on their positions in the sorted set. So SortedSetMultimap.collectValues() returns a ListMultimap. What about SortedSetMultimap.collectKeysValues()? In this case, the keys are also transformed, and the transformed keys can have duplicates and collide with each other. Since keys are stored in an unpredictable order, the whole result has an unpredictable order, so SortedSetMultimap.collectKeysValues() returns a BagMultimap.

Let’s say there was a SortedSetMultimap with keys and values like this:
{Mammal=[Bat, Cat, Dog, Mouse], Bird=[Duck, Goose, Turkey], Insect=[Cricket, Fly]}
When collecting the values of the sound each animal makes, we want to make sure the same order of values is maintained and that duplicates are allowed. The order of keys is not important but the order of values definitely is, so the backing value collection needs to be a List. Using collectValues() will return this list:

{Mammal=[Chirp, Mew, Bark, Squeak], Bird=[Quack, Quack, Gobble], Insect=[Chirp, Buzz]}

But when you also collect the classification of each animal using collectKeyValues(), the keys collapse as well. For vertebrates, there’s no way to decide whether mammal sounds or bird sounds should come first. You might wind up with a BagMultimap like this:

{Vertebrates=[Mew, Gobble, Bark, Squeak, Chirp, Quack, Quack], Invertebrates=[Chirp, Buzz]}

Q: What other benefits have you seen from GS Collections?
The optimizations and the concurrency utilities. In 2014, we were working on enhancing our system performance. The speed and memory optimizations in GS Collections caught my interest, but our application is IO-bound, which meant that any significant performance gains would have to come from improved concurrency.

We wound up using ParallelIterate, a GS Collections utility class that provides familiar iteration patterns like forEach(), collect(), and select() that execute in parallel. Each pattern has a number of configurable overloads. The simple forms completely hide the thread-pools and batching logic.

We wound up deleting a lot of hand-written concurrency code and replacing it with calls to ParallelIterate. We also found and eliminated code that spun up thread pools but never shut them down properly. By simply relying on the configurable overloads in ParallelIterate and carefully choosing the right inputs, we achieved an 8% speed improvement without spending a lot of time.

Q: Now that GS Collections 6.0 has been released, what’s next?
My team has a lot of interesting work coming up, and I am also working on GS Collections 6.1. The changes are mainly related to performance tests of GSC Maps, JDK Maps and Scala Maps, which will be my first exposure to JMH testing and memory testing. The learning experience continues!