icon-play icon-normal-size icon-expand icon-pause iconExit Click for search iconTarget iconCheckmark iconEngine iconTruck linkedin_icon twitter_icon arrow arrow-left arrow-right close iconWorkforce iconAudit iconEmergency icon-pin icon-dots icon-act playbutton pause-button

Pain & Label:

Why and How We Built our Own ML Data Labeling Tool and Released it Free for Everyone

The Machine Learning (ML) revolution is here. It seems like every company and technical team wants to join this new wave of innovation. But what’s the first step?

At Sixgill, after setting out to infuse ML capabilities throughout our own product suite, we hit an obstacle that surprised us. It wasn’t figuring out ML itself. Nor was it defining best practices for deep neural net architectures, activation functions, or data augmentation techniques. That type of information is readily available. It wasn’t even putting developed models into production; we were able to quickly deploy new models on multiple clouds, trained and served from Python, Javascript, and Go.

The bottleneck in our process was creating high-quality datasets that we could use to train our models for novel use cases. Sound familiar?

We quickly realized the best way to overcome this obstacle was to build our own data-labeling tool. While we built it to solve our own pain, we deeply understood the pain our peer developers, data scientists and engineers were also feeling. So we launched our data labeling application publicly to share our faster, easier, better way to create visual ML training datasets.

It’s called HyperLabel. It’s fast, extremely easy to use, free to download for local use, flexible for diverse use-cases, and powerful enough to scale to high volume workflows. HyperLabel is a desktop application that uses ML itself to accelerate data labeling and end-to-end encryption that protects your privacy and preserves your IP. The decision to build HyperLabel wasn’t easy. Here’s some insight into our journey from conception to launch of our data labeling tool, and the key questions we asked ourselves along the way:

Should We Use Open Source or Not?

Like most startups, we first tried to solve our problem using open source solutions. A few seemed fine at first, but quickly proved insufficient once our needs became more complex and demanding.

For example, let’s say you’re trying to detect people in an image. To gain higher detection accuracy, you’ll probably retrain Coco SSD mobile net, YOLO, or other object detection models. This type of training requires manually drawing bounding boxes around people in countless images. Unless you want to devote expensive engineering time to this mundane task, you’re likely going to outsource it. That’s when the problem rears its head, as these open source solutions are simply hard to use.

Sure, engineers can figure them out. But people with less of a technical background need training. Most stop at “first install python”. We had to write training materials for these users, and even then, the results were not ideal.

The second issue we ran into with open source solutions proved even more dire. Labelers often ask questions such as, “This image is mostly black and hard to see. How should I label it?” or “The image is rotated. How should I draw the boxes?”.

After continuing to hear such questions, the lightbulb clicked on, and we realized that we needed to label these issues themselves and treat them as a multi-class classification problem. But when we went back to our labeling software, we realized that it didn’t allow users to associate the data in a single workflow; instead, it required multiple tools to accomplish the task.

Should We Go Commercial?

After striking out with open source solutions, we tried a few commercial ones, including Figure Eight and Labelbox. Both offered a highly configurable schema and customizable interface, which helped with the above-mentioned, multi-label workflow problem. But there were significant downsides.

With both products, it was shocking how quickly we were forced to talk to sales. Labelbox cuts off its free tier at 2,500 “labeled assets per year”, which we quickly hit. With Figure Eight, it took us a couple of calls to even get pricing.

And, neither product was cheap: Labelbox charged $1,000 a month just to use their labeling software, and Figure Eight was even more expensive. That price seemed astronomical for a SaaS offering.

Another huge downside was that we had to trust them with our datasets and labels, both of which are valuable IP for us. In our experience, some companies are categorically forbidden to trust third parties with their data unless they perform due diligence. For us, this extra requirement would have hindered client negotiations. While Labelbox and Figure Eight do offer on-premise installations, I’m sure the cost would be even more prohibitive.

Why We Decided To Build Our Own Application

After facing these challenges and finding inadequate solutions, we decided that the best way to get what we really needed was to build it ourselves. With our own ML data-labeling application toolset (HyperLabel), we made certain to include these six important qualities:

1)Make it easy and intuitive for non-engineers to use, and get from project setup to label export in just a few steps (5 in our case).

2)Provide flexible schema selection to enable quick iteration across various data labeling workflows. Included custom schemas are: rectangles, polygons, point, feature points, free text, select, and multi-select.

3)Let users to control their own data. Users can import files from local drives or cloud storage – no need to use any external service.

4)Give it scalability that allows users to manage labeling projects of almost any size or complexity.

5)Provide easy export in formats such as JSON, COCO, Pascal VOC and YOLO.

6)Make it free. By removing the cost barrier for the Developer version, we’re making quality ML training datasets an easy reality for anyone.

As we iterated our own product, we also discovered a need to synchronize datasets and labels across machines, which we plan to provide at a reasonable price, with end-to-end encryption. Other features that we found desirable and are currently working on include deep learning-based object tracking for speeding up video labeling, pretrained object detectors for identifying objects with additional taxonomy, and the ability to use GANs to guess object or scene outlines for semantic segmentation.

We’re passionate about realizing the great potential of deep learning ourselves, as well as removing the pain from labeling for data experts and developers around the world. We can’t wait to experience and share the innovation yet to come as HyperLabel helps others take the fastest path to Machine Learning.