Interview with Keynote Speaker Joe Hellerstein

Written by dsa_admin on 7. July 2021. Posted in Blog, SDS021.

Joe Hellerstein is a Professor at the University of California, Berkeley, and founder of the software company Trifacta. He works broadly on computer science and in data management on everything from data systems to human-computer interaction, machine learning and networking.

Can you tell us a little about yourself and your background?

I grew up in Wisconsin, in the United States. My mom was a computing pioneer; she worked in computer science already in the late 50s, early 60s, so she’s had a big influence on me. And my dad as well – he was a mathematics professor. And I have big sisters one of whom is a computer scientist. So, there are many family influences on my choice of career.

I have been a Professor at Berkeley now for about 25 years. About nine years ago I co-founded the company Trifacta. The goal of Trifacta is to transform data into a shape for use without having to write code. This allows people who aren’t coders to do their own data preparation and it allows people who are coders to do things much more effectively. Trifacta came about as an extension of a research that I was doing with some colleagues at Stanford, and we then founded the company together.

For fun I play the trumpet.

What will you be talking about at the conference?

At the conference I’m going to talk about data engineering. I will talk about how important it is to the data science lifecycle and how the tasks for data engineering are shifting from being a burden on a small select few in IT departments to something that everybody in data science can and must take on.

What do you tell people when they complain about data cleaning before they can do the fun machine learning stuff?

First of all, I tell people that they never know their data as well as they do when they are in the middle of preparing it for use. That’s when you get the complete context of what is in the data and what to do to get it in the form you need in order for it to work. You’re in a very intimate relationship with the data. It’s like when you’re deeply in practice with a piece of music—you really are immersed. If you’re not engaged in this process, then you probably don’t actually know what’s going on. It is only at the point of preparation when you’re really intimate with the material.

“In the machine learning lifecycle, the point of maximum agency takes place when you’re doing the data preparation and featurization. That is when you as a person have the most influence on things”.

And I would actually take a little issue with the framing of the question because mostly with machine learning, all you’re doing is turning on a model and seeing what pops up, and there’s not a lot of agency in that. In the machine learning lifecycle, the point of maximum agency takes place when you’re doing the data preparation and featurization. That is when you as a person have the most influence on things.

But we don’t do enough experiential teaching on this in universities. We tend to give students pre-prepared data sets and then they don’t get the experience of preparation until they’re in the field.

I will also say that over the years many of the tools for data preparation have been very poor, which has made the task unpleasant. It often looks like programming and you’re in practice not immersed in the data, you’re immersed in some code. I think that has to change. That’s actually a big part of what we do at Trifacta.

One of our conference topics this year is Learning from Little. How different are the big and little data problems and their solutions?

It’s such a lovely question. Partly it’s a nice question because, of course, when you start by thinking about why data is so big, you really only focus on the aspect of scale and performance and you don’t really focus on the quality of the data: what’s in there and how to get it into shape to use it.

Scale can be a problem for the user even with small data, because we as humans cannot really work with large data sets—our heads don’t do that. We need computational aids to look at more than a screenful of data. So, when you look at a table that is spread over 20 screens—which is only a few kilobytes of data!—you will not be able to keep it in context in your head. So, all the problems on a human scale happen already with very small scales of data. And they, as much as big data sets, challenge us in a bunch of ways to be able to do what we want, raising the questions: how do I know what’s in here and how will I make sure that what’s in here is appropriate to my task? This happens even on a very modest scale. So those questions should be present in everyone’s mind.

“You always need to be asking what’s missing. And small data kind of drives us to that question right away, which I think is great.”

And when you’re working with small data you almost always ask the question: what am I missing? Which is the question that you may forget to ask if you have this giant dataset. Which is a problem, because no dataset covers all the data that could have been generated by a specific phenomenon in the world. Even with banking transactions you probably don’t have all the transaction scenarios from the beginning of time. So, you can take these very humble computerized tasks, and you can still not get the complete data. You always need to be asking what’s missing. And small data kind of drives us to that question right away, which I think is great.

And in Trifacta, we start by giving people a small sample of their data set, even if the full set is large, because they can then interact with the data quickly and hypothesize about what they will find. They can try out different transformations to see what they get. And all of that happens at the speed of thought rather than at the speed of some gigantic computing task. We have an architecture that is called “sample to scale”, where we give you a sample to work on and then, when you believe that the work you’ve done is the right task for the whole data set, we compile it down to a job you can run on a big data platform. That’s a computer science compiler problem that we’re handling for you, and then running the job is a back-end systems problem handled by infrastructure. But the hard part of your job is the exploration and transformation work you do on the sample, in order to get it into shape.

So, we’re very much on board with the idea that even with large data sets you want to “go little” in order to get that fluidity: experimentation and exploration. So, I think it’s a wonderful theme.

And lastly: why is SDS important and what do these conferences bring to the community?

I think it’s really important for practitioners, technologists and researchers to be together in a dialogue about what matters and what innovations can do to help. I think when research has been done in a vacuum, sometimes you get innovations that aren’t really great for people to use—people can’t adopt the technology because it is too hard to use or too generally focused. The feedback from practice to research is to let researchers understand what holds people back from getting value out of data and that’s critical to the research effort.

“And in my own work in particular I’m very informed by practitioners, with the idea that innovation in computer science may be about helping practitioners do their jobs better as opposed to creating things from scratch that nobody asked for.”

At the same time, there’s a lot of creative work that goes on in R&D in both universities and companies that practitioners can learn from. I see it very much as a two-way street. And in my own work in particular I’m very informed by practitioners, with the idea that innovation in computer science may be about helping practitioners do their jobs better as opposed to creating things from scratch that nobody asked for. That dialogue can be quite healthy.