SDS021 – SDS Archiv

Interview with Speaker Lukas Widmer

Written by dsa_admin on 7. July 2021. Posted in Blog, SDS021.

Lukas Widmer is Senior Principal Statistical Consultant at Novartis Pharma AG.

Could you tell us a little bit about your professional background?

From young age I had an interest in computer technology. Initially that led me to work in software engineering while studying Computer Science at ETH Zürich, after which it became clear to me that some of the most challenging and impactful applications where I could contribute were in life science and healthcare – here some exposure to the field of medicine through my family definitely had an influence. This led to me to pursue an MSc degree in Computational Bioinformatics at ETH Zurich and UC Santa Barbara and a PhD degree at ETH Zürich in Basel. In my PhD I focused on more interdisciplinary work; interfacing statistical / computational modelling and simulation to further understanding of basic biology, with the long-term goal to improve treatments. The desire to have more immediate impact in life science brought me to the Advanced Exploratory Analytics group – part of Advanced Methodology and Data Science at Novartis – which I joined in 2019. I joined Novartis with the double mission to drive the use of innovative methodology – such as data science, modelling and machine learning – across drug development, to deliver science-based progress to our patients and to re-imagine medicine.

What will you be speaking about at the SDS2021?

I will be discussing the importance of developing, propagating and applying Good Data Science Practice in the Data Science community in general, and in healthcare and pharma in particular. There have been several recent examples that highlighted the need for this, such as introduction of unwanted (and potentially unnoticed) bias which could impact patients in an unintended manner in Covid risk prediction or bias introduced into melanoma recognition using deep learning when not accounting for surgical skin markings. That any holistic approach to human research must be built on a solid ethical foundation has also been a current point of discussion in Computer Science, through the banning of University of Minnesota from making any further Linux Kernel contributions and the following apology, highlighting our duty to protect human subjects in research. We will discuss the need for Good Data Science Practice from multiple perspectives in the pharmaceutical industry and beyond, and we look forward to your thoughts and questions.

Why is the SDS conference important?

The Swiss Conference on Data Science is a great platform for exchange on current developments and issues in the data science space in industry and academia across Switzerland. There is a lot of excellent research and development going on both at universities and companies, and I find that having a good and critical discussion and dialogue (for example at SDS) is a good seed for collaboration and innovation. Having a direct line to people on the ground – subject matter experts – seems to be one of the key success factors, so I am looking forward to a diverse conference program.

Interview with speaker Aleksandra Chirkina

Written by dsa_admin on 7. July 2021. Posted in Blog, SDS021.

Could you tell us a little bit about your professional background?

My professional background is in application of data science to finance and financial data. Recently I’ve been working on an NLP project for analysis of financial documents, a recommender system for financial advice, and adaptation of different data science techniques for KYC compliance.

What will you be speaking about at the SDS2021?

My presentation “Data Science for Uninterrupted KYC Compliance” is going to demonstrate how data science can boost the efficiency and quality of KYC (Know Your Client) and AML (Anti-Money Laundering) processes for financial institutions. Proper implementation of KYC and AML measures is crucial not just for the smooth operations of a financial institution, e.g., a private bank, but for the economy as a whole, preventing the inflow of ‘dirty’ untaxed money and combating criminal activities.

In the talk I will present our experience with two data science models being applied to transactions and KYC profiles. As a result, some non-trivial KYC violations were detected, which were missed by traditional rule-based approach. The talk is aimed at inspiring financial institutions explore intelligent data-driven solutions for detecting KYC violations, money laundering and fraud.

Why is the SDS conference important?

For me personally, the SDS conference is a yearly milestone, for which we always thoroughly prepare in our team. We reflect on our achievements and discoveries over the past year, select the most interesting client projects and internal research to share with the Swiss data science community.

Another important role of SDS, particularly in the current self-isolation times, is being the attraction point for data scientists from different companies and industries, where they can share their professional and personal experiences, exchange ideas and inspire each other.

How Confidential Computing and Decentriq can facilitate greater industry collaboration

Written by dsa_admin on 7. July 2021. Posted in Blog, SDS021.

Born in Zurich, David Sturzenegger is a mechanical engineer by training and obtained a PhD degree in electrical engineering from ETH Zurich in 2015. From his time at big-data company Teralytics, he has several years of experience working with highly-sensitive data and leading teams of senior data scientists and software engineers.

Now David is Head of Product at Decentriq, where he is leveraging privacy-preserving technologies to help organizations collaborate on sensitive data. At SDS2021, David will be talking about Confidential Insights, a confidential survey platform jointly developed by Decentriq and Swisscom’s Fintech unit.

Confidential Insights was announced in November 2020. It is the world’s first platform for provably confidential surveys and peer-group analyses. Built to make collaborations around sensitive data easy and secure, Confidential Insights allows combining survey answers from multiple participants and extracting insights while keeping the answers provably confidential from anybody – including all admins. With Confidential Insights there is no trade-off between data utility and data privacy anymore.

An application of Confidential Insight

Leveraging the additional confidentiality guaranteed by Confidential Insights, Swisscom’s market research department – e.foresight – recorded an increase in participation in their annual survey on online mortgages, conducted by 30 banks in Switzerland. Banks that would not have participated previously due to confidentiality concerns, now responded to the survey through Confidential Insights, providing e.foresight with greater data input and deeper insights into the online mortgage space in Switzerland.

The underlying technology platform

The confidentiality guarantee is achieved by leveraging a technology called confidential computing, which is also the underlying technology behind the Decentriq platform. This is a SaaS enterprise that allows anyone to easily collaborate on the most sensitive data without risk of exposure. Confidential computing ensures that all the data passing through Decentriq is completely secure and encrypted, end-to-end. Even Decentriq itself cannot see the raw data input by organizations into the Decentriq platform. Confidential Insights uses the Decentriq platform as a backend.

Applications in different industries

From customers’ financial information to patients’ health data, collaborating with industry partners on sensitive data can securely bring about significant benefits and value to organizations and their customers. Below are some examples of industries that can benefit from secure data collaboration:

Insurance: To provide better protection and service for customers by leveraging customer insights, or to enhance collaborative fraud detection by analyzing data with fellow insurers, without exposing sensitive customer and claims data.

Financial Services: To participate in collaborative credit risk scoring with other firms and improve credit risk modelling and scoring, without ever exposing customers’ confidential data.

Healthcare: To bring together patients’ highly-sensitive health data, often distributed across different hospitals and clinics, in an anonymized manner so as to allocate resources more efficiently and provide patients with more effective treatments.

With confidential computing powering Confidential Insights and Decentriq, more organizations and industries can now collaborate with each other on their most sensitive of data with minimal risk, unlock new business value and deliver products that best match their customers’ needs.

Interview with Speaker Jürg Meierhofer

Written by dsa_admin on 7. July 2021. Posted in Blog, SDS021.

Jürg Meierhofer is senior lecturer, researcher and project manager in smart service engineering at the Zurich University of Applied Sciences (ZHAW). He is the coordinator of the “ZHAW Platform Industry 4.0“and expert group leader for the Expert Group “Smart Services” in the data innovation alliance.

Could you tell us a little bit about your professional background?

The design and engineering of services is the common thread throughout my professional activities. I got my PhD from the Swiss Federal Institute of Technology in Zurich (ETHZ) as well as an executive MBA degree from the international institute of management in technology (iimt). For more than ten years I worked as a manager for service innovation and optimization in the telecommunications and insurance industry.

What will you be speaking about at the SDS2021?

Service customization is a key factor for value creation in socio-technical service ecosystems, enabled and fuelled by new data-driven approaches. I will address the question of how to design service customization within the provider-customer interaction and show a novel quantitative approach to model this. We will see that the optimum design of the customer journey is hard to find by heuristic approaches and that the latter are often sub-optimal.

Why is the SDS conference important?

The SDS is important because it brings together the leading actors in data driven innovation from all around the world as well as from the Swiss community. It allows for visibility of the various innovation streams and provides the chance to connect to create new potential.

Interview with Keynote Speaker Lothar Baum

Written by dsa_admin on 7. July 2021. Posted in Blog, SDS021.

Lothar Baum is Head of Engineering Cognitive Systems at Bosch. He holds a PhD in computer science. Before joining Bosch, he worked for Hewlett Packard in Germany and in a smaller company in the USA.

Could you tell us what you are doing at Bosch?

I joined Bosch in 2006, working in corporate research, where I built up a research group on connective systems. We were looking at robotics and machine learning applications. Then I moved to Data and started a project in Data Mining. I then got involved in the foundation of the Bosch centre for AI. Since 2017 – four years now – I have been with the business unit on automated driving. I am responsible for the department that develops smart algorithms. So, basically all the algorithms from perception to situation analysis, prediction and behaviour planning.

What will you be talking about at the conference?

I will be happy to give you an overview of what it takes to build autonomous driving cars; what the technical challenges and the approaches to tackle these challenges are. Ultimately, I want to give you an impression of where we stand and where we need to extend our technology.

What are the biggest challenges for Automated Driving? And what are the solutions?

There are of course a lot of challenges. Trying to summarize them in a short time is in itself a challenge.

One of the first challenges is performance. Especially the perception performance: how well does a car perceive its environment? This comes down to sensor variety and performance; the more sensors you have and the more different modalities of sensors you have, the better it is. Secondly, it comes down to computational performance. We need fast computers that require energy and space, which – in turn – means more costs.

The second challenge is what we call the “open world problem”. The boundaries of the driving task and the rules for behaviour are not, and cannot be, clearly defined. The problem is that there will always be situations out there that nobody has ever thought about. And how do we handle situations that no one has implemented a solution for? This calls for approaches that are data driven. This means that we train systems with data examples and hope that they are sufficiently able to abstract and generalize. This, in turn, means that we need a lot of data training, which is another big challenge.

“The basic idea is to have an approach that is similar to how we humans learn to drive. We know a set of rules, and we have collected experience via driving lessons. We don’t have a clear plan for every possible situation, but we have, let’s say, some kind of abstract data set in our heads where we are able to transfer situations or solutions to other scenarios. That’s the data driven approach.”

Then there comes the long tail of unknown cases: for which scenarios do we capture data and what happens with cases where we haven’t captured enough data? How do we create a sound safety argumentation around this? Ultimately this leads to ethical questions: what are the guardrails allowing or not allowing systems on the street? And, let’s be clear about this: there will always be accidents. You will never get a 100% safe system and the question is whether we accept this, and at which stage we accept that there are residual risks. And this is a question that is both for technologists and for the society at large to decide.

The amount of responsibility when driving a car is big, and it can already be difficult for a human to anticipate all the possible things that can happen on the street. How does this translate to a self-driving car?

The basic idea is to have an approach that is similar to how we humans learn to drive. We know a set of rules, and we have collected experience via driving lessons. We don’t have a clear plan for every possible situation, but we have, let’s say, some kind of abstract data set in our heads where we are able to transfer situations or solutions to other scenarios. That’s the data driven approach.

Given all of these challenges, how do you see the future of Automated Driving? When can we expect to see automated cars?

This depends on what we expect. How much are we willing to pay for it? Probably, it’s technically possible already. It’s a question of cost, obviously, and it’s probably not achievable for the broad public right now. But it’s something that is interesting for everybody. And on the other hand, it’s a question of performance. Today we still see a lot of new “corner cases” where many of these fully automated cars fail. And the question is to which extent we accept this.

In general, these challenges can be approached from two different angles. There’s one approach that is building on the (rather rules-based) assistance systems we see already in many of today’s cars. We’re trying to develop these assistance systems that are not fully automated but are the first steps towards helping the driver. These systems help us collect data and experience and iteratively expand their functionality.

And then there’s the top-down approach that basically strives to directly build a fully autonomous car, neglecting economic constraints such as costs or compute power, hoping the ready solution later can be scaled down to reasonable setups. At some point, hopefully, these two approaches will meet.

What, in your opinion, are the organizational and legal effects in bringing an automated car to the streets?

Again, it depends on the level of automation. There is a classification by the Society of Automotive Engineers (SAE) which defines five levels of automation. Level five corresponds to the highest degree of automation where no human driver is involved. Level zero means no automation or assistance at all. From level three upwards, the autonomous system actually takes over the responsibility for driving, at least for a limited amount of time. For the fully autonomous level five car, there is currently no consensus on how to handle this legally. It’s not allowed in most countries.

There is, for example, the Vienna convention on road traffic from 1968 which many countries around the world have adopted. It basically says there has to be a driver in the vehicle in charge of driving at all times. Only recently some countries have taken measures to soften this regulation and taking steps towards making self-driving cars legally possible. In Germany, about five years ago, there were some changes to the laws, that made it possible for the driver not to have direct control of the car at every point in time.

“I’m pretty sure that changes in laws will happen. The question is when it will be fully accepted, that is, when it will have become normalized in the society at large.”

Many companies involved are lobbying for changed laws. We can already see that in certain areas in the world, primarily the U.S. and China, they are pushing for this kind of legislation. And there have already been some regional adjustments. For example, in the states of Nevada and California self-driving cars are allowed under certain conditions. I’m pretty sure that changes in laws will happen. The question is when it will be fully accepted, that is, when it will have become normalized in the society at large.

Fully autonomous cars will have other impacts as well. They may have consequences on different sectors of the economy, for example the car industry, because less cars would be needed. Most of our cars are just standing idle 90% of the time and that’s because they are waiting for us. If they could drive to different locations themselves, we could probably get away with fewer cars. That’s one thing. And obviously the whole businesses of taxis, business shuttles and so on would be in trouble.

And lastly: why is SDS important and what do these conferences bring to the community?

From my perspective the main benefit is that this kind of conference brings together researchers, practitioners and decision makers to exchange ideas. And I would like to say that especially with respect to data science, the exchange of ideas and also the exchange of data – to know what data is available where, how and what can be done with it – is specifically important for the data science community.

“if we look at things like autonomous driving, it is not just a technical question, it’s also a question for the society. What do we expect and what kind of risks do we accept? And this means that there has to be a discourse in the society about this.”

And last but not least, if we look at things like autonomous driving, it is not just a technical question, it’s also a question for the society. What do we expect and what kind of risks do we accept? And this means that there has to be a discourse in the society about this. And that’s why it’s important to talk openly and come to an overall decision on how to cope with the challenges.

Interview with Keynote Speaker Joe Hellerstein

Written by dsa_admin on 7. July 2021. Posted in Blog, SDS021.

Joe Hellerstein is a Professor at the University of California, Berkeley, and founder of the software company Trifacta. He works broadly on computer science and in data management on everything from data systems to human-computer interaction, machine learning and networking.

Can you tell us a little about yourself and your background?

I grew up in Wisconsin, in the United States. My mom was a computing pioneer; she worked in computer science already in the late 50s, early 60s, so she’s had a big influence on me. And my dad as well – he was a mathematics professor. And I have big sisters one of whom is a computer scientist. So, there are many family influences on my choice of career.

I have been a Professor at Berkeley now for about 25 years. About nine years ago I co-founded the company Trifacta. The goal of Trifacta is to transform data into a shape for use without having to write code. This allows people who aren’t coders to do their own data preparation and it allows people who are coders to do things much more effectively. Trifacta came about as an extension of a research that I was doing with some colleagues at Stanford, and we then founded the company together.

For fun I play the trumpet.

What will you be talking about at the conference?

At the conference I’m going to talk about data engineering. I will talk about how important it is to the data science lifecycle and how the tasks for data engineering are shifting from being a burden on a small select few in IT departments to something that everybody in data science can and must take on.

What do you tell people when they complain about data cleaning before they can do the fun machine learning stuff?

First of all, I tell people that they never know their data as well as they do when they are in the middle of preparing it for use. That’s when you get the complete context of what is in the data and what to do to get it in the form you need in order for it to work. You’re in a very intimate relationship with the data. It’s like when you’re deeply in practice with a piece of music—you really are immersed. If you’re not engaged in this process, then you probably don’t actually know what’s going on. It is only at the point of preparation when you’re really intimate with the material.

“In the machine learning lifecycle, the point of maximum agency takes place when you’re doing the data preparation and featurization. That is when you as a person have the most influence on things”.

And I would actually take a little issue with the framing of the question because mostly with machine learning, all you’re doing is turning on a model and seeing what pops up, and there’s not a lot of agency in that. In the machine learning lifecycle, the point of maximum agency takes place when you’re doing the data preparation and featurization. That is when you as a person have the most influence on things.

But we don’t do enough experiential teaching on this in universities. We tend to give students pre-prepared data sets and then they don’t get the experience of preparation until they’re in the field.

I will also say that over the years many of the tools for data preparation have been very poor, which has made the task unpleasant. It often looks like programming and you’re in practice not immersed in the data, you’re immersed in some code. I think that has to change. That’s actually a big part of what we do at Trifacta.

One of our conference topics this year is Learning from Little. How different are the big and little data problems and their solutions?

It’s such a lovely question. Partly it’s a nice question because, of course, when you start by thinking about why data is so big, you really only focus on the aspect of scale and performance and you don’t really focus on the quality of the data: what’s in there and how to get it into shape to use it.

Scale can be a problem for the user even with small data, because we as humans cannot really work with large data sets—our heads don’t do that. We need computational aids to look at more than a screenful of data. So, when you look at a table that is spread over 20 screens—which is only a few kilobytes of data!—you will not be able to keep it in context in your head. So, all the problems on a human scale happen already with very small scales of data. And they, as much as big data sets, challenge us in a bunch of ways to be able to do what we want, raising the questions: how do I know what’s in here and how will I make sure that what’s in here is appropriate to my task? This happens even on a very modest scale. So those questions should be present in everyone’s mind.

“You always need to be asking what’s missing. And small data kind of drives us to that question right away, which I think is great.”

And when you’re working with small data you almost always ask the question: what am I missing? Which is the question that you may forget to ask if you have this giant dataset. Which is a problem, because no dataset covers all the data that could have been generated by a specific phenomenon in the world. Even with banking transactions you probably don’t have all the transaction scenarios from the beginning of time. So, you can take these very humble computerized tasks, and you can still not get the complete data. You always need to be asking what’s missing. And small data kind of drives us to that question right away, which I think is great.

And in Trifacta, we start by giving people a small sample of their data set, even if the full set is large, because they can then interact with the data quickly and hypothesize about what they will find. They can try out different transformations to see what they get. And all of that happens at the speed of thought rather than at the speed of some gigantic computing task. We have an architecture that is called “sample to scale”, where we give you a sample to work on and then, when you believe that the work you’ve done is the right task for the whole data set, we compile it down to a job you can run on a big data platform. That’s a computer science compiler problem that we’re handling for you, and then running the job is a back-end systems problem handled by infrastructure. But the hard part of your job is the exploration and transformation work you do on the sample, in order to get it into shape.

So, we’re very much on board with the idea that even with large data sets you want to “go little” in order to get that fluidity: experimentation and exploration. So, I think it’s a wonderful theme.

And lastly: why is SDS important and what do these conferences bring to the community?

I think it’s really important for practitioners, technologists and researchers to be together in a dialogue about what matters and what innovations can do to help. I think when research has been done in a vacuum, sometimes you get innovations that aren’t really great for people to use—people can’t adopt the technology because it is too hard to use or too generally focused. The feedback from practice to research is to let researchers understand what holds people back from getting value out of data and that’s critical to the research effort.

“And in my own work in particular I’m very informed by practitioners, with the idea that innovation in computer science may be about helping practitioners do their jobs better as opposed to creating things from scratch that nobody asked for.”

At the same time, there’s a lot of creative work that goes on in R&D in both universities and companies that practitioners can learn from. I see it very much as a two-way street. And in my own work in particular I’m very informed by practitioners, with the idea that innovation in computer science may be about helping practitioners do their jobs better as opposed to creating things from scratch that nobody asked for. That dialogue can be quite healthy.