Boost your tech industry knowledge with our FREE RESOURCES - Explore our collection
Back to all articles

June 27, 2024

Innovation in the Age of Privacy, with Surfshark's Data Science Team Lead

Find out how Surfshark use synthetic data to innovate whilst protecting privacy, with Dovilė Komolovaitė

Ellen Merryweather

Senior Content Manager

Articles by Ellen

Data Science & Machine Learning

Data Science is taking great strides, as Artificial Intelligence sweeps over every corner of the tech industry. But how do you balance innovation and disruption whilst prioritising data privacy? We invited Surfshark's Data Science Team Lead, Dovilė Komolovaitė, to a special interview to ask just that.

Catch up on our conversation and find out about the potential uses and unique challenges of using synthetic data, as well as Dovilė's predictions for the future of Data Science.

About Dovilė Komolovaitė

Summary

Value Alignment and Motivation:

  • It's important to align personal values with the company's mission. Dovile joined Incogni because she was motivated by their commitment to data privacy and protecting people's rights.

Innovation in Data Science:

  • Advancements in technologies such as transformer models, diffusion models, and generative adversarial networks are driving significant innovations, particularly in the medical field.

Evolving Data Science Practices:

  • The use of AI developer tools and synthetically generated data is becoming more common, helping to automate tasks, ensure data privacy, and reduce bias in data models.

Challenges with Synthetic Data:

  • While synthetic data can maintain the statistical properties of real data, its reliability depends on the quality of the real data used. Addressing outliers and ensuring representativeness are critical challenges.

Balancing Innovation and Privacy:

  • Responsible data collection and usage are key to balancing advancements in data science with data privacy. Adhering to regulations and ensuring diversity and fairness in datasets and algorithms is essential for ethical AI development.

Balancing Innovation and Data Privacy

Ellen (0:03 - 0:33): Hi everyone, welcome. Thank you so much for joining us. We've got a very exciting guest today. I'm here with Dovilė from Incogni, and we're going to talk about innovation in the era of privacy and some challenges for data scientists. So Dovilė, welcome. Thank you so much for joining.

Dovilė (0:33 - 3:20): Thank you for having me. So, I am Dovilė, and I am based in Kaunas, Lithuania. I'm currently a data science team lead working with the Incogni product as you mentioned, and I've been in this field for about five years now. I jumped into data science right after finishing my bachelor's degree in applied mathematics, and honestly, back then I had no idea what data science even was. I was just figuring out what I wanted to do, and I imagined that if I followed what I was interested in, I would eventually find a job I enjoy. As I was really into numerical methods and optimization algorithms back in university, that kind of set the stage for my career path.

I remember I was in my first job at a cloud engineering company. I was really fortunate to get the chance to work on some amazing projects involving computer vision, signal processing, and even geospatial data processing. We were working on optical character recognition problems for handwritten documents, detecting items being scanned at self-checkouts, and predicting early rail defects. All of these problems were very interesting and challenging at the same time.

Naturally, after some time, I realized I needed to learn more and dive deeper into the things we were doing. So I enrolled in a master's program in artificial intelligence informatics. My main project was about predicting visual stimuli from brain signals. I've always been fascinated by and wanted to work with neurology data, and I felt like it was a very good step to take. If there's a particular domain and problem you want to tackle, the entire two years are spent researching and working on it.

During that time, I also joined a startup where we were working on detecting cyberbullying in the Lithuanian language. After my master's, I worked at a data science consultancy company for a while, and eventually, I ended up at Incogni. Very recently, I started leading a team of five, and I feel like with every little experience, I am constantly growing, and there's still so much more to learn ahead.

Ellen (3:20 - 3:39): And what was it about Incogni in particular? Because whenever we join a new company or start a new project, there's always certain things that really attract us. What was it about Incogni that made you think, "Yeah, that's the one for me"?

Dovilė (3:39 - 4:52): Yeah, that's a great question. I think it's really important that your values align with the company's values in some way. For me, I've always been driven by the idea of doing something meaningful in my work. If I can use my math and coding skills to solve important problems, that's where I find my sweet spot.

During the interview with Incogni, I saw their motivation and drive to protect people's rights to their own data, and that intrigued me the most. Of course, I can admit that I am not a data privacy expert. On the contrary, I consider myself a generalist who can adapt. This flexibility is what I think makes my job exciting.

Here at Incogni, we help clients remove their personal information, ensuring companies comply with data privacy regulations. As data scientists, we support all the automation processes by applying machine learning techniques.

Ellen (4:53 - 5:43): It's really interesting what you said about how you don't consider yourself an expert, but you think it was your passion for the field that helped you get the job. That's something we see a lot with our students as well. Even if they have maybe 80% of the requirements for the job they're going for, if they're really driven by the mission, hiring managers are much more attracted to candidates who have the same drive and want to reach the same end goal. So, just for our audience of probably aspiring data scientists, I think that's a really important message to highlight.

Let's talk about innovations in data science. One of the main topics we want to discuss today is some of the most exciting innovations you're seeing in data science right now.

Dovilė (5:45 - 7:27): Yeah, there are so many exciting things happening in data science right now. Technologies that everyone uses for text, image, or video generation are advancing rapidly due to advancements in transformer models, diffusion models, and generative adversarial networks. These technologies have such a broad value proposition that they draw attention and investments from various industries.

I feel that they are already widely discussed, so I'd like to draw attention to how these innovations are impacting the medical field. Innovations in one area are often quickly adopted in others. One particular example that I'm very intrigued by is transforming electroencephalogram (EEG) brain signals into images. Approaches like brain-to-image and dream fusion are laying the groundwork for amazing future possibilities. Extracting meaningful EEG signal patterns might allow us to understand the human brain better, develop more advanced brain-computer interfaces, convert dreams into visual representations, and revisit memories. It sounds futuristic, but based on research, the future is closer than it looks. As curious as it is, it also brings many questions regarding ethical use cases.

Ellen (7:28 - 8:20): Yeah, so I think that's definitely the most intriguing part for me at least. It's such a cliché saying, but I'm seeing it everywhere when looking at advancements in AI, machine learning, and data science. Everyone keeps saying, "Science fiction is now science fact," and it's such a cliché, but it is also so true. All these things that we thought weren't possible or were going to be sort of in the year 3000 are all right at our doorstep.

Things are moving so quickly, and I think maybe a few decades ago, you would say, "I've only been in the field for five years," but in your field, five years is actually a very long time. Have you noticed any difference in the way data scientists work now compared to maybe a handful of years ago with all of the rapid changes?

Dovilė (8:21 - 10:28): Yeah, I think the impact is very significant. With advancements in generative AI, data scientists are coding with AI developer tools that can automate various tasks and speed up the process. It's like always having a colleague next to you and doing peer programming with an AI assistant. This has become the norm in today's world.

Another thing I observe is that we as data scientists have increasingly begun to utilize synthetically generated data by employing generative AI. The idea is not new by itself, but now it has become easier to generate artificial data that still maintains all the statistical properties of the original. This was expected as the models are data-driven, and the more data you have, the more complicated models you can use. Generating synthetic data opens the door for many other data-extracting opportunities. For example, it doesn't have to involve human labeling, which we all know is expensive and time-consuming. Or it might only partially involve humans as we still need to supervise the model and gather human feedback.

Another very important component is that by using synthetic data, we can ensure it does not contain personally identifiable information, which is beneficial for data privacy reasons and can help eliminate potential bias. We can now represent a diverse range of cohorts. I believe data scientists will increasingly rely on synthetic data, and with today's advancements, we can achieve larger, higher-quality artificial datasets that ensure data privacy and are still effective in model training.

Challenges of Synthetic Data

Ellen (10:28 - 10:46): And what are some of the challenges that come with that? Because there's always sort of two sides of the coin with any new innovation or any tech. You always have a heap of benefits, but then there's always the challenges to figure out. What would you say are some of the main challenges in using synthetic data?

Dovilė (10:47 - 12:18): Yeah, it's not a one-size-fits-all solution; it has certain limitations. Synthetic data is only as reliable as the real data used to generate it. Ensuring the quality and representativeness of the real data is crucial in this process.

We also have to be mindful of outliers that might no longer exist in the artificially generated data, so we need to find a way to address them as well. Additionally, generating artificial data can be computationally expensive and time-consuming, depending on the problem. These are the main limitations, but the advantages outweigh them. Using synthetic data for training prevents model inversion attacks when someone tries to reconstruct the input data, and it ensures that if the deep learning model memorizes training data, it won't leak sensitive information. This way, we can ensure our models are as privacy-preserving as possible.

Ellen (12:18 - 12:36): I'd love to dive a little bit more into the topic of data privacy since that's what Incogni is all about. Could you provide an example of where data science specifically has significantly improved a service through the implementation of data privacy practices?

Dovilė (12:37 - 13:46): Yes, in one of our projects, we are tackling the task of automating data removal processes within the domain of NLP. When we incorporate synthetic data into the training process, we witness a noticeable enhancement in the model's ability to generalize. Although there is a slight decrease in accuracy with real data, we decided to move forward with this approach as it ensures the model is more robust and adaptive to data drifts. It also strengthens data privacy measures while maintaining a high level of performance. In our context, this approach represents a trade-off we are willing to make. There might be cases where losing performance in real-world scenarios is not a trade-off worth taking, especially in critical applications. However, there are other techniques available to preserve privacy.

Ellen (13:47 - 14:14): I know you've touched on this next question a little bit already, but as a data scientist, you want to make advancements in the field, innovate, and create new practices. But there's also data privacy to consider. How do you balance those things, and how important is data privacy in data science? I imagine you're going to say very, but you'll have a much more sophisticated answer than just very.

Dovilė (14:15 - 16:32): Yes, very. It's something we think about constantly. In this field, data is the fuel that powers our models, and the more high-quality data we have, the better our models perform. But there's always a balance to strike here. Collecting and using data responsibly is key. Regulations like GDPR and CCPA play a significant role in our work. We must adhere to these laws to ensure we minimize data collection. Additionally, the European AI Regulation Act focuses on the ethical use of AI, setting different rules depending on the level of risk associated with the application.

We need to be aware of the risks our applications might cause and the agreed-upon rules governing them. As data scientists, ethics and data privacy go hand in hand. We must ensure our datasets are diverse and our algorithms are fair. Simply removing sensitive variables does not guarantee fairness. It does not prevent models from unintentionally picking up biases from other data points. For example, behavior data can reveal a lot about a person, like demographics, habits, culture, and more. Take emoji usage, for example. Certain emojis might be more common in specific regions and cultures. By analyzing someone's emoji choices, you could potentially guess their region. That's why it's crucial for us as data scientists to stay on our toes and be aware of both the obvious and hidden features that might influence our models' behavior. We need to build models that are both accurate and responsible.

Attitudes Towards Data Privacy

Ellen (16:34 - 17:16): With that in mind, we all know we should be careful who we give our data to online. But it's very easy to get lazy and just hit the terms and conditions, sign up for a service, get a subscription, download an app. Why should we be more mindful about our data privacy online? Why is it really important for individuals to be more mindful of that?

Dovilė (17:17 - 18:40): The first thing we should consider is changing our attitudes towards data privacy. We have to think of it the same way we approach insurance—something we invest in to protect ourselves against unforeseen events. Especially now with the rapid advancements in AI technologies, while they bring incredible benefits, they also open the door for more sophisticated fraudulent activities mimicking human behavior. Safeguarding your private data today can help prevent potential misuse both now and in the future. We all need to be aware that even seemingly insignificant pieces of personal data, as you mentioned, can be aggregated, sold, and made public. This might cause serious consequences if they fall into the wrong hands. Therefore, it's essential to be mindful of where our data is and how it is being used.

Ellen (18:41 - 19:17): Why do you think that, as individuals, we can get so relaxed about our data privacy? Like I just mentioned, signing up for services without thinking about it too much. When we think of all our data being public, it's horrifying, yet so many people don't take any steps to prevent that. Do you think it's a safety-in-numbers thing, like "If everyone's data is online, what are the chances of my data being used against me"?

Dovilė (19:18 - 19:32): I actually have no idea. I guess we just fall into that habit and routine and tend not to think about it as much as we should.

Career Advice for Data Scientists

Ellen (19:32 - 20:19): We're slightly running out of time, so I want to dive into the next section of our talk, which is career advice for the aspiring data scientists in the room with us now. In your opinion, or from what you've seen at Incogni or in the field, if there's someone who is either a data scientist or an aspiring data scientist who really wants to focus on privacy, what key skills and knowledge sets would you recommend they focus on?

Dovilė (20:21 - 21:52): Good question. During interviews, we always focus on critical thinking skills and how mathematical concepts can be translated to solve business problems. That's the foundation. It's also important to have good collaboration skills as data science involves a lot of communication with stakeholders and teamwork. The last key skill is having an agile mindset with a focus on continuous improvement. If you don't update your knowledge constantly, you'll fall behind, especially considering how rapidly the field evolves right now. These are the main skills that are general for all data science positions. Domain knowledge is also very important, but it can be learned within the company from domain experts, books, and other resources. Motivation and the ability to learn quickly are more important. I always suggest focusing on being a generalist first and then transitioning to become a specialist in a particular domain.

Evolution of the Field

Ellen (21:53 - 22:05): How do you see the field of data science evolving in the next five to ten years? Where do you see things going, or what trends are you noticing that you think are really going to pick up speed?

Dovilė (22:07 - 23:22): That's a tough question. I'm more inclined to trust tree models to predict the future. But considering how fast technology is moving, a lot can change over the next five to ten years in ways we can't even imagine. Taking a conservative view, since we were talking about using synthetic data, I anticipate significant enhancements in simulating physical and chemical environments. It's fascinating to see how these simulations are getting more accurate and sophisticated, mimicking the real world. We should also see a bigger focus on responsible AI, driven by concerns about data privacy and ethics. Explainable AI will play a crucial role in keeping things transparent, especially for high-risk applications in healthcare and finance. In terms of day-to-day work, I think we will see stronger collaboration between legal experts and data scientists.

Ellen (23:23 - 23:48): One last question before we say goodbye to everyone. If you had to very quickly summarize what it's like being a data scientist or what your experience being a data scientist is like, what would you say to convince someone that data science is fantastic?

Dovilė (23:51 - 24:53): I would say that data science requires a lot of creativity. You have to think of out-of-the-box solutions quite a lot. It's not only about learning and knowing hard skills; it involves a lot of collaboration, brainstorming activities with your team, and thinking creatively. So, I would say it's a very intriguing and exciting job, and I'm very fascinated and excited to be working here.

Launch Your Career, with Ironhack's Data Science and Machine Learning Bootcamp

If you're ready to follow in Dovilė's footsteps, and launch your career is a Data Scientist, check out our Data Science and Machine Learning Bootcamp!

In just a few months, pick up the in-demand skills and tools to:

- Dive into the data deep-end as a Data Scientist,

- Gear up for AI adventures as an AI Engineer,

- Build the backbone of big data as a Data Engineer, and much more! 

Related Articles

Recommended for you

Ready to join?

More than 10,000 career changers and entrepreneurs launched their careers in the tech industry with Ironhack's bootcamps. Start your new career journey, and join the tech revolution!