Jonathan Rioux - What, How and When to use PySpark (#18) Artwork

WHAT the Data?!

Let's talk about how we leverage our data to improve the user targeting and experience. Join our podcast while we explorer how our guests using data, how they develop the user experience, and improve marketing expenses.
Lior Barak, data strategist, author of "Data is Like a Plate of Hummus" and founder of "Tale About Data", Michael Stiller, analyst, founder of STILLGROVE, going to explorer how companies collect, process, and use their data, and how this improves the user experience of their products.

All Episodes

WHAT the Data?!

Jonathan Rioux - What, How and When to use PySpark (#18)

April 20, 2021 • Lior Barak, Michael Stiller, and Jonathan Rioux • Season 1 • Episode 18

Have you heard about the power of PySpark?
Jonathan Rioux joining Michael and Lior for an amazing episode talking about the power of Spark, and Python as well as why he decided to write an entire book about PySpark.

In today’s episode, we have Jonathan Rioux, he leads the data science operations for EPAM Canada and writes a book about large-scale data analysis.

Michael and I talked with Jonathan about setting your spark cluster using python, what should you know about it, why he started writing his book.

Michael went into the technical details as he loves, and I stayed focus on the product, enjoy this episode!

You have an extra discount on all books using podwtd20 on https://www.manning.com/
Wish to talk with Jonathan?
Linkedin: https://www.linkedin.com/in/jonathanrx/
Twitter: https://twitter.com/lejonesberg
Link to Jonathan's book: https://www.manning.com/books/data-analysis-with-python-and-pyspark

-------

Give us feedback - https://forms.gle/sTbkzHPUo86nqM6y8
Interested in joining our podcast as a guest? Fill the following form https://forms.gle/gV6hobxNmBgfdDoa6
Join our mailing list - https://bit.ly/2MXfK8Y
You can follow us on social media
- Facebook: https://www.facebook.com/whathedata/
- Linkedin - https://www.linkedin.com/company/what-the-data-podcast
Have a data-related question? Submit it here: http://bit.ly/3rj5VBY

As always if you enjoyed this episode, please like, subscribe, and share it with your network so we can outreach new audiences.

Jonathan Rioux 0:06
Welcome to the what the data podcast with your hosts, Mitch and Leo.

Lior Barak 0:22
Hey, Jonathan, I'm so happy to have you on the show today. How's it going?

Jonathan Rioux 0:26
It's going very well. How are you?

Lior Barak 0:28
Good, good. So you are located in Canada, it's quite cold for you. And maybe you can tell us a little bit about yourself about what you're doing on a day to day?

Jonathan Rioux 0:40
Sure. So I, I work for a software consultancy called EPAM systems, we're a global consultancy that basically specialize into a custom solution. I'm leading the data science practice from Canada, which is, you know, we have a couple of local heads, like scattered through the world. But you know, especially with COVID, the boundaries between countries kind of became non existent, you're behind your computer, and you work with just a bunch of other people that just happen to be in different time zone. So a lot of my job relates to, you know, creating machine learning model, optimists solving optimization problem for a variety of clients. And I've been doing this with EPAM. For the past, it's going to be almost three years.

Lior Barak 1:36
And you also wrote the book, right?

Jonathan Rioux 1:40
Yeah. Yeah. I mean, on my spare time, I'm also writing the book on basically what what I liked the most about my, from my work, which is, you know, performing analytics on on large datasets. So my book is called Data data analysis with Python and PYSPARK.

It's worth Manning publication. And hopefully, I'm aiming to be done writing it before the end of the year. That's cool.

Lior Barak 2:15
Would you give us a little bit more details about the book?

Jonathan Rioux 2:18
Oh, for sure. I mean, so data analysis, one of the things that we found, I mean, through my experience, working with a variety of clients, and also my previous experience in the workplace, is that, you know, the data set are growing, growing, growing, growing, but computers are not following. You know, you still have cloud, which is fantastic. But you know, when you want to be able to do something quick, or you want to be able to scale your your dataset, because your your data is getting larger than what's reasonable for your computer, you know, 16 gigs, 16 gigs of RAM has been pretty standard, you know, you're able to get laptops for like, 32 to 64. But, you know, we're still far from like, getting those like, huge machine with terabytes of data that can crunch everything locally. So spark became like, I accidentally started using spark and really liked the data model, the programming model felt very natural to me. And then as I was going and trying to, to look through documentation and perfect my knowledge, I realized that the documentation, the blog posts, some of the material that was available online was

it didn't really answer all the questions that I had. So I started drafting my own. I mean, in the workplace, I've, I took the source code, reverse engineered it, learn Scala on the side. And, and that kind of became like, a couple of fast forward maybe a year or so I realized that I had a lot of material that was dated, but I could kind of consolidate it into a book. And this is what I've been working on. So it's really a big labor of love. I really, like I'm probably, like, pyspark is the tool that I'm going to default to it's very intuitive to me. I find it to be extremely powerful. But yeah, the documentation what's available online is just like, I think that my book supplements, you know what's available quite well.

Lior Barak 4:36
Cool.

Michael Stiller 4:37
Hey, everyone. I just snuck in while the two of you were talking now. I'm probably late. Am I right?

Lior Barak 4:45
It's already started.

Michael Stiller 4:47
Okay, really? Sorry. Yeah, I was just running a little bit long with a bedtime story tonight. So yeah, sorry, guys.

Jonathan Rioux 4:54
No problem.

Michael Stiller 4:56
So my first question after eavesdropping on your call, For a little bit, could you maybe give us a quick working definition of what spark actually is?

Jonathan Rioux 5:06
For sure. So, I mean, the, to me, the best way to describe spark is a way to perform data analysis and data science and data transformation on a cluster of multiple machine as if they were a single entity. So basically, distributed computing is something that's quite tough to nail properly. There's a lot of consideration that needs to happen there. And I find that spark has the right level of abstraction, and the right level of magic to make it look like you're programming a single computer.

Yeah, I think that's like, if it's TLDR, I think I think it's the best way to describe it.

Michael Stiller 5:54
All right. Makes sense. So I think one one other point we should probably clear up to is that not every use case necessarily needs a cluster for for for, for processing of wars. At the same time, it helps you to future proof your project if you already build something that's scalable, right. So what are your main considerations for choosing a spark approach, as opposed to running a Jupyter Notebook on someone's computer or as a pro as opposed to? Yeah, as opposed to kind of local processing of data on a non cluster?

Jonathan Rioux 6:30
Oh, for sure. So I mean, as you said, there's two camps of people, there's people that are going to default to spark for pretty much everything because of the expressiveness of the framework. And they're going to take, you know, they're going to default to it, because it's kind of a way to future proof, your analytics, as you know, your program is going to scale. My usually my decision boundary, and this is something I was discussing a little bit earlier is, you know, for the past, maybe I would say five to six years, you know, the amount of memory that the laptop is getting, like stock hasn't moved that much at 1632 gigs of RAM, if you're lucky, computers are getting, you know, we're able to squeeze more data within that Ram. But in terms of comfort level, the moment that you're spending more than what's reasonably available, both for storing your data, and for processing your data in terms of memory, I think spark starts starts to make sense. So I would say like, like 100 gigabytes, maybe an above like, it's something that's going to be you can get a virtual machine, that's going to do the same, the same thing for you. But the more that you're going to push, it starts making sense using Spark. And I also find, you know, there's a second component, which I think is important to not, discount is PySpark, and itself uses Python, but you know, your, as a data scientist, you might prefer Python, your data engineer might prefer SQL, you might have some ops people that are going to be Java or Scala, the whole ecosystem, the fact that people can talk with the same kind of talk to the same library, using their language of choice also makes it very powerful when you're working in the real world. It's, it's easier to transition, it's easier to the abstractions are the same. So to me, those two dimensions become kind of important. But in terms of share size, usually what I'm going to do is I'm, if I'm unable to open the data set on my local machine, to me, it's a good sign that it's better to use Spark, even on the small cluster. Usually, my comfort level is better.

Lior Barak 8:58
If I may ask a question. So I'm less technical than the median is actually a when we're talking about Spark, right. So as far as I know, it's something that running on the cloud. Can I run it locally on my computer?

Jonathan Rioux 9:15
Oh, yes. Yeah, yeah. So so if you, like, if you distill it, like before, before cloud became vastly popular. You know, there were some Hadoop vendors, which are, you know, the data lake like, I know, cloudera was a major player, you had hortonworks. And as a matter of fact, data bricks of the company who sponsors the development of spark and was founded by the creator of Spark, actually built a whole data ecosystem around around Spark, which now is cloud native. But I have it on my machine and as a matter of fact, When I developed for my book, like I'm creating some code example, I keep it on my machine, like, I don't want to spend money to keep a virtual machine on the cloud, or a couple of virtual machine on the cloud, just to be able to do some examples. And it's quite easy to install on your computer, there's a couple of tricks that I explained in my book, and you can be up and running, work locally. And then when you're ready to scale, then you just get your cloud with the number of machines that you want, and just go all guns blazing.

Michael Stiller 10:35
I think it also makes sense to maybe just quickly talk a bit more about the notion of the cloud or the concept of how it is different from your machine. Because I think sometimes people underestimate the fact that it is just a machine. In the end,

Jonathan Rioux 10:50
it's somebody else's machine.

Michael Stiller 10:52
And so essentially, the main point of developing for the cloud is actually to just learn to standardize your code and just kind of use standard components. And then the cloud actually allows you to save a lot of time and effort on deploying your your work.

Jonathan Rioux 11:06
One of the thing that's quite interesting, like, and and i think it's a big argument to learn spark is that spark is standard across all major cloud provider. So you know, there are a managed spark instances. So you don't have to worry about configurating your environment, you select the number of machines that you want, how much memory Do they have, and then you can get started, they're going to take care of networking, they're going to take care of storage, they're going to take care of all of those little details that you would do if you were to, let's say, configure a spark on your own cluster, which I've done actually tried to, like, bought some small Intel nukes a couple of years ago, I tried to install spark on my own. I don't do that anymore, I just default to cloud.

Michael Stiller 11:55
Yes, that sounds really like like a labor of love, as opposed to running a productive environment with the modern tools that you could have nowadays, right? So for me, it's also been a big kind of change in mindset, when I kind of started to work with the cloud a bit more and kind of tried to understand what the advantages are. But I think one more thing that's also important to keep in mind is always the use cases, right? So, for example, you have this Spark cluster that's now processing data, what would be specific use cases and how you would process the data and get from A to B?

Jonathan Rioux 12:33
Well, so Spark, I would say to a lot of the bread and butter of spark is really like data transformations. So if, if you're, let's say your source of truth, you know, the data that you're having access to, it's quite large. What's really cool about spark is that it plays quite well with the standard, I'm going to use Python as an example. But this is also applicable to like, sparkler with with R you you're able to one of the use cases that we see the most is people start with Spark, you know, they they take the data source that's from their data lake or enterprise data warehouse or their SQL database, there are going to process it to be able to get it manageable. Because let's say you start with, you know, let's say a terabyte of data, you're not going to do your analysis on this, you're going to look into summarizing it, taking a couple of dimensions that are going to be useful. And then the moment that the data makes sense to be able to get into a single mode.

Let's say you're having you know, a couple 100 1000s of records, which is totally manageable by a single machine, you can convert your your PI spark data structure into something that's going to be Python native or even pandas, and then continue doing your analysis, the same way that you You've always been doing, which makes ityou know, it's not necessarily like, the spark is not really like playing against those player, it's, it's really hand in hand. And I see it's kind of a way to scale what you have. But it doesn't mean that what you've learned before or you know, your your statistical libraries, whether it is stats model or psychic learn.

They're, they're going down to the drain like you're still able to, to use them. spark has some pretty nifty feature about you know, parallelizing some of those workflow which can be pretty fun.

But, like I would say this is probably the best way to start with spark is, you know, you use it when you have a lot of data and the moment that it's it fits your, your, your use case and potentially to me the best example is charting. There's no point than to having charts that are going to be distributed the data needs to be on the state Well know to generate the image. So there's like, there's really kind of a good blueprint on when you are distributed and when you are a single node. And you can play between both of them in a way that's almost seamless.

Lior Barak 15:18
Can I can I ask you a question? So I am aware of spark for quite a while. But I also know that a lot of people saying that the downside of it, it's the how expensive it is actually, to run a query that many times it's not really efficient in the way that it's processing data. And this is why it's costing you quite a lot of money. What is your opinion about it? Or what do you think about where it's standing today, compared to let's say, five years ago? Was it improved or not?

Jonathan Rioux 15:45
Oh, oh, this is this is really an excellent question. So if you look five years ago, SPARK had a single abstraction. So you know that the way that spark was working five years ago, is they think of it as like a distributed collection. So pi spark was slower, because it wasn't optimized, you had a lot. spark is a scalar application. So up until I would say, three, three years ago, you had conversion between, you know, let's say, Scala and Python that would happen at runtime, which made it very, very slow. by today's standards, with SPARC 2.0, what happened is, spark created a new data structure called the data frame, which is really similar to a panda's data frame, which is like rows and columns, a little bit more powerful than the basic panda's data frame.

And then they bridge the gap between the main standard implementation, which was Scala, and the secondary implementation, Java, Python are, I've heard that Microsoft is working on the dotnet implementation, but it's not something I'm, I'm extremely familiar. And, and that created a lot of a lot of momentum, you know, for because the data structure that the data frame was a lot more familiar with data scientists was a lot more familiar with data engineer. And then, again, with sparks 3.0, which was released. I think last year, there was some a big focus on Python on interruptor. Sorry, making pandas and spark work very seamlessly, in a way that leverage the best of both worlds. And so there's a lot of work that happened into the performance. The second part to me is one of the one of the one of my pet peeves when I'm teaching SPARC. And one of the reason why I wrote the book is you have to think differently. The way when you're programming in SPARC, the data model is quite simple. But you cannot reflect you cannot think about your your data set as being something that's always available. And one of the best example is spark will actually not store intermediate data. Because if you get to think of it, if you're working with terabytes of data, every single time you would do a transformation, it would generate terabytes of data all the time.

Spark is actually more intelligent than this to be able to optimize the memory, which gives the impression that it's slow because it's repeating the, the operation again and again. So knowing and doing a little bit of planning ahead when you're working with large data set is going to pay dividends. But a lot of people are jumping in and are saying well, it's just like pandas, but like distrubuted and B forget that, you know, actually working with larger data set comes with its own set of problems that you have to address. So I think that those two dimension kind of gave spark a bad reputation for quite some time. But it's speed of innovation that happened. And, and, you know, people getting more aware of the complication of doing distributed data processing, I think are giving kind of a renaissance of Spark, you know, since spark 3.0. I'm seeing a lot of people having momentum, my book sales up. So I'm, I'm quite excited about what they've done. And I'm quite excited about what's coming as well.

Michael Stiller 19:36
I think there are also sometimes like unfair comparisons being made, because essentially, the upside of SPARC is that it scales. So if you would just do a toy example with a small data frame with like 100,000 rows. It probably would look faster to do processing on it on pandas.

Jonathan Rioux 19:55
Oh, yeah.

Michael Stiller 19:56
And at the same time, you would for a test purpose, probably not Unlike 100 gigabyte data set, on the other hand, of course, you have the side of SQL and SQL processing larger amounts of data sets. But they are also fairly standardized. And they're also already usually running on a cluster, and then something that's optimized and set up. So um, I think it's important to keep in mind what the proper comparisons are for smart performance.

Jonathan Rioux 20:24
Yeah, for you're totally right for small data frames, and you know, a lot of the toy examples that we're doing. And I mean, I'm doing this in my book as well, like, I can't realistically take a data set that's going to be terabytes of data and say, Well, this is what you're going to learn with, because it's going to take some time, because you know, a larger data set is going to necessarily take more time.

So we're doing example, with small data set. And then when you're comparing spark on a small data set versus pandas, well, spark is actually doing a lot of optimization that takes a few seconds, in order to be able to send a job. So I really believe that, you know, at this scale, where spark start to make sense, is the scale where a solution like pandas would be an appropriate because it would choke. And it's nothing against pandas. It's really that, you know, pandas is using the memory that's available on a single node, it's not distributed. So there is a cap on what's being processed. And, and then this is where spark starts to make sense. But you're, you're kind of in a chicken and egg problem where, you know, when you're learning, you want to do something on a smaller data frame, because you want to explore what's going to happen and look at the underpinnings. And a lot of people are disappointed because it seems slow.

But, you know, as as, when you start having like, let's say, five, six machine with 100 gigs of RAM each and you're splitting a data frame, that's a couple of terabytes, then it starts making sense, because you wouldn't be able to do this on a single machine. So your your comparison is what's possible versus what's impossible, rather than what's fast versus what still,

Michael Stiller 22:10
I think those are two very interesting distinctions to draw one that is about what is possible and what is not possible. So one thing, for example, is like whenever someone tells me like, oh, why can't you do your analysis in Excel? And then I would just kind of reply well, and a million rolls to, it just stops opening files, right. So this is kind of an impossible task to do. There's other reasons why I would use Python to process my data. But I think that's a good argument to make. And a lot of cases, if we go back to the question of what is slow and what is fast, another thing I think that's very important is to kind of know a little bit more about the background, the technical background of the tools you're using. So I have to look at a lot of code from like more junior analysts, data scientists. And then they, for example, would do things like taking data out of a panda's data frame, running a Python loop over it, putting it back into a data frame, and all these back and forth things where, if you understand a little bit more about the C bindings behind the scenes that are kind of making pandas as fast as it is that this is just kind of a situation where sometimes you'll quadruple your memory needs just by being not very good at handling your tool. Right. So I think there's the other component is also that we should keep in mind that in some cases, you have an illusion of simplicity, that causes issues. And that's something that I think for spark is not as true as it is for Python and pandas.

Jonathan Rioux 23:40
Yeah, it's people, you know, as data scientists, data analysts statistician, we often tend to forget how much engineering is happening behind the scene. For data processing. If you look at pandas, as an example, the amount of optimization and the amount of work that went into creating a library, it's absolutely staggering. They Python is not a remarkably fast language in itself, but the fact that it's been optimized down to see for certain moves, and, and and, you know, this is also the thing like spark recently, in order to communicate faster with pandas, when you're doing distributed pandas using spark as an orchestrator kind of kind of sword. They're using arrow as a serialization format. arrow is a marvel of engineering, like you know, the speed up that you're able to see. And the fact that the talk seamlessly is is nothing short of amazing. So people have a tendency to take them for granted, assume that everything is going to run fast. And then the moment that you jump off of the ecosystem, as you were saying, doing a for loop or doing something that's not vectorize, then the performance tanks and it's it's not surprising, because you're Going down to a language that is not optimized for, for what you're trying to do. And it doesn't mean that it's wrong. I mean, I'm a big fan of like, you know, make it work, make sure that you understand your code and then look at improving the performance. And this is how I approach most of my, my PI spark development. But

it's, the reality is that, in order to have powerful abstraction, there's always going to be a little bit of leakage. And, and this is what I'm trying to address, you know, in a book form, rather than having a couple of blog posts that are going to say, well, spark is the best thing since sliced bread. And you can do whatever you want, which is not true. It's a fantastic tool, but it has its purpose. And it's important to understand where it fits and how to make it fit.

Michael Stiller 25:48
We've talked a lot about how spark is constantly evolving, and how all kinds of things change with the time. So the thing that kind of, I had to think about was, what was your rationale for actually writing a book on the topic specifically for the format of a book. For me personally, books are like a part of my life. That's, that's still kind of very important. But I also do realize that this is not the norm and not the thing that it used to be maybe some time ago, I think my first experience with programming actually came from a book that I got from the library, just because there was no internet connection, and the kinds of resources weren't really available. At the same time, I understand that a lot of programmers nowadays, mostly kind of live on Stack Overflow. So yeah, what was your rationale for actually going with the with an actual book on the topic?

Jonathan Rioux 26:40
Well, I mean, I decided to write a book, because this is the way I learned. I mean, I. So I'm publishing with Manning, which Manning is the publishing house that taught me how to code, I got their Python book. And then I explored many programming language because I was fascinated about how you can express ideas into code. So to me, this is the way that I, I learned best, also working directly with with a publisher, there is a much, much better focus on the quality of content, one of the things that I've learned is, when you're writing to teach something, your language is going to be a lot different than when you're writing, let's say, to showcase that you know, something. So it was a huge learning experience of finding the right material, the right way to teach something, use repetition, find good code, that's going to support what you're doing, drawing the fingers. All of that, I think overall makes it a more compelling package, I still have plans to be able to do videos, I'm still doing twitch livestream, where I live code and in, in PI Spark, but using the book kind of a central framework really helps me clarify my thoughts. And also it's something that's going to last longer. It there's precaution that we're taking so that I'm teaching first principles that are not going to get stale.

So So to me, this is really, it's something that's deeply personal. I know that books are, you know, a lot of people are, are thinking that, you know, books are going away. But I like I mean, I have a tablet like us consumer ebooks, I'm not a big fan of paper books, but I would say I read pretty much one or two books per month. So So to me, it's it's the best way. And I make the bet that there's actually a fringe of the population that thinks like,

Michael Stiller 28:58
Yes, I am I certainly right there on that fringe with you. And I still believe that some things can also be be modernized, but just creating combinations, right? I'm a bit more into paper books, maybe then you are, but the way I would usually use it is that I have the book on my lap, and then I may have my tablet or my phone. And then sometimes when there's some technical term I'm not quite sure off or something that I didn't think was explained in depth enough, I would then actually just kind of Google on the side and then just go back to the book with like some firmer grasp of the concept that I'm reading about.

And so I think there's there's definitely just kind of like an evolving situation where books can be helpful. For me, the question is also so now you've decided that you want to write a book, what was your kind of recipe for making it entertaining and making it pertinent to people and in a way that only a book can.

Jonathan Rioux 29:54
So I, I mean, if I look back at my experience of learning Spark, I found that it was hard to find appropriate material that combined spark the engine, and Python the language. So looking at this from a perspective, like I'm a Python developer, I really find the language to be beautiful and the ecosystem to be really enjoyable. But But taking this as a Python first pi spark second approach, and putting myself in the shoes of somebody who comes from pandas, comes from regular Python, and wants to be able to scale, I didn't find any material that really did everything at once.

And then I also wanted to have something that would be entertaining, and, and not heavy and academic. So bundling multiple use cases, and, and giving people the opportunity to experiment. And, and then hoping that this would help a lot of people to bridge the gap. And people who believe that Yeah, you know, spark is a little bit hard to understand, and I don't really get it. It was, it was kind of the reflection that I had, when I made the proposal to, to write the book.

Lior Barak 31:24
So we arriving to the end of the interview, then before you go, ever very, very a interesting question for you. So I'm not a technical guy, I don't know how to develop anything. How can actually start tomorrow, using spark and start using Python actually, in a combination, what

Jonathan Rioux 31:46
The first thing is, I believe, you know, getting a good appreciation for Python, the language to understanding, you know, the basic control flow, it doesn't Spark, one of the thing that I liked the most about spark is, if you look at a well structured spark program, it reads quite a lot like English, spark borrowed the vocabulary for data manipulation, from SQL. So you know, you have select, you have where, and the way it's structured, it really like some of the time I almost look at my, my code block, and unlike by selecting the data frame, selecting those three columns, creating a new one, filtering, to remove all the records that don't contain any value, and I'm writing this back to a data lake, and you're looking at this, and it's very logical. So I think, you know, the moment that you're able to understand the basic syntax of Python, because it is still a Python library, there's no way to go around this. I think spark is very accessible. And, and you're going to find that there's a lot less magic in terms of syntax. It requires more typing than pandas. But I find it to be more regular and self explanatory, compared to pandas that has a little bit of gotchas. I know the pains of pandas. And I know how tough it was actually to learn it and get a grip on it. I'm still having a hard time with like indexes. And sometimes I get confused. And it's, yeah, it's, it's, I find that I need to Google more stuff when I'm working with pandas, then pi Spark. Also, I'm a lot more familiar with PI Spark. So there might be an unfair comparison, but still, like with a sample size of one.

Michael Stiller 33:42
Because I think, I think for the next episode, we should bring somebody with a panda experience, and do head to head see who

I must say I'm struggling to imagine what that kind of person would look like, like I spend a lot of time my day with pandas. But I wouldn't sit here saying like, Oh, I love dealing with multi indexes. I love having to guess the data type of my columns columns in a data frame at any given point in time. I love not knowing if I'm dealing with a deep copy a shallow copy of a reference. No, I wouldn't sit here just kind of sit I go. This is my favorite way of working.

Jonathan Rioux 34:20
I think the answer is boat. I mean, it's, it's really like I mean, to me, it's it's not an either or, and, and I mean, one of the best proof is let's say you have a cluster and you want to be able to run multiple pandas model spark is going to give you the option to parallelize your your pandas code across multiple nodes in a way that's going to be like it takes literally like just a decorator on top of your method. And then just like an apply in pandas as a method name, and, and there's many things in the past that was kind of cumbersome. And I know there's third party libraries that been invented, but it's just so seamless now. So it's really not a competition of like, is it pandas? Or is it pi spark? To me, it's just Yes, give me

Michael Stiller 35:13
I really think the key word for this all is convergence in the end, right? Like the idea of saying, I have my Python binding for spark code. Now, I have a way to write SQL like code for my data processing without actually running a SQL database, or reusing some vocabulary from one concept or one paradigm to the other. And then having Python as some kind of, kind of global language that's sitting in the middle and just kind of binding out to all these other approaches. I think this this is really kind of the thing that makes this time so interesting that you wouldn't have to hyper specialize in one area. But now you're actually able to kind of bring in a lot of understanding a lot of contexts that you've learned in other places. Someone who spent half of his career writing SQL as a data engineer, could nowadays also just kind of branch out to airflow, right, a lot of Python and just kind of look at situations a bit differently. And then one day would use PI spark to actually just kind of also branch out towards Spark. I mean, I mean, you know,

Jonathan Rioux 36:21
SQL coming back with a vengeance like, you know, you're able to, you know, now people are rediscovering the vocabulary of SQL, pi spark has like native bindings, where you can blend SQL and Python together, which makes it super cool. The first programming language that I've learned is, is SQL. I mean, I've started with this when I was working with databases. So it's, like, I'm happy to see kind of, you know, everything kind what's old is new again, sequels cool again, and I'm quite excited to see, you know, where, where the future is gonna lead us, but but I really, really think that start got a lot of ideas, right. And, and their, their way of approaching data is very compatible with the way I think. So I'm trying to share the good news and, and get more people to be excited about it and break some misconception that could arise.

Lior Barak 37:22
That's cool. So as we closing the show, would you like to share with us? Why do you like data? And what drives you actually? Or what drove you actually, to start exploring this field?

Jonathan Rioux 37:39
Well, I mean, I, my major is an applied math, and I work in actuarial science for many years, which is basically, my job was to build models for life expectancy, retirement, you know, drug planes. And, and one of the things that I found is, there's so much opportunities with data, there's so many ways to approach this and a very interesting thing. And I've always been very passionate about computers and how they work. So to me, it was kind of a just combining what I do for a living with what I love. So now wake up every day, and there's like, fascinating data problem. And you get to do crazy stuff. Like, I mean, during my pastime, I, I build some models that do like pretty much nothing.

And I'm enjoying the process. Just because it trains your brain into thinking differently. And it gives you a new insight on what's happening. One of the recent things that I've been doing is, I've been downloading all of my transaction, you know, from my banking statement and credit card statement, and then trying to build a model of when I'm going to go bankrupt, I might have to do some change. But you know, this is the thing, like you can do stuff that is really silly. And, and once you remove yourself from the process, and you're just enjoying the process of doing data discovery, and I just think this is where the fun starts.

Lior Barak 39:13
Yeah, I completely agree. Or we lost me here. So

if I may ask then one more question. Which will come at the end of the show. What was your biggest failure? Actually, when you started to explore the world of PySpark? What was the most frustrating moment that you had?

Jonathan Rioux 39:38
Oh my god. Um, so my first interaction with PI spark was my previous employer, and they had an internal data leak, and spark was working on it. And it was so annoying at the time. It was I would say like it was kind of the dark age when you know, spark 1.6 2.0 2.1 was like, kind of like, meshing with

Transcribed by https://otter.ai