Loading…
DevConf.cz 2022 has ended

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

HPC & Big Data & Data Science [clear filter]
Friday, January 28
 

4:30pm CET

Data Engineering for Java Developers
Data Science is for everyone who wants to find patterns in large amounts of data. That being said, they should be able to work with data no matter if they are an IT person or not. For IT people, we usually see only examples in Python or R to work with Data Science or Data Engineering. For the Data Engineering side, I have good news for Java Developers: "Yes, you can! With Java!". This session will show some of the tools Java Developers can use for Data Engineering tasks such as ETL, Data Visualization, and BI.

Session chairs: Andrei Veselov and Pavel Yadlouski

Speakers
avatar for Ricardo Oliveira

Ricardo Oliveira

JBUG:Brazil, Ansible Meetup, Red Hat Developers, Red Hat, Inc.
Ricardo has 10+ year of Italy experience with both Development and sysadmin skills. Works at Red Hat in the OpenShift xPaaS team, providing all JBoss solutions to run in Dockerized environments and providing advices about how to use OpenShift at their bes
avatar for Maulik Shah

Maulik Shah

Softwate Engineer at the AI Center of Excellence at Red Hat, Red Hat Inc.
Hi I am a Software Engineer with the AICoE at Red Hat. Before this I did my Masters in Computer Science @ Boston University. At Redhat I work as a Data Engineer which involves ferrying massive amounts of data across systems and I also work on monitoring a



Friday January 28, 2022 4:30pm - 4:55pm CET
Session Room 1

5:00pm CET

Uncovering Project Insights from GitHub PR Data
What does your GitHub repo say about your software development process? What’s the average “idea-to-production” time for new features? How long does it typically take before a Pull Request (PR) is merged? How much content does each PR add, remove, or modify? Understanding such bits of information about your project can help you better guide its development. Furthermore, it can help you promote a healthy and thriving open source community around your project.

In this talk we will show you how to use a number of open source tools to collect data about your repo’s PRs, analyze it, and visualize key metrics on a dashboard to gain greater insights into your software development process. Then, we will show you how to build reproducible workflows which use historical PR data to train machine learning models to predict the time taken to merge a PR. Finally, we will walk you through how we packaged our prediction pipeline and deployed it as a service using Seldon Core on OpenShift. This service can then be integrated into GitHub apps to give live predictions of time to merge for new incoming PRs.

By the end of this talk, participants will be able to use this open source tool to predict the time to merge PRs on their own projects, learn how to use OpenShift to build and deploy their own ML models, and learn how to calculate and visualize metrics from their GitHub repos on a dashboard.

Session chairs: Andrei Veselov and Pavel Yadlouski

Speakers
avatar for Oindrilla Chatterjee

Oindrilla Chatterjee

Senior Data Scientist, Red Hat
Oindrilla is a Senior Data Scientist at Red Hat, in the Office of the CTO working on emerging trends and research in ML and AI. She spent the past year developing open source AI applications for CI data.
avatar for Karanraj Chauhan

Karanraj Chauhan

Data Scientist, Red Hat
I like math, machine learning, and deep learning. Big fan of CPUs, GPUs, FPGAs, and other such lightning powered stones.



Friday January 28, 2022 5:00pm - 5:25pm CET
Session Room 1

5:30pm CET

Data Science + Cloud Native Development == Awesome
In this talk we will answer the questions on every engineering team’s mind: How do we seamlessly integrate data scientists and their machine learning models into our development workflow? How can data scientists collaborate effectively with each other in a reproducible fashion? How do we empower data scientists to be more like software engineers? Does a cloud native approach help make data science development any easier?

Learn how Red Hat’s AI Center of Excellence (AICoE) integrates CI/CD, gitops and other traditional cloud native concepts into our data science workflows using tools like OpenShift, Kubeflow and Prow in a totally open source cloud environment that empowers teams of data scientists to not only become more integrated into the larger application development life cycle, but to own their models from data collection and exploration through deployment, monitoring and refinement. Ultimately, shortening the gap from a proof of concept locked in some jupyter notebook to a deployed intelligent application.

By the end of this talk, attendees will see some real world implementations of these concepts and learn about the Operate First community cloud environment, which is free and open for everybody to try out all of the topics and techniques discussed in the talk.

Session chairs: Andrei Veselov and Pavel Yadlouski

Speakers
avatar for Michael Clifford

Michael Clifford

Data Science Manager, RH - Boston
Data Science Manager at Red Hat working in the Office of the CTO on AI Ops.


Friday January 28, 2022 5:30pm - 5:55pm CET
Session Room 1

6:00pm CET

Building data pipelines for Anomaly Detection
Cloud-native applications. Multiple Cloud providers. Hybrid Cloud. 1000s of VMs and containers. Complex network policies. Millions of connections and requests in any given time window. This is the typical situation faced by a Security Operations Control (SOC) Analyst every single day. In this talk, the speaker talks about the high-availability and highly scalable data pipelines that he built for the following use cases :

- Denial of Service: A device in the network stops working.
- Data Loss : An example is a rogue agent in the network transmitting IP data outside the network
- Data Corruption : A device starts sending erroneous data.

The above can be solved through anomaly detection models. The main challenge here is the data engineering pipeline. With almost 7 Billion events occurring every day, processing and storing that for further analysis is a significant challenge. The machine learning models (for anomaly detection) has to be updated every few hours and requires the pipeline to create the feature store in a significantly small time window.

The core components of the data engineering pipeline are:
- Apache Zookeeper
- Apache Kafka
- Apache Flink
- Apache Pinot
- Apche Spark
- Apache Superset

The event logs are stored in Pinot through Kafka topic. Pinot supports apache kafka based indexing service for realtime data ingestion. Pinot has primitive capabilities to create sliding time window statistics. More complex real-time statistics are computed using Flink. Apache Flink is a stream-processing engine and provides high throughput and low latency. Spark jobs are used for batch processing. Superset is used as BI tool for realtime visualization.

The speaker talks through the architectural decisions and shows how to build a modern real-time stream processing data engineering pipeline using the above tools.

Outline
  • The problem: overview

  • Architecture

  • Real-Time Processing

  • Anomaly Detection

  • Visualization

  • Demo



Session chairs: Andrei Veselov and Pavel Yadlouski

Speakers
avatar for Tuhin Sharma

Tuhin Sharma

Senior Principal Data Scientist, Red Hat
Tuhin Sharma is Senior Principal Data Scientist at Redhat in the Corporate Development and Strategy group. Prior that he worked at Hypersonix as AI Architect. He also co-founded and has been CEO of Binaize, a website conversion intelligence product for Shopify. He received master’s... Read More →


Friday January 28, 2022 6:00pm - 6:50pm CET
Session Room 1
 
Saturday, January 29
 

10:30am CET

Building Petabyte Scale ML Models with Python
Although building ML models on small/ toy data-set is easy, most production-grade problems involve massive datasets which current ML practices don’t scale to. In this talk, we cover how you can drastically increase the amount of data that your models can learn from using distributed data/ml pipes.

It can be difficult to figure out how to work with large data-sets (which do not fit in your RAM), even if you’re already comfortable with ML libraries/ APIs within python. Many questions immediately come up: Which library should I use, and why? What’s the difference between a “map-reduce” and a “task-graph”? What’s a partial fit function, and what format does it expect the data in? Is it okay for my training data to have more features than observations? What’s the appropriate machine learning model to use? And so on…

In this talk, we’ll answer all those questions, and more!

We’ll start by walking through the current distributed analytics (out-of-core learning) landscape in order to understand the pain-points and some solutions to this problem.

Here is a sketch of a system designed to achieve this goal (of building scalable ML models):

1. a way to stream instances
2. a way to extract features from instances
3. an incremental algorithm

Then we’ll read a large dataset into Dask, Tensorflow (tf.data) & sklearn streaming, and immediately apply what we’ve learned about in last section. We’ll move on to the model building process, including a discussion of which model is most appropriate for the task. We’ll evaluate our model a few different ways, and then examine the model for greater insight into how the data is influencing its predictions. Finally, we’ll practice this entire workflow on a new dataset, and end with a discussion of which parts of the process are worth tuning for improved performance.

Detailed Outline

1. Intro to out-of-core learning
2. Representing large datasets as instances
3. Transforming data (in batches) – live code [3-5]
4. Feature Engineering & Scaling
5. Building and evaluating a model (on entire datasets)
6. Practicing this workflow on another dataset
7. Benchmark other libraries/ for OOC learning
8. Questions and Answers

Key takeaway

By the end of the talk participants would know how to build petabyte scale ML models, beyond the shackles of conventional python libraries.

Participants would have a benchmarks and best case practices for building such ML models at scale.

Session chairs: Justin Nixon and Michal Ruprich

Speakers
avatar for Vaibhav Srivastav

Vaibhav Srivastav

Data Scientist, Deloitte GmbH
I am a Data Scientist and a Master's Candidate - Computational Linguistics at Universität Stuttgart. I am currently researching on Speech, Language and Vision methods for extracting value out of unstructured data.In my previous stint with Deloitte Consulting LLP, I worked with Fortune... Read More →



Saturday January 29, 2022 10:30am - 10:55am CET
Session Room 4

11:30am CET

Preconditioners to scale Multi-physics Simulations
Preconditioners (PCs) are used to improve both, the efficiency and robustness of iterative techniques while solving very large linear systems on a Krylov subspace. However, determining with preconditioner to use with which equations or set of equations on a certain multi-physic simulation requires a combination of knowledge of preconditioning, matrices techniques, types of matrices, Krylov subspaces, iterative methods, among other Linear Algebra's foundation. The present work provides a benchmark of the most popular preconditioners available today, emphasising their respective performance in terms of time to solution of the Finite Element problem, usage of memory, number of iterations, the value of |R| achieved when converged. The performance evaluation is made for the Compute Finite Strain Elastic Stress in 3D, using the University of Cambridge Research Computing Service (CDS3) and the Message Passing Interface (MPI) implementations that allows parallelisation. The benchmark and scaling was done with MOOSE which use the Finite Elements Method and million Degrees of Freedom (DoF). Along with the preconditioners and KSP types, a variety of options were tested to optimise its performance.

Session chairs: Justin Nixon and Michal Ruprich

Speakers
avatar for Julita Inca

Julita Inca

HPC Software Specialist, UKAEA
Education:- Systems Engineering in Peru, Callao's university.- Computer Science Masters in Peru, PUCP's university.- High Performance Computing Masters in the UK, Edinburgh's university.- Red Hat Certified Professional 140-100-496Latest Work Experiences:- Member of the GNOME Foundation... Read More →



Saturday January 29, 2022 11:30am - 12:20pm CET
Session Room 4

12:30pm CET

Build your own social media analytics with Apache
Apache Kafka is more than just a messaging broker. It has a rich ecosystem of different components. There are connectors for importing and exporting data, different stream processing libraries, schema registries and a lot more. The first part of this talk will explain the Apache Kafka ecosystem and how the different components can be used to load data from social networks and use stream processing and machine learning to analyze them. The second part will show a demo running on Kubernetes which will use Kafka Connect to load data from Twitter and analyze them using the Kafka Streams API. After this talk, the attendees should be able to better understand the full advantages of the Apache Kafka ecosystem especially with focus on Kafka Connect and Kafka Streams API. And they should be also able to use these components on top of Kubernetes.

Session chairs: Justin Nixon and Michal Ruprich

Speakers
avatar for Jakub Scholz

Jakub Scholz

Principal Software Engineer, Red Hat
Jakub is a Principal Software Engineer in the Messaging and IoT team. He has a long-term experience in messaging and lately focuses mainly on Apache Kafka. He is one of the core maintainers of the Strimzi project, which delivers several operators and tools for running Apache Kafka... Read More →



Saturday January 29, 2022 12:30pm - 1:20pm CET
Session Room 4

3:00pm CET

Explainable AI for Business Processing Models
Open source business automation (OSBA) is a useful tool to help orchestrate complex business workflows. But what if you could use artificial intelligence (AI) to help extend those automations even further?

Although AI and machine learning (ML) techniques can also greatly benefit OSBA, fairness and transparency are fundamental requirements when implementing or using AI/ML outcomes.
In this session we will focus on how we implemented different explainability techniques in the TrustyAI project to allow different aspects of opaque predictive models' outcomes to be better understood by both end users and ML practitioners.

We will discuss feature importance estimation using LIME and SHAP and counterfactual explanations and how they can benefit OSBA processes.

The attendees will leave this session familiar with concepts such as why is explainability important, knowledge of the techniques implemented to achieve black-box model explainability and an example of a real-world service oriented OSBA incorporating these techniques.

Session chairs: Andrei Veselov and Richard Filo

Speakers
avatar for Rui Vieira

Rui Vieira

Senior Software Engineer, Red Hat
Rui is a Software Engineer at Red Hat working on Data Science, Business Automation, Apache Spark and streaming applications. He has a PhD in Bayesian Statistics, specifically Sequential Monte Carlo methods in long running streaming data and a MSc in Internet Technologies and Enterprise... Read More →



Saturday January 29, 2022 3:00pm - 3:25pm CET
Session Room 4

6:00pm CET

How data helps to make a better city
To flourish and provide citizens with great quality of life, cities need to increase the utilization of urban data. With the ever-changing nature of our world that need becomes ever more urgent than before. Cities all across the globe follow that trend and the city of Brno is not an exception. The talk will showcase several use-cases that help Brno to bring that change and become a better city for its people.


Session chairs: Andrei Veselov, Richard Filo

Speakers
avatar for Robert Spál

Robert Spál

GIS Specialist, data.Brno
GIS Specialist at the Data and analytics dept. of Brno City Municipality. Manages an ongoing program of technical and functional development of Brno City datastore to ensure that the content, functionality, operation, and user experience meet the needs of its users.


Saturday January 29, 2022 6:00pm - 6:25pm CET
Session Room 4
 
  • Timezone
  • Filter By Date DevConf.cz 2022 Jan 28 -29, 2022
  • Filter By Venue hopin.to
  • Filter By Type
  • Analysis &Testing & Automation
  • Capture the Flag
  • Cloud & Hyperscale
  • Edge Computing
  • Future Tech & Open Research
  • HPC & Big Data & Data Science
  • Linux distribution
  • Meetup
  • Modern Software Development
  • Open Source Education
  • Open Source UX/Design
  • Workshop