Skip to main contentOpen Source Cloud Guide

Artificial Intelligence

What is artificial intelligence?

Artificial intelligence leverages computers and machines to mimic the problem-solving and decision-making capabilities of the human mind. AI combines computer science and robust datasets to enable problem-solving. AI developers take the models that data scientists create and make them into deployable models that can be used in applications.

The diagram below shows how machine learning and deep learning fits into the realm of AI.


Why is this important for hybrid cloud developers?

Integrating AI and machine learning technologies with cloud environments is an increasingly common scenario, driven by use of microservices and the need to scale rapidly. Developers are faced with the challenge to not only build machine learning applications, but to ensure that they run well in production in cloud-native and hybrid cloud environments.

Solution sketch

When developing AI-powered services and applications that run in cloud environments, there is a vast array of development areas to consider including:

  • Data selection
  • Data preprocessing
  • Data visualization
  • Model development
  • Model deployment


  • Overfitting: Overfitting is a concept in data science which occurs when a statistical model fits exactly against its training data. When this happens, the algorithm unfortunately cannot perform accurately against the unseen data, defeating its purpose.

  • Underfitting: Underfitting is a concept in data science which occurs when a statistical model is not complex enough to fit against its training data. When this happens, the algorithm cannot perform accurately against the training nor the unseen data, defeating its purpose.

Key open source projects

Open datasets

Project CodeNetN/ALarge-scale dataset with approximately 14 million code samples, each of which is an intended solution to one of 4000 coding problems
UC Irvine Machine Learning RepositoryN/ACollection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms.
Kaggle datasetsN/AOver 50,000 public datasets and 400,000 public notebooks
Registry of Open Data on AWSN/ADiscover and share datasets that are available via AWS resources.
Awesome Public DatasetsN/AA high quality list of topic-centric public data sources
World Bank Open DataWorld Bank Data Open APIFree and open access to global development data
WHO Open DataGHO OData APIWorld Health Organization’s gateway to health-related statistics for its 194 Member States
Google Public Data ExplorerN/ALarge datasets that are easy to explore, visualize and communicate
U.S.Census BureauMicrodata APIThe leading source of quality data about the United States’ people and economy CKAN APIData, tools, and resources to conduct research, develop web and mobile applications, design data visualizations, and more
Yelp Open DatasetN/ASubset of Yelp’s businesses, reviews, and user data for use in personal, educational, and academic purposes
UNICEFN/AWorld’s leading source of data on children with databases of hundreds of international valid and comparable indicators

Data preprocessing - Wrangling, Cleaning, Analyzing, Computing

NameDescriptionGitHub repoGet Started guide
PandasA fast, powerful, flexible and easy to use open source data analysis and manipulation tool built on PythongSourceGuide
scikit learn - preprocessingThe sklearn.preprocessing package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators.SourceGuide
NumPyAn open source project that enables numerical computing with PythonSourceGuide
BeautifulSoupA Python library designed for quick turnaround projects like screen-scrapingSourceGuide

Data visualization

NameDescriptionGitHub RepoGet Started guide
MatplotlibComprehensive library for creating static, animated, and interactive visualizations in PythonSourceGuide
SeabornHigh-level interface for drawing attractive and informative statistical graphicsSourceGuide
PlotlyA Python graphing library that makes interactive, publication-quality graphsSourceGuide
BokehPython library for creating interactive visualizations for modern web browsersSourceGuide

Model development

NameDescriptionGitHub RepoGet Started guide
Scikit-LearnSimple and efficient tools for predictive data analysisSourceGuide
XGBoostOtimized distributed gradient boosting library designed to be highly efficient, flexible and portable.SourceGuide
TensorflowThe core open source library to help you develop and train ML modelsSourceTensorFlow Get Started
KerasA deep learning API written in Python, that enables fast experimentationSourceKeras Get Started
PyTorchA machine learning framework that accelerates the path from research prototyping to production deploymentSourcePyTorch Get Started
PySparkPySpark is an interface for Apache Spark in Python.PySpark supports a machine learning library known as MLlib that is used for model training.SourceGuide
NLTKA platform for building Python programs to work with human language dataSourceGuide
GensimPython library for process raw, unstructured digital textsSourceGuide
StatsmodelsPython module that provides classes and functions for the estimation of many different statistical modelsSourceGuide

Model deployment

NameDescriptionGitHub RepoGet Started guide
KubeflowThe Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable.SourceGuide
TensorFlow ExtendedTensorFlow Extended is an end-to-end platform for preparing data, training, validating, and deploying models in large production environments.SourceGuide
MLflowMLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.SourceGuide

Cloud comparision

SaaS PlatformsCloud Pak for DataVertex AISageMakerAzure ML

Additional resources