GitHub community decided to dig deeper into machine learning and pulled data on contributions from Jan-Dec 2018. The contributions include pushing code, opening an issue or pull request, commenting on an issue and reviewing a pull request. The Octoverse report used data from the dependency graph for the most imported packages which include all public repositories and any private repositories which have opted into the dependency graph. The information on this article has been cited from the original documentation and the sources are also cited.
In this article, we list down the 10 most popular machine learning and data science packages on GitHub.
1| Numpy
NumPy is the fundamental package for scientific computing with Python. It contains a powerful N-dimensional array object, sophisticated (broadcasting) functions, tools for integrating C/C++ and Fortran code and is useful linear algebra, Fourier transform, and random number capabilities. NumPy can also be used as an efficient multi-dimensional container of generic data. Here, the arbitrary datatypes can also be defined that allows NumPy to seamlessly as well as speedily integrate with a wide variety of databases.
Click here to read more.
2| Scipy
SciPy is open-source software for mathematics, science, and engineering which includes modules for statistics, optimisation, integration, linear algebra, Fourier transforms, signal and image processing, ODE solvers, and more. It basically depends on NumPy which provides convenient and fast N-dimensional array manipulation. SciPy is built to work with NumPy arrays and provides many user-friendly and efficient numerical routines such as routines for numerical integration and optimization.
Click here to read more.
3| Pandas
Pandas is a Python package which provides fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis/manipulation tool available in any language.
Click here to read more.
4| Matplotlib
Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPythonshells, the Jupyter notebook, web application servers, and four graphical user interface toolkits. You can easily generate plots, histograms, power spectra, bar charts, error charts, scatterplots, etc., with just a few lines of code in this library.
Click here to read more.
5| Scikit-learn
Scikit-learn is an open source machine learning library for Python which is built on top of SciPy and distributed under the 3-Clause BSD license. Scikit-learn 0.20 is the last version to support Python2.7. Scikit-learn 0.21 and later will require Python 3.5 or newer. Scikit-learn also uses CBLAS, the C interface to the Basic Linear Algebra Subprograms library. If you already have a working installation of NumPy and SciPy, the easiest way to install Scikit-learn is using pip or conda
i.e. pip install -U scikit-learn
Or conda install scikit-learn
Click here to read more.
6| Six
Six is a Python 2 and 3 compatibility library which provides utility functions for smoothing over the differences between the Python versions with the goal of writing Python code that is compatible on both Python versions. Six is basically a utility package and it is intended to support codebases which work on both Python 2 and 3 without any modification and it can be downloaded on PyPI.
Click here to read more.
7| TensorFlow
TensorFlow is an open source software library for numerical computation using data flow graphs. The graph nodes represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) that flow between them. This flexible architecture enables you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device without rewriting code. TensorFlow also includes TensorBoard which is a data visualization toolkit.
Click here to read more.
8| Requests
Requests is an Apache2 Licensed HTTP library, written in Python. It is designed to be used by humans to interact with the language. It allows you to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor and is one of the most downloaded Python packages of all time, pulling in over 11,000,000 downloads every month.
Click here to read more.
9| Python-dateutil
The dateutil package provides powerful extensions to the standard datetime module, available in Python. The features of Python-dateutil includes computing of relative deltas between two given date or datetime objects, computing of dates based on very flexible recurrence rules, using a superset of iCalender specification, generic parsing of dates, etc.
Click here to read more.
10| Pytz
The pytz library allows accurate and cross platform timezone calculations using Python 2.4 or higher versions and provides access to the Olson timezone database. It also solves the issue of ambiguous times at the end of daylight savings, which you can read more about in the Python Library Reference.
Click here to read more.