dotlinux blog

How to Install and Use Scikit-Learn in Linux

Scikit-learn is one of the most powerful and popular machine learning libraries for Python. It provides simple and efficient tools for data mining and data analysis, built on top of NumPy, SciPy, and matplotlib. Whether you're a beginner looking to get started with machine learning or an experienced data scientist building complex models, scikit-learn offers a consistent and user-friendly API that makes implementing algorithms like classification, regression, clustering, and dimensionality reduction remarkably straightforward.

In this comprehensive guide, we'll walk through everything you need to know to get scikit-learn up and running on your Linux system, from basic installation to creating your first machine learning model.

2026-03

Table of Contents#

  1. Prerequisites
  2. Installation Methods
  3. Verifying the Installation
  4. Your First Machine Learning Project: Iris Classification
  5. Key Concepts and Common Tasks
  6. Troubleshooting Common Issues
  7. Conclusion
  8. References

Prerequisites#

Before installing scikit-learn, ensure your system meets these requirements:

  1. A Linux Distribution: Ubuntu, Debian, CentOS, Fedora, etc.
  2. Python: Version 3.7 or newer. Check your version with:
    python3 --version
    If Python is not installed, you can install it using your distribution's package manager. For example, on Ubuntu/Debian:
    sudo apt update
    sudo apt install python3 python3-pip
  3. pip: The Python package installer. It usually comes with python3-pip.
  4. (Optional) Virtual Environment: It's a best practice to use a virtual environment to manage dependencies for your projects. To install venv on Ubuntu/Debian:
    sudo apt install python3-venv
    Create and activate a virtual environment:
    python3 -m venv my_ml_env
    source my_ml_env/bin/activate
    Your terminal prompt should now show the environment name (my_ml_env).

Installation Methods#

Method 1: Using pip#

This is the most straightforward method for most users.

  1. Update pip (recommended):

    pip3 install --upgrade pip
  2. Install scikit-learn and its core dependencies:

    pip3 install scikit-learn

    This command will automatically install NumPy and SciPy if they are not already present.

Pros: Simple, fast, and gets the latest stable version. Cons: If you have complex scientific computing needs, the precompiled binaries from conda might offer better performance.

Method 2: Using conda (via Anaconda/Miniconda)#

conda is a powerful package manager especially well-suited for data science and scientific computing, as it handles library dependencies very effectively.

  1. Install Miniconda (a lightweight alternative to Anaconda) or Anaconda from the official website.

  2. Create and activate a new conda environment (good practice):

    conda create -n my_ml_env python=3.9
    conda activate my_ml_env
  3. Install scikit-learn:

    conda install scikit-learn

Pros: Excellent dependency resolution, often provides optimized binaries for performance. Cons: Larger initial download (especially Anaconda).

Method 3: Installing from Source#

This is typically for developers who want to contribute to scikit-learn or need the absolute latest, unreleased features.

  1. Clone the repository:

    git clone https://github.com/scikit-learn/scikit-learn.git
    cd scikit-learn
  2. Create a virtual environment and install build dependencies:

    python3 -m venv sklearn-dev
    source sklearn-dev/bin/activate
    pip install -U pip
    pip install -r build_requirements.txt
  3. Install in development mode:

    pip install --editable .

Pros: Access to the latest code. Cons: More complex setup, requires a build environment.

Verifying the Installation#

After installation, verify that scikit-learn is correctly installed and check its version.

  1. Open a Python interpreter in your terminal:

    python3
  2. Run the following import command:

    import sklearn
    print(sklearn.__version__)

    If no error occurs and a version number is printed (e.g., 1.3.0), the installation was successful.

    You can also try importing a specific module to test further:

    from sklearn import datasets
    print("Scikit-learn is ready to use!")
    exit() # To leave the Python interpreter

Your First Machine Learning Project: Iris Classification#

Let's build a classic machine learning model to classify iris flowers into three species. This follows the standard scikit-learn workflow.

Step 1: Import Necessary Libraries#

Create a new Python file, e.g., iris_classifier.py.

# Import datasets, model, and evaluation metric
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
 
# For visualization (optional)
import matplotlib.pyplot as plt

Step 2: Load the Dataset#

Scikit-learn comes with several built-in datasets, including the famous Iris dataset.

# Load the iris dataset
iris = datasets.load_iris()
 
# The data (features) and target (labels) are stored as NumPy arrays
X = iris.data  # Features: sepal length, sepal width, petal length, petal width
y = iris.target # Target: 0=Setosa, 1=Versicolour, 2=Virginica
 
# Feature names and target names for reference
feature_names = iris.feature_names
target_names = iris.target_names

Step 3: Explore the Data#

It's always good to understand your data before building a model.

print("Feature names:", feature_names)
print("Target names:", target_names)
print("Shape of data (samples, features):", X.shape)
print("First 5 samples of data:\n", X[:5])
print("First 5 targets:", y[:5])

Step 4: Split the Data#

We split the data into a training set (to train the model) and a testing set (to evaluate its performance on unseen data).

# Split the data: 70% for training, 30% for testing
# random_state ensures reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
 
print("Training set size:", X_train.shape[0])
print("Testing set size:", X_test.shape[0])

Step 5: Choose a Model and Train It#

We'll use a Random Forest classifier, a powerful and versatile algorithm.

# Create a Random Forest Classifier object
model = RandomForestClassifier(n_estimators=100, random_state=42)
 
# Train the model using the training data
model.fit(X_train, y_train)

Step 6: Make Predictions and Evaluate the Model#

Now, we use the trained model to make predictions on the test set and see how well it performs.

# Make predictions on the test set
y_pred = model.predict(X_test)
 
# Calculate the accuracy: (correct predictions) / (total predictions)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}") # Should be very high (~1.00) for this simple dataset
 
# (Optional) Show a few actual vs. predicted values
print("\nSample Predictions:")
for i in range(5):
    print(f"Actual: {target_names[y_test[i]]} - Predicted: {target_names[y_pred[i]]}")

Run the script: python3 iris_classifier.py. You should see output showing the model's high accuracy.

Key Concepts and Common Tasks#

Data Preprocessing#

Real-world data is often messy. Scikit-learn provides tools in sklearn.preprocessing to clean it up.

from sklearn.preprocessing import StandardScaler, LabelEncoder
 
# Standardize features by removing the mean and scaling to unit variance
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Important: use the same scaling for the test set!
 
# Encode text labels into numbers (if your target is text like 'cat', 'dog')
# le = LabelEncoder()
# y_encoded = le.fit_transform(y_text_data)

Model Evaluation Techniques#

Accuracy isn't always the best metric. Use cross-validation for a more robust evaluation.

from sklearn.model_selection import cross_val_score
 
# Perform 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5)
print("Cross-Validation Scores:", scores)
print(f"Mean CV Accuracy: {scores.mean():.2f} (+/- {scores.std() * 2:.2f})")

Saving and Loading Models#

Once you have a trained model you're happy with, you can save it to disk for later use.

import joblib
 
# Save the trained model to a file
filename = 'final_iris_model.pkl'
joblib.dump(model, filename)
 
# ... Later, in another script ...
# Load the model from the file
loaded_model = joblib.load(filename)
# Use it to make predictions without retraining
new_prediction = loaded_model.predict(X_test_scaled)

Troubleshooting Common Issues#

  • ModuleNotFoundError: No module named 'sklearn': This means scikit-learn is not installed. Double-check that you are in the correct virtual environment (if using one) and that the installation command completed successfully.
  • Import errors related to NumPy or SciPy: Scikit-learn depends on these. Try installing them first: pip3 install numpy scipy.
  • Permission Errors during pip install: Never use sudo with pip if you are in a virtual environment. If you are installing globally (not recommended), you might need sudo pip3 install scikit-learn, but using a virtual environment is a much safer practice.
  • Slow Performance: If installing via pip on a Raspberry Pi or older hardware, the build process can be very slow. Consider using conda which provides pre-compiled binaries, or install pre-compiled wheels: pip3 install scikit-learn --only-binary=scikit-learn.

Conclusion#

You have successfully installed scikit-learn on your Linux system and taken your first steps into the world of machine learning by building a simple classifier. The key to mastery is practice. Explore the scikit-learn documentation to learn about the dozens of other algorithms available, try different datasets (you can load CSV files using pandas and then use scikit-learn on them), and experiment with the various parameters of each model.

The consistent API design of scikit-learn makes it easy to swap out different models and preprocessing steps, so don't be afraid to experiment!

References#

  1. Official Scikit-learn Website: https://scikit-learn.org/stable/
  2. Scikit-learn User Guide: https://scikit-learn.org/stable/user_guide.html
  3. Scikit-learn API Reference: https://scikit-learn.org/stable/modules/classes.html
  4. Anaconda Distribution: https://www.anaconda.com/products/distribution
  5. Python Virtual Environments Documentation: https://docs.python.org/3/tutorial/venv.html