Open-Source AI Cookbook documentation

Data analyst agent: get your data’s insights in the blink of an eye ✨

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Open In Colab

Data analyst agent: get your data’s insights in the blink of an eye ✨

Authored by: Aymeric Roucher

This tutorial is advanced. You should have notions from this other cookbook first!

In this notebook we will make a data analyst agent: a Code agent armed with data analysis libraries, that can load and transform dataframes to extract insights from your data, and even plots the results!

Let’s say I want to analyze the data from the Kaggle Titanic challenge in order to predict the survival of individual passengers. But before digging into this myself, I want an autonomous agent to prepare the analysis for me by extracting trends and plotting some figures to find insights.

Let’s set up this system.

Run the line below to install required dependancies:

!pip install seaborn "transformers[agents]"

We first create the agent. We used a ReactCodeAgent (read the documentation to learn more about types of agents), so we do not even need to give it any tools: it can directly run its code.

We simply make sure to let it use data science-related libraries by passing these in additional_authorized_imports: ["numpy", "pandas", "matplotlib.pyplot", "seaborn"].

In general when passing libraries in additional_authorized_imports, make sure they are installed on your local environment, since the python interpreter can only use libraries installed on your environment.

⚙ Our agent will be powered by meta-llama/Meta-Llama-3.1-70B-Instruct using HfEngine class that uses HF’s Inference API: the Inference API allows to quickly and easily run any OS model.

from transformers.agents import HfEngine, ReactCodeAgent
from huggingface_hub import login
import os

login(os.getenv("HUGGINGFACEHUB_API_TOKEN"))

llm_engine = HfEngine("meta-llama/Meta-Llama-3.1-70B-Instruct")

agent = ReactCodeAgent(
    tools=[],
    llm_engine=llm_engine,
    additional_authorized_imports=["numpy", "pandas", "matplotlib.pyplot", "seaborn"],
    max_iterations=10,
)

Data analysis 📊🤔

Upon running the agent, we provide it with additional notes directly taken from the competition, and give these as a kwarg to the run method:

import os

os.mkdir("./figures")
additional_notes = """
### Variable Notes
pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
"""

analysis = agent.run(
    """You are an expert data analyst.
Please load the source file and analyze its content.
According to the variables you have, begin by listing 3 interesting questions that could be asked on this data, for instance about specific correlations with survival rate.
Then answer these questions one by one, by finding the relevant numbers.
Meanwhile, plot some figures using matplotlib/seaborn and save them to the (already existing) folder './figures/': take care to clear each figure with plt.clf() before doing another plot.

In your final answer: summarize these correlations and trends
After each number derive real worlds insights, for instance: "Correlation between is_december and boredness is 1.3453, which suggest people are more bored in winter".
Your final answer should have at least 3 numbered and detailed parts.
""",
    additional_notes=additional_notes,
    source_file="titanic/train.csv",
)
>>> print(analysis)
Here are the correlations and trends found in the data:

1. **Correlation between age and survival rate**: The correlation is -0.0772, which suggests that as age increases, the survival rate decreases. This implies that older passengers were less likely to survive the Titanic disaster.

2. **Relationship between Pclass and survival rate**: The survival rates for each Pclass are:
   - Pclass 1: 62.96%
   - Pclass 2: 47.28%
   - Pclass 3: 24.24%
   This shows that passengers in higher socio-economic classes (Pclass 1 and 2) had a significantly higher survival rate compared to those in the lower class (Pclass 3).

3. **Relationship between fare and survival rate**: The correlation is 0.2573, which suggests a moderate positive relationship between fare and survival rate. This implies that passengers who paid higher fares were more likely to survive the disaster.

Impressive, isn’t it? You could also provide your agent with a visualizer tool to let it reflect upon its own graphs!

Data scientist agent: Run predictions 🛠️

👉 Now let’s dig further: we will let our model perform predictions on the data.

To do so, we also let it use sklearn in the additional_authorized_imports.

agent = ReactCodeAgent(
    tools=[],
    llm_engine=llm_engine,
    additional_authorized_imports=[
        "numpy",
        "pandas",
        "matplotlib.pyplot",
        "seaborn",
        "sklearn",
    ],
    max_iterations=12,
)

output = agent.run(
    """You are an expert machine learning engineer.
Please train a ML model on "titanic/train.csv" to predict the survival for rows of "titanic/test.csv".
Output the results under './output.csv'.
Take care to import functions and modules before using them!
""",
    additional_notes=additional_notes + "\n" + analysis,
)

The test predictions that the agent output above, once submitted to Kaggle, score 0.78229, which is #2824 out of 17,360, and better than what I had painfully achieved when first trying the challenge years ago.

Your result will vary, but anyway I find it very impressive to achieve this with an agent in a few seconds.

🚀 The above is just a naive attempt with agent data analyst: it can certainly be improved a lot to fit your use case better!

< > Update on GitHub