How to Generate Pandas Code with LLM Tools like ChatGPT
What Is Pandas?
Pandas is an open source Python package most commonly used for data analysis. Wes Mckinney created pandas, while at a hedge fund named AQR capital, so it is not a coincidence that Pandas is often used in financial services for data analysis. Pandas allows users to work with large datasets with syntax that comparably simple to other python packages. Python has a whole host of packages for data analysis, but pandas is certainly a foundational one, which most beginners start with. Pandas has a ton of use cases. Let’s quickly go over a few:
Statistical Analysis: In Pandas, you can run all the analysis you would learn in the first few years of a statistics degree. This includes regressions, finding statistical significance, correlations, minimums, maximums, and much more.
Data Cleaning: Pandas is the best Python package for taking messy, unstructured data, and putting it into a form where it can be analyzed or included in some production process. These tasks include, removing null values, capitalizing strings, and formatting date values.
Data Transformations :This includes joining multiple datasets, creating pivot tables for a datasets, transposing dataframes, and much more!
Now let's do a deep dive into some of the best applications for AI Code Generation with Python and Pandas:
ChatGPT Web-app
To get started with ChatGPT, go the OpenAI website and make an account. Here you can access the ChatGPT chat interface, where you can ask questions about anything such as "Who was the 10th president of the united states, and what were their major policy intiatives?"
Notably, the model does not have access to real time data. So if I ask "What is the weather going to be in London on Friday?" I will receive this response:
Let's see some examples with Python/Pandas code generation. All you need to do is type your prompt in the chat window.
Here is what it looks like to tell ChatGPT to make a Python pivot table:
One great feature is that the interface will output code in a box seperate from the text. This makes copying the code simple, and the comments distinguishable from the Python syntax.
Next we'll cover the ChatGPT API extension into Jupyter, and how to use it. By the way, I wrote the whole next section with ChatGPT.
ChatGPT API:
Getting started with ChatGPT for Python generation can be a great way to add some fun and creativity to your programming projects. Here's a 200-word guide on how to get started:
- Install the necessary packages: First, you'll need to install the
openai
package in Python. You can do this usingpip install openai
. - Set up your OpenAI account: Next, you'll need to create an account on the OpenAI website, if you haven't already. Once you've done this, you'll need to generate an API key.
- Initialize the OpenAI API: Now, you're ready to start using ChatGPT. In your Python code, you'll need to initialize the OpenAI API by calling
openai.api_key = "YOUR_API_KEY"
. - Generate text: To generate text using ChatGPT, you'll need to call the
openai.Completion.create()
method. This method takes several parameters, including the text prompt, the length of the output, and the model to use (in this case, "text-davinci-002"). - Display the output: Finally, you'll want to display the output generated by ChatGPT. This can be done by accessing the
choices
attribute of the response object returned byopenai.Completion.create()
, and then accessing thetext
attribute of the first element in the list.
That's it! With these steps, you'll be well on your way to generating creative and engaging text using ChatGPT in Python.
Mito AI Code Generation interface in Jupyter
Mito is a spreadsheet interface for Python. Imagine Excel or Googlesheets, but inside your Python environment each edit you make in the spreadsheet generates the equivalent Python. So if you make a pivot table or merge two datasets together, Mito will generate that Pandas in the following code cell.
Mito AI is a new enhancement to Mito where you can access an LLM inside your Python environment. The interface allows you to write a prompt in the side menu, and have the equivalent Pandas generated for you. You can then apply the code clicking the "Execute Generated Code" button.
Mito will also allow you to scroll through your recent prompts so you can see you recent work. One way Mito differs from other code generation tools is that is it's AI feature generate the code and it edits the spreadsheet interface as well. This allows the user to see the code ouput and a visual representaiton of their edit at the same time. As discussed above, it is important to review the output of the LLM's work. The Mito extension in Jupyter allows the users to see the effect of the output visually – a great way to understand if there was an error or if the ouput had the effect the user wanted.
Why would I want to use AI to generate code?
Generative AI has been used to write code for some time, but in the last year ChatGPT has popularized this way of writing Python to an exponential degree. Before using AI to write Pandas, the important questions are:
- Is this the write use case for AI code generation?
- Which AI tool should I use for Python generation?
- Will using AI Pandas generation make me a worse data analyst?
Is this the right use case for AI code generation?
Let’s dive into each of these questions. When it comes to deciding if your use case can be helped by AI programming, there’s not a clear yes or no answer. There are some important axioms to consider though. For starters, the longer the task you assign the AI, the more you will need to go in and clean up the code yourself. Telling the LLM (large language model – these are the types of models that are used for code generation) to “create a full ETL process including data ingestion, cleaning, transformations, and modeling” will lead to the output below. You can see it is more of a process guide than an actual python script that you could put into production.
Shorter tasks are better for AI currently. This may change as AI/LLM’s ability to generate code advances in the future. A great prompt for the AI would be “what is the pandas syntax for concatenating two columns?” Below you can see a concise, copyable chunk of Pandas a great description for how it works. Regardless of the size of your prompt and how much code is generated, it is important that you review the code yourself before sending it to a colleague, or putting it into production. ChatGPT or any other LLM has errors and human review is still an important step in writing good Python or Pandas.
Which AI tool should I use for Python generation?
Below you’ll find a deep dive into specific code generation tools that use AI. Here we’ll discuss how to decide which tool is correct for your task. The first thing to think about is which environment do you want to be working in. ChatGPT is most commonly used in the OpenAI web environment. This is great if you are using AI as a replacement for searching for Python/Pandas syntax online. Combing through Stack Overflow documentation can be time consuming. Using AI allows you to access the answer you need much faster. But other tools can be more embedded right into your Python environment, so you don’t need to to switch between your environment and an AI programming tool to write your code. Mito exists directly in your Jupyter environment, so you can access LLM code generation without leaving your work. This helps the user stay in flow and focus on completing their analysis.
There is also a difference between AI code generation and AI code completion. Github Copilot is a great tool that help you finish lines of code in line. The user starts typing the line of Python/Pandas they want and Copliot will reccomend the rest of the line based on its understanding of your previous work and millions of repo’s generally. Code generation models like ChatGPT are prompted for full snippets of code, instead of code completions. Something like Co-pilot is better for scripting – you are writing hundreds or thousands of lines of Python and you want a tool to help you stay in flow and writ that faster. Something like Mito is better for data analysis. You want to merge to Pandas dataframes together, and you need to grab the command for that quickly.
Will using AI Pandas generation make me a worse data analyst?
It’s fair to say that this boils down to personal preference, but if you are looking for a career in data science, most experts would tell you that you need an understanding of Pandas/Python that you can’t get to if you only generate code with LLMs or other AI models. It is probably better to think of these models as an extension to your programing ability and not rely on it fully. A simple reason for this is for communication. You want to be able to tell colleagues what your code does and why it’s structure makes sense. A lot of programming is subjective, meaning you can solve the same problem with code that looks very different. So while use AI code generation may lead to a great outcome for a single task, overtime you will miss out on valuable domain expertise, if you use it too much. A balance needs to be struck between relying out LLM’s like ChatGPT and writing code yourself.
An important skill when using an AI model to generate code is the ability to use them quickly. Obviously these models speed provide a lot of time savings out of the box, but there are some quick tricks that will help too:
- If you’re using a model that isn’t in the same window as your python environment, such as ChatGPT, keep both windows open next to eachother so you can your code and model's code at the same time. This will save time on context switching
- This is mentioned above, but it can be faster to prompt the model with shorter commands, so that you can quickly check the output for correctness and relevance before building out a lot of code that relies on it. The one caveat here is that you may prefer to provide a larger, more encompassing prompt first, and tell then check the approach that the model is taking to complete the task to see if it lines up with out how you'd like to tackle it.
- Use keyboard shortcuts!! It is obvious, but quite helpful. Command + C to copy an output and command + V to paste it. Another great shortcut is the ability to switch tabs on your web browser between your python environment and your code generation model: CTRL + ALT.