Using Open AI “Codex” for Data Science Tasks

Topic Modelling the 20 Newsgroup Dataset

Andreas Stöckl
3 min readSep 3, 2021
Photo by Christina Morillo from Pexels

The “Codex” language model published by Open AI for generating source code was viewed with great interest, and described in many reports, sometimes very enthusiastically.

In this article, I want to check if the system is suitable for typical data science tasks in Python. For this, I will use Jupiter Notebooks, as usual, but I will not write the code by hand but will give instructions in natural language. These instructions will then be sent to Open AI’s API (https://beta.openai.com) to generate the code.

The following code shows the call to “Codex” via the API.

The corresponding language model “davinci-codex” from the GPT3 family is selected. The parameters (temperature, top_p) are chosen so that no random results are generated, but the scripts with the highest probability. It should also not be suppressed that tokens appear neither repeated because this is quite reasonable in source code. The maximum number of tokens that should be generated must also be selected.

In the sequel, I want to create a topic model using the LDA method, for the well-known 20 newsgroup dataset (© Tom Mitchell, School of Computer Science, Carnegie Mellon University).

Command in natural language (“Prompt”):
Generate the python code
Import pandas and load the 20 newsgroups dataset with scikit learn into a dataframe.
Label the column as ‘text’.
Print the first 10 rows”

Result:

This script runs perfectly. Let us now describe the second part of our task.

Command in natural language (“Prompt”):
Load the gensim package make a corpus of documents from the column ‘text’ of the dataframe, and a dictionary from the tokenized documents and train a LDA model on it.”

Result:

This gives the following output:

Topic: 0 
Words: 0.007*"." + 0.004*":-" + 0.003*"*" + 0.002*"**" + 0.002*"<><" + 0.002*"*****" + 0.002*"Georgia" + 0.002*"Covington)" + 0.001*"mcovingt@aisun3.ai.uga.edu" + 0.001*"2.10"
Topic: 1
Words: 0.053*"the" + 0.029*"to" + 0.028*"of" + 0.020*"a" + 0.020*"and" + 0.017*"that" + 0.016*"is" + 0.016*"in" + 0.012*"I" + 0.011*">"
Topic: 2
Words: 0.034*"the" + 0.025*"and" + 0.024*"of" + 0.019*"to" + 0.015*"for" + 0.015*"a" + 0.013*"is" + 0.013*"in" + 0.011*"-" + 0.010*"|"
Topic: 3
Words: 0.025*"1" + 0.021*"0" + 0.013*"2" + 0.009*"-" + 0.008*"3" + 0.007*"4" + 0.006*"$" + 0.006*"25" + 0.005*"team" + 0.005*"7"
Topic: 4
Words: 0.003*"----" + 0.001*"rind@enterprise.bih.harvard.edu" + 0.001*"Stephenson)" + 0.000*"|*|" + 0.000*"Rind)" + 0.000*"Defensive" + 0.000*"Stephenson" + 0.000*"Boylan)" + 0.000*"enterprise.bih.harvard.edu" + 0.000*"Rind"
Topic: 5
Words: 0.028*"the" + 0.026*"I" + 0.024*"a" + 0.018*"to" + 0.013*">" + 0.012*"From:" + 0.012*"and" + 0.012*"Lines:" + 0.011*"Subject:" + 0.011*"Organization:"
Topic: 6
Words: 0.089*"MAX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'" + 0.003*"M"`@("`@("`@("`@("`@("`@("`@("`@("`@("`@("`@("`@("`@("`@("`@(" + 0.001*"MG9V=G9V=G9V=G9V=G9V=G9V=G9V=G9V=G9V=G9V=G9V=G9V=G9V=G9V=G9V=" + 0.001*"of" + 0.001*"(Cliff)" + 0.001*"pwiseman@salmon.usd.edu" + 0.001*"------------" + 0.001*"--------" + 0.001*"14" + 0.000*"Mel"
Topic: 7
Words: 0.014*"_/" + 0.002*"_/_/_/" + 0.001*"Gainey" + 0.001*"jbh55289@uxa.cso.uiuc.edu" + 0.001*"_/_/" + 0.001*"_/_/_/_/" + 0.001*"aws@iti.org" + 0.001*"Sherzer)" + 0.001*"+---------------------------------------------------------------------------+" + 0.001*"SSTO"
Topic: 8
Words: 0.067*"X" + 0.053*"*" + 0.011*"|" + 0.009*"=" + 0.008*"*/" + 0.007*"/*" + 0.007*"}" + 0.005*"|>" + 0.004*"entry" + 0.004*"{"
Topic: 9
Words: 0.003*"ground" + 0.003*"wire" + 0.002*"wiring" + 0.002*"outlet" + 0.002*"outlets" + 0.002*"neutral" + 0.002*"water" + 0.001*"ADL" + 0.001*"Canadiens" + 0.001*"cooling"

This worked fine. We created a topic model with two paragraphs of statements in natural language without writing the necessary python code by hand.

What I have learned

The prompt must be chosen very carefully. Sometimes small changes in the prompt lead to untraceable changes in the result. The instructions must precisely describe the process that is to be generated as code. This requires some experimentation and not less know-how than writing the code by hand. The main advantage and thus time-saving is that one does not need to know all the details of the syntax.

The algorithms must be mastered by the author, and the required program packages must also be known. The language model does about the same as if you use the correct text input in Google search, take the results of Stackoverflow that it finds, and put the code snippets together. But in a faster and more comfortable way.

As you can see in the examples above with the output of the number of documents in the first part and the printing of the top words in the second part, sometimes parts are added in the script, which is correct and useful, but actually were not specified in the prompt. In this sense, the language model is creative.

--

--

Andreas Stöckl

University of Applied Sciences Upper Austria / School of Informatics, Communications and Media http://www.stoeckl.ai/profil/