Photo by bruce mars on Unsplash
How to do Function Calling with any Chat LLM (that is clever enough)
Using C#, Reflection and Workflow-as-Text to create an Agent that can do everything you can code, autonomously
Introduction
Earlier this year, OpenAI introduced Function Calling, a way to have the LLM decide what to execute in your code based on predefined function definitions. A little later, Microsoft added the feature to their Azure OpenAI services. As per default in this field, support for integration into a programming language is limited to "Python" and "Everyone else that puts up with the REST API". This is bad news for us .NET devs, but it also opens up the opportunity to not be vendor-locked due to laziness: if we have to implement it anyway, why not do it in a generic way that includes all other LLMs as well? This blog post shows just that: using reflection to have LLM agnostic function calling. This works with every LLM in theory that is good at logic and understanding and transferring context. Unfortunately, almost all OSS LLMs are worse than GPT-3.5-Turbo right now, and 3.5-Turbo is not clever enough for this. The only one that works is GPT-4. But the field is changing so quickly that soon LLama (Meta), Bard (Google), Claude (Anthropic) and hundreds of Open Source initiatives will catch up to GPT-4, making them viable for this.
The scenario
We want to make a Bot that guides the user through the process of creating an image with stable diffusion, refining that image and finalizing it, all while being as responsive as a chatbot, but triggering the making, refining and finalizing of images, while also naturally talk to the user, getting their opinion and feedback.
This might be done with a lot of If-Else-Cases, some loops and an implementation that is tightly coupled to the use case. Or, we just code the functions like "MakeImage", "RefineImage" and so on, and let the LLM decide what to do and when.
All we have to do is to write down a freely written workflow as a text and let the LLM figure out the rest.
How to make a chatbot do things
Chatbots are bots that chat. They receive text, do some "thinking", and reply with text. This is how ChatGPT, Claude and all of them operate. They start to be able to receive files and images instead of text and reply with text, files and images. But there is no effect on anything outside of the conversation. This is where functions come in: They are a way of saying "Instead of replying to me with some text, reply to me with an idea of what to execute". For example, if the user asks "When will I see the sun again in Sydney?", which is real-time information, the LLM can reply with "Lookup_sun_at_location(city: Sydney)", and it's up to the client of the user or some system in between to execute the "Lookup_sun_at_location" function and offer the result back to the LLM, which can use the answer to create a text-based answer like "It is sunny right now in Sydney".
For our scenario, however, we can go one step further. Instead of only reacting to a user's query, we can define a whole workflow, in which the LLM will read, understand, and execute functions when they think it's necessary. In fact, we can omit every chat behaviour and restrict clear text answers to calling a function named "AskQuestion". The LLM will no longer "chat" with the user, it will call a function that will inquire feedback from the user. Taking the chat out of the chatbot is what requires heavy-lifting text understanding and logic, which is why only the most advanced models at the moment can do it. (One could probably fine-tune an LLM to do that, though).
To archive this restriction, this system prompt does the trick:
You are a computer system. You can not answer the user directly. instead, your only option is to call functions. Refer to the list of functions below to see what you can do.
For example, if you need input from the user, you do not ask directly. instead, you call the AskQuestion(questionText: Your question here) function. All parameters are required, always add the parameter name as well as the value.
Don't make assumptions about what the user wants. Ask for clarification if a user request is ambiguous. Only use the functions you have been provided with.
The Workflow
For this specific scenario, we need a workflow that facilitates all aspects of creating an image with stable diffusion, which includes coming up with an initial idea, creating a prompt for that idea, generating a couple of sample images, picking the best one, refining the best one into more images, picking the best one again, and so on, until we have a final image in the end. In Text, this is the workflow that the LLM is required to understand and follow:
1) Ask the user what they want in their picture
2) create 4 images with slightly different prompts with different artists and artstyles
3) Present all 4 pictures to the user and ask them which one they like most, and if they want to have something changed in the prompt for the picture.
4) refine that image 4 times, use the original prompt for the image, or a changed prompt, if the user wanted some change to it. Use a lower denoise strength and a random seed. Present all 4 refined images to the user
5) repeat steps 3 and 4 for 3 more times
6) upscale the final image, present the upscaled image and say goodbye.
This is of course a workflow for this specific scenario. The nice thing about this approach is, that you can exchange the workflow without rewriting any logic code.
The Functions
As the workflow defines what to do, the functions tell how to do it. To enable the LLM to follow the workflow, and for example, create an image, we need to have the "CreateImage" function coded out and make it be known to the LLM in the system prompt. This is where Reflection comes in handy. Thanks to the nature of .NET, we get information about our code, while it's running. This means we can read all Methods from a class, map them, convert them to a JSON, and present the JSON to the LLM. To help the LLM understand what a method is for and how to use it, we better provide Meta-Information, such as a Description, an Example how to call it, and what it returns. This is the "CreateImage" Method with appropriate decoration:
[Description("Create an image")]
[Example("MakeImage(prompt: 'a picture of a whale flying, RAW glossy image, modern, in front of church', negativeprompt: 'ugly dark', imageName: img1.jpg )")]
[Returns("The URL to the created image")]
public string MakeImage(string prompt, string negativeprompt, string imageName )
{
//Your actual implementation here. This could be running stable diffusion
//locally, or using a webservice like Replicate
}
The Description Annotator already exists. To make "Example" and "Returns" available to us to decorate our methods, we have to implement them:
public class ExampleAttribute : Attribute
{
public string Example { get; }
public ExampleAttribute(string example)
{
this.Example = example;
}
}
public class ReturnsAttribute : Attribute
{
public string Returns { get; }
public ReturnsAttribute(string returns)
{
this.Returns = returns;
}
}
With that information presented as a method, we can now use Reflection to turn that method with its parameters and its decoration attributes into a JSON that is transferable to the LLM via text:
private void FillFunctionDefinitions(Type type)
{
// Get all public instance methods
MethodInfo[] methods = type.GetMethods(BindingFlags.Public | BindingFlags.Instance);
// Iterate over each method
foreach (MethodInfo method in methods)
{
FunctionDefinition definition = new FunctionDefinition();
definition.FunctionName = method.Name;
// Check if the Description attribute is defined for this method
if (Attribute.IsDefined(method, typeof(DescriptionAttribute)))
{
var descriptionAttribute = (DescriptionAttribute)Attribute.GetCustomAttribute(method, typeof(DescriptionAttribute));
var exampleAttribute = (ExampleAttribute)Attribute.GetCustomAttribute(method, typeof(ExampleAttribute));
var returnsAttribute = (ReturnsAttribute)Attribute.GetCustomAttribute(method, typeof(ReturnsAttribute));
definition.Example = exampleAttribute.Example;
definition.Description = descriptionAttribute.Description;
definition.Returns = returnsAttribute.Returns;
ParameterInfo[] parameters = method.GetParameters();
// Iterate over each parameter
definition.Parameters = new List<string>();
foreach (ParameterInfo parameter in parameters)
{
definition.Parameters.Add(parameter.Name);
}
functionDefinitions.Add(definition);
}
}
}
This is now saved in a List of this class:
internal class FunctionDefinition
{
public string FunctionName { get; set; }
public List<string> Parameters { get; set; }
public string Description { get; set; }
public string Example { get; set; }
public string Returns { get; internal set; }
}
A FunctionDefinition, converted to JSON, looks like this:
{
"FunctionName": "MakeImage",
"Parameters": [
"prompt",
"negativeprompt",
"imageName"
],
"Description": "Create an image",
"Example": "MakeImage(prompt: \u0027a picture of a whale flying, RAW glossy image, modern, in front of church\u0027, negativeprompt: \u0027ugly dark\u0027, imageName: img1.jpg )",
"Returns": "The URL to the created image"
}
Next to creating images, the LLM needs to be capable to refine images, finishing an image (by upscaling, which is usually the last step), asking questions (like "What image do you want?"), show images to the user and say goodbye, to indicate that the work is done.
Adding those (with dummy values for now) as well, we end up with this "Toolbox" of options:
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace ToolboxAi
{
internal class MakeImageToolbox: IToolbox
{
string IToolbox.InitialPrompt { get => "Lets make an Image"; }
[Description("Ask a question to the user to clarify intent or get feedback for the next step")]
[Example("AskQuestion(questionText: hello, how can i help you?)")]
[Returns("The answer of the user")]
public string AskQuestion(string questionText) {
Console.WriteLine(questionText);
string answer = Console.ReadLine();
return answer;
}
[Description("Create an image")]
[Example("MakeImage(prompt: 'a picture of a whale flying, RAW glossy image, modern, in front of church', negativeprompt: 'ugly dark', imageName: img1.jpg )")]
[Returns("The URL to the created image")]
public string MakeImage(string prompt, string negativeprompt, string imageName )
{
string answer = "http://images.created/" + imageName;
return answer;
}
[Description("Refine an image. Prompt can be changed, but can also stay the same. Denoising strength has a range between 0 and 1")]
[Example("RefineImage(denoisingStrength: 0.7, prompt: 'a picture of a whale flying, but with angel wings' imageName: img1_refined.jpg, originalImageURL: http://image.upload/img1.jpg, seed: 120)")]
[Returns("The URL to the refined image")]
public string RefineImage(string denoisingStrength, string prompt, string imageName, string originalImageURL, string seed)
{
string answer = "http://images.created/" + imageName;
return answer;
}
[Description("Upscales an image to a final version")]
[Example("UpscaleImage( imageName: img_1_refined_upscaled_finished.jpg, originalImageURL: http://image.upload/img1_refined.jpg)")]
[Returns("The URL to the upscaled image")]
public string UpscaleImage(string imageName, string originalImageURL)
{
string answer = "http://images.created/" + imageName;
return answer;
}
[Description("Function to indicate that you understand the concept")]
[Example("Start(understood: true)")]
[Returns("true if the prompt is understood, otherwise false")]
public bool Start(string understood)
{
return bool.Parse(understood);
}
[Description("Present an Image to the user")]
[Example("PresentImage(imageUrl: http://image.upload/img1.jpg")]
[Returns("true, if the presentation was successfull")]
public string PresentImage(string imageURL)
{
Console.WriteLine("Look at this image: " + imageURL);
return "true";
}
}
}
The IToolbox interface is a simple way to quickly exchange Toolboxes without tight coupling classes to the rest of the code.
public interface IToolbox
{ //prompt to kickstart the Model into the workflow
public string InitialPrompt { get; }
}
The Initial prompt is used to make the LLM start thinking about the workflow, usually by making it ask its first question. In this scenario, the first question always is "What kind of image do you want to create?", as that is the first step of the Workflow.
Executing functions that the LLM want executed
Now we told the LLM that they can only talk in Functions, gave them a list of functions, and a workflow to follow that requires them to use those functions. If the LLM is clever enough, it will call a function based on the example given for a function, which could be this:
AskQuestion(questionText: "What scene, elements and theme would you like incorporated in your image?")
To run the "AskQuestion" Method in the "MakeImageToolbox" class with the "questionText" parameter, we need to turn this text into something more readable by our code. This could be done with a lot of "If-Else-Cases" as well, or we use the newest trick up our sleeves: ask an LLM. What we can not work with is plain Text, what we can work with are instances of classes that contain atomic information like "FunctionName" and a List of Parameters and their values.
This instance could be defined by this:
internal class FunctionCall
{
public string FunctionName { get; set; }
public List<FunctionCallParameter> Parameters { get; set; }
}
public class FunctionCallParameter
{
public string Name { get; set; }
public string Value { get; set; }
}
But how do we turn plain text into an instance of a FunctionCall class with all information neatly put into parameters? JSON. Thanks to JsonSerializers like System.Text or Newtonsoft we can turn this:
{
"FunctionName": "GetHotel",
"Parameters": [
{
"Name": "nights",
"Value": "5"
},
{
"Name": "rooms",
"Value": "2"
}
]
}
into this:
By doing this:
FunctionCall exampleFunction = JsonSerializer.Deserialize<FunctionCall>(json);
To get the JSON from the plain Text, we can simply ask the LLM to convert it for us.
As an extra layer of making sure, we can provide a Schema to the JSON, in case the example is not enough for the LLM to do the job right. The prompt to the LLM looks like this:
You are a JSON generator, you only reply in JSON format. The only JSON you can generate has this schema: { "$schema": "http://json-schema.org/draft-04/schema#", "type": "array", "items": { "$ref": "#/definitions/Anonymous" }, "definitions": { "Parameter": { "type": "object", "properties": { "Name": { "type": "string" }, "Value": { "type": "string" } } }, "Anonymous": { "type": "object", "properties": { "FunctionName": { "type": "string" }, "Parameters": { "type": "array", "items": { "$ref": "#/definitions/Parameter" } } } } } }
For example, the input might look like this: 'GetHotel(nights: 5, rooms: 2)', and you create this JSON from it: [{"FunctionName":"GetHotel","Parameters":[{"Name":"nights","Value":"5"},{"Name":"rooms","Value":"2"}]}]. There might be multiple input Functions, or just one. Put them in an array, like the example. Input: Start(understood: true)
On a second thought while writing this article, this last step could probably be done cheaper and faster with a generic parsing algorithm
Having an instance of FunctionCall, or a List of FunctionCalls (to make fewer round trips, which saves time and money), we can now use reflection again to call that Function
private string CallFunction(FunctionCall call)
{
Type type = toolbox.GetType();
MethodInfo methodInfo = type.GetMethod(call.FunctionName);
// Convert parameters to the correct type.
List<object> parameters = new();
foreach (var parameter in call.Parameters)
{
parameters.Add(parameter.Value);
}
// Invoke the method using the parameters.
var result = methodInfo.Invoke(toolbox, parameters.ToArray());
return (string)result;
}
Safety first
As a quick measurement of an LLMs capability, i added this virtual method:
{
"FunctionName": "Start",
"Parameters": [
"understood"
],
"Description": "Function to indicate that you understand the concept",
"Example": "Start(understood: true)",
"Returns": "true if the prompt is understood, otherwise false"
}
When asking the LLM to run the start method, we can determine if the LLM writes a blob of text, or actually just answers with "Start(understood: true)", for which the LLM had to understand the concept of being a function calling bot only, and having read and understood the list of functions provided. This is not a guarantee that the LLM does all the workflow work afterwards as well, but it's a first indicator. When an LLM fails at this step, the rest is hopeless anyway.
Use whatever LLM
To decouple use cases from OpenAI-based APIs (OpenAI, Azure OpenAI), there is one common minimal denominator that all LLM have, and we can use that: They all work based on a List of Messages, which each could have a Role (Assistant, User, System). Some separate the role from the message in the API call (OpenAI), and some use both in the same Text without an atomic distinction (some OSS LLMs). We can use this common minimal denominator to implement a simple interface:
public interface IChatCompletion
{
public Task<string> ChatCompletion(List<ChatCompletionMessage> messages);
}
public class ChatCompletionMessage
{
public string Role { get; set; }
public string Message { get; set; }
}
Now all our Workflow-with-a-Toolbox System needs is an instance of IChatCompletion to run a List of System, User and Assistant Messages by whatever LLM the implementing coder chooses. For OpenAI, this implementation could look like this:
public async Task<string> ChatCompletion(List<ChatCompletionMessage> messages)
{
var chat = api.Chat.CreateConversation();
chat.Model = "gpt-4";
foreach(var msg in messages)
{
ChatMessageRole role = ChatMessageRole.User;
if (msg.Role == "Assistant") role = ChatMessageRole.Assistant;
chat.AppendMessage(role, msg.Message);
}
string response = await chat.GetResponseFromChatbotAsync();
return response;
}
Putting everything together
Using the initial system prompt priming the LLM to be a computer instead of a chat, examples of what to do and the workflow, we end up with a hefty 1000+ Token System prompt. This costs around 3 dollar-cent (0,03$) to process, will be sent again every new workflow step, and will increase in size over time, as the function calls and function results need to be recorded as well in order for the LLM to figure out where in the workflow it is right now.
This is the final system prompt:
You are a computer system. You can not answer the user directly. instead, your only option is to call functions. Refer to the list of functions below to see what you can do. For example, if you need input from the user, you do not ask directly. instead, you call the AskQuestion(questionText: Your question here) function. All parameters are required, always add the parameter name as well as the value.Don't make assumptions about what the user wants. Ask for clarification if a user request is ambiguous. Only use the functions you have been provided with.Your goal is to help the user to generate an image. The Prompts you create are for an image generation AI like DAll-E or stable diffusion. Stick to the following rules and examples:
Examples for prompts:
A ghostly apparition drifting through a haunted mansion's grand ballroom, illuminated by flickering candlelight. Eerie, ethereal, moody lighting.
portait of a homer simpson archer shooting arrow at forest monster, front game card, drark, marvel comics, dark, smooth
pirate, deep focus, fantasy, matte, sharp focus
red dead redemption 2, cinematic view, epic sky, detailed, low angle, high detail, warm lighting, volumetric, godrays, vivid, beautiful
a fantasy style portrait painting of rachel lane / alison brie hybrid in the style of francois boucher oil painting, rpg portrait
athena, greek goddess, claudia black, bronze greek armor, owl crown, d & d, fantasy, portrait, headshot, sharp focus
Rules for prompts: Prompt should always be written in English, regardless of the input language.
Each prompt should consist of a description of the scene followed by modifiers divided by commas.
When generating descriptions, focus on portraying the visual elements rather than delving into abstract psychological and emotional aspects. Provide clear and concise details that vividly depict the scene and its composition, capturing the tangible elements that make up the setting.
The modifiers should alter the mood, style, lighting, and other aspects of the scene.
Multiple modifiers can be used to provide more specific details.
Functions you can use:[{"FunctionName":"AskQuestion","Parameters":["questionText"],"Description":"Ask a question to the user to clarify intent or get feedback for the next step","Example":"AskQuestion(questionText: hello, how can i help you?)","Returns":"The answer of the user"},{"FunctionName":"MakeImage","Parameters":["prompt","negativeprompt","imageName"],"Description":"Create an image","Example":"MakeImage(prompt: \u0027a picture of a whale flying, RAW glossy image, modern, in front of church\u0027, negativeprompt: \u0027ugly dark\u0027, imageName: img1.jpg )","Returns":"The URL to the created image"},{"FunctionName":"RefineImage","Parameters":["denoisingStrength","prompt","imageName","originalImageURL","seed"],"Description":"Refine an image. Prompt can be changed, but can also stay the same. Denoising strength has a range between 0 and 1","Example":"RefineImage(denoisingStrength: 0.7, prompt: \u0027a picture of a whale flying, but with angel wings\u0027 imageName: img1_refined.jpg, originalImageURL: http://image.upload/img1.jpg, seed: 120)","Returns":"The URL to the refined image"},{"FunctionName":"UpscaleImage","Parameters":["imageName","originalImageURL"],"Description":"Upscales an image to a final version","Example":"UpscaleImage( imageName: img_1_refined_upscaled_finished.jpg, originalImageURL: http://image.upload/img1_refined.jpg)","Returns":"The URL to the upscaled image"},{"FunctionName":"Start","Parameters":["understood"],"Description":"Function to indicate that you understand the concept","Example":"Start(understood: true)","Returns":"true if the prompt is understood, otherwise false"},{"FunctionName":"PresentImage","Parameters":["imageURL"],"Description":"Present an Image to the user","Example":"PresentImage(imageUrl: http://image.upload/img1.jpg","Returns":"true, if the presentation was successfull"},{"FunctionName":"Finished","Parameters":["goodbyemessage"],"Description":"Function to call when the final image is done and the user is happy","Example":"Finished(goodbyemessage: \u0027bye user, i was happy to help you\u0027)","Returns":"true, if the goodbye was successfull"}]
Your workflow is this: 1) Ask the user what they want in their picture2) create 4 images with slightly different prompts with different artists and artstyles 3) Present all 4 pictures to the user and ask them which one they like most, and if they want to have something changed in the prompt for the picture. 4) refine that image 4 times, use the original prompt for the image, or a changed prompt, if the user wanted some change to it. Use a lower denoise strength and a random seed. Present all 4 refined images to the user5) repeat steps 3 and 4 for 3 more times 6) upscale the final image, present the upscaled image and say goodbye.Stick to your workflow. Do not mix the steps up.
After every round trip to the LLM, the Function Call of the LLM as well as the result of the Method in our Toolbox are added as chat messages to the history. This is technically a chat between 2 computer systems, with some user input in between.
Running the code:
<< AskQuestion(questionText: Can you please describe what you would like to have in your image? Ideas can range from a specific scene, characters, objects or feelings, and even the mood or style.)
>> a picture of a motorcycle
<< MakeImage(prompt: A realistic style image of a motorcycle against a clear sky, negativeprompt: Cartoon, Abstract, imageName: motorcycle_1.jpg)
>> http://images.created/motorcycle_1.jpg
<< MakeImage(prompt: An image of a motorcycle in a night setting, glowing under neon lights, negativeprompt: Daylight, Vintage, imageName: motorcycle_2.jpg)
>> http://images.created/motorcycle_2.jpg
<< MakeImage(prompt: A vintage style image of a motorcycle parked near a retro diner, negativeprompt: Modern, Abstract, imageName: motorcycle_3.jpg)
>> http://images.created/motorcycle_3.jpg
<< MakeImage(prompt: An image of a motorcycle speeding on a desert road, dust trailing behind, negativeprompt: City, Peaceful, imageName: motorcycle_4.jpg)
>> http://images.created/motorcycle_4.jpg
The << lines are what the LLM produces, and the >> lines are what the functions return to the LLM as result of the function call. These functions are dummy functions to showcase the usage.
You can see that the first call is to ask a question on what to generate, which is answered by the user with "a picture of a motorcycle". The LLM then calls the "MakeImage" Function 4 times (which the workflow tells it to do) with varying details in the prompt, giving the user some choices.
<< PresentImage(imageURL: http://images.created/motorcycle_1.jpg)
Look at this image: http://images.created/motorcycle_1.jpg
>> true
<< PresentImage(imageURL: http://images.created/motorcycle_2.jpg)
Look at this image: http://images.created/motorcycle_2.jpg
>> true
<< PresentImage(imageURL: http://images.created/motorcycle_3.jpg)
Look at this image: http://images.created/motorcycle_3.jpg
>> true
<< PresentImage(imageURL: http://images.created/motorcycle_4.jpg)
Look at this image: http://images.created/motorcycle_4.jpg
>> true
<< AskQuestion(questionText: Among these 4 images, which one do you like the most? Would you like to make any changes to the selected image? You can mention specific details or the mood you would like to alter.)
>> i like the second one, please add more dust to it. also make the driver look more badass
The images are next presented to the user (in accordance with the workflow). Since this is a demo, the "PresentImage" Function simply says "Look at this". The LLM then asks which one the user likes, to further refine their preferences. The user says "i like the second one, please add more dust to it. Also make the driver look more badass"
This is the beauty of LLMs: you do not have to click a specific image, you do not have to input a number from a given list, and you are not bound to textboxes to give information in a structured way. Instead, the LLM understands our slackness when writing "i like the second one", and automatically understands that we are talking about the "motorcycle_2.jpg" image.
<< RefineImage(denoisingStrength: 0.2, prompt: An image of a motorcycle in a night setting, glowing under neon lights. Add more dust trailing behind and make the driver look more badass, imageName: motorcycle_2_refined_1.jpg, originalImageURL: http://images.created/motorcycle_2.jpg, seed: 123)
>> http://images.created/motorcycle_2_refined_1.jpg
<< RefineImage(denoisingStrength: 0.15, prompt: An image of a motorcycle in a night setting, glowing under neon lights. Add more dust trailing behind and make the driver look more badass, imageName: motorcycle_2_refined_2.jpg, originalImageURL: http://images.created/motorcycle_2.jpg, seed: 234)
>> http://images.created/motorcycle_2_refined_2.jpg
<< RefineImage(denoisingStrength: 0.3, prompt: An image of a motorcycle in a night setting, glowing under neon lights. Add more dust trailing behind and make the driver look more badass, imageName: motorcycle_2_refined_3.jpg, originalImageURL: http://images.created/motorcycle_2.jpg, seed: 345)
>> http://images.created/motorcycle_2_refined_3.jpg
<< RefineImage(denoisingStrength: 0.25, prompt: An image of a motorcycle in a night setting, glowing under neon lights. Add more dust trailing behind and make the driver look more badass, imageName: motorcycle_2_refined_4.jpg, originalImageURL: http://images.created/motorcycle_2.jpg, seed: 456)
>> http://images.created/motorcycle_2_refined_4.jpg
<< PresentImage(imageURL: http://images.created/motorcycle_2_refined_1.jpg)
Look at this image: http://images.created/motorcycle_2_refined_1.jpg
>> true
<< PresentImage(imageURL: http://images.created/motorcycle_2_refined_2.jpg)
Look at this image: http://images.created/motorcycle_2_refined_2.jpg
>> true
<< PresentImage(imageURL: http://images.created/motorcycle_2_refined_3.jpg)
Look at this image: http://images.created/motorcycle_2_refined_3.jpg
>> true
<< PresentImage(imageURL: http://images.created/motorcycle_2_refined_4.jpg)
Look at this image: http://images.created/motorcycle_2_refined_4.jpg
>> true
<< AskQuestion(questionText: Which refined image do you prefer? Do you wish to make more changes or shall we proceed to the next refining step?)
>> The last one is perfect. I want that one as final image
Next, a refining round happens. The same image is used as an input, the prompt is the same but slightly altered to the new wishes, and the seed is randomized. A different seed in Stable Diffusion means a different outcome, while the denoisingStrength sets how different the new image will be from the original.
The user is then asked again to make a choice and potential changes, for which the user decides on a final image. At this point, the user is diverting from the workflow, which states that 3 refining rounds should be done before a final image is made. But since the main objective of the LLM is to help with the creation of an image until the user is happy, the LLM decides to adhere to the wish of the user, skip additional refining rounds and finalise the image:
<< UpscaleImage(imageName: motorcycle_2_refined_4_final.jpg, originalImageURL: http://images.created/motorcycle_2_refined_4.jpg)
Conclusion
LLMs are useful beyond simple text-in-text-out scenarios. You can give an appropriately clever LLM control over a workflow and toolbox with tools to complete that workflow. They can do this while talking to users in natural language on the one side, and being precise enough to invoke functions on the other side
I will share the complete Source Code on Github soon.