Video Summary and Transcription
This Talk covers various aspects of artificial intelligence and user experience in software development. It explores the evolution and capabilities of large language models, the importance of prompt engineering, and the need to design AI applications with human users in mind. The Talk also emphasizes the need to defensively design for AI failure, consider user happiness, and address the responsibility and risks of AI implementation. It concludes with recommendations for further reading and highlights the importance of trustworthiness in AI code tools.
1. Introduction to AI and User Experience
Hello, everyone. My name is Chris, and I'm here today to talk to you about artificial intelligence and user experience. In my past life, I used to be a software developer, and I've spent a lot of time developing open-source tools for developers. I work at GitHub Next, a group responsible for figuring out the future of software development using AI.
Hello, everyone. My name is Chris, and I'm here today to talk to you about artificial intelligence and user experience.
In my past life, I just used to be a software developer, as probably most of you in this room. I used to be an independent consultant, mostly in .NET space, which sounds boring, but it paid quite well. So that was during the day. During the night, I've spent quite a lot of time on developing open-source tools for developers.
And because of that, I kind of was asked to join the co-pilot team back in the very early days of developing co-pilot. So I work at the organisation called GitHub Next. It's a special group inside GitHub. There are 20 of us researchers and developers, and we are responsible for figuring out what is the next crazy idea? What is the next GitHub co-pilot? What is the future of software development? We develop a lot of stuff with AI, because obviously AI is very hyped nowadays, but we also do different things like fonts. We recently released a very nice family of fonts called Monospace, if you are into it.
The most important thing about this talk is that this is the talk about what happens in the middle. I will not talk to you about model training or about data science. I know absolutely nothing about this aspect of the stack. I'm not a data scientist. I don't have a PhD in some fancy thing with math.
2. Introduction to Large Language Models
This talk is not about bringing a project to product or the challenges of scaling and earning money. It's about creating prototypes of cool ideas and introducing them as technical previews. We will discuss artificial intelligence, specifically large language models, and how they have evolved over the years. Large language models are trained on massive amounts of text and are designed to predict the next word in a given prompt. While they may seem capable, it's important to remember that they can provide unreliable answers, similar to a student bullshitting their way through a question.
Also, it's not really a talk about bringing the project to product, because as probably you all know, getting products to millions of people is very difficult. It's about scaling, it's about latency, it's about capacity, it's about how to observe results, it's about figuring out how to earn money on the project. That's not what we do.
What we do is we create prototypes of cool ideas and we throw those prototypes at people as technical previews. You could have seen multiple technical previews coming from GitHub. So, let me do a brief introduction to artificial intelligence, a brief introduction to large language models, just so you know what we are talking about.
Artificial intelligence has been a term that has been around for years. It's not a new term. I know that the hype wave is here nowadays. The term itself, it's coming from the 50s, from the 60s, something like that. The space had multiple cycles of hype and then so-called AI winters, which is the period where the hype dies out and the founding dries out and everyone is like, oh, no, this AI thing doesn't make any sense.
The current wave of hype around AI has been mostly about large language models. This is the type of artificial intelligence system which was trained on millions and millions of lines of text found in the internet, in the books, in all the sources that you can imagine. And those models have been trained in an unsupervised way. The researchers just throw this text at the models and the models learn from it something. Those models are designed to do one thing and one thing only. Given a prompt, an input, a beginning of the text, they try to figure out what is the next most probable word coming in this text. This is the only thing that those models do.
Every high-level ability that you see from those models, like chat, function calling, all those fancy things that open AI is putting into those models is built on top of that basic functionality. The most important thing that you need to remember about AI is that they have this cool capability that researchers called hallucinations. But really I prefer to call it differently, which is bullshitting. AI models, those large language models, they will always answer your question, even if they don't know the answer. You have absolutely no way of knowing whether the thing is correct. They are the best student ever because, you know, you probably, we all have been in school and we had this moment that, oh, there was some question from the teacher. You need to stand in front of the class and start answering and you have absolutely no idea. Then you start saying something just to say something and pretend that you know stuff. This is the large language model. Just on scale. So you need to remember that you cannot really trust those models. They are probabilistic machines and they try to figure out what is the most probable thing to say to you.
3. Understanding Chat GPT and Prompt Engineering
When talking with chat GPT, the model's responses are based on probability, not emotions. Large language models are very general and can be steered by context through prompt engineering. The prompt has two parts: a constant part that describes the persona and task, and a dynamic part that incorporates user context through retrieval. Retrieval methods vary depending on the application, such as querying a database or using embeddings for unstructured search.
So whenever you talk with chat GPT and you kind of like by discussion, the discussion results in chat GPT saying, oh, I love you or I hate you. The model doesn't really have those feelings. It's just from this conversation it turned out that that was the most probable word that should come in this conversation.
One very cool thing about those models, those large language models and what is really unique about them is that they are very general. Because they've been trained on everything, on the old internet, they know a lot of stuff and they have a lot of capabilities like reasoning capabilities. The really cool thing is that we can steer their behavior by putting context into our inputs into prompts and this process is called prompt engineering.
Let's focus first at the top of the prompt. There are basically two different parts of the prompt when we think about it. One is kind of like constant part, what I like to call constant part. It's all about describing what is the persona that AI should take, what is this task that should be solved by the AI system.
The second part of the prompt is something that's more dynamic. It's all about putting the context from your particular user, from your particular session into the prompt, from your particular, I don't know, database. And this process is called retrieval. Retrieval is a thing that's very dependent on your particular application. In some cases, it may just be about querying your SQL database. In other cases, it may be about looking at your IDE and figuring out what files you have open in IDE, maybe there's some interesting code in those files. In other cases, it will be using embeddings to do very unstructured search on the data. But there's no single solution. I guess that's what I want to tell you.
Okay. So, that was kind of like a brief introduction. First half of the talk.
4. AI and User Experience Design
In some cases, retrieval involves looking at the IDE and finding interesting code in open files or using embeddings for unstructured search. It's important to design applications with human users in mind, as AI is meant to enhance human capabilities, not replace them. AI should never make decisions as it lacks the ability to understand the decision-making process. Designing with human users in mind involves understanding the parts of the workflow where AI can assist, such as helping with boilerplate code in software development.
In other cases, it may be about looking at your IDE and figuring out what files you have open in IDE, maybe there's some interesting code in those files. Just a random example. In other cases, it will be using embeddings to do very unstructured search on the data. But there's no single solution. I guess that's what I want to tell you.
Okay. So, that was kind of like a brief introduction. First half of the talk. I was perfectly on time. Impressive. Now I will talk a bit about what is closer to my heart which is user experience design for those applications. And there will be some hot takes in this part, so bear with me.
First of all, the hottest of the takes. You need to design your application human in mind. AI is great for enhancing human capabilities. AI shouldn't replace human. Shouldn't replace your users, shouldn't replace developers in our case. It's just not good enough for that. It should never make any decisions. AI sucks at making decisions, because as I've mentioned to you, it's just about predicting the next most probable word. That's not how the decision process works. So computer can never be responsible for the decision you make. That just means that it cannot make any management decisions, because if you cannot have any responsibility for the decision, like what happens there? That breaks all the chain of command in your company, like whatever. So always design with human in mind.
Following on that, you need to understand really well what are the part of the workflow that your users want to help with? For example, in software development process, we've done quite a lot of user research, user studies on that topic, and we know, we've learned that developers are very happy if AI helps them with boilerplate code, with the boring parts of the coding. But they don't want AI to make any decisions for them. They don't want AI to take away, like, very important things like complex algorithmical problems from them, because developers believe that this is the value of them. And well, I absolutely agree. So whenever you design application using AI, please think or ask your users, do research about what is the part of the process that should be solved by the AI, that AI should help with. As I mentioned, those are probabilistic machines.
5. Designing for AI Failure and User Happiness
Design defensively for failure in AI systems. Inform users that some content is generated by AI and may be wrong. Allow users to regenerate answers or provide more details for AI to generate a better response. Always present answers for human acceptance and provide the capability to edit them. The accuracy of AI models may not always translate to more value. Consider the impact of model power and latency on user experience. User happiness can be achieved even with imperfect suggestions that push them forward. Latency importance depends on user experience design. Consider using streaming or generating responses all at once based on the use case.
They like bullshitting. So you need to design defensively for failure. The AI systems will fail. Sometimes they will give you wrong answers. You cannot do anything to solve that. It's not physically not possible to solve that problem. So you need to design your user experience with this in mind. You need to inform user that some content is generated by AI because users need to be aware that it may be wrong. You need to design your user experience in a way that, for example, user can very easily regenerate the answer or ask again, or provide more details, and then kind of like AI generates answer again based on additional context. So the way that you present answer, it always needs to be accepted by human, and you always need to give human capability of editing the answer. It's not like, oh, there is something from AI, let's execute it automatically on your SQL database. That's a really terrible idea, trust me. I've tried that.
The one really interesting thing to remember is that the more accuracy doesn't necessarily always mean more value. If we think about how to get most accuracy from AI models, it's always about let's use as big model as possible, most powerful, the latest, fanciest model from open AI or other vendors. Let's put as much context into the prompt as possible. But this has huge impact on the user experience because the more powerful the model is, the more stuff you put into the context, the latency will be higher. Sometimes this is not really what you need. Back in the original co-pilot days, we decided to run a co-pilot with the smaller model, not the most powerful model available at the time. Because with the user experience we've designed, there's inline suggestions in your IDE, it turned out that the more powerful model, yes, it was more accurate, the suggestions were better, but because you've seen so much, many less of them, the value for you was smaller. Also, even if the accuracy, even if the suggestion is not perfect, that doesn't necessarily mean that user is not happy with it. Very often, and again, under this QR code, it's linked to the research that my teammates did about user happiness when using co-pilot. What we discovered is that users are very happy with bad suggestions as long as those suggestions can push them forward. If you're stuck and you don't know what to do, and you get some suggestion, maybe not perfect but at least it unstucks you, unblocks you. This is a very valuable thing. So latency may be important or it may not be important. It really depends on how you design your user experience. All those models support streaming, so you can either ask model to generate all the response at once and wait as a normal HTTP request and response, or they do streaming using HTTP2 streaming thingy. In some cases, maybe you need to do all at once. In case of co-pilot, we decided to do all at once always because it would be super weird if the suggestion, the ghost text, the inline suggestion that you see in your ID is streaming and adding new lines.
6. Designing with the Human in Mind
Designing with the human in mind and defensively for failure. Provide optional suggestions that can be easily ignored. Keep the user in the flow. Use a more structured approach for specific tasks to increase accuracy. Consider different use cases and design the user experience accordingly. Introduce copilot workspace for a structured exchange process.
That would be a very weird user experience. For example, in cases of chat applications, they, those applications fairly often use streaming mode to pretend that they are faster than they really are. There's a couple of typical ways of designing with human in mind and designing defensively for failure. I think those are a couple of main examples that we've discovered over the years of building those applications.
The first one is providing user with something very optional, like this inline suggestion in your ID. You can very easily ignore them if something is wrong. If the co-pilot suggestion is wrong, all you need to do is keep typing. Nothing has changed. It hasn't negatively impacted your user experience too much. Maybe you stopped to read it, but if it's obviously wrong, then you keep coding whatever you've been coding. It's all about keeping human in the flow.
The second potential is a bit more structured. It's about like, oh, you need to select the code, or select some range, or select some picture, and run some kind of transformation on top of it. This is a very defined process, a very structured process, but because it is so structured, we can increase the accuracy of the responses. We can increase the likelihood that the solution will be better. And obviously there is chat which is very great for, like, iterating. I don't know what to do. Let's think through the process. So really, if you put it on the scale, something like GhostX is very much about doing things and staying in the flow, while something like chat is about planning. So depending on your use case and what you want to do with your application and the design of user experience for AI system will be different. And let's talk a bit about the structured exchange for a moment. So, recently, announced project called copilot workspace. It is a project that you can point at the issue, and it will try to solve this issue, generate a whole pull request with changes in multiple files in your repository. But we are not doing that directly. So it's not like, oh, just read the issue and then generate a bunch of code because that would be terrible. That doesn't give user control. The accuracy would be terrible. So we created this multi-step structured process of, yes, the AI first read the issue. Then it generates specification of what is your current state of the application and what is supposed to be the new state of your application based on this issue. You can change it.
7. Designing User Experiences with AI
You can edit the specifications and control the code generation process. Think outside the box when designing user experiences with AI. Avoid relying solely on chat bots. Experiment and understand your users. The journey to production of AI systems is long and requires considerations such as hosting the model and measuring results.
You can edit it. It's going to be a bunch of bullet points, right? For both of those specifications. Then if you accept the specification, we ask the model to generate specific plan of the implementation, changes step by step, what will be the changes in which file. Again, this is something that you can see before we start generating the code, so you can edit that easily. You can change that if the AI is wrong. And only then, if you accept the plan, we will generate the code for you. That gives you a lot of control over what is happening. It makes sure that human is in the centre of the process.
And there are other examples of user experiences. You can imagine something like that. This is just random example that we thought a while ago, but we haven't really managed to do it well. But you can imagine that, you know, you have two different panels in your editor, for example, and you write code and then there's like natural language description that automatically updates as you write the code. But also if you edit the natural description, the code automatically updates. There's a lot of potential in those different user experiences and I think the main thing is that you need to think about what your application is doing and do something interesting. Using AI is a new thing to the people. Using AI is fairly new phenomenon. So you cannot just go back to the same user experiences that have been working for us for the last 30 years and just pretend this will work also. You need to think a bit outside of the box.
And that brings me to the last point, that everyone nowadays writes chat bots when they think about AI. But really maybe that's not the best idea. Whenever you provide the user a text box, which is very unstructured, the user will type whatever they want there. No matter how much you try to scope the chat to your particular use case, users will do kind of like weird questions about how to build a gun or like whatever illegal thing. So really there's no single path to success. You need to experiment, you need to understand your users really well. So frankly it's not much different than what we are doing right now with the user experience that we design in real systems. But you just need to remember that you need to think about stuff. And after all that that I've talked for those 20 minutes, you need to remember you're just at the start of your journey, the journey to production of those AI systems is long. You need to figure out how to host the model if you're not using open AI directly or other vendor. You need to figure out how to measure results. This is another very interesting topic that I should have, well, I could, I would talk about if I had more time today.
8. Responsibility and Risks of AI Implementation
Consider the implications and potential risks before implementing AI solutions. Stories of AI being used incorrectly highlight the need for caution. Think about the impact on user trust and safety, such as replacing human specialists with language models. Don't blindly rely on AI for critical tasks like tax returns. Your responsibility is to carefully consider the consequences and protect user privacy and security.
So yeah. And really the last thing, and I'm running over the time a bit, but I need to tell you this one last thing, is that, you know, if your boss, your CEO, your CTO, your manager comes to your room as a developer and tells you five of our competitors have now AI solutions, we also need AI solution. It's your responsibility to ask yourself and ask them, is it really a good idea? Should we really do it? I mean, here in media, all those information about people using AI in very weird ways. There was the story of the lawyer that generated a list of precedents in US using JAPGPT. And judge looked at this list when it was presented to him. And it turned out that this list was absolutely false. None of the precedent was true. There was no actual cases there. Judge was not impressed, trust me. There was another story about this service that was built to put people in touch with mental health specialists through web interface. It's a really great application, right? It helps people in kind of like smaller villages, for example, or smaller cities or less developed countries to get in touch with mental health specialists. Maybe it's not possible for them. There's no one like that living in their area. And this service suddenly, after GPT-4 was released, decided to randomly replace some of their mental health specialists with language models. How would you feel if you realise that you're not talking with a medical doctor but with large language model? Probably not great. And, you know, I'm fairly sure that people this year were using JAPGPT for the tax return or something like that, and I'm fairly sure that IRS was also not impressed. So yeah, it's your responsibility to think about that, like, would I want AI to have access to my money, to my bank account? Never. Please, please think.
Books and Q&A on Copilot Experimentation
I have two books to recommend: one about prompt engineering for language models and another about observability for large language models. The Q&A session will address questions about Copilot's offline experimentation and metrics for user experience improvement.
And, yeah, thank you for watching. That was fun. APPLAUSE Thank you so much. I have one more thing to show you. To pitch to you two books. One is by one of my ex-colleagues from the team about prompt engineering for a language model. It's a very interesting book, so I do absolutely recommend that.
And another one is a report for O'Reilly by my very good friend, Phillip Carter, about observability for large language models. It's part of this, like, what happens, how to bring the model to production, which is very, very important, because understanding results of the AI system is very difficult, so this is a really good book.
Thank you. Thank you so much, Christophe, and you're staying with us. You're welcome to have a seat. We have the Q&A session now. Now we have a few questions. We'll try to answer as much as possible. It depends how fast and good you will answer. Oh, no. That's difficult. That's the game. That's the game. So we'll start with this one. You can also see the questions at the top. The first one is how does Copilot experiment offline for different features? What offline metrics correlate with user experience improvement? Oh, this is ... I could give a whole talk about that topic. That's the challenge! So let's try to do that in one minute. So basically we do have this offline evaluation system where we clone thousands and thousands of Python repositories, I believe, and it's not about user experience, but about measuring accuracy of the prompt improvement especially. So we clone those repositories. We figure out if the code is well-tested in terms of unit tests. We run the tests to capture the state of the application, and then we find the functions that were tested by this test suite, we remove the body of the function, we regenerate that with Copilot, we run the test again to see if the stuff is the same. This is helpful, though it's not super...
Trustworthiness of LLMs and AI Code Tools
There is no direct correlation between the offline evaluation results and the real user experience. A-B experiments are conducted on real users. Making LLMs more trustworthy can involve asking AI to generate code and validating it. However, for more generic answers, further research is needed.
It's not 100% accurate. There is no direct correlation between those results and the real results for the users, or kind of like user experience especially. So really the only thing we can do is do A-B experiments, and we do a lot of A-B experiments on real users.
The next one is are there ways to make LLMs more trustworthy? You talked about hallucination. I have no idea! No, so obviously you can... When we talk about whole AI systems, then yes. Potentially, you could, for example, one of the very good methods is asking AI, instead of doing math calculations, because AI sucks at math, you ask AI to generate the code to perform those calculations, and then you run the code, and that usually gives you way better results. You could probably ask AI about, I don't know, generating the proofs for your code in context of code generation, but that's... Code is in general fairly easy to validate, right? For more generic answers, I have no idea. I'm fairly sure it's... Well, I know that it is a research area, and a lot of people are very interested into that, but I don't have any good answer for that right now.
Comments