435 reads

Tesla AI Day: How Does Tesla's Autopilot Work

by Louis BouchardAugust 21st, 2021

Too Long; Didn't Read

A couple of days ago was the first Tesla AI day where Andrej Karpathy, the Director of AI at Tesla, and others presented how Tesla’s autopilot works from the image acquisition through their eight cameras to the navigation process on the roads.Learn more in this short video. The video is a short video that shows how a Tesla car can not only see but navigate the roads with other vehicles. Learn more in the short video below. The video was made on the top of the top corner of the video and quickly takes you to the top right.

Companies Mentioned

featured image - Tesla AI Day: How Does Tesla's Autopilot Work

If you wonder how a Tesla car can not only see but navigate the roads with other vehicles, this is the video you were waiting for. A couple of days ago was the first Tesla AI day where Andrej Karpathy, the Director of AI at Tesla, and others presented how Tesla’s autopilot works from the image acquisition through their eight cameras to the navigation process on the roads.This week, I cover Andrej Karpathy's talk at Tesla AI Day on how Tesla's autopilot works.

Learn more in this short video

Video Transcript

00:00if you wonder how a tesla car can not00:02only see but navigate the roads with00:04other vehicles this is the video you00:06were waiting for a couple of days ago00:08was the first tesla ai day where andrei00:11karpathy the director of ai at tesla and00:14others presented how tesla's autopilot00:17works from the image acquisition through00:19their eight cameras to the navigation00:21process on the roads tesla's cars have00:23eight cameras like this illustration00:26allowing the vehicle to see its00:27surrounding and far in front00:29unfortunately you cannot simply take all00:31the information from these eight cameras00:33and send it directly to an ai that will00:35tell you what to do as this will be way00:37too much information to process at once00:39and our computers aren't this powerful00:42yet just imagine trying to do this00:44yourself having to process everything00:46all around you honestly i find it00:48difficult to turn left when there are no00:50stop signs and you need to check both00:52sides multiple times before taking a00:54decision well it's the same for neural00:57networks or more precisely for computing00:59devices like cpus and gpus to attack01:02this issue we have to compress the data01:04while keeping the most relevant01:06information similar to what our brain01:08does with the information coming from01:10our eyes to do this tesla transfers01:13these eight cameras data into a smaller01:16space they call the much smaller vector01:18space this space is a three-dimensional01:21space that looks just like this and01:23contains all the relevant information in01:25the world like the road signs cars01:27people lines etc this new space is then01:31used for many different tasks the car01:33will have to do like object detection01:35traffic light tests lane prediction etc01:38but how do they go from eight cameras01:40which will mean eight times three01:42dimensions inputs composed of red green01:45blue images to a single output in three01:47dimensions this is achieved in four01:50steps and done in parallel for all eight01:52cameras making it super efficient at01:55first the images are sent into a01:56rectification model which takes the01:59images and calibrates them by02:00translating them into a virtual02:02representation this step dramatically02:05improves the autopilot's performance02:07because it makes the images look more02:09similar to each other when nothing is02:11happening allowing the network to02:13compare the images more easily and focus02:15on essential components that aren't part02:18of the typical background then these new02:20versions of the images are sent in a02:22first network called regnet this regnet02:25is just an optimized version of the02:27convolutional neural network02:29architecture cnns if you are not02:31familiar with this kind of architecture02:33you should pause the video and quickly02:34watch the simple explanation i made02:36appearing on the top right corner right02:38now basically it takes these newly made02:41images compresses the information02:43iteratively like a pyramid where a start02:45of the network is composed of a few02:47neurons representing some variations of02:50the images focusing on specific objects02:52telling us where it is especially then02:55the deeper we get the smaller these02:57images will be but they will represent03:00the overall images while also focusing03:02on specific objects so at the end of03:05this pyramid you will end up with many03:07neurons each telling you general03:09information about the overall picture03:11whether it contains a car a road sign03:13etc in order to have the best of both03:16worlds we extract the information at03:18multiple levels of this pyramid which03:20can also be seen as image03:21representations at different scales03:23focusing on specific features in the03:25original image we end up with local and03:28general information all of them together03:31telling us what the images are composed03:34of and where it is03:35then this information is sent into a03:38model called bi fpm which will force03:41this information from different scales03:42to talk together and extract the most03:45valuable knowledge among the general and03:47specific information it contains the03:50output of this network will be the most03:52interesting and useful information from03:54all these different scales of the eight03:56cameras information so it contains both03:58the general information about the images04:01which is what it contains and the04:03specific information such as where it is04:06its size etc for example it will use the04:09context coming from the general04:10knowledge of deep features extracted at04:13the top of the pyramid to understand04:15that since these two blurry lights are04:17on the road between two lanes they are04:20probably attached to a specific object04:22that was identified from one camera in04:25the early layers of the network using04:27both this context and knowing it is part04:29of a single object one could04:31successfully guess that these blurry04:33lights are attached to a car so now we04:35have the most useful information coming04:37from different scales for all eight04:39cameras we need to compress this04:41information so we don't have eight04:43different data inputs and this is done04:45using a transformer block if you are not04:48familiar with transformers i will invite04:50you to watch my video covering them in04:52vision applications in short this block04:54will take the eight different pictures04:56condensed information we have and04:58transfer it into the three-dimensional05:00space we want the vector space it will05:03take this general and spatial05:04information here called the key05:07calculate the query which is of the05:09dimension of our vector field and we'll05:12try to find what goes where for example05:14one of these query could be seen as a05:16pixel of the resulting vector space05:19looking for a specific part of the car05:21in front of us the value will merge both05:23of these accordingly telling us what is05:26where in this new vector space this05:28transformer can be seen as the bridge05:30between the eight cameras and this new05:333d space to understand all05:35interrelations between the cameras now05:37that we have finally condensed your data05:40into a 3d representation we can start05:42the real work this is a space where they05:45annotate the data they use for training05:47their navigation network as the space is05:49much less complex than 8 cameras and05:52easier to annotate ok so we have an05:54efficient way of representing all our 805:57cameras now but we still have a problem05:59single camera inputs are not intelligent06:02if a car on the opposite side is06:04occluded by another car we need the06:06autopilot to know it is still there and06:08it hasn't disappeared because another06:10car went in front of it for a second to06:13fix this we have to use time information06:16or in other words use multiple frames06:18they chose to use a feature cue and a06:21video module the feature queue will take06:23a few frames and save them in the cache06:26then for every meter the car does or06:29every 27 milliseconds it will send the06:32cached frames to the model here they use06:35both a time or a distance measure to06:38cover when the car is moving and stopped06:41then these 3d dimensions of the frames06:44we just processed are merged with their06:46corresponding positions and kinematic06:48data containing the car's acceleration06:51and velocity informing us how it is06:53moving at each frame all this precious06:56information is then sent into the video06:59module this video module uses these to07:01understand the car itself and its07:03environment in the present and past few07:05frames this understanding process is07:07made using a recurrent neural network07:10that processes all the information07:11iteratively over all frames to07:13understand the context better and07:16finally build this well-defined map you07:18can see if you are not familiar with07:20recurrent neural networks i will again07:22orient you to a video i made explaining07:24them since it uses past frames the07:26network now has much more information to07:29understand better what is happening07:31which will be necessary for temporary07:33occlusions this is the final07:35architecture of the vision process with07:37this output on the right and below you07:40can see some of these outputs translated07:42back into the images to show what the07:44car sees in our representation of the07:47world or rather the eight cameras07:50representation of it we finally have07:52this video module output that we can07:54send in parallel to all the cars tasks07:57such as object detection lane prediction08:00traffic lights etc if we summarize this08:02architecture we first have the eight08:04cameras taking pictures then they are08:07calibrated and sent into a cnn08:10condensing the information which08:12extracts information from them08:14efficiently and merges everything before08:16sending this into a transformer08:18architecture that will fuse the08:20information coming from all eight08:22cameras into one 3d representation08:26finally this 3d representation will be08:29saved in the cache over a few frames and08:32then sent into an rnn architecture that08:35will use all these frames to better08:37understand the context and output the08:40final version of the 3d space to send08:42our tasks that can finally be trained08:44individually and may all work in08:47parallel to maximize performance and08:49efficiency as you can see the biggest08:52challenge for such a task is an08:53engineering challenge make a car08:56understand the world around us as08:58efficiently as possible through cameras09:00and speed sensors so it can all run in09:03real time and with a close to perfect09:06accuracy for many complicated human09:08tasks of course this was just a simple09:11explanation of how tesla autopilot sees09:13our world i strongly recommend watching09:15the amazing video on tesla's youtube09:18channel linked in the description below09:20for more technical details about the09:22models they use the challenges they face09:24the data labeling and training process09:26with their simulation tool their custom09:28software and hardware and the navigation09:32it is definitely worth the time, thank you for watching.