I am working on (infant) human body detection in videos. I have read that the best way to track human movements is to exploit both spatial and temporal information, and every research I have found had used a modified U-Net, so an encoding part, through iterative convolutions, followed by a decoding part, through iterative deconvolutions
Some insights about my ground truth: I have access to black and white depth frames and to the positions of limb joints in each frame. From these data, I have created bounding bboxes, here called bboxes (containing the pixel positions of small circles or rectangles built around every joint), and masks (logical matrices of the same size as the frames, whose only non-zero entries are the one defined in the bboxes). So finally I have as many bboxes and masks as many joints I'm studying per frame. I created the masks because I was planning to use a Mask RCNN, but I couldn't find a way to do that.
I have built a network, although I can't guarantee that it works: my problem is that I have no idea how to train it! Whatever combo (inputData, trainFunction) I have tried hasn't succeded. The errors I receive are never about the net itself so I think it is useless to give further information about it, but the first layer is an image3dInputLayer since I intend to extract temporal features.
I would be glad if you could help me with this. I attach a workspace containing all the variables you need to know. With those variable I am trying to create an imds or a gTruth or whatever could work to build a network. Do you have any idea about what approach could work for my case?
Thank you very much.