Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
PySceneDetect – A tool for detecting scenes in movies (pyscenedetect.readthedocs.io)
187 points by josefslerka on Sept 1, 2019 | hide | past | favorite | 62 comments


My PhD (before 10+ years) was in video search and one of the proposed methods for video comparison was using the shot durations of the video. Notice that with shots I refer to cuts in the camera flow, IIRC hollywood movies have a such a shot cut every 4 seconds as an average (for example when two people talk the camera will move from one person to the other).

I remember that I used two techniques for extracting scene cuts:

* Difference in the brightness (Y) histogram of the YUV video between frames; when that difference is more than a threshold there's a scene cut

* Counting the number of Intra macroblocks per frame on an H.264 encoded video; if that number was more than a threshold then there's a scene cut


Author of PySceneDetect here. The current implementation does exactly what you hint at, except instead of YUV, it considers deltas in the HSV domain (specifically differences in hue and color).

Other techniques being considered for future work include use of optical flow, background subtraction, and analyzing histograms.


From what I remember the Y (luma) component in a YUV video has more information than the other two components and it could also be extracted without the need to fully decompress the video (in mpeg compressed videos). Of course this info is more than 10 years old (I don't really do any video research any more) so I guess there should have been progress in that area.


This is indeed correct, I'm just using HSV instead of YUV, but the primary source of information is the luma/brightness component (although currently all 3 of the HSV components are averaged, so perhaps a better weighting may improve precision).


What if the shot is of two people having a conversation in a disco, with lots of lights flashing (stroboscope), etc?


Super interesting! Mind explaining what an Intra macroblock is?


The image for an MPEG compressed video is splitted in 16x16 blocks. For each frame (excluding the 1st of course), the compressing algorithm tries to to matchthat particular block with a block in the previous frame (it searches the previous frame to see where there are the fewest differences). If it can do it it only encodes the differences and the position in the previous frame; this is called an inter(predicted) block. If however it can't match that block with the previous frame then it needs to re-encode from scrach; that's the intra macroblock. As you can understand after a shot cut there will be much more intra macroblocks.

There are some more info in the wikipedia: https://en.wikipedia.org/wiki/Macroblock and also there's a nice article explaining some of the magics of H264 compression: https://sidbala.com/h-264-is-magic/


Intra search takes place in the same frame, using the MBs encoded so far in that frame.


Can the link be changed to the home page of the docs instead of the CLI params one? https://pyscenedetect.readthedocs.io/en/latest/


Am I stupid, or is it really hard to get from a readthedocs page to the github repo it is generated from.

It seems like the only way to do it is click edit, get an error message and then hit backspace in the url to get to the root of the repository.


Author of PySceneDetect here. Sorry about that, this is one thing I never figured out how to fix with the generated documentation yet.

Now that you mention it, I might ask the Readthedocs folks to see if they have any idea why this is happening.


Even if that is fixed, a link to the repo on the left column[1] would be nice. Maybe you don't want to edit that page, just encountered the docs first, e.g. from HN :D. This project sounds pretty interesting, I'll take a look!

[1]See e.g. docs of requests: https://requests.readthedocs.io/en/master/


Great idea, will give that a shot - thanks for the suggestion! :)


No need to apologise! Thank you for making such an awesome library - can't wait to try it out on some of my home videos :p


Funny you should mention that - initially home videos were what I was doing this with, as I was unable to find any existing programs that could handle them reliably (or had various shortcomings, or weren't hackable/extensible, etc...). Then I realized that what I came up with actually competed pretty well with the existing solutions (benchmarks below [1]), so decided to open source it and share it with the world.

Of course, the project is still very basic in it's current form, but does offer a good platform for testing more advanced scene detection methodologies (and has also been used as a baseline in various research/academic contexts). There's been plenty of suggestions for other detection methods (e.g. using histograms) which I'm very interested in looking into adding for future releases.

Lastly, one thing I do want to address going forwards is the performance, to make the algorithms more suitable for real-time systems (have decided to rewrite the core algorithms in C++ when I have some time, and interface that with the Python library/CLI). That being said, I'm always open to feedback or different ways of approaching things. If you or anyone else has any feedback or suggestions, feel free to create a new issue on the Github issue tracker [2] and I would be happy to discuss the possibility of including it in a future release of PySceneDetect.

[1] https://github.com/albanie/shot-detection-benchmarks

[2] https://github.com/Breakthrough/PySceneDetect/issues


Tap the black and green widget that reads “v:latest” and it expands with a link to the repo.


I'm either stupid too, or it's hard, or both.


Author of PySceneDetect here: Thank you all for the thought-provoking discussions, and the attention you've given my side project. There are some specific cases where PySceneDetect achieves great accuracy (like fast cuts or fades), and some where it's currently not so good at (like sudden flashes or large obstructions). That being said I do want to track these things and come up with solutions to improve the robustness of the content detection algorithm over time.

I'm most open to any feedback or feature requests/ideas/suggestions; feel free to checkout the issue tracker on Github, or create a new entry:

https://github.com/Breakthrough/PySceneDetect/issues

Some ideas being considered/researched for future releases:

- looking at changes to image histograms

- using edge detection to improve robustness

- camera flash / foreground object suppression

- automatic threshold detection using statistical methods (currently is just a heuristic)


So this is likely beyond the scope of your project, but I've always thought a really good project would be a website to host scene indexes for movies and TV.

Eg. Let's say that you wanted to watch a prerecorded football game or baseball game without all the commercials, timeouts, commentators talking about the fans, etc.

Or... Let's say that you wanted to re-cut a movie in a certain way, by re-ordering the scenes, you could just generate a new scene data file and let the encoder/player use that.


This is still relevant I think :) What you mentioned is very similar to an edit decision list (EDL [1]), of which I only learned recently. I had a feature request [2] to support EDL as an ouptut format, and upon further investigation, it seems like the format is very similar to what you're talking about. The Wikipedia page also indicates that VLC supports xspf files, but I haven't done much research into that yet ("XML Shareable Playlist Format").

[1] https://en.wikipedia.org/wiki/Edit_decision_list

[2] https://github.com/Breakthrough/PySceneDetect/issues/101


So I can look forward to a "Next scene" shortcut in my video player of choice soon? (Without needing embedded "DVD" chapters) While also getting nice thumbnails for them?

Pornhub et al often show scene markers, but I have assumed they're manual or extracted in the same way as DVD chapters.


Tested it out on an anime episode to see if it could detect accurately when the opening and closing credits play. It seemed to be able to detect the ending of both even with a high threshold, but it didn't really detect the start. And with a low threshold, it gave me like 350 chapters for a 22 minute video, which seemed a bit excessive.


Hi Hamuko;

Would you be able to share a small sub-set of the episode, in particular the area where you're unable to detect the starting segment? (If not, no worries!)

There's a few issues with PySceneDetect currently that may lead to false or missed detections, but these are things that I would like to solve in the long run:

- threshold is heuristic/fixed right now, but I would like to change it to an adaptive/statistical method which can dynamically change

- single-frame events can trigger false scene changes

Thanks for your feedback, and feel free to share any other suggestions you might have.


Little off topic..

I'm trying to help a friend in media industry. His requirement is ability to identify different voice in a movie and have the output time stamped - Ex. Voice A: 00.00.00sec - 00.05.30sec, Voice B: 00.05.31 - 00.06.30, etc.

It would be very helpful if anyone can point to any tools that exists that can do that (open source or otherwise).


If AWS gets you going definitely go for it but if you're looking for existing tools type 'python speaker diarization' in github.'

Most of what you'll find would require downsampling your audio to 16khz and you'll find a combination of NN based diarizer and hmm based models pre-ML.

One thing to note a lot of the systems will work well for interviews, broadcast media and footage taken from a camera because the audio will tend to be clean.

Film and movies will be a challenge because of the background music being identified as a separate voice. It tends to to confuse it.

Haven't used the cloud based systems with audio the background sounds you tend to find in film movies.


AWS offer this for $1.44ph


Thanks. Are you referring to Transcribe? https://aws.amazon.com/transcribe/

Transcribe seems to be more for speech-to-text irrespective to who is making the speech.

Here the requirement is to identify the unique voices. Ex. if "Mary had a little lamb" is voiced by two different voices then the engine should identify Voice A said "Mary" at 00.00.00sec-00.00.01sec and then Voice B said "had a little" at 0.00.02sec-00.00.03sec, then Voice A again said "lamb".. etc.


From your link:

Recognize Multiple Speakers

Amazon Transcribe is able to recognize when the speaker changes and attribute the transcribed text appropriately. This can significantly reduce the amount of work needed to transcribe audio with multiple speakers like telephone calls, meetings, and television shows.

So they claim to be able to do it. However, having to do speaker diarization doesn't exactly make speech recognition easier, so you should adjust your expectations regarding the error rate accordingly.


Duh. It was right there.

But thanks; on doing little bit of search on 'speech diarization' saw that Google Cloud has that service. https://cloud.google.com/speech-to-text/docs/multiple-voices

And then came across this comparison between various cloud transcription service, which should be helpful for evaluating on diarization aspect as well.

AI-POWERED TRANSCRIPTION SERVICES SHOWDOWN: AWS VS. GOOGLE VS. IBM WATSON VS. NUANCE https://www.armedia.com/blog/transcription-services-aws-goog...


Transcribe does “Recognize Multiple Speakers” — it’s on the page that you linked to, in the list of features.

I would also recommend comparing it to Google’s and maybe MS/Azure’s services.

Aside from the fidelity of the transcription itself, and the accuracy of disambiguating the different speakers, I’m also not convinced that all of these services will give you an end timestamp (start timestamps are sometimes there, but not necessarily for every word or sentence).

You could try a multi prong approach by doing the transcript to get text and speakers only (using the services mentioned above), and then using an aligner such as “gentle” [0] to find the start / end times.

You will have some gaps and wrongly transcribed words, but it may be a start..!

[0] https://github.com/lowerquality/gentle


I've used this tool for breaking up movie trailers and old Vines. I really didn't like the output quality.

I ended up using optical flow and key frame information instead.


Hi genp;

Just curious, what didn't you like about the output quality? What version of PySceneDetect were you using?

The latest version (v0.5.x) uses ffmpeg instead of mkvmerge for output by default now, which produces significantly more accurate and higher quality output than before.

That being said, you are correct in that optical flow and keyframe information is currently not being used during the detection phase. There are several proposals to incorporate this into a future release, however, along with several other techniques: - histogram analysis - edge detection - background subtraction - automatic or dynamic threshold detection

Thanks for your feedback!


> I ended up using optical flow

Optical flow for scene detection? How does this work?


This was my assignment in the "artificial intelligence" course I took at my University in Naples (Italy) in 2003. My NN could recognise scenes cuts with about 50% probability :)


a Film is a collection of shots, scenes and sequences. Shots are fairly easy to detect. Most editing software comes with built in shot detection.

This approach promises to detect scenes. I’m not enough of a coder to know if it is valid, but I can certainly imagine that it is a solvable problem. Most scenes are defined by location, and movies makers like each locatation to be distinct from the preceding one.

To my mind, sequences are where the action is at. Sequences are defined by narrative development... These can usually be defined by a trained human, but are probably tough to computationly define.

There is also a huge difference between film genres. The scenes and sequences of Finding Nemo are pretty easy to define. But you try the same approach on an art house film like The Scent of Green Papaya, and see how far you get.

https://www.helpingwritersbecomeauthors.com/movie-storystruc...


Does this program really detect scenes by the film-making definition? In video encoding terminology, shots are commonly called "scenes". E.g. x264's "scenecut" option really sets the sensitivity for detecting new shots, so that I-frames can be inserted there to avoid the inefficiency of encoding differences from a different shot.


Author of the program here. You are correct, the program detects shots rather than scenes, but I didn't want to give the impression that this project was related to the existing ShotDetect program. I felt that the documentation explained this well enough, but I'm open to considering an alternative project name if anyone has a suggestion.


I had fun some years ago working on some sentiment analysis, and found if you took subtitles you could often see where general changes happened. Splitting that over where the histogram jumped you could more reasonably split things over camera cuts. Plot an exponentially smoothed sentiment and you could see patterns for types of content (sadness then uplift at the end of current affairs shows for example).

I built it for auto-summarisation of video and categorisation as a few day hack and it was OK (better than I expected, not good enough to use). Totally failed at sports though, as the commentary usually has more varied speech when there's little happening.


Detecting scenes seems incredibly difficult to me. Even in a basic edit you could have back to back shots looking at very different parts of a single set. Take the "Jack Rabbit Slims" scene from Pulp Fiction - Vincent Vega walking through the diner, the car table head to head scene about the milkshake with linking cuts around the diner when they discuss the two Marilyns, and the beginning of the Twist competition are three scenes that use the same location, actors, background, etc. Finding the boundaries would be very hard.


One approach might be to estimate the intrinsic parameters of the camera based on keypoint tracking. Assuming the camera was adjusted between shots, you might be able to get that to work.

Or you could make some kind of generic feature vector from each frame that includes contrast, etc. The interiors are the same, but the camera focus/fov will likely be different.

Nowadays just take an annotated movie and throw it into an encoder/decoder network to assess if a frame is from the same scene?


Surprisingly, this uses a fairly simple heuristic to accomplish just that (HSV deltas). I'm the author of the program, and it actually works pretty well with Pulp Fiction as a whole.

The one area where PySceneDetect does struggle with currently, however, is dealing with sudden/rapid flashes, or momentary obstructions to the viewing scene.


Yeah.

It would be kind of awesome if we could “hash tag” scenes such that at some point we could search fot “remember that #action scene in movie X where the [something happens]” and it can pull up that clip of the scene...


You could get pretty close to that by allowing people to bookmark and tag timestamps in films. In fact I'm pretty sure we are close to being able to do it, but implementations lag because it would be a feature for services that do not generally compete on features but only on content.


Searching over subtitles works well for some of this, if you did transcribed audio-description you'd probably have something pretty cool.


I've seen a proof of concept which combined the output of PySceneDetect with subtitle information and computer vision to allow you to do something like "go to the scene with the big castle" or something similar. I can't remember what it's called off the top of my head, but it seemed like a pretty cool concept.

Disclaimer: I'm the author of PySceneDetect.


God i love HN


If you mix pictures and audio the scene detection is bound to be more accurate since there is usually patterns there in most movies.


Author of the program here. This is definitely something that is worth while considering, although I'm not too sure if a correlation always exists between the audio and picture patterns (e.g. in a music video, scene transitions may not be exactly synchronized with the audio).

I've also seen people combine the cuts detected by PySceneDetect with subtitle (.srt) information to generate a more comprehensive list of scenes, which are sets of shots joined together by some context (to allow for jumping forward to a particular scene by it's description).


Beyond all that, good filmmakers often use edits to manipulate or confound expectations about time and space.

However, there are scenes in the narrative sense, but also scene main the way an AD making a schedule will see them, that is, scene headings (sluglines), which would be easier to differentiate and detect.


The example they show in the repo seems to indicate that it detects shots rather than scenes


Author of the program here. Yes, the program detects shots rather than scenes, but I didn't want to give the impression that this project was related to the existing ShotDetect program. I felt that the documentation explained this well enough, but I'm open to considering an alternative project name if anyone has a suggestion.


PyShotSegmentation?


PyShotSleuth PySpotShot PyShotFinder


PyShotFinder is good


Odd, scene differentiations often seemed like arbitrary slices just meant for trying to get to a certain point in the movie, long ago when I'd look at them on DVD's. Not something where you could objectively say "this is where one scene ends and another begins" and others would come to the same conclusion on their own.


Can it detect subliminal frames? Would an adequate threshold also lead to many false positives (triggering on any brutal massive change, for example a blast-inducing explosion)?


You could differentiate single-frame inserted "subliminal" (like the ones in Fight Club) images from other flashy things like explosions, by the fact that the frames right before and after the flash should be nearly identical or very similar. While after an explosion, things usually changed a bit.


This is definitely a good idea, and something that I'm most open to considering for a future release of PySceneDetect. Admittedly the current version does not handle single-frame "upsets" like this, but this does seem like a logical and reasonable approach to a first attempt at filtering them out.


I would do an exponential smoothing of pixel values over some timescale, say, 0.2 seconds, before further detection of scene changes. That should do the trick.


Currently, flashes and other single-frame events are not dealt with gracefully by PySceneDetect, although there is an open feature request for that: https://github.com/Breakthrough/PySceneDetect/issues/35

(Disclaimer: I'm the author of PySceneDetect.)


Testing with some older black and white shows, results aren't accurate to catch fades-to-black. Any options I should be playing with?


Thanks for giving PySceneDetect a try. Can you share what you're using as the command line arguments?

For detecting fades-to-black, you want to make sure that you're using the detect-threshold command (not detect-content). For example:

    scenedetect -i somevideo.mp4 detect-threshold -t 12 list-scenes
Where -t specifies the threshold to use (the default being a value of 12). Full documentation for the detect-threshold command:

https://pyscenedetect-manual.readthedocs.io/en/latest/cli/de...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: