War for honor
The cause of the incident is this. Since I want to find some classic movies to enjoy, I seek resources from an old driver (I note that I need a regular video, definitely not the kind of resources he wants), then he threw it to me. A video resource website is said to be a relatively famous video resource website. I believe it is true, then excited to open a search for classic movies, so it led to a classic Baidu network disk battle.
Disclaimer: The tools in the article are for personal testing and research. Please delete it within 24 hours after downloading. It should not be used for commercial or illegal purposes. Otherwise, the consequences will be at your own risk. The screenshots of the article will only be used for sample demonstration. Please do not use it illegally.
I have to say that this is a formal website, a regular video, just looking at the title, I think more.
With full curiosity, I clicked on the link and saw a link to the video resource at the bottom of the page.
There are 2 kinds of resources here, one is Baidu network disk, the other is Thunder seed, I have to say that this website is still relatively conscience, compared to some websites that only send pictures without leaving seeds. According to the normal logic, at this point I should open the resource address and quietly appreciate it (not right, I am not such a person), so I chose to silently add resources to the network collection. I saw a few more excellent works on the network disk, and I suddenly felt a lot better, but just adding a few works did not satisfy my desire to collect, so I began to explore how to quickly add video resources to Baidu network disk, also by This triggered a series of struggles against Baidu’s network disk.
First, by observing the url composition of the website and the source code of the web page, I decided to use the crawling method to collect the resource link address.
The process did not encounter a big problem, I used python + coroutine to collect, and soon got a part of the resource address:
Baidu network disk resource address:
After writing the data collection script, it was 11 o’clock at night, and it should have been washed and slept. However, the power of technical exploration encouraged me to move on. At present, the resource address is available. However, for Baidu network disk resources, it still needs to be opened a bit and then added to my network disk. This step is too expensive, so I decided to continue to explore the method of automatically adding resources to Baidu network disk.
Note: The following content is the key technical content of this article, which is related to the final outcome of my battle with Baidu’s network disk. Please do not go away and continue wonderfully.
First of all, I analyzed the characteristics of Baidu’s shared page by capturing the package, viewing the source code, reviewing the elements, etc., and judging whether it is suitable for crawling.
After a series of tests, I found that although the process is a bit tortuous, you can still use the crawler to automate the process of adding resources to the network disk.
To implement this technology, I have summarized the following processes:
- Get user cookies (you can log in manually and get captured)
- First crawl like: http://pan.baidu.com/s/1o8LkaPc share the page, get the source code.
- Parse the source code and filter out the name of the shared resource on the page, shareid, from(uk), bdstoken, appid(app_id).
- Construct post package (used to add resources to the network disk), the package needs to use the above 4 parameters + cookies.
I can use a lot of tools to grab cookies. I used Firefox’s Tamper plugin. The effect is as follows:
Get the login packet:
Check the request packet sent by the login and find the account password. Of course, we need the cookie here, which can be viewed in the response.
The format of the cookie is as follows:
Since this cookie involves a personal account, I made changes, but the format should be the same.
The request page is: http://pan.baidu.com/s/1o8LkaPc
After obtaining the cookie, you can write the cookie value in the headers when you visit the Baidu resource sharing page, and use the cookie to log in. During this period, I also failed several times. The reason is that other header parameters need to be added (if the cookie parameter is not added, The result returned will be “Page does not exist”).
After the request is successful, we can find some of the content we need in the source code, such as the name of the page sharing resource, shareid, from(uk), bdstoken, appid(app_id).
First look at the construction of the post package:
There are some parameters in the url of the post package, fill in the content we get, there is a logid parameter, the content can be written casually, it should be a random value and then do base64 encryption.
In the post package payload, filelist is the resource name, format filelist=[“/name.mp4”], path is saved to that directory, format path=/pathname
The cookie must be filled in, which is the cookie value we obtained earlier.
Finally, if you see the above, the resource has been successfully added to the network disk. If errno is another value, it indicates that an error has occurred, and 12 indicates that the resource already exists.
After spending nearly an hour, I wrote the code, most of which was spent on debugging and researching the data package. I encountered a lot of pits, but it was finally solved.
Enjoy the thrill of running the program:
Baidu network disk’s results:
After finishing this, writing this article is almost 12 o’clock in the middle of the night, I only ran a small part of the video resources, and the rest will continue tomorrow. (Is it easy to watch videos??)
I will release the source code tomorrow. I will share my network disk today: https://pan.baidu.com/s/1nvz74Vn
Project GitHub address: https://github.com/tengzhangchao/BaiDuPan